**3.1. The scorecards method**

Regarding *Fit*, it is no stretch of the imagination to state that the scorecard methodology *"fits like a glove"* in the overall ALIZA process. It enables the benchmarking processes to rapidly define the current and target situations from a business perspective and utilize a relatively seamless interface to The OSE Calculator.

But the *Form* and *Function* are a totally different matter. Most of the Business Users, after a cursory initial inspection, jumped to the conclusion that the I4-PS and I4-ES scorecards are in an ideal form. The fact that the scorecards clearly outline the key metrics, which have been defined by a reputable industry organization (VDMA), the graphics are easy to understand, they have a continuum and they are measurable appealed very strongly to them. One even went so far as to declare *"Great, now we can manage Industry 4.0".* But this work has delved a lot deeper and found issues with the function which, although not insurmountable, are not insignificant and must be addressed before widespread adoption.

An experiment was designed to validate the accuracy of these scorecards. The objective was to determine if they were *Repeatable* (the same inspector getting the same result when evaluating the same item more than once) and *Reproducible* (inspectors getting the same result when evaluating the same item) as gauges [21]. The first stage of validating the scorecards was conducted at the end of the first semester on the MEng in Mechatronics, University of Limerick, 2017. Eight students worked as a group and utilized the I4-PS and I4-ES to rate two pieces of equipment. The second stage of validating the scorecards was conducted at the end of the second semester. Four random students, who were members of the original team, were requested to utilize the I4-PS and I4-ES again to rate the same two pieces of equipment. The results were analyzed, and significant variation was observed. On the five-point scale of the scorecards the Lower Control Limit (LCL) lay close to 0 across all metrics and equipment while the Upper Control Limit (UCL) ranged between 3 and 5. Several factors such as group dynamic versus individual score, new knowledge attained, knowledge forgotten or simply confusion may have influenced these outcomes. Regardless of the root cause of the variability these results do highlight the fact that gauges which appeal to our desire to not increase our cognitive load [15] and are easy to memorize [16] in no way guarantee that they are accurate.

But all is not lost. A detailed review with the students revealed that they had significantly different interpretations of the iconography, the words simply were not descriptive enough and open to interpretation (e.g. What does" *connected"* really mean?) while many were not *mutually exclusive*. Thus, it can be concluded that with further experiments the content of the scorecards can be optimized to minimize variability and increase the accuracy to a point whereby the scorecard methods can be generally relied upon to achieve their *Function*.
