Estimating Interrater Reliability of Examiner Scoring for a State Quality Award

October 2002
Volume 9 • Number 4


Estimating Interrater Reliability of Examiner Scoring for a State Quality Award

by Garry D. Coleman, University of Tennessee, Eileen M. Van Aken, Virginia Tech, and Jianming Shen, Lennox Industries

Examiner scores for two years of a state quality award were analyzed by sector to estimate interrater reliability. The intraclass correlation coefficients (ICC), ICC(1,1) and ICC(2,1), were chosen as the statistics to estimate reliability to enable the researchers to generalize results from specific examiner teams to the larger pool of examiners, thus providing an assessment of the overall scoring effectiveness in the state quality award. These forms of ICC address interchangeability of examiners—that is, they provide an estimate of the reliability of the pool of examiners no matter which individual examiners were selected to evaluate applicants. Interrater reliability, as shown by ICC values, ranged from low to moderate, with significant values ranging from 0.18 to 0.58. These low correlation coefficient values were likely due to the inferential limitations of the field data used as much as any lack of consistency among examiners.

Key words: quality award examiners, rater consistency, rater reliability


Quality awards and organizational assessment processes based on quality awards have become a commonly used tool for evaluation, improvement, and recognition in the past 10 to 15 years. Many benefits can be realized from participating in structured assessment processes, including recognition from receiving an award, receiving feedback from experts via an external evaluation, facilitating self-assessment and improvement in results, and promoting economic development (Coleman and Davis 1999). There is mounting evidence that organizations that have scored relatively high in quality award processes achieve superior performance levels compared to those that do not (for example, see the analysis of stock price in NIST 2001a). This doesn’t imply that winning the award causes a company’s stock price to increase but rather, supports the idea that the strategies and actions that score well against the award criteria also support superior financial performance (Coleman and Davis 1999).

Although there continue to be relatively few applicants for the Malcolm Baldrige National Quality Award, state quality awards are receiving healthy numbers of applicants (NIST 2001b; Vokurka 2001). This may be because winning the state awards seems more feasible and within reach for most applicants, as well as providing useful learning for an organization prior to entering the more intense and expensive Baldrige process. A second reason for participating in state level awards is the perception of additional benefits available to applicants (for example, opportunity to discuss feedback face-to-face with examiners, increased probability of receiving a site visit, opportunity to network with other applicants and award recipients in the same geographic area). Many of the state quality awards have gained face validity by adopting the Baldrige Criteria for Performance Excellence (CPE) as their award criteria. The widespread availability and use of the Baldrige CPE provides the additional benefit of a consistent standard for internal and external use. Bemowski and Stratton (1995, 43) found that of those organizations that obtained copies of the Baldrige Award criteria and used them, more than 70 percent used them as a “source of information on business excellence.” In addition, Van der Wiele et al. (2000) found that 71 percent of the greater Boston area firms responding to their survey reported using self-assessment against quality award criteria or other standards. So without ever applying for an award, an organization can use the award criteria as a low-cost training, assessment, and learning tool.

Organizations invest considerable resources in preparing for and conducting these assessments. The findings from these assessments are used to justify changes in the organization, its priorities, and resource allocation. While the Baldrige CPE are an example of a consistent standard used for self-assessment (see Van der Wiele et al. 2000, 16), the process used for scoring and evaluating the organization against a standard is likely more variable. Who evaluates, how they do it, and what level of effort is applied varies not only for internal organizational self-assessments, but also among the many local and state quality award processes. Furthermore, for such an important measurement and evaluation tool, little research has been published regarding the psychometric properties of this form of measurement. The amount of, and impact of, variability in scoring and evaluation processes is ripe for investigation. The research that has been published tends to be exploratory, likely due to the perception of a limited theoretical basis for the methods employed. By conducting empirical investigations of the various methods used and publishing results, one can begin to justify greater uniformity and consistency among self-assessment and quality award processes. This is not to advocate uniformity for uniformity’s sake, but to provide data and analysis that justify the adoption of best practices for assessment, evaluation, and scoring.

This article reports the findings from an investigation of rating effectiveness for a state-level quality award. While the article does not report all findings from this research, it does report on two studies (Sienknecht 1999; Shen 2001) focused on interrater reliability—a desirable characteristic that is always present to some degree. For the purpose of these studies, interrater reliability was conceptually defined as the similarity of scores among examiners when scoring the same applicants on a single category (dimension). The purpose of the two studies was to estimate the interrater reliability of examiners’ scores of applicants in a specific state quality award (SQA). These studies were modeled on analogous studies from the industrial psychology literature. In particular, this article reports the estimated interrater reliability by category for each sector for each of two consecutive years in the SQA. Estimating interrater reliability was of interest for several reasons.

The first reason was the desire to create visibility for the amount of variability existing among examiners’ scores, so that those processes using individuals or small teams of examiners have greater knowledge of the potential process capability they are working within. Given the decisions that are made using these scores (such as providing feedback to applicants, making changes to the organization, and/or selecting organizations to receive site visits in a quality award process), this would appear important. A second reason was to add to the public database on examiner score variability—not only on the amount of variability, but on ways to reduce it. Having visibility for the amount of variability creates the ability to test various controls (for example, examiner training, examiner characteristics and selection, and so on) intended to reduce variability. Once a published, replicable body of knowledge has been established, this knowledge can be used to optimize each process at the level of reliability and accuracy appropriate for its purposes. A third reason was to provide information to help those conducting self-assessments understand why they should expect differences between self-assessment scores (even when examined by a third party) and the scores of actual award examiners (or any second team of examiners), that is, how much of this might be random, how much might be due to the examiner, and how much might reflect true performance differences of applicant organizations?

Lastly, interrater reliability was studied to explore the issue of interchangeability of examiners. There are several approaches used in state quality awards to manage consistency of examiner scoring. Many SQAs have identified different sectors (for example, manufacturing, service, and so on) and may assign examiners to sector teams with the intent that all examiners on a given sector team would evaluate all applicants in that sector during the independent review (scoring) stage. Practical issues relating to the number of applicants, the available pool of examiners, and/or examiner turnover may make this practice infeasible—in other words, the same set of individuals may not score all applicants in a sector in a given year, or in subsequent years. Not all state award processes use this sector approach, nor do all awards expect the same team of examiners to score all the applicants in a given cohort group. In these cases, the awards may use different, but perceived equivalent, teams of examiners to consistently evaluate each of the applicants. Thus, in these cases, reliable scoring is an issue. Therefore, the estimate of interrater reliability needed to be in a form that recognized the reality of examiner interchangeability. In other words, rather than estimating the absolute level of interrater reliability for a sector team (that is, the reliability of a specific mix of examiners who scored a specific set of applicants), the interest here was in generalizing the estimate of interrater reliability to the larger pool of examiners, providing a reliability estimate for the overall scoring “system” in the award process. Thus, the estimate for the interrater reliability for a sector would be valid even if a different mix of examiners had been used to rate that set of applicants. This interest dictated the selection of the particular measures of interrater reliability as discussed later in the article.


This study is concerned with differences in the ratings (scores) provided by two or more raters (examiners) independently evaluating the same applicant organization (ratee). Reliability in this context means consistency of examiners, or interrater reliability. Interrater agreement, although similar, is a distinctly different measure of interrater convergence (Shrout and Fleiss 1979). Consistency is the relative lack of variation among raters and may be measured many ways. Specifically, interrater reliability is the extent to which two or more raters independently provide similar ratings on given aspects of the same individual’s behaviors (Saal, Downey, and Lahey 1980). Where multiple ratees are involved, interrater reliability includes low variation among raters’ scores for each applicant and correlation of raters’ scores among applicants. Interrater reliability is a popular way for measuring consistency, although Saal, Downey, and Lahey (1980) identified at least five distinct operational definitions for interrater reliability. Consistency and reliability are assessed not for individual raters, but for groups of raters that have independently rated the same ratee(s).

Descriptive studies of the interrater reliability of state quality award examiners have produced modest results. Keinath and Gorski (1999) found little statistical difference between the mean item scores of experienced and inexperienced examiners on ten teams of Minnesota Quality Award examiners and suggested this was evidence of interrater reliability. Each team had at least two inexperienced examiners and more than two experienced examiners. While there is logic to their argument, failing to find a statistical difference between groups could be a function of within-group variance as much as between-group location (that is, mean scores). Working within the restricted scoring range of zero to 100 percent, teams of examiners with poor interrater reliability would have produced the same statistical result. Keinath and Gorski did not cite any previous research using their operational definition of interrater reliability.

Industrial and organizational psychologists have long studied interrater reliability in the context of performance appraisal research (Berk 1979; Bernardin 1977; Bernardin, Alvares, and Cranny 1976; Borman 1975; Friedman and Cornelius 1976; Heneman et al. 1975; Ivancevich 1984; Rothstein 1990; Saal, Downey, and Lahey 1980; Shrout and Fleiss 1979; and Towstopiat 1984). Saal, Downey, and Lahey (1980) conducted a historical survey of research on the psychometric qualities of rating data and found five common operational definitions of interrater reliability. These ranged from simply calculating the variance of the ratings assigned to one ratee on one dimension by several raters (Bernardin 1977) to methods employing various statistics derived from a rater x ratee analysis of variance (ANOVA) (Friedman and Cornelius 1976; Heneman et al. 1975; Shrout and Fleiss 1979). Given the lack of agreement on operational definitions, a triangulation approach may seem logical—that is, using two or more of these operational definitions to look for consistent results. However, Saal, Downey, and Lahey (1980) point out that the inherent differences in the quantification strategies associated with these operational definitions should be expected to produce different, sometimes even diametrically opposed, results. Instead, researchers should choose their operational definition based on close alignment with the conceptual definition of the psychometric quality being studied. While seemingly obvious, Saal, Downey, and Lahey cite multiple examples where this practice was not followed. Comparing statistical results with graphical analysis that illustrates the location and dispersion of the ratings would enable patterns consistent with both the conceptual and operational definitions to be seen.

For interrater reliability as it applies to organizational assessments, the following operational definitions seem particularly relevant:

1. Measured by the variance (or standard deviation) of the ratings assigned to a particular ratee by several raters for a given dimension (Bernardin 1977):

Smaller standard deviations for each behavior dimension reflects greater interrater reliability. This definition is sometimes operationalized as boxplots to support comparison of interquartile range as a surrogate measure for variance.

2. Measured by the intraclass correlation coefficients (ICC), the most frequently used reliability coefficients. Six methods for estimating coefficients of interrater reliability were critiqued by Berk (1979). Five of the reliability statistics produced precise estimates, yet possessed similar limitations. Only one of the six, the intraclass correlation-generalizability theory approach, seemed to offer the precision, comprehensiveness, and flexibility required to deal with the complexity of reliability assessment. Shrout and Fleiss (1979) provide guidelines for selecting among the different forms of the ICCs. In a typical interrater reliability study, each of a random sample of n ratees is rated independently by k raters on one or more dimensions of interest. Three different cases of this kind of study can be defined (Shrout and Fleiss 1979). In each case, the larger the ICC value, the higher the interrater reliability.

  • Case 1: Each ratee is rated by a different set of k raters, randomly selected from a larger population of raters. The reliability of each set of raters’ is estimated by (Shrout and Fleiss 1979):

    BMS—between-mean square in a rater x ratee ANOVA

    WMS—within-mean square in a rater x ratee ANOVA

    For the SQA, this form of ICC was necessary when a different mix of examiners scored each of the applicants in a sector and when the authors wished to estimate reliability for the entire pool of examiners. A sector ICC value was calculated for each dimension of interest—in this case, each category of the award criteria.

  • Case 2: A random sample of k raters is selected from a larger population, and each rater rates each ratee. That is, each rater rates n ratees altogether. The reliability of this set of raters’ ratings is estimated by (Shrout and Fleiss 1979):

    EMS—error mean square in a rater x ratee ANOVA

    JMS—rater’s mean square in a rater x ratee ANOVA

    For the SQA, this form of ICC was necessary when the same mix of examiners scored all the applicants in a sector and when the authors wished to estimate reliability for the entire pool of examiners. Again, a sector ICC value was calculated for each category in the award criteria.

  • Case 3: Each ratee is rated by each of the same k raters, who are the only raters of interest. The reliability of this set of raters’ ratings is estimated by (Shrout and Fleiss 1979):

    For the SQA, this form of ICC would be used when the same mix of examiners scored all the applicants in a sector and when the authors wished to estimate the absolute level of reliability for that specific team of examiners (for example, an examiner team scoring applicants in one sector). Software, such as SPSS, can be used to calculate these coefficients. Alternatively, Futrell (1995) provides examples that show the procedures for calculating these coefficients manually.

3. Measured by a statistically significant ratee main effect, especially one that explains a sizable proportion of the rating variance, in a rater x ratee ANOVA (Friedman and Cornelius 1976):

This last approach “is identical to one of the operational definitions of range restriction. The presence of a significant ratee main effect can be interpreted as both high interrater agreement and the absence of range restriction, whereas the absence of a significant ratee main effect can indicate both range restriction and a lack of interrater agreement” (Saal, Downey, and Lahey 1980, 419).


Participants were the 1998 and 1999 examiners for an SQA. Twenty-three volunteer examiners participated in the 1998 and 1999 award processes, although it was not the same 23 individuals in both years. Many of the examiners were members of the board that oversees the award, predominantly employees of previous award winners. For example, in the 1999 award process, two-thirds of the examiners were also board members. The SQA board provided the researchers with the examiners’ scores for each applicant under the condition that applicants and examiners remain anonymous.

The 1998 examiners received four hours of nonmandatory examiner training in one session, with approximately 75 percent of the examiners participating. In the 1999 cycle, examiners received five hours of training spread over two sessions; 65 percent of examiners attended both sessions while 91 percent attended one of the two sessions. Training in both years included presentation and discussion of the scoring bands and examples of what might be observed across the bands in different types of organizations. Training in 1999 also included establishing a frame-of-reference through individual scoring of items from a common case study followed by discussion of scores. Of the 1999 set of 23 examiners, 11 were first-time examiners, while the rest were repeat or experienced examiners from 1998 (or earlier). However, even first-time examiners frequently had a lot of experience in the area of quality and organizational assessment processes through their formal job positions. Furthermore, many of the first-time examiners were from organizations having won the SQA the previous year and were subsequently asked to join the board; therefore, by definition, these individuals were familiar with the award criteria. Additionally, the examiners about equally represented the four sectors in terms of their current organization, while several were in organizations not necessarily aligned with one of the sectors (that is, university faculty and consultants).

In principle and historically, the SQA has assigned examiners primarily to one of four sectors: private sector manufacturing (PSMfg), private sector service (PSSv), public sector local agencies (PSLA), or public sector state and federal (PSSFed). In practice and particularly more recently, due to the practical issues mentioned earlier, examiners have often been asked to score applications in other sectors. For example, in 1999, only eight of 23 examiners evaluated applications in only one sector, while the rest evaluated applications in two or more sectors. For three of the sectors in 1998, the examiners assigned to a particular sector did evaluate all the applicants in that sector. The exception to this in 1998 was PSMfg, which had the most applicants. In 1998, the PSMfg team received 12 applications, nearly as many as the other three sectors combined (see Table 1 for a summary of the number of applicants and examiners by sector for both years). To avoid overload of the PSMfg examiners in 1998, examiners from the other sectors evaluated some of the applicants. This resulted in a total of 16 examiners being used for the sector, with only one examiner evaluating all 12 applicants, four examiners evaluating seven applicants, three examiners evaluating six applicants, and the other examiners evaluating five or fewer applicants each. This data structure imposed many limitations on the statistical analysis of the PSMfg data, discussed under Data Analysis Procedure.

The procedure for assigning examiners in 1999 was similar to the procedure used for the 1998 PSMfg examiners. That is, different sets of examiners from the larger examiner population evaluated each applicant organization, even within a sector. Thus, in 1999 at least, the implicit approach was to treat the pool of examiners as just that—a pool of examiners from which to draw team members for each applicant. Examiners were assigned to applicants partly because of the examiner’s experience, partly to give examiners applicants in only one sector, and partly at random, to spread the workload. There were typically some common examiners within each sector, but no intact examiner team evaluated more than one applicant. As mentioned previously, this placed limitations on the analysis of the data. Every team for an applicant had at least one experienced examiner, and most had more than one. Every applicant had at least four examiners scoring the application.

In summary, examiners were not assigned to a sector, and thus the applicants, on a truly random basis. Generally, examiners were assigned to produce heterogeneous teams mixing levels of experience with the award process, job function, and industrial background. In reality, availability and convenience were often key factors in assignments, resulting in a process that was neither purely systematic nor random. As described earlier, the dispersion of examiners varied widely in 1999, with virtually every applicant being evaluated by a different team. While the violation of the assumption of random assignment of examiners is of concern in using ICC, the actual assignment process reflects the realities of a true award process. From the applicant’s perspective, particularly within a sector, the probability of being assigned any specific examiner was equal among applicants.


The dependent variables in this research were the individual examiners’ percent scores for each applicant on the evaluation criteria of the SQA. Prior to assigning percent scores, individual examiners read the application and identified strengths and opportunities for improvement to provide the basis for feedback to each applicant. These qualitative comments were not examined as part of this study. The percent scores were used to calculate statistics estimating the interrater reliability (IRR) of the examiners on each category of the criteria. The independent variables were the evaluation categories (eight in 1998, seven in 1999) and the 37 applicants (25 in 1998, 12 in 1999) organized by sector. This produced an examiner by applicant by dimension (category) design for each sector, although this design was incomplete for some sectors.

The SQA evaluation criteria for 1998 and 1999 are compared in Table 2, and the 1998 criteria are described in Appendix A. The criteria are similar to those used in other quality awards, although the origins of the 1998 criteria predate the Baldrige Award (that is, SQA was established prior to the Baldrige Award). For the purpose of the award process, percent scores were weighted with the appropriate point values and aggregated to arrive at a total score (out of 1000 points). For the purpose of this study, all ICC analyses were conducted using category-level percent scores. The 1998 criteria, unlike the Baldrige criteria, were not broken down into Items and, therefore, scoring occurred at the category level. These percent scores were used to calculate ICC values. In 1999, however, when the SQA adopted a condensed version of the Baldrige Award criteria, scoring occurred at the item level to align with Baldrige scoring processes. Therefore, weighted average percent scores were calculated for each category (based on the item-level percent scores and associated point values for each item); these data were used to calculate ICC values for 1999. Use of percent scores facilitated comparison between categories and removed the weighting effect of different category point values from the statistical analyses.

The SQA scoring process uses two sets of rating scales (scoring guidelines). One set of scoring guidelines applies to the organization’s Approach and Deployment, the second to the organization’s Results. The 1998 scoring guidelines were similar in structure to those used for the Baldrige Award (NIST 1999). One noticeable difference between the 1998 SQA and the Baldrige scoring guidelines was the existence of category-specific guidelines (shown as bullets within the standard guidelines). The 1998 SQA scoring guidelines for Approach/Deployment are shown in Appendix B and the scoring guidelines for Results are shown in Appendix C. In 1999, SQA adopted the Baldrige Award’s scoring guidelines for Approach/Deployment and Results without modification.

Individual examiners used these guidelines as descriptive anchors along the 0 to 100 percent scale to score each applicant on each dimension (that is, item for 1999 and category for 1998 as described earlier). In practice, examiners were told to provide scores in 10-percent increments, although some examiners used five-point increments when struggling to decide on a score. In the next stage of the process, individual examiners’ scores were aggregated by applicant and reviewed by the examiner team assigned to that sector. The output of this review was a consensus decision on which applicants in each sector proceeded to the next stage of the process, a site visit. Consensus scores generated at this stage were not utilized in this study.

The Approach/Deployment scoring guidelines were used to score all of the 1998 SQA criteria except category 8 (results over time), for which the Results scoring guidelines were used. These Results scoring guidelines were also used to score category 3 (employee involvement, development, and management of participation). Category 3 is unusual in the 1998 process in that examiners scored this criterion using both sets of scoring guidelines and computed the average of these two scores to produce an aggregate category 3 score. For 1999, only category 7, business results, was scored using the Results scoring guidelines.

Data Analysis Procedure

Upon receipt of the SQA board’s permission to use examiner scores for this research, the individual examiners’ scorebooks were obtained and scores transcribed to a spreadsheet. Individual percent scores from 1998 were collected in a full examiner (rater) by applicant (ratee) by category (dimension) design, with the exception of the PSMfg applicants (due to the limitations described under Participants). The 1998 PSMfg scores and all 1999 scores produced a partial examiner by applicant by category design, as described earlier. Percent scores were converted to their decimal equivalents for computer-based analysis. SPSS 8.0 and SPSS 10.0.7 for Windows and Microsoft Excel 97 were used for all calculations.

Intraclass correlations (Shrout and Fleiss 1979) were calculated for each category by sector for each of the two years of the award. The ICC seeks to compare the applicant-based variance in scores to the sum of all the variance components for a given dimension of interest—in this case, the categories in the award criteria. Of the forms of ICC for reliability studies compared by Shrout and Fleiss, ICC (2,1) (formula (3) as defined earlier) best fit the questions raised and the data in the full design, where the same set of examiners rated each applicant in the sector. This condition was met in three of the sectors in the 1998 SQA rating process (PSSv, PSLA, and PSSFed). This form calculates ICC over both examiners and applicants, thus providing greater degrees of freedom and most likely a more favorable estimate of reliability, if it is present. For the PSMfg applicants in 1998 and all 1999 sectors, ICC(1,1) was used because each applicant was rated by a different set of examiners (formula (2) as defined earlier). Even though the selection of examiners for each sector was not random, each team was selected from the population of SQA examiners, and the results for each team are expected to be generalizable to this population of SQA examiners. The authors selected these forms of ICC because they were interested in the overall scoring effectiveness for the SQA, rather than the absolute level of reliability for a particular team of examiners. The absolute level of reliability may be useful for evaluating the past performance of a specific team of examiners; however, it is not inferential and cannot be used to predict the reliability of future teams.

The ideal condition for using ICC(1,1) is when an equal number of examiners rate each applicant in a sector. When the number of examiners rating each applicant is not equal (as was the case for 1998 PSMfg and for 1999 PSMfg and PSLA), ICC(1,1) can be used with some manipulation of the data, although this can result in lower estimates of reliability due to fewer degrees of freedom. Where necessary, examiner scores were randomly omitted to make the number of scores for each applicant equal in the analysis. For example, in the case of one 1999 PSMfg applicant, four examiners scored this application, yet seven, six, five, and six examiners, respectively, rated the other four applicants in this sector. Individual examiner scores for these other applicants were randomly deleted to yield a total of four scores for each of the five applicants in this sector. The same process was used for the 1999 PSLA (four or five examiner scores reduced to four) and 1998 PSMfg (five to seven examiner scores reduced to five) applicants. In the remaining 1999 sectors, there were an equal number of scores from examiners and the random deletion process was unnecessary.

Note the use of ICC(1,1) and ICC(2, 1) preclude a statistical analysis of between-applicants differences, given the across-applicants design necessary to obtain the error term. That is, the authors are not calculating reliability estimates for the ratings of each applicant and then comparing between applicants, but, rather, the authors are calculating estimates of interrater reliability across applicants in a given sector. This issue is endemic to correlational analyses and variance analyses, which seek to measure interrater reliability, given that the data used are not test-retest or longitudinal in design (Sienknecht 1999). This formulation of interrater reliability, which effectively looks across applicants (ratees) as the source of the error term, is consistent with the definition of the intraclass correlation established by Guilford (1954, 395):

“If each of k raters has rated n [ratees] on some [dimension] on one occasion, we have the possibility of obtaining intercorrelations of ratings of the n [ratees] from all possible pairs of the k raters. This suggests the use of the statistic known as the intraclass correlation, which gives essentially an average intercorrelation.”

However, this is not necessarily in alignment with the conceptual definition of interrater reliability, which expects a measure of rater consistency for all raters (examiners) on a single dimension for a single ratee (applicant). Because the ICC looks across applicants, it yields an estimate of reliability at a higher level of analysis (that is, for all applicants in a sector) than would be the case for alternate measures of interrater reliability evaluating each combination of applicant and dimension. This is acceptable given the interest in assessing the overall scoring effectiveness for the SQA.

To supplement the ICC and provide a measure of examiner consistency for all examiners on a single dimension for each applicant, boxplots of scores were constructed and interquartile ranges were calculated. Additional graphical analysis included the use of line graphs to illustrate the variation of scores associated with differing levels of ICC values. These line graphs illustrate the alignment between the authors’ conceptual definition of interrater reliability and the outputs of the operational definition in the form of ICC values.


The ICC values calculated for each sector with both 1998 and 1999 scoring data were tested against zero to ensure the estimated values were meaningful. That is, a significant result confirms there is sufficient evidence to believe the ICC value is significantly different from zero and implies some level of correlation among the examiner scores. The results for 1998 and 1999 are shown in Tables 3 and 4, respectively. An example SPSS output reporting ICC values for category 1 of 1998 PSMfg is shown in Appendix E.


For the 1998 PSMfg, ICC values for all dimensions except category 4 (rewards and recognition) were significant. While significant, most of these values are in the low to moderate range. The ICC value for category 4 implies extremely low interrater reliability. 1998 SQA PSSv exhibited a similar level of reliability, with ICC values in six out of eight categories significant at the 0.05 level and a seventh significant at the 0.10 level. Only category 3 (employee involvement, development, and management of participation) was not significantly greater than zero. Overall, the scores from the PSSv examiners produced ICC values implying low to moderate interrater reliability. The results were not as favorable for the two 1998 public sectors. Scores from the PSLA produced ICC values that were significant at the 0.05 level in only three of eight dimensions. Scores from the PSSFed examiners produced only one ICC value significant at the 0.10 level.

ICC values from the 1999 data indicated overall low interrater reliability. Only three ICC values were significantly different from zero for PSMfg applicants (categories corresponding to leadership, process management, and business results), as shown in Table 4. For the PSS applicants, only one ICC value was significant—category 3, customer and market focus. For the PSLA applicants, only the ICC value for leadership (category 1) was significant. No ICC values for the PSSF applicants were significant. The small number of applicants, particularly in the three latter sectors, likely contributed to low ICC values. While these 1999 ICC values suggest low interrater reliability, they may be as much a function of the small sample sizes as actual reliability.

Plotting the examiners’ raw scores for all the applicants in a sector provides a visual representation of the level of interrater reliability reflected in the ICC values. Figure 1 shows a line graph of examiner scores from the 1998 PSMfg on category 1 (maturity of effort). The line graphs are not continuous across the applicants because not all examiners rated all the applicants in this sector. Line graphs were used because they illustrate the correlational aspect of IRR as well as the variance among examiner scores. The scores in Figure 1 produced an ICC(1,1) value of 0.47, significant at the 0.05 level. Compare this with Figure 2, the examiner scores from the 1998 PSMfg on category 4 (rewards and recognition). The Figure 2 scores produced a negative ICC(1,1), which was not statistically different from zero. Overlaying the line graphs in Figures 1 and 2 are plots of the examiners’ mean scores for each applicant in the respective category. Comparing the mean score to the line graphs of examiner scores clearly shows why the category 1 scores received the higher ICC value and can be said to reflect moderate interrater reliability. The individual examiner scores used to construct Figures 1 and 2 are shown in Appendices D and F, respectively.

Others (Coleman and Koelling 1998; Keinath and Gorski 1999) have used boxplots to illustrate the consistency of examiner scores. While boxplots are well suited for portraying the level of agreement, line graphs overlaid with the mean scores are better suited for portraying interrater reliability. Boxplots depict the level of variance in a set of scores, but omit the correlational element of interrater reliability. Figures 3 and 4 show the boxplots corresponding to the 1998 PSMfg scores for categories 1 and 4, respectively. In this case, the boxplots do provide support for the findings from the ICC(1,1) values and the line graphs. The interquartile ranges of the scores in Figure 4 are consistently larger than those seen in Figure 3. Such a clear distinction was not apparent in other boxplots for other categories compared with their corresponding ICC values and line graphs.

The 1998 PSMfg data were chosen for this example because it was an incomplete data structure (that is, overlapping, but different, sets of examiners rated each applicant), requiring the use of ICC(1,1), yet it included a relatively large sample size. Thus, this sector and year provided adequate data to illustrate interrater reliability under the least ideal condition. Categories 1 and 4 were chosen because they represented the extreme ICC values observed from this sector.

What is seen in the 1998 PSMfg category 1 scores is a moderate level of IRR that implies similar scoring among examiners; although, the range of examiner scores for some applicants approaches 50 percent on a 0 to 100 percent scale. With scores like these, an award administrator might feel that consensus could be easily achieved in most cases and consensus scores would fall well within the range of individual examiner scores. For the few applicants receiving scores over a relatively large range (for example, 45 percent to 50 percent), the administrator might look for explanations such as a relatively ambiguous application. In these cases, consensus may require more involved discussion by the examiners.

At the other extreme one can see the low level of IRR exhibited by the 1998 PSMfg category 4 scores. Scores such as these might cause concern for an award administrator. A good consensus process is needed to effectively deal with these large score variances and poor correlations. This level of IRR for a specific category may imply a need for improved examiner training on the content of the category or improvement in the criterion used, assuming a frame-of-reference has been established for scoring. Award administrators (or team leaders) should consider using a graphical analysis like Figures 1 and 2 to diagnose team interrater reliability and calculate ICC values to track relative improvement from year to year.


Interrater reliability of examiner scores was estimated for each sector of the 1998 and 1999 SQA. In most cases where adequate data were available, the examiners exhibited low to moderate interrater reliability. In those cases with small data sets (that is, three or fewer applicants), the results consistently indicated low interrater reliability. This is not surprising, given that these instances have more examiners than applicants. In cases where the small data set was further aggravated by an incomplete data structure (such as all the 1999 sectors), interrater reliability coefficients were particularly low. Such inconclusive results are part of the risk of using field data from an actual quality award. The researcher does not have control of the design and must accept incomplete data structures, trading off internal validity for external validity. When the number of applicants and the availability of examiners allows, award administrators should assign the same examiners to all the applicants in a single sector. Not only does this increase the likelihood of higher interrater reliability, it increases the likelihood that a consistent frame of reference is used among applicants. That is, a score of 50 percent for one applicant is comparable in meaning to a similar score for another applicant competing in that sector.

Additional care should be taken when evaluating the ICC reliability coefficients. The assumptions of the study likely lowered these values by an order of magnitude compared to the values one might expect from this type of process and study. Because the interchangeability of examiners was assumed important for SQA and other quality awards, ICC(2,1) and ICC(1,1) were chosen, to enable generalizing these reliability estimates to the larger pool of examiners and the scoring “system” for the award process. Had only the absolute level of interrater reliability for each team of examiners been of interest, ICC(3,1) would have been used where possible and substantially higher coefficient values expected. Incomplete data structures in 1999 forced the use of ICC(1,1) and as expected, produced lower coefficient values than either of the other two methods. In an example prepared by Shrout and Fleiss (1979, 424) to illustrate the differences between the forms of ICC, the same data set produced correlation estimates of 0.17 with ICC(1,1), 0.29 with ICC(2,1) and 0.71 with ICC(3,1). Had this study focused on team reliability rather than examiner interchangeability, an alternate form of ICC that estimates the reliability of the team’s mean score would have been used.

One might speculate about differences in interrater reliability among sectors. For example, do the examiner scores from certain sectors exhibit less consistency than others? Here, the PSSFed was the only sector where scores exhibited consistently low interrater reliability in both 1998 and 1999. Based on the authors’ 35 years combined experience in the quality field, they suggest this may be due in part to the difficulty in evaluating organizational approaches and results in relatively large government agencies/offices. While this may be a plausible explanation, the small sample sizes for this sector preclude any definitive conclusions.

One might also speculate about differences in interrater reliability between years. Specifically, one reason the SQA board decided to adopt the Baldrige CPE in the 1999 cycle, among other reasons, was to improve rating effectiveness of examiners’ scores. One cannot conclude that there was any meaningful change in interrater reliability from 1998 to 1999. As noted earlier, small numbers of applicants in all sectors in 1999 preclude any definitive conclusions based on these data.

As Figure 1 illustrated, interrater reliability among SQA examiners can be good when evaluated on a pragmatic basis. The next step in the SQA process after individual scoring is the aggregation of individual examiner scores for each sector. These aggregate results are presented to the board and examiners at a meeting where sector teams come to consensus on whether each applicant will progress in the process—that is, receive a site visit. In considering the results presented earlier, a team of examiners should find it relatively easy to reach a consensus score on category 1 for each of the applicants in Figure 1. On the other hand, reaching consensus on the applicants’ category 4 scores (see Figure 2) may be more challenging. For each applicant shown in Figures 1 and 2, the same examiner teams evaluated them on these two dimensions with widely different reliability. Consideration of consensus scores is planned for future analyses. Sector teams should superimpose consensus scores on box plots of the team scores when presenting their recommendations for site visits to the entire board.

It is possible that the content of the categories had some effect on reliability. Coleman and Koelling (1998) found less variance among scores of categories with relatively quantitative content (such as process management and business results) compared to those with more qualitative content. Review of the 1998 SQA criteria implies that category 1 (maturity of effort) is more quantitative and perhaps less subjective than category 4 (recognition and reward). This is anecdotal evidence, as the authors’ results do not display the clear pattern observed by Coleman and Koelling.

Quality award administrators should initiate analyses of rating effectiveness and use these findings to make process improvements (for example, examiner selection, examiner training, scoring process, documentation, and criteria selection). Rating effectiveness studies are a clear example of managing by fact, an often-cited principle of quality awards. Ideally, applicants should expect that an analysis of rating effectiveness (for example, interrater reliability and accuracy) for examiners be conducted and published, just as applicants often require their suppliers to provide validated quality control data. Given that best practices for rating effectiveness studies in this context are not well established, the scrutiny of peer-review publication provides additional assurance that the data and results are reasonable and correct.

The SQA board has made a number of improvements in their process based on this and related analyses. The board has successfully recruited more experienced examiners. Examiner training requirements are more structured and the training has been lengthened. The scoring process has been made more systematic and explicit, including the use of standard templates and checklists. Baldrige-based criteria have been adopted, including tailored item notes for each of the sectors. These changes will likely reduce some of the score variance in future years. For example, the extensive development and more widespread knowledge of the Baldrige Award criteria should reduce some of the ambiguity associated with qualitative dimensions and thus have a positive impact on interrater reliability.

The authors propose several areas for future research, related to interrater reliability, rating effectiveness, and examiner scoring from award programs in general. A number of these research areas continue and extend the research described here, including investigating some of their speculations, while others represent new research. First, alternative approaches to assessing rating effectiveness of quality awards should be developed, applied, and compared to the approaches used here and in other recent studies (Coleman, Koelling, and Geller 2001; Keinath and Gorski 1999; Coleman and Koelling 1998; van der Wiele, Williams, Kolb, and Dale 1995). For example, in future studies of interrater reliability, researchers should use alternate forms of the ICC when justifiable to estimate the absolute level of interrater reliability for specific teams of examiners, estimate the interrater reliability of the teams’ average scores, and compare these results to the forms of the ICC used here. To enable these types of studies, researchers should strive to obtain larger samples than those obtained here. Where possible, complete data structures (that is, a full rater x ratee design) should be employed to produce better estimates of true reliability (all examiners rating all applicants in a given sector).

Second, future research should investigate the amount of variation in examiner scores across different types of organizations. For example, if it were found that there is more variation in organizations in certain sectors (for example, in government organizations), such knowledge could be used to improve examiner training by increasing the emphasis on and number of examples from organizations of this type. Additionally, future research should investigate whether organizations involved in the award process relatively longer exhibit less variation in examiner scores compared to “newer” applicants to account for learning how to prepare an application. Such knowledge could be used to improve information shared in educational workshops for potential applicants.

Third, future research should investigate interrater reliability with respect to the criteria—in other words, are there differences in the relative amount of variation in examiner scores across the criteria? The purpose here would be to assess whether some categories or items require additional explanation or clarification in examiner training or in educational workshops with potential applicants.

Fourth, longitudinal research should be conducted to investigate whether changes in rating effectiveness, including interrater reliability, can be observed. Although one could not conclude there was improvement in interrater reliability due to SQA adopting the Baldrige Award criteria in the 1999 cycle, further analysis using data from subsequent years is planned to investigate whether improvement is observed over time controlling for examiner experience. Longitudinal studies of ways to evaluate the effectiveness of individual examiners are also proposed, including the presence of rating errors such as halo or leniency/severity. The impact of controls over time, such as examiner training or changes in examiner selection, on individual examiner rating effectiveness should be investigated.

A fifth area relates to the overall scoring process. In the SQA and other award programs, examiners generate qualitative comments in addition to the quantitative percent scores. Future research should investigate the relationship between these comments (for example, the quantity and nature of comments) and quantitative scores. This triangulation of quantitative and qualitative data could provide insight into the level of and variation of quantitative scores. Examining the rating effectiveness of these comments, such as their interrater reliability, is also ripe for further exploration.

A last area is the replication of this study on interrater reliability using other quality awards with both similar and different scoring processes. This research will build the empirical basis for estimating the reliability of future award processes. Furthermore, our proposed research areas could also be studied in specific quality award programs, as well as compared across award programs.

Although the study here focused on examiner scoring in a state quality award, the approach used to estimate interrater reliability and the issue of examiner interchangeability addressed through the selection of ICC measures is applicable beyond simply state quality awards. The same issues are applicable to any type of third-party, or external, evaluation process such as supplier or corporate awards based on similar quality award criteria. There are even implications here for those assessment processes that use binary (that is, “pass/fail”) scoring rather than quantitative scoring, such as quality system certification audits and institutional accreditation. The issue of auditor or rater interchangeability is still valid, and future research might examine how to estimate interrater reliability for these different, but related, processes. Estimating and understanding interrater reliability, particularly given the practical need to address examiner interchangeability, are very important to these types of processes as well.


The authors would like to thank the SQA board for their support of scholarly study of the award process. The SQA board is to be commended for allowing results of examiner rating effectiveness to be studied and published. The authors would also like to thank Theodore Sienknecht for his contributions to this research.


Bemowski, K., and B. Stratton. 1995. How do people use the Baldrige Award criteria? Quality Progress (May): 43-47.

Berk, R. A. 1979. Generalizability of behavioral observations: a clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency 83, no. 5: 460-472.

Bernardin, H. J. 1977. Behavioral expectation scales versus summated scales: A fairer comparison. Journal of Applied Psychology 62, no. 4: 422-427.

Bernardin, H. J., K. M. Alvares, and C. J. Cranny. 1976. A recomparison of behavioral expectation scales to summated scales. Journal of Applied Psychology 61: 564-570.

Borman, W. C. 1975. Effects of instructions to avoid halo effect on reliability and validity of performance evaluation ratings. Journal of Applied Psychology 60, no. 5: 556-560.

Coleman, G. D., C. P. Koelling, and E. S. Geller. 2001. Training and scoring accuracy of organisational self-assessments. International Journal of Quality and Reliability Management 18, no. 5: 512-527.

Coleman, G. D., and J. Davis. 1999. State quality and productivity awards: Tools for revitalizing the organization. In Proceedings of the 1999 World Productivity Congress, Edinburgh, Scotland: World Academy of Productivity Sciences.

Coleman, G. D., and C. P. Koelling. 1998. Estimating the consistency of third-party evaluator scoring of organizational self-assessments. Quality Management Journal 5, no. 3: 31-53.

Friedman, B.A., and E. T. Cornelius III. 1976. Effect of rater participation in scale construction on the psychometric characteristics of two rating scale formats. Journal of Applied Psychology 61: 210-216.

Futrell, D. 1995. When quality is a matter of taste, use reliability indexes. Quality Progress (May): 81-86.

Guilford, J. P. 1954. Psychometric methods. New York: McGraw-Hill.

Heneman, H. G., D. P. Schwab, D. L. Huett, and J. J. Ford. 1975. Interviewer validity as a function of interview structure, biographical data, and interview order. Journal of Applied Psychology 60: 748-753.

Ivancevich, J. M. 1984. Longitudinal study of the effects of rater training on psychometric error in ratings. Journal of Applied Psychology 69, no. 1: 85-98.

Keinath, B. J., and B. A. Gorski. 1999. An empirical study of the Minnesota Quality Award evaluation process. Quality Management Journal 6, no. 1: 29-38.

National Institute of Standards and Technology (NIST). 2001a. 29 August. Baldrige Stock Study 2001. See URL .

National Institute of Standards and Technology (NIST). 2001b. 2000 State, Regional, and Local Quality Award Program Statistics, compiled by the Baldrige National Quality Program. Gaithersburg, Md.: NIST.

National Institute of Standards and Technology (NIST). 1999. Baldrige National Quality Program 1999 Criteria for Performance Excellence. Gaithersburg, Md.: NIST.

Rothstein, H. R. 1990. Interrater reliability of job performance ratings: growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology 75, no. 3: 322-327.

Saal, F. E., R. G. Downey, and M. A. Lahey. 1980. Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin 88, no. 2: 413-428.

Shen, J. 2001. Empirical analyses of rating effectiveness for a state quality and productivity award, Master’s thesis, University of Tennessee.

Shrout, P. E., and J. L. Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86, no. 2: 420-428.

Sienknecht, R.T. Jr. 1999. An empirical analysis of rating effectiveness for a state quality award, Master’s thesis, Virginia Polytechnic Institute and State University.

Towstopiat, O. 1984. A review of reliability procedures for measuring observer agreement. Contemporary Educational Psychology 9: 333-352.

Van der Wiele, T., A. Brown, R. Millen, and D. Whelan. 2000. Improvement in organizational performance and self-assessment practices by selected American firms. Quality Management Journal 7, no. 4: 8-22.

Van der Wiele, T., R. Williams, F. Kolb, and B. Dale. 1995. Assessor training for the European Quality Award: An examination. Quality World Technical Supplement (March): 53-62.

Vokurka, R. J. 2001. The Baldrige at 14. The Journal of Quality and Participation (summer): 13-19.


Garry D. Coleman is an assistant professor of industrial engineering at the University of Tennessee. His research interests include examining the effectiveness of organizational assessments and third-party review, developing approaches for evaluating and improving organizational performance measurement systems, and linking each of these areas to a continual process of planning, implementing, measuring, and improvement. He has studied how organizations use planning and measurement to improve performance since he joined The Virginia Productivity Center (later known at The Performance Center) in 1986. He received his doctorate from Virginia Tech in industrial and systems engineering. He also received his master’s degree in industrial engineering and operations research and a bachelor’s degree in mining engineering from Virginia Tech. He is a senior member of the Institute of Industrial Engineers, a member of ASQ, NSPE, ASEM, and SEMS, and a 2001 examiner for the Tennessee Quality Award. Coleman is a licensed professional engineer and a Fellow of the World Academy of Productivity Sciences. He can be reached at University of Tennessee, 411 B. H. Goethert Pkwy., Tullahoma, TN 37388-9700, or by e-mail at .

Eileen M. Van Aken is an assistant professor in the Grado Department of Industrial and Systems Engineering at Virginia Tech and director of the Enterprise Engineering Research Lab. Her research interests include performance measurement, organizational and work system assessment, organizational transformation, lean production, and team-based work system design. She was employed for seven years at the Center for Organizational Performance Improvement at Virginia Tech and for two years as a process/product engineer at AT&T Microelectronics. She received her bachelor’s, master’s, and doctorate degrees from Virginia Tech in industrial and systems engineering. She is a member of IIE, ASEM, and ASQ, and is an examiner and current vice chair of the board for the U. S. Senate Productivity and Quality Award for Virginia. Van Aken is a Fellow of the World Academy of Productivity Sciences.

Jianming Shen is a reliability engineer for Lennox Industries in Carrollton, Texas. He received his master’s degree in industrial engineering from the University of Tennessee. He received a master’s degree in mechanical engineering from the Shanghai Academy of Space Flight Technology and his bachelor’s degree in mechanical engineering from Shanghai Jiaotong University in China. Prior to attending the University of Tennessee, he worked in the space industry for five years as a product system engineer. He is a member of ASQ and is an ASQ Certified Reliability Engineer and Certified Quality Engineer.


Return to top

Featured advertisers


(0) Member Reviews

Featured advertisers

ASQ is a global community of people passionate about quality, who use the tools, their ideas and expertise to make our world work better. ASQ: The Global Voice of Quality.