Annals of Emergency Medicine
Volume 54, Issue 6 , Pages 843-853, December 2009

A Consideration of the Measurement and Reporting of Interrater Reliability:

Answers to the July 2009 Journal Club Questions

University of California, Los Angeles, Los Angeles, CA

Article Outline

 

Back to Article Outline

Discussion Points 


1.Cruz et al1 contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.

2.Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (yes or no):
MD Recorded “Yes”MD Recorded “No”Total
RA recorded yes1176123
RA recorded no18220
Total1358143

MD, Medical doctor; RA, research assistant.



A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data?

3.Cruz et al quote the oft-cited Landis and Koch2 article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?

4.A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect).Recalculate percentage agreement and κ for the same 100-statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; and (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.B. Consider the 2 tables below and calculate percentage agreement and κ for each. Why is κ lower on the right? What does this mean?


Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations? C. To further consider the meaning of κ, imagine that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements.


Calculate percentage agreement and κ for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done?

5.Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2×2 table.
Tables 1 and 2.
A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2×2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?


B. Can you comment on the relationship between the size of the smallest cell in the 2×2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2×2 table instead of reporting the percentage agreement or κ?

Back to Article Outline

Answer 1 

Q1. Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians regarding historical information in chest pain patients, and the comparison of these participants' recordings with a “correct” value for each item.

Q1.a For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.

The first part is an assessment of reliability, and the second is an assessment of validity. The distinction between reliability and validity is an important one. At the racetrack handicappers may unanimously agree (100% interrater reliability) that Galloping George will win the third race. When he comes in dead last, however, track aficionados receive a painful reminder that even a perfectly reliable analysis does not guarantee a valid result. The reliability of a test speaks only to the agreement obtained when multiple fallible observers independently conduct the test on the same persons, specimens, or images. In contrast, an assessment of validity compares a fallible observer against a criterion, or “gold,” standard. Because the criterion standard is assumed to be correct, validity studies typically report the comparative performance of the fallible observer using statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or κ.

Q1.b What did the authors use as their criterion standard for the validity analysis?

When the physician and the research assistant agree, it is assumed that their answer is correct. When they disagree, a different research assistant has the patient select which of the 2 discrepant answers is “correct.”

Q1.c What are potential problems with their method of defining the gold (criterion) standard? Can you think of any alternative approaches?

Defining a gold (criterion) standard for this study is not trivial. For any item, there can be 2 truths: what the patient answers and what is actually true. For example, a patient asked “Do you have pain in the epigastric region” might say yes, thinking that “epigastric” is a fancy word for butt cheeks when in fact the true answer is “no.” Or a patient might say that his cholesterol is normal despite its being 300 because he does not understand the laboratory results his physician shared with him.

What then is the criterion standard for this study: what the patient said, what the patient should have said, or what the patient would say if the information were optimally elicited? The answer, of course, is that we have no way to know what the “true” answer is. Most emergency medicine residents have had the experience of reporting some part of a patient's history to their attending physician and soon thereafter hearing the patient give the attending physician a completely different history! This “answer drift” could be because one or the other of the physicians asked the question in a manner that was clearer to the patient, or because extra time or reflection resulted in the patient expressing a different answer. We do not know if the answers given on the second or third interview are more truthful, or whether they are provided to appease the interviewers and end the questioning. Some patients may be too confused, in too much pain, or too distracted to give an accurate reply.

A better approach, which the authors acknowledge, would have been to randomize the order in which the research assistant and physician interviewed each patient. That should equally distribute and minimize the effect of any bias related to a change in answer accuracy with repeated questioning.

Q1.d The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.

The interquartile range (IQR) refers to the middle 50% of a distribution:

The original definition of IQR is a single number that represents the distance from the 25th percentile to the 75th percentile, though this format is seldom used. Instead, investigators typically present the 25th and 75th percentile (in the format [25th percentile, 75th percentile]) from which the “real” IQR can easily be gleaned by subtraction.

In statistics, the term “quartiles” refers to the 3 points that divide a distribution into 4 equal parts. In epidemiology, the term is typically used to signify these 4 equal parts. The second quartile is called the “median,” the first the “25th percentile,” and the third the “75th percentile.” The IQR is the difference between the third and first quartiles. This is a more robust (less influenced by outlier observations) descriptive statistic than the range of a distribution, and it is more relevant when data are not in the shape of a classic bell curve (ie, not “normally distributed”).

When data are skewed and thus not normally distributed (Figure 2), the mean, median, and IQR convey the center of the observed values. In this case, the mean±2 SDs (central tendency statistics for a normal bell curve distribution) goes from roughly –0.8 to 2.8. Note that the left-sided value of –0.8 is well outside the range of the data. Consequently, if the only thing readers were told about this distribution was that “the mean is 1 and the mean±2 SD is –0.8 to 2.8,” they would likely imagine a curve very different from Figure 2 and would likely assume that values between 0 and –0.8 existed.

  • View full-size image.
  • Figure 2. 

    (Adapted from: http://en.wikipedia.org/wiki/Interquartile_range.) [From Wikipedia: “In the Creative Commons Attribution and Share Alike license (CC-BY-SA), re-users are free to make derivative works and copy, distribute, display, and perform the work, even commercially. When re-using the work or distributing it, you must attribute the work to the author(s) and you must mention the license terms or a link to them. You must make your version available under CC-BY-SA.”]

For those questions in which the research assistant and physician did not agree, the authors report (by category) the percentage agreement with the “correct” answers (as determined by the tiebreaker criterion standard). Percentage agreement is a reasonable statistic for a reliability assessment, but is not the appropriate statistic to best describe this validity assessment (comparison of a fallible observer with a criterion standard). Studies that are designed to estimate a test's validity should report statistics such as sensitivity and specificity, or likelihood ratios, not reliability metrics such as percentage agreement or κ.

Back to Article Outline

Answer 2 

Q2. Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, “Was the quality of the chest pain crushing? (yes or no)”:

MD Recorded “Yes”MD Recorded “No”Total
RA recorded yes117 [a]6 [b]123
RA recorded no18 [c]2 [d]20
Total1358143

Q2.a Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?

The 2 observers both agreed “yes” 117 times, and “no” 2 times. Thus, crude percentage agreement for this table is (117+2)/143=83.2%. Percentage agreement can range between 0% and 100%.

Q2.b Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.

The κ statistic, introduced by Cohen3 in 1960, is defined as:

% Agreement observed–% Agreement expected due to chance 1–% Agreement expected due to chance

κ is easily calculated with statistical software, but we will discuss the manual method as well. For 2×2 contingency tables, it is customary to refer to the inner cells by letters, with [a] and [b] on the top row and [c] and [d] just below. The outer 5 cells represent various row and column totals of the 4 inner cells a to d. Because these five cells occupy the margin of the table, they are often referred to as “marginal totals.” Note that if one knows the values of inner cells a to d, then one can calculate all 5 marginal totals. The reverse is not true; in most circumstances one cannot determine the inner cells from the marginal totals. The agreement cells for this table (where the 2 raters both recorded the same thing, either yes or no) are [a] and [d]. κ uses the marginal totals to calculate the percentage of expected agreement due to chance for each agreement cell, and these are summed to determine the expected percent agreement. Using the values from Table 1, κ is calculated as follows:

(i) The expected value of cell “a” due to chance alone is:

(ii) The expected value of cell “d” due to chance alone is:

(iii) The percentage agreement due to chance alone is:

At the beginning of this section, we calculated that observed agreement was 83.2%, or 0.832. Now, we have determined that the expected agreement due to chance alone is 0.820. Plugging these 2 numbers into the κ formula yields

κ=(.832–.820)=0.07

(1–.820)

κ was introduced as a “coefficient of agreement for nominal scales”3 intended to measure agreement beyond chance. κ can range from -1 (with negative numbers indicating that observed agreement occurs less often than expected by chance) to 1 (perfect agreement when observed percentage agreement=1, regardless of percentage agreement expected due to chance). A κ of zero signifies that observed agreement is exactly that expected by chance alone: percentage observed agreement=percentage expected agreement). An inherent assumption of the κ statistic is that the marginal totals of the observed agreement table adequately define “chance” agreement. This assumption, like many of the assumptions in classic statistics, implies that all observations are independent, identically distributed, and drawn from the same probability density function. Under these very limited and strict conditions, “chance” agreement will be a function of the observed marginals. We explain these assumptions in layman's terms in subsequent questions.

Q2.c What other measures can be used to measure reliability for binary, categorical, and continuous data?

Reliability can be measured with a multitude of methods. An excellent review4 emphasizes that there is little consensus about which is “best” and that no one method is appropriate for all occasions.

It is important to consider what kinds of data are being compared. Categorical (also called discrete) variables take on a small, finite number of values. These qualitative variables include nominal (no meaningful order, such as disposition=admitted, transferred, or home) and ordinal (ordered in a meaningful sequence, such as Glasgow Coma Scale score=3 to 15). A binary variable is a categorical variable with only 2 options (female or male; yes or no). Continuous variables (such as pulse rate) can theoretically take on an infinite number of values (a patient's pulse could be precisely 88.228 beats/min), but both clinical relevance and measurement accuracy effectively categorize most continuous variables (pulse rate is estimated to the nearest integer). Many reliability measurements are intended for use only with continuous variables, and one must decide whether a variable is “continuous enough” to permit their use.

Several measures of correlation are available for use as reliability metrics, but there are important limitations with using correlation to measure agreement. First, correlation is frequently used colloquially to indicate any association between 2 variables, but in statistics, correlation implies only a linear association. Consider the following scatterplots (that graph observed values for 2 variables) and their associated correlation coefficients (in these examples, the Pearson product-moment, which ranges between –1 and 1):

Because correlation coefficients measure only how well observed data fit with a straight line, a correlation coefficient of zero may indicate that the 2 variables are not associated with each other (or are “independent,” as in the middle of the top row) or may be missing a more complex but potentially meaningful nonlinear association (as in the bottom row).

Another limitation with using correlation is that 2 judges' scores could be highly correlated but show little agreement, as in the following example5:

Subject12345
Rater A108642
Rater B65321

The Kendall6 and Spearman7 coefficients measure the degree of correlation between 2 rankings. These coefficients require ordinal and not simply nominal data. Kendall S is a simple way to measure the strength of a relationship in a 2×2 table: S=C (the number of agreement pairs)–D (the number of disagreement pairs). A preponderance of agreement pairs (resulting in a large positive value of S) indicates a strong correlation between 2 variables; a preponderance of disagreement pairs (resulting in a large negative value of S) indicates weak correlation. A disadvantage of S is that its range depends on the sample size, but a simple standardization (computed as τ=2S/n(n–1)) gets around this problem, and Kendall's τ always ranges between –1 and 1. Spearman's ρ involves a more complicated, less-intuitive calculation8 and is equivalent to Kendall's τ in terms of ability to measure correlation.

The intraclass correlation (ICC) can also be used to measure reliability.9 The ICC compares the variance among multiple raters (within a subject) to the overall variance (across all ratings and all subjects). Imagine that 4 physicians (raters) use a decision aid to independently estimate the likelihood of acute coronary syndrome in each of 20 patients (subjects). The 2 graphs (Figure 4) show 4 estimates (1 dot for each physician) for each patient. In the upper graph, the 4 raters give similar ratings for each patient. The variation in ratings for any given patient is small compared with the total variance of all the ratings. Said another way, there is more variance in the ratings among patients than there is in the ratings within patients. A high ICC suggests that the raters have good correlation (when one rater scores a subject high, so do the others). Ratings for each patient tend to be clustered. In the bottom graph, ratings within each subject are all over the place. Here the ICC would be lower as the raters are not highly correlated. A number of ICC estimators have been proposed within the framework of ANOVA. Unfortunately, the various ICC statistics can produce markedly different results when applied to the same data. We believe that the pictures tell the most complete story about agreement and are free from the assumptions made by the various statistics.

The Bland-Altman approach is a graphic presentation of agreement data that plots the difference in measurements for each subject pair against their mean.10 Consider a study that measures peak expiratory flow rate, using 2 different meters in 17 patients:

The top scatterplot suggests that the results from each of these 2 meters are similar. However, the bottom graph (a Bland-Altman plot) examines this association in more detail by plotting the differences between the paired measures for each patient (y axis) stratified by the mean of each pair. In this case, this plot confirms that the average difference between the 2 meters is very close to zero. However, the Bland-Altman plot also shows that the difference in readings between the 2 meters can vary up to 80 L/min in some subjects, and that the meters seem to perform differently for lower flow rates than they do with higher flow rates. The principal advantage of this method is that the observed disagreement data can be put into a clinical context. Would differences in measured peak expiratory flow rate up to 80 L/min (especially in sicker patients with lower flow rates) affect patient management? A potential problem with Bland-Altman plots is that patterns can be obscured if the scale of the y axis is not carefully selected. The scale of the y axis needs to be appropriate for the concentration range of the x data (show absolute differences if the range is small but percentage or log-scale differences if the range is larger).11

Back to Article Outline

Answer 3 

Q3. Cruz et al quote the oft-cited Landis and Koch article2 stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?

Many investigators contrast their κ values to arbitrary guidelines originally proposed by Landis and Koch2 and further popularized by Fleiss.12 As we hope our example demonstrates, the mechanical mapping of numeric values of κ to the adjectives poor, fair, moderate, good, and excellent is fraught with problems. A κ of .75 might be good enough if the cost of being wrong is low (such as categorizing subjects into personality types), but nothing less than near-perfect agreement is requisite if the decision has important consequences. We would not be pleased if our airplane's copilots attained a κ of 0.75 on “is it safe to land?” Some tests (eg, a set of historical questions that are used to identify patients at high risk for alcohol addiction) might be useful even if their results are only somewhat reliable. Other tests, however (eg, a set of history and physical examination data that are used to identify which patients with traumatic neck pain can safely forgo cervical spine radiography), will be useful only if they are highly reliable. This is because no poorly reliable test will ever be highly valid when used by multiple fallible observers. Conceptualizing any specific degree of agreement as poor, excellent, or anywhere in between regardless of the test's clinical context is, therefore, a dangerous oversimplification.

Back to Article Outline

Answer 4 

Q4.a Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true-false statements such as “red is a color,” “2+2=5”, etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many of statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100-statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.

This example is designed to show that in certain situations, κ can underestimate actual agreement, particularly when there are skewed marginals and when percentage agreement is fairly high.

Recall that κ=(% observed agreement–% expected agreement)/(100–% expected agreement). Thus, when 2 raters agree on all observations, regardless of how these are distributed between true and false or what the expected agreements are, κ=1. Tables 1 and 2 show 2 of the possible results under condition 1. Note that we do not know whether the one incomprehensible statement was true or false or how each rater will classify it. We do know that because only 1 of the 100 observations was made with low confidence, all possible results will yield very similar percentage agreement and κ.

Tables 3 and 4 show 2 possible results under condition 2. The preponderance of true questions has skewed the marginal totals. Consequently, how each rater classifies the one unheard question affects κ a bit more than when the marginals are roughly equal.

Tables 3 and 4.

Recognize that in these first 2 sets of conditions, the raters are asked to rate 99 easy statements and 1 hard (plane flying overhead) statement. The expected agreement should be the same in all 4 tables, yet the value of percentage agreement expected due to chance, as calculated for κ, changes. κ is lower in Table 4 than in Table 2, even though raters are performing equally well in both. Observed percentage agreement also varies, but only slightly.

Table 5 shows the most likely result under condition 3 (50% of statements are true and 20% are incomprehensible). This can be derived by considering the 80 high-confidence and 20 low-confidence classifications separately (Tables 6 and 7). The audible questions will result in Table 6. If each rater has no knowledge about how often inaudible statements are true, then the modal result for the 20 inaudible questions is that depicted in Table 7. Summing Tables 6 and 7 results in Table 5.

Tables 5, 6 and 7.

Of course, by chance alone, the 20 unheard questions might result in a more skewed table, like either Table 8 (all observations falling into disagreement cells b or c) or Table 9 (all observations falling into agreement cells a or d). Summing these results to Table 6 yields Tables 10 and 11 respectively. Thus, depending on how the incomprehensible 20 questions end up being classified, percentage agreement and κ could range between 80% and 0.615 to 100% and 1.

Tables 8, 9, 10, and 11.

Table 12 shows a possible result under condition 4 (90% of statements are true and 20% are incomprehensible), again derived by considering the 80 high-confidence and 20 low-confidence classifications separately. The audible questions will result in Table 13. If both raters believe that 90% of the unheard questions are true, those 20 questions might result in Table 14. κ is negative for this table because the observed agreement (80%) is less than that expected due to chance ((0.9×0.9)+(0.1×0.1)=0.82=82%). Summing Tables 13 and 14 results in Table 12.

Tables 12, 13, and 14.

The 20 unheard questions might again result in a more skewed table. It seems unlikely that either rater would believe that 90% of all the questions were true and still classify all of the unheard questions as false (Table 15), but we have no assurance that this could not happen. Both raters could simply classify all the unheard questions as true (Table 16). Thus, depending on how the incomprehensible 20 questions end up being classified, percentage agreement and κ could range between 80% and 0.365 (summing Tables 12, 13, and 14, Tables 15, 16, 17, and 18) to 100% and 1 (summing Tables 12, 13, and 14, Tables 15, 16, 17, and 18).

Tables 15, 16, 17, and 18.

We summarize the range of percentage agreement and κ results for the 4 scenarios:

Scenario Type (Range % Agreement, Range κ)
Percentage of T/F Statements1% Inaudible20% Inaudible
50/5099–100%, 0.98–180–100%, 0.62–1
90/1099–100%, 0.95–180–100%, 0.37–1

We remind readers that the skill of the raters is the same within each column; when raters can hear the statement, they always agree. Despite this, simply varying the percentage of true statements can result in a wide range of κ, even though agreement should be identical. Also, the range of κ widens as the proportion of low-confidence classification increases. The implicit assumption of the κ statistic is that expected agreement due to chance, as calculated from the marginal totals, is an unbiased estimate of actual chance agreement. Our examples illustrate that that assumption is often false and can lead to κ values that underestimate actual agreement, particularly when agreement is fairly high.13, 14

Q4.b Consider the 2 tables below and calculate percentage agreement and κ for each. Why is κ lower on the right? What does this mean?

This question reinforces the concepts illustrated in question 4a. The percentage agreement is the same in both of these tables, but κ is lower in the right table because its marginal totals are skewed. κ multiplies the a+b and a+c marginals together (and then sums this to the multiplied c+d and b+d marginals) to determine agreement expected by chance. Calculated expected agreement is higher when marginals are skewed compared with when the marginals are equal.

κ is lower when the marginals are skewed because of how κ defines “chance” agreement. κ assumes that when there are skewed marginal totals, there is likely to be more chance agreement. In certain situations, this makes sense. If 2 blindfolded independent raters say yes or no, on each of 100 times a fair roulette wheel with 90% of slots marked “yes” and 10% marked “no” is spun, they will likely have margins of 90 and 10, and we would expect that their agreement due to chance would be 82% (0.9×0.9+0.1×0.1). They certainly would have to do far better than 82% for us to start wondering about clairvoyance or cheating. But if those same raters are told that the roulette wheel could be marked any way from all slots being “yes” to all slots being “no,” then we would expect our raters to have marginals of 0.5 and 0.5 and to agree by chance 50% of the time (0.5×0.5+0.5×0.5). κ assumes that skewed marginals always indicate agreement due to chance; but there are many situations (as in the examples in question 4a) in which skewed marginals result from the raters making high-quality judgments.

We find it particularly irksome when κ is used to measure the interrater reliability of inanimate mechanical devices. Imagine 2 bedside pregnancy tests that produce a positive or negative test based on some form of immunoassay. The tests do not “know” what the correct marginals should be, so that any deviation of the marginals cannot be assumed to produce chance agreement in the inner cells. Certainly if both tests were wholly invalid (there was no reagent on the cards, so they read “negative” every time), then there would be 100% agreement, but this would not be chance agreement. For tests that do not “know” what the right marginals are, percentage agreement is a simple and sufficient summary statistic, although, as noted below, reporting the actual 2×2 table is the best method of communicating reliability results.

Q4.c To further consider the meaning of κ, imagine (for the right-hand panel above) that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements.

Calculate percentage agreement and κ for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done?

κ fails to distinguish between 2 phenomena. In the first, raters make ratings with high confidence. This produces a high percentage agreement. Depending on whether the values of what they are rating are evenly distributed (a coin toss) or skewed (a dice roll in which 1 is true and the other 5 numbers are false), the marginals will be even or skewed. Note that in this example, the confidence of the raters determines the values of the inner cells and the marginals reflect the values of these inner cells.

In the alternate phenomenon, raters have low confidence. Because they do not know what to make of individual observations (akin to having to say “true” or “false” when they cannot hear the statement because of the airplanes), they rely on their knowledge of the marginals to guide their choices. As a result, it is the value of the marginals that determines that value of the inner cells, and agreement is largely mediated by chance.

The problem is that we typically cannot tell which phenomenon is occurring. We get to see one table and we have no way to break it down into high-confidence and low-confidence subtables. The left table has perfect agreement, and the margins here are highly skewed. Percentage agreement and κ approach 100% and 1, respectively, and the choice of statistic makes little difference.

In analyzing the right-hand table, we encounter the same problems discussed in question 4b. These raters had no ability to discern the individual statements, and presumably guessed based on their knowledge of the marginals (ie, their experience with the audible statements). Observed agreement here is thus highly subject to chance occurrence based on the marginal values, and thus κ is an appropriate statistic for this table.

Back to Article Outline

Answer 5 

Q5. Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2×2 table.

Q5.a Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2×2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?

These tables all show similar percentage agreement. Each successive table has an increasing κ because its marginals are less skewed than those of the previous table. Compare the sums of the a+b and a+c marginals with the sums of the c+d and b+d marginals for each table: 263 and 23, 259 and 27, 250 and 36, 228 and 58. In each successive table, these 2 numbers get closer together (less skewed). κ uses these marginal values to calculate percentage agreement due to “chance.”

The “best” metric of reliability is that which summarizes the raw reliability data in the most useful way. Percentage agreement is easy to understand, but it is limited, which is important, in that it does not account for agreement that occurred simply by chance. How, then, should we think of and define chance agreement? Like all classic statistic techniques, κ makes a rigid and narrow assumption (that all observations are independent, identically distributed, and drawn from the same probability density function) to use the observed marginal values to calculate chance agreement. It follows, then, that if each physician and research assistant asked these 4 questions, but overflying airplanes prevented any of them from hearing the responses clearly, and they had no previous information about the quality or radiation of chest pain that might guide their guesses toward some target marginals, then the assumptions of κ are reasonably valid and we could infer that the question, “Does it radiate to the left arm?” performed better than “Is pain burning?” because the first showed less chance agreement (despite equal raw agreement).

It could be more meaningful and useful, however, to think of and define chance agreement as that which occurs under conditions of low confidence. Imagine that resident physicians assess 100 patients with pericardial effusion for ultrasonographic evidence of right ventricular diastolic collapse. Each rater makes a dichotomous (yes/no) assessment of right ventricle collapse for each patient and also records his or her subjective confidence for that assessment. Confidence can be assessed categorically (high versus low, high/medium/low, quartiles) or continuously (mark a point on a confidence line, with anchors “not at all confident” and “extremely confident”). If confidence were converted to a number ranging between 0 and 100, corresponding to where the users put a mark on a continuous confidence line, study data might result in:

Patient #Rater ARAC (0-100)Rater BRBC (0-100)
1Yes42Yes96
2No12Yes28
3Yes65No70
4No90No77
100No62No48

RAC, Rater A confidence; RBC, Rater B confidence.

The numbers in these tables were deliberately chosen to illustrate a point: although the overall table suggests high agreement (and κ) for these ultrasonographers, the stratified data tell a different story. Most of the agreement here occurred with observations for which one or both raters reported low confidence (analogous to answering a simple true/false question that was obscured by an overflying airplane). It seems intuitive to assume that some chance agreement occurs under conditions of uncertainty. In contrast, there was only 43% agreement for the high-confidence assessments (analogous to answering “Does 2+2=4?” with no airplanes nearby). If assessments are truly being made with high confidence, should any part of that be attributed to chance?

An assessment of rater confidence has several obvious and important limitations. Different raters likely assess their own confidence in markedly different ways and may tend to cluster their confidence assessments in a particular range. There is also no single criterion standard to assess the validity of confidence ratings. It may be that many of the ultrasonographers in this example were discerning RV function very accurately but just did not feel confident in their assessments (perhaps because of lack of experience). Despite these limitations, assessing the subjective confidence of ratings may be a more useful approach to defining and accounting for the effect of chance agreement than κ.

Q5.b Can you comment on the relationship between the size of the smallest cell in the 2×2 table and the extent to which κ may deviate from percentage agreement?

The graphic shows that κ is most likely to deviate from a linear relationship to percentage agreement when (1) percentage agreement is high and (2) there is at least 1 cell with a small N. The presence of a cell with sparse data suggests that the marginals are skewed, and, as discussed in the answer to 5a, as marginals become skewed expected agreement increases and κ (for a given percentage agreement) decreases. This graphic shows that the variation in κ among the questions in Table 1 of the Cruz et al article may have less to do with the difficulty of the question than of the rarity of “yes” (or “no”) answers. Those questions in which the majority of respondents provide the same answer are likely to have lower κs, regardless of the true reliability of the measure.

Q5.c Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2×2 table, instead of reporting the percentage agreement or κ?

We hope that this Journal Club has made readers aware of the oversimplifications and distortions that can occur when a 2×2 table (or more complex data structure) is reduced to a single reliability metric such as κ. If an experimental design warrants consideration of interrater reliability, then investigators should strongly consider reporting the actual interrater reliability data rather than percentage agreement or κ. This information could go in an online-only supplement if it is too bulky to go in the main article.

Back to Article Outline

References 

  1. Cruz CO, Meshberg EG, Shofer FS, et al. Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome. Ann Emerg Med. 2009;54:1–7
  2. Landis JR, Koch GC. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174
  3. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measure. 1960;20:37–46
  4. Uebersax J. Statistical methods for rater agreement. http://ourworld.compuserve.com/homepages/jsuebersax/agree.htmAccessed May 31, 2009
  5. Wuensch KL. Inter-rater agreement. http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.docAccessed May 18, 2009
  6. Kendall M. A new measure of rank correlation. Biometrika. 1938;30:81–89
  7. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15:72–101
  8. Noether GE. Why Kendall tau?. http://rsscse.org.uk/ts/bts/noether/text.htmlAccessed May 18, 2009
  9. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–428
  10. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–310
  11. Dewitte K, Fierens C, Stockl D, et al. Application of the Bland-Altman plot for interpretation of method-comparison studies: a critical investigation of its practice. Clin Chem. 2002;48:799–801
  12. Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed.. New York, NY: John Wiley & Sons; 1981;
  13. Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549
  14. Cicchetti DV, Feinstein AR. High agreement but low kappa, II: resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558

 Section editors: Tyler W. Barrett, MD; David L. Schriger, MD, MPH

 Editor's Note: This 10th installment of Annals of Emergency Medicine Journal Club departs slightly from previous installments by focusing on a single methodological issue, the measurement of reliability. We use the Cruz et al article as a jumping-off point for our discussion.1 Although this installment may be appropriate for some residency journal clubs (particularly if they use our more basic questions and add some clinical questions about the article), we suspect that it will be of greater value to research fellows and researchers.Readers should recognize that these are suggested answers and, although it is hoped that they are correct, are by no means comprehensive. There are many other points that could be made about these questions or about the article in general. Questions are rated “novice,” () “intermediate,” () and “advanced” ().

PII: S0196-0644(09)01258-X

doi:10.1016/j.annemergmed.2009.07.013

Annals of Emergency Medicine
Volume 54, Issue 6 , Pages 843-853, December 2009