Journal Home
Search for

Volume 54, Issue 1, Pages 9-11 (July 2009)


View previous. 10 of 58 View next.

Journal Club: The Measurement of Reliability

Frank C. Day, MD, MPH, David L. Schriger, MD, MPH

Refers to article:
Journal Club questions Interrater Reliability and Accuracy of Clinicians and Trained Research Assistants Performing Prospective Data Collection in Emergency Department Patients With Potential Acute Coronary Syndrome , 30 January 2009
Carlos O. Cruz, Emily B. Meshberg, Frances S. Shofer, Christine M. McCusker, Anna Marie Chang, Judd E. Hollander
Annals of Emergency Medicine
July 2009 (Vol. 54, Issue 1, Pages 1-7)
Abstract | Full Text | Full-Text PDF (193 KB)

Editor's Capsule Summary for Cruz et al

1

What is already known on this topic

Valid clinical research requires high-quality data collection. Physicians are commonly considered the standard by which valid prospective data are obtained.

What question this study addressed

This study determined whether non–medically trained research assistants could reliably collect subjective historical data from emergency department patients with chest pain.

What this study adds to our knowledge

This prospective comparative study included 33 research assistants, 39 physicians, and 143 patients. Research assistants demonstrated fair to excellent reliability (as defined by crude agreement and kappa) when obtaining cardiac histories and cardiac risk factors.

How this might change clinical practice

The results of this study will not change clinical practice. They do, however, provide evidence to support the use of trained research assistants for the collection of certain types of clinical data.

Article Outline

Abstract

Discussion Points

References

Copyright

Discussion Points 

return to Article Outline


1.Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.

2.Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (Yes or no):
MD Recorded “Yes”MD Recorded “No”Total
RA recorded yes1176123
RA recorded no18220
Total1358143

MD, Medical doctor; RA, Research assistant.

A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data?

3. Cruz et al quote the oft-cited Landis and Koch2 article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?

4. A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100 statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.B. Consider the 2 tables below and calculate percentage agreement and κ for each. Why is κ lower on the right? What does this mean?Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations?C. To further consider the meaning of κ, imagine (for the table immediately above) that planes flew overhead such that 60 statements were heard perfectly and 40 were barely comprehensible or not heard at all. Below are separate tables for the 60 audible and 40 incomprehensible statements.Calculate percentage agreement and κ for these tables. Which is the better measure for each? Consider the confidence level of the raters in the different scenarios presented in this exercise. Should rater confidence be considered when interrater reliability is described? How might this be done?


5.Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 × 2 table.A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 × 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?B. Can you comment on the relationship between the size of the smallest cell in the 2 × 2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 × 2 table, instead of reporting the percentage agreement or κ?

References 

return to Article Outline

1. 1Cruz CO, Meshberg EB, Shofer FS, et al. Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome. Ann Emerg Med. 2009;54:1–7. Abstract | Full Text | Full-Text PDF (193 KB) | CrossRef

2. 2Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. CrossRef

University of California, Los Angeles, CA

 Section editors: Tyler W. Barrett, MD; David L. Schriger, MD, MPH

 SEE RELATED ARTICLE, P. 1.

 Editor's Note: This 10th installment of Annals of Emergency Medicine Journal club departs slightly from previous installments by focusing on a single methodological issue, the measurement of reliability. We use this issue's Cruz et al article as a jumping-off point for our discussion. This bimonthly feature seeks to improve the critical appraisal skills of emergency physicians and other interested readers through a guided critique of actual Annals of Emergency Medicine articles. Each Journal Club will pose questions that encourage readers—be they clinicians, academics, residents, or medical students—to critically appraise the literature.

During a 2- to 3-year cycle, we plan to ask questions that cover the main topics in research methodology and critical appraisal of the literature. To do this, we will select articles that use a variety of study designs and analytic techniques. These may or may not be the most clinically important articles in a specific issue, but they are articles that serve the mission of covering the clinical epidemiology curriculum. Journal Club entries are published in 2 phases. In the first phase, a list of questions about the article is published in the issue in which the article appears. Questions are rated “novice,” () “intermediate,” () and “advanced” () so that individuals planning a journal club can assign the right question to the right student. The answers to this journal club will be published in the December 2009 issue. US residency directors will have immediate access to the answers through the Council of Emergency Medicine Residency Directors Share Point Web site. International residency directors can gain access to the questions by going to http://www.emergencymedicine.ucla.edu/annalsjc/ and following the directions. Thus, if a program conducts its journal club within 5 months of the publication of the questions, no one will have access to the published answers except the residency director. The purpose of delaying the publication of the answers is to promote discussion and critical review of the literature by residents and medical students and discourage regurgitation of the published answers.

It is our hope that the Journal Club will broaden Annals of Emergency Medicine's appeal to residents and medical students. We are interested in receiving feedback about this feature. Please e-mail journalclub@acep.org with your comments.

PII: S0196-0644(09)00521-6

doi:10.1016/j.annemergmed.2009.05.012


View previous. 10 of 58 View next.