Annals of Emergency Medicine
Volume 55, Issue 6 , Pages 570-577, June 2010

The Conduct and Reporting of Meta-Analyses of Studies of Diagnostic Tests, and a Consideration of ROC Curves:

Answers to the January 2010 Journal Club Questions

Article Outline

 

Back to Article Outline

Discussion Points 


1.Please review the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement and explanatory document (both available at http://www.prisma-statement.org).
A.Review the Hlibczuk et al article against each element in PRISMA. According to PRISMA, what items are reported well? Reported poorly?

B.For this particular study's question (the diagnostic accuracy of noncontrast computed tomography [CT] in suspected appendicitis), which elements of PRISMA seem most important? Are these handled well by the article?


2. Talk to the radiologists at your hospital. What imaging studies are routinely done when appendicitis is suspected? What is the rationale for whatever approach is used? Have other approaches been considered? Does remuneration to the hospital or radiologists play a role?

3.The authors report that “[un]enhanced CT test performance was assessed with the traditional summary receiver operating characteristic (SROC) curve analysis, with independently pooled sensitivity and specificity values across studies using a random effects model.”
A.What is a receiver operating characteristic (ROC) curve (for a single study)? How is it helpful? What are its failings?

B.Consider the data in the Figure for a new blood test for appendicitis. What can be said from the Figure? Use the data in the Figure to make an ROC curve. Label each point with the appropriate cut point. Draw a chance line. Interpret your ROC curve.

C.How does a summary ROC (SROC) curve differ from an ROC curve for a single study?

D.What does the sentence quoted above mean?

E.The authors “independently pooled the sensitivity and specificity values.” In general, what is the potential problem with separately pooling these measures (see Authors’ Appendix E2 and their reference 51)? Is it a problem in this study?


4.If you are reviewing this article for a formal journal club, assign presenters to debate the following topics. If you are reading on your own, put a Bluetooth earpiece on and argue both sides; people will think you are on the telephone.
A.A systematic review/meta-analysis that does well according to PRISMA will likely answer the question posed by the analysts.

B.The mathematical combination of the results of the 7 articles in the meta-analysis adds value beyond the systematic review itself.


Back to Article Outline

Answer 1 

Q1. Please review the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement and explanatory document (both available at http://www.prisma-statement.org).

Q1.a Review the Hlibczuk et al article against each element in PRISMA. According to PRISMA, what items are reported well? Reported poorly?

PRISMA Checklist Items by Section 

Title (PRISMA Checklist Item 1). PRISMA requires that the title of the article “identify the report as a systematic review, meta-analysis, or both.” The title of this article, “Diagnostic Accuracy of Noncontrast Computed Tomography for Appendicitis in Adults: A Systematic Review,” describes the review component but does not indicate that the article contains a meta-analysis as well. A “systematic review” tries to gather all data relevant to a given research question with an explicit, prespecified, systematic and replicable search strategy. The use of statistical methods to combine the results of the included studies constitutes a meta-analysis. Because this article contains summary statistics, such as the pooled estimates for sensitivity and specificity, it should also be identified as a meta-analysis.

Abstract (PRISMA Checklist Item 2). This abstract provides a structured summary that includes the objectives, data sources, eligibility criteria, synthesis methods, results, and conclusions and makes some note of the implications of key findings. The abstract does not include background and limitations or the systematic review registration number called for by PRISMA. (The authors do provide substantial discussion of background and limitations in subsequent sections, and the PRISMA explanatory document describes this simplified abstract form as an acceptable alternative.)

Introduction (PRISMA Checklist Items 3 to 4). This article's introduction establishes the impetus for using CT in the diagnosis of acute appendicitis (clinical examination is insensitive and laboratory diagnosis nonspecific, and CT may reduce negative appendectomy rates) and specifically for using noncontrast CT (time considerations, decreased adverse effects). The description of the objectives, however, is very general and states only a plan to “assess the evidence.” PRISMA calls for “an explicit statement of questions being addressed with reference to participants, interventions, comparisons, outcomes, and study design” in the introduction.

Methods (PRISMA Checklist Items 5 to 16). A “written systematic review protocol” (checklist item 5) is mentioned on page 51 to 52 but not provided. No registration information is provided.

The eligibility criteria (checklist item 6) can be described using the PICOS approach (participants, interventions, comparators, outcomes, and study design; see Item 11 and Box 2 of the PRISMA statement). This study provides a fair description of participant criteria (ED patients older than 16 years—excluding mixed pediatric and adult populations—with “pain suspicious for acute appendicitis, but not immediate candidates for surgery”). The “suspicion for acute appendicitis” is not further described and may have been determined differently in the various studies because there is a wide range for the percentage of subjects diagnosed with appendicitis (20.1% to 84.5%). In addition, the review targets only the population of patients who are “not immediate candidates for surgery.” This criterion is not further elucidated. If it is intended to exclude patients too ill for CT, this could also be accomplished by excluding patients who did not undergo CT. It may also serve to include only patients in whom CT may change management, ie, excluding those who are ill enough to require operative management regardless of CT results. The rationale for this criterion is not explicitly discussed. No age, language, or publication status restrictions were applied to the search.

Information sources (checklist item 7) are listed on page 52 but are incomplete because date ranges are given only for MEDLINE. The MEDLINE search strategy (checklist item 8) is provided as Appendix E1 of the original article.

A limited description of the study selection process (checklist item 9) is provided, though as mentioned above, the full review protocol is not. Agreement between the 2 initial reviewers for the initial relevance screen is evaluated with a κ statistic (0.62, provided in the Results section), and disagreements are said to be mediated by a third reviewer, though the number of cases for which there was disagreement is not provided. In addition, although there is a table identifying reasons for exclusion after full article review, some inconsistencies remain. It is unclear, for example, why the Yetkin study was excluded in the first round of full article review (see Table 1) for the reason “Could not verify length of clinical follow-up,” whereas others given this reason for exclusion (eg, Cakirer, Heaston, Hershko, Lane, Peck, and Togawa), were not excluded until the following round, after author contact was attempted. In addition, no details are given about the mode of attempted author contact or about confirmation of data obtained by these means. The process of data extraction (checklist item 10) is partially characterized. The same 2 reviewers abstract independently, though “consensus was reached by conference,” and disagreements are mediated by a third party. The data extraction form is not provided and no piloting of the abstraction or review form was described.

The authors describe which variables they consider to be key (“patient spectrum” and use of an “appropriate reference standard”) among those for which they collect data, though there is not a clear list of all the variables collected (checklist item 11). They use the QUADAS tool1 to assess the general “quality” of studies, identify the 6 QUADAS criteria that they consider to be most relevant, and report methodological quality in a summary figure (Figure 3 in the article by Hlibczuk et al, page 56). They provide limited explicit discussion of the risk of bias in individual studies (checklist item 12) with reference to verification bias and blinding. Because surgical findings and pathology are the reference standard for patients undergoing surgery, and not all patients will undergo surgery, there is the potential for differential verification bias, ie, the possibility that the 2 reference standards differ in their representation of the truth about whether a patient has appendicitis. Clinical follow-up, for example, may miss some mild, self-limited cases of appendicitis that would be deemed positive by surgical pathology. Although these clinically mild cases may be unlikely to involve “patient-important outcomes,” as the authors argue, this reassurance does not entirely address the issue of differential verification bias. In other words, although from a clinical point of view it is safe to discharge a patient with a disease that will never be worse than “mild,” it is still a problem for a given study that the same patient would be deemed to have a positive result by one criterion standard (surgical pathology) and a negative result by the other (clinical follow-up). This means that the actual disease state of the patient depends on the choice of criterion standard and would be different in each case. We can still claim that noncontrast CT allows us to safely discharge patients, but the differential verification, the differential performance of the 2 criterion standards, compromises our evaluation of the performance (sensitivity and specificity) of noncontrast CT.

The summary findings presented (checklist item 13) are “independently pooled sensitivity and specificity” measures, which assumes that there is no relationship between sensitivity and specificity. See question 3 for a discussion of the reasonableness of this assumption and of the statistical methods used to generate the summary measures (checklist item 14).

There is no discussion of the risk of bias that (as described in the PRISMA checklist item 15) “may affect the cumulative evidence (eg, publication bias, selective reporting within studies).” The authors argue against the need for an analysis of publication bias based on the arguable validity of the relevant statistical tests but do not address the potential effect of publication bias on their results. They do not evaluate heterogeneity for similar reasons, arguing that the “validity of using formal statistical tests for assessing heterogeneity in diagnostic meta-analysis has been recently questioned.”

Results (PRISMA Checklist Items 17 to 23). As suggested by PRISMA, a flow diagram is provided (Hlibczuk et al, Figure 1, page 53). The initial relevance screen yields 32 articles; 15 are excluded by full article review for the reasons listed in Table 1, and another 10 are excluded after attempts to contact authors. (See also the discussion of checklist item 9 above for further issues.)

Each study for which data were extracted is described and cited (checklist item 18), though only limited PICOS information (called for by the PRISMA checklist) is given for individual studies. Specifically, see above for a discussion of the inclusion criteria “pain suspicious for acute appendicitis, but not immediate candidates for surgery.” Table 2 provides a summary of study characteristics and Figure 2 provides a forest plot (checklist item 20).

Checklist item 19 calls for “data on risk of bias of each study,” and checklist item 22 calls for the “results of any assessment of risk of bias across studies.” See the discussions of checklist items 12 and 15 above.

The authors present random-effects pooled estimates for sensitivity and specificity and generate 95% confidence intervals (CIs) and positive and negative likelihood ratios (checklist item 21). See question 3 for a discussion of the validity of this approach.

Discussion (PRISMA Checklist Items 24 to 26). The authors present more background here than elsewhere in the article. As called for by PRISMA (checklist item 24 to 25), they summarize their main findings but do not discuss the strength of evidence for their outcome. They provide only limited discussion of the relevance of their results “to key groups (eg, health care providers, users, and policy makers).” They do present some of the difficulties associated with the diagnosis of acute appendicitis and the pitfalls of various contrast modalities and place their review in the context of other reviews (checklist item 26).

Funding (PRISMA Checklist Item 27). By Annals of Emergency Medicine policy, all authors are required to report relevant funding relationships, and none were reported for this article. Author statements on funding are not routinely subject to verification.

Q1.b For this particular study's question (the diagnostic accuracy of noncontrast computed tomography [CT] in suspected appendicitis), which elements of PRISMA seem most important? Are these handled well by the article?

As the authors point out, important issues for this meta-analysis include the “reference standard” and the “patient spectrum.” As discussed above in the discussion of checklist item 12, there is a substantial risk for verification bias when different reference standards are used in each group.

The inclusion criteria that guide study selection determine the “patient spectrum”—the characteristics of the included study population—and therefore, the generalizability or “external validity” of the meta-analysis. Here, the studies are said to include adult ED patients with “pain suspicious for acute appendicitis, but not immediate candidates for surgery.” However, individual study inclusion criteria are not described in either the text or the study characteristics table, and this limits readers' ability to evaluate potential bias in individual studies. (See Table 2, “Example Table: Summary of Included Studies,” in the PRISMA statement.) The very different prevalence of appendicitis in some studies would suggest that the populations are quite different, despite sharing a “pain suspicious for appendicitis.” The combination of “heterogeneous” studies also creates a risk for bias, and these authors do not offer an assessment of the risk of comparability bias across studies (PRISMA statement item 22).2 See also question 3 for a discussion of summary effects.

Back to Article Outline

Answer 2 

Q2. Talk to the radiologists at your hospital. What imaging studies are routinely done when appendicitis is suspected? What is the rationale for whatever approach is used? Have other approaches been considered? Does remuneration to the hospital or radiologists play a role?

Current imaging strategies for the diagnosis of acute appendicitis include ultrasonography, noncontrast CT, and CT with oral, rectal, or intravenous contrast. Considerations that may affect this choice include diagnostic performance of the test (for appendicitis and other conditions that may present similarly), patient safety issues (possible adverse effects of exposure to contrast or radiation), and ED flow issues (time required to complete an imaging study, the wait time for laboratory results needed before imaging, or the need to call in a ultrasonography technician if not available in-house at all times). In addition, facility, technician, and radiologist charges may vary considerably for each of these options.

Back to Article Outline

Answer 3 

Q3. The authors report that “[un]enhanced CT test performance was assessed with the traditional summary receiver operating characteristic (SROC) curve analysis, with independently pooled sensitivity and specificity values across studies using a random effects model.”

Q3.a What is an ROC curve (for a single study)? How is it helpful? What are its failings?

ROC (receiver operating characteristic) curves are used in medicine to examine the tradeoff between sensitivity and specificity for diagnostic tests that produce continuous results (eg, WBC count, β-human chorionic gonadotropin level). The odd name comes from electrical engineering. Radios can be designed to increase sensitivity—which makes it easier to detect weaker signals but harder to tune out static and interference (decreased specificity)—or to suppress noise (increase specificity) at the expense of being unable to detect weaker stations (decreased sensitivity). A radio that is very specific will detect only the strongest radio stations but will do so with little static. A radio that is very sensitive will detect many stations but they may all sound fuzzy. A great radio optimizes this signal-to-noise ratio. The ROC curve for a given radio shows its tradeoff between signal and noise.

Before we tackle ROC curves in medical care, let us define some terms. We will refer to positive test results in persons with disease (as determined by an appropriate criterion standard) as “true positives” (TP). Those who have the disease but have negative test results are “false negatives” (FN). “True negatives” (TN) and “false positives” are similarly defined in persons who, by the criterion standard test, do not have the disease. The sensitivity is the fraction of persons with the disease who have a positive test result (TP/(TP+FN)), analogous to the fraction of stations that a radio can detect. Specificity (TN/(TN+FP)) is the fraction of negative test results in persons who do not have the disease. Now technically, in epidemiologic terms, a “rate” is a fraction with time as the denominator (incidence rate=new cases per year). However, sensitivity is commonly (though erroneously) referred to as the “true-positive rate” (TPR), whereas it should be the “TP proportion.” Because these terms are in widespread usage, however, we will use them here. If specificity, then, is the “true-negative rate,” then “1–specificity” is the “false-positive rate” (FPR).

The jump from radio performance to medical diagnostic testing is straightforward. We want a test that optimizes the signal (TP) while minimizing noise (FP). The ideal test would achieve a sensitivity (TP/(TP+FN)) of 100% regardless of what specificity (TN/(TN+FP)) was selected and would maintain a specificity of 100% regardless of what sensitivity is desired. If we graph the TP rate on the y axis and the FP rate (1–specificity) on the x axis, a perfect test would hug the y axis (perfect specificity) from 0% to 100% and then go horizontally at 100% sensitivity across all values of specificity (red line in Figure 1).

A 45° line from the origin to the value (100%, 100%) represents the chance line, the curve of a test that is no better or worse than chance. One can think of that line as being constructed from games of chance that involve a different number of possible outcomes. For example, if our diagnostic test is a true coin with 2 equally likely outcomes, “diseased” and “normal,” we can expect it to have a sensitivity (TPR) of 50% (because it will flip “disease” 50% of the time when it is used on diseased patients) and an FPR of 50% (because it will flip “diseased” 50% of the time on nondiseased patients). This coin creates the point 50%, 50% on the chance line. By similar logic, a true die, labeled “diseased” on one side and “normal” on the other 5, will produce the point (1/6, 1/6 or 16.7%, 16.7%) because it will have a TP on one sixth of diseased patients and a FP on one sixth of nondiseased patients. One can envision populating the remainder of the 45° line by using test objects with various numbers of sides that are labeled for disease and normal results in various proportions.

The ROC curve for a given test is constructed by a selecting a cut point (say, cut point 1) and calculating the TPR and FPR (see the next question for an example of this). This activity produces a single point on the curve (TPR1, FPR1). We then shift to a new cut point and repeat the calculation, producing a second data point (TPR2, FPR2). By repeating this process across the range of possible cut point values, we generate the full ROC curve for that test.

One can measure the global performance of a test by considering the area under the ROC curve (AUC). From Figure 1, we observe that the area under the chance line is 0.5. A test that is better than chance will have a value between 0.5 and 1 and a good test will be much closer to 1 than to 0.5. We can compare tests by comparing their AUCs.3, 4

ROC curves are useful, but not without problems. Some of these are the following:

1)They are only as good as the data they come from. Just as sensitivity and specificity measurements are subject to spectrum bias (problems with patient selection) so too are ROC curves. A test could have an AUC near 1 in one patient sample and a much lower value when applied to a different population.

2)ROC curves are not loss functions. By design, the ROC curve gives TPR and FPR equal weight (the x and y axes are symmetric and the curve forms a square with the chance line as a diagonal). With many diagnostic tests, however, we are far more concerned about one of the 2 rates. For example, in many ED situations, we are unwilling to miss cases and therefore insist that our test have a high TPR (sensitivity). Consider the 2 curves shown in Figure 2. They have the same AUC but are not equivalent. If we value TPR above FPR, then we much prefer test 1. This example is representative of the larger principle that knowledge of TPR and FPR (or sensitivity and specificity) is insufficient to guide test selection (or choice of cut point within a test). To make such choices, one must first specify a loss function. This function stipulates how one values FPs relative to FNs. Until we specify a loss function, we cannot determine the optimal cut point for a given test and can only state which test is preferable in situations in which one test's ROC curve is above the other's for every value of FPR (x axis).

In summary, ROC curves are useful for comparing tests but have their limitations and require the stipulation of a loss function if they are to be truly useful.

Q3.b Consider the data in the Figure for a new blood test for appendicitis. What can be said from the Figure? Use the data in the Figure to make an ROC curve. Label each point with the appropriate cut point. Draw a chance line. Interpret your ROC curve.

The Figure shows data for 49 normal patients and 52 patients with appendicitis and suggests that those with appendicitis are more likely to have high values of this test. An article describing this new test might state that “the mean value for those with appendicitis (7.3) was significantly higher than for normal subjects (5.3), a difference of 2.0 (95% CI 1.0 to 3.1; P=.0002), providing strong support for the merits of this blood test.” We hope that readers of this journal club will question such a statement and recognize that a difference between means does not imply that the test classifies patients effectively.

To make the ROC curve, we follow the method discussed in question 3.a. Let us begin by setting the cut point at 0. We examine the histograms and observe that all patients have values above zero. Therefore, the sensitivity (TPR) is 100% (because all patients with appendicitis will be identified), and the FPR is also 100% (because all patients without appendicitis will also be deemed “positive”). This point is represented in the upper right corner of the ROC curve with the label “0.”

We now move the cut point to 1. From the graph, we see that 1 patient with appendicitis has a value below 1 and therefore the test now has a TPR of 51/52, or 98%. Similarly, only 1 nondiseased patient has a value below 1, and the FPR is therefore 48/49, or 98%. These values create the point 98%, 98%, which is labeled with the cut point “1.” We continue counting the TPs and FNs at each cut point, generating an FPR,TPR x,y pair for each value and labeling each point with the cut point used to generate it.

We did not actually make the curve this way. We entered the data into Stata 11 (StataCorp, College Station, TX) and had Stata draw both the histogram and the ROC curve. The code to do this is posted as an online attachment to this article.

The area under the curve is 0.71 (95% CI 0.61 to 0.81), evidence that the test is undoubtedly better than chance. Unfortunately, despite the statistically different mean values discussed above, there is no cut point for which sensitivity and specificity are sufficiently high that we would be interested in using this test on our patients.

Q3.c How does a summary ROC (SROC) curve differ from an ROC curve for a single study?

Jones and Athanasiou5 have done a terrific job of answering this question in a statistical note in the Annals of Thoracic Surgery, and we refer interested readers there. In brief, an ROC curve shows the tradeoff of sensitivity and specificity based on results from a single study. An SROC curve has the same axes as an ROC curve, but each point is derived from a different study. The process begins with the meta-analyst culling sensitivity and specificity values from each study, with the goal of using the same cut point (threshold) across studies. This is important because if one study reports sensitivity and specificity values based on a test result above 3 being abnormal and another study reports sensitivity and specificity values based on a test result above 6 being abnormal, then we cannot compare the tests effectively. The test characteristics may differ because the thresholds were different or because the test actually performed differently. The sensitivity and specificity values for each study are then mathematically transformed (typically to the logit scale) to facilitate analysis, and the SROC curve is created by fitting the best smooth curve to the TPR/FPR values for each study. The curve provides a sense of the test characteristics according to all the available evidence.

Q3.d What does the sentence quoted above mean?

Let's parse the sentence into comprehensible phrases. The first phrase “Unenhanced CT test performance was assessed with the traditional summary receiver operating characteristic (SROC) curve analysis” was explained in 3c. The second phrase “with independently pooled sensitivity and specificity values across studies” was also discussed in 3c where we explained how the investigators examined each study and culled sensitivity and specificity values based on a uniform cut-point across studies. The final part of this phrase, “across studies using a random effects model,” explains how the authors took their list of sensitivity values (one for each study) and pooled them to get a single overall sensitivity (and then repeated the process for specificities). There are a variety of mathematical models for combining data across studies. Most methods calculate a weighted average using weights based on each study's size and variance. Averaging methods differ with respect to assumptions made about the structure of the data in relation to the underlying population. As Lang puts it, “A fixed-effects model assumes that there is a single ‘fixed’ effect that every study will approximate. That is, if every study were infinitely large, every study would yield an identical result. A random-effects model, on the other hand, assumes that the results of individual studies form a distribution of effects that has some central value and some degree of variability. The random effects model makes fewer assumptions about the variability in the analysis and so is more conservative than the fixed effects model.”6

Q3.e The authors “independently pooled the sensitivity and specificity values.” In general, what is the potential problem with separately pooling these measures (see Authors’ Appendix E2 and their reference 51)? Is it a problem in this study?

As discussed in the answers to the other parts of this question, sensitivity and specificity values are not independent since they both depend on the choice of cut-point and deviate in a predictable way as the cut-point is moved. If the cut-point is raised the sensitivity goes down (since it is more difficult for a test to come back “true positive”) and the specificity goes up (since, for the same reason, there are fewer false positives). Separately analyzing sensitivity and specificity may produce spurious results if the cut-points used in different studies vary in important ways. For example, 4 studies of equal size might contribute sensitivities of 86%, 87%, 88%, and 99% for an average sensitivity of 90%. The interpretation of this value depends on the corresponding specificity values. If all 4 were 94% then we might conclude that 90% was a reasonable estimate of test's sensitivity; however, if the values were 94%, 94%, 94%, and 72% we might throw out the 99% value as produced by a study that had a lower cut-point than the others. A better estimate of the sensitivity would then be 87%.

This example highlights one of the difficulties of meta-analyses; what to do with heterogeneity. The prevalence of appendicitis in these studies varies from 20% to 85%, ample evidence of heterogeneity. Should we be concerned that the two studies that have the best combination of sensitivity and specificity also have the highest prevalence of appendicitis? Including or excluding studies could shift the values of sensitivity and sensitivity from the low 90's to 100%. Can we learn anything more from this analysis than to say that both sensitivity and specificity are likely higher than 90%? Can any valid definitive statement be made about the relative value of contrast-enhanced and unenhanced CT? Unfortunately, after all of the technical analysis and mathematical modeling we are still left with a subjective judgment about the relative value of these modalities.

Back to Article Outline

Answer 4 

Q4. If you are reviewing this article for a formal journal club, assign presenters to debate the following topics. If you are reading on your own, put a Bluetooth earpiece on and argue both sides; people will think you are on the telephone.

Q4.a A systematic review/meta-analysis that does well according to PRISMA will likely answer the question posed by the analysts.

PRISMA is a set of guidelines designed to improve the reporting of systematic reviews according to the premise that complete reporting is essential for conveying science and that clear reporting will decrease the possibility of unappreciated biases. As such, it provides a list of elements that must be included for completeness and transparency.

The pro side for this argument might suggest that such completeness will necessarily increase the chances of answering the question posed (ie, if the search strategy, results, and calculations are described in a complete manner, then their implications for the research question will be clear and the summary calculations more useful).

The con side, however, can argue that the researchers may have impeccable reporting by the terms of PRISMA and still reach a conclusion that is fundamentally biased. This could be due to an underlying literature whose individual studies are biased, an underlying literature that suffers from publication bias, or a search strategy that fails to identify all of the relevant literature. PRISMA offers no guarantees that a chosen search strategy is appropriate for a given question. A meta-analysis that does well by PRISMA will provide readers with a complete and, assuming that the researchers are honest, accurate description of what the authors have done and make it possible to reproduce the study. Although this is all very important, PRISMA compliance does not eliminate the possibility of bias, and it is bias that we are most concerned with.

Q4.b The mathematical combination of the results of the 7 articles in the meta-analysis adds value beyond the systematic review itself.

Pro 

Pooling studies in a meta-analysis should average individual outliers and narrow the range of results. In other words, combining studies provides a larger overall study population and greater precision in the summary effect. With a larger sample size, the effect of random variation is decreased. As seen here, the 95% CIs are substantially narrower for the combined result than for most of the individual studies.

Con 

If, however, we combine one good study whose results actually approach the truth with several biased studies, the summary effect may actually be farther from the truth than the results of the good study alone. In addition, if study populations are too heterogeneous or the statistical methods used to generate a summary effect are flawed (see question 3), then the process of combination itself can introduce significant bias. We might improve the precision (the narrowness of the results range) while actually decreasing accuracy (or increasing bias) if the studies or the combination methods are biased.

Repeating a biased study on an ever-larger population, although it may increase precision, does nothing to increase the accuracy of the study, the relationship between the measured outcome and the actual outcome (the truth). Bias affects accuracy and cannot be countered by increasing precision. One could design an extremely precise but biased rocket model, for example, in which all rockets fall within 10 feet of one another, though miles from the target (the truth). (See Figure 4).

In this article, there is inadequate discussion and evaluation of heterogeneity, and the external validity of the summary effect is unclear.

Back to Article Outline

Application 

Back to Article Outline

References 

  1. Whiting P, Rutjes AW, Reitsma JB, et al. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol. 2003;3:25
  2. Eddy DM, Hasselblad V, Shachter RD. Meta-analysis by the Confidence Profile Method: The Statistical Synthesis of Evidence. Boston, MA: Academic Press; 1992;
  3. McNeil BJ, Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Med Decis Making. 1984;4:137–150
  4. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36
  5. Jones CM, Athanasiou T. Summary receiver operating characteristic curve analysis techniques in the evaluation of diagnostic tests. Ann Thorac Surg. 2005;79:16–20
  6. Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers. 2nd ed.. New York, NY: American College of Physicians; 2006;

 Section editors: Tyler W. Barrett, MD; David L. Schriger, MD, MPH

 Editor's Note: You are reading the 13th installment of Annals of Emergency Medicine Journal Club. The questions and the article they are about (Hlibczuk et al. Ann Emerg Med. 2010;55:51-59) were published in the January 2010 issue.Information about journal club can be found at http://www.annemergmed.com/content/journalclub.Readers should recognize that these are suggested answers. We hope they are accurate; we know that they are not comprehensive. There are many other points that could be made about these questions or about the article in general. Questions are rated “novice” (), “intermediate” (), and “advanced” () so that individuals planning a journal club can assign the right question to the right student. The “novice” rating does not imply that a novice should be able to spontaneously answer the question. “Novice” means we expect that someone with little background should be able to do a bit of reading, formulate an answer, and teach the material to others. Intermediate and advanced questions also will likely require some reading and research, and that reading will be sufficiently difficult that some background in clinical epidemiology will be helpful in understanding the reading and concepts.We are interested in receiving feedback about this feature. Please e-mail journalclub@acep.org with your comments.

PII: S0196-0644(10)00121-6

doi:10.1016/j.annemergmed.2010.02.008

Refers to article:

  • Continuing Medical EducationJournal Club questions Diagnostic Accuracy of Noncontrast Computed Tomography for Appendicitis in Adults: A Systematic Review , 07 September 2009

    Veronica Hlibczuk, Judith A. Dattaro, Zhezhen Jin, Louise Falzon, Michael D. Brown
    Annals of Emergency Medicine January 2010 (Vol. 55, Issue 1, Pages 51-59.e1)

  • Journal Club: The Conduct and Reporting of Meta-Analyses of Studies of Diagnostic Tests, and a Consideration of ROC Curves

    David L. Schriger, Teri A. Reynolds
    Annals of Emergency Medicine January 2010 (Vol. 55, Issue 1, Pages 60-61)

Annals of Emergency Medicine
Volume 55, Issue 6 , Pages 570-577, June 2010