The use of dedicated methodology and statistical reviewers for peer review: A content analysis of comments to authors made by methodology and regular reviewers☆☆☆★
Article Outline
Abstract
Study objective: In 1997, Annals of Emergency Medicine initiated a protocol by which every original research article, in addition to each regular review, was concurrently evaluated by 1 of 2 methodology and statistical reviewers. We characterized and contrasted comments made by the methodology and regular peer reviewers. Methods: After pilot testing, interrater reliability assessment, and revision, we finalized a 99-item taxonomy of reviewer comments organized in 8 categories. Two authors, uninvolved in the writing of reviews, classified each comment from a random sample of methodology reviews from 1999. For 30 of these reviews (15 for each methodology reviewer), the 2 authors also scored all (range 2 to 5) regular reviews. Results: Sixty-five reviews by methodologist A, 60 by methodologist B, and 68 by regular reviewers were analyzed. Comments by methodologist A most frequently concerned the presentation of results (33% of all comments) and methods (17%). Methodologist B commented most frequently on presentation of results (28%) and statistical methods (16%). Regular reviewers most frequently made non-methodology/statistical comments (45%) and comments on presentation of results (18%). Of note, comments made by methodology and regular reviewers about methods issues were often contradictory. Conclusion: The distributions of comments made by the 2 methodology and statistical reviewers were similar, although reviewer A emphasized presentation and reviewer B stressed statistical issues. The regular reviewers (most of whom were unaware that a dedicated methodology and statistical reviewer would be reviewing the article) paid much less attention to methodology issues. The 2 dedicated methodology and statistical reviewers created reviews that were similarly focused and emphasized methodology issues that were distinct from the issues raised by regular reviewers. [Ann Emerg Med. 2002;40:329-333.]
See related articles, p. 313, p. 317, p. 323, and p. 334, and abstracts, p. 338.
Introduction
Prepublication manuscript review is intended to identify high-quality submitted research, generate useful feedback to authors and editors, and ultimately result in the publication of high-quality manuscripts. An analysis of the effectiveness of editorial peer review concludes that "peer review and editing lead to better reports of research results"1 on the basis of studies comparing the quality of manuscripts before and after peer review.2, 3, 4 Critics of this process note that peer review is unstandardized, subject to bias, expensive, and insufficiently validated.5
The quality of a submitted manuscript needs to be assessed in terms of both context and content. An expert reviewer, intimately familiar with the relevant field, is needed to provide context (ie, to evaluate the originality and importance of the research). These reviewers are typically academicians identified by the editorial committee as experts in their fields. Many journals also rely on these reviewers to evaluate the methodology and to judge the internal and external validity of the manuscript. Their expertise in the content area is no guarantee that these individuals have the skills needed to assess the research and analytic methods contained in an article. For this reason, some journals supplement content reviews with methodology reviews, which are typically performed by statisticians or clinicians who have additional training or experience in biostatistics or clinical epidemiology.
From 1997 to 1999, every original research article submitted to Annals of Emergency Medicine was reviewed by 1 of 2 dedicated methodology and statistical reviewers (RLW, DLS) in addition to the usual content reviewers. We performed this analysis to describe the kinds of comments made by the 2 methodology and statistical reviewers and how they differ and to contrast the content of their reviews with the content of the regular peer reviews.
Materials and methods
On the basis of existing taxonomies used in peer-review research,4, 6, 7 we developed and pilot tested a 17-category, 70-item taxonomy for classifying comments made in manuscript reviews. We designed the taxonomy with the goal of facilitating the detailed classification of reviewer comments about methodology but included sufficient categories to accommodate the classification of all comments. Two of the authors who were not involved in the peer-review process of the journal (FCD, CT) served as the raters responsible for classifying the content of reviews. They began by independently parsing a random sample of 10 methodology reviews into comments. We defined a comment as a distinct statement or idea found in a review, regardless of whether that statement was presented in isolation or was included in a paragraph that contained several statements. After parsing the reviews, the raters independently classified each comment using the taxonomy.
We identified and discussed classification discrepancies between the 2 raters to clarify differences among taxonomy items and revised the taxonomy accordingly. After 2 iterations of this process, each time using reviews (a total of 31) that had not been previously categorized, we finalized a 99-item taxonomy of reviewer comments organized in 8 categories (Appendix; online only).
To assess the interrater reliability of the finalized taxonomy, the raters independently scored 7 additional methodology reviews.
Journal staff randomly selected reviews from 1998 and 1999 written by methodology reviewers A and B (DLS, RLW) and redacted the reviewer's identity. The 2 raters used the final 99-item taxonomy to classify an equal number of randomly assigned methodology reviews. In addition, to compare methodology and regular reviews, the raters also classified all (range 2 to 5) regular reviews of 30 manuscripts (15 that had been reviewed by methodology reviewer A and 15 that had been reviewed by reviewer B).
Results
The median number of comments made per review by methodologists A and B and regular reviewers were similar, but the range of comments per review written by the regular reviewers was much larger than that of A and B (Table 1).
Table 1. Comparison of methodology reviewers A and B and the regular peer reviewers.
| Variables | Methodology Reviewers | Regular Reviewers | |
|---|---|---|---|
| A | B | ||
| No. of reviews (No. of comments) | 65 (623) | 60 (628) | 68 (996) |
| Mean No. of comments per review (range) | 9 (0–28) | 9.5 (0–32) | 10 (2–77) |
Table 2. Distribution of reviewers' comments, by category.
| Symbol | Category Description | Total Comments by Category, % | ||
|---|---|---|---|---|
| Methodology Reviewer A | Methodology Reviewer B | Regular Reviewers | ||
| H | Hypothesis/purpose/theoretic model | 7 | 3 | 3 |
| D | Study design/power | 8 | 10 | 7 |
| M | Research and analytic methods | 8 | 11 | 5 |
| S | Statistical methods | 6 | 16 | 3 |
| PM | Presentation of methods | 17 | 13 | 11 |
| PR | Presentation of results | 33 | 28 | 18 |
| I | Interpretation of results/limitations | 7 | 8 | 8 |
| O | Other comments (non-methodology/statistical) | 14 | 11 | 45 |
Regular reviewers commented much more frequently on non-methodology and statistical issues than did the methodology reviewers. All reviewers made comments that concerned study design, the interpretation of results, and study limitations with nearly equal frequency. Importantly, comments made by methodology and regular reviewers about the same methodology issue were often contradictory (eg, the methodology reviewer advised the authors to eliminate statistical tests, whereas the regular reviewer suggested that more P values be added).
The interrater reliability of the finalized taxonomy is shown in Table 3. Thirty-five of the 78 comments were assigned to the identical item from the 99-item taxonomy by both raters (45% overall agreement). Forty-eight of the 78 items were assigned to the identical category (62% category agreement). Twenty-one of the 30 discrepancies involved the “Presentation of results” and “Other” categories (Table 3); 9 of the 14 misclassifications into the "Presentation of results" category involved a single item ("12a—Inadequate or unclear description of analytic methods").
Table 3. Interrater reliability: Agreement between the 2 raters for major categories.*
| Symbol | O | I | PR | PM | M | S | D | H |
|---|---|---|---|---|---|---|---|---|
| H | 1 | 3 | 1 | |||||
| D | 5 | 4 | ||||||
| S | 2 | 2 | 5 | |||||
| M | 1 | 2 | 2 | 5 | ||||
| PM | 5 | 2 | 9 | |||||
| PR | 2 | 1 | 14 | |||||
| I | 2 | 5 | ||||||
| O | 5 | |||||||
| *See Table 2 for definition of abbreviations. This table depicts the frequency of agreement (cells designating agreement between raters are in bold) and disagreement among the 8 categories. For example, there were 5 instances when one reviewer scored a comment as "PR (Presentation of results)" and the other reviewer scored it as "D (Study design/power)." Because the categories are not ordered, no importance should be attached to how close each disagreement cell is to the diagonal. | ||||||||
Discussion
The distributions of comments made by the 2 methodology and statistical reviewers were similar, although A emphasized presentation issues and B stressed statistical issues. Not surprisingly, the methodology and statistical reviewers frequently commented on methodology issues, whereas the regular reviewers (most of whom were unaware that a dedicated methodology reviewer would be reviewing the manuscript) focused most on topical content and, in general, paid much less attention to methodology.
This study provides evidence that dedicated methodology reviewers add something to the review process that is likely omitted in their absence. Although methodology reviewers are known to have limited interrater and intrarater reliability,7, 8 there is evidence that they find major methodologic flaws.9 In a companion study in this issue of Annals , we demonstrated that methodology comments have a positive, although incomplete, effect on published manuscript quality.10 We conclude that dedicated methodology and statistical reviewers generate different feedback to authors and editors concerning prepublished manuscripts than do regular peer reviewers.
We also found that the methodology reviewers themselves focus on slightly different areas when they review manuscripts. Is this a problem? If there was universal agreement regarding the best methods for analyzing and reporting investigational data, it would be. In the absence of such guidelines, some heterogeneity might be both unavoidable and desirable. It is important to note that the companion study in this issue provides evidence that both reviewers (although their priorities differed) were successful in detecting fatal flaws in manuscripts.10
The 99-item taxonomy demonstrated limited reliability. This reflects the general difficulty in distinguishing between a comment that addresses the process of the research and a comment that addresses the presentation of research. The following reviewer comment exemplifies the type of comment that raters typically categorized differently: "It is never stated whether a paired or unpaired test for the differences in success proportions was used—the paired test (McNemar's) would be appropriate." One rater categorized this comment as "12a—Inadequate or unclear description of analytic methods," whereas the other rater categorized the same comment as "12i–Alternate analytic strategy (choice of stats tests, etc.) suggested." The first categorization indicates that the comment refers to the presentation of results, whereas the second categorization indicates that the comment refers to the process of statistical analysis that occurred. Although both raters categorized this example as a single comment, this statement could be considered to include 2 comments. We asked each rater to independently determine what constituted a comment, and they rarely parsed the review content identically. This lowered reliability below that which could be achieved had the raters assigned a taxonomy item to comments that had already been parsed.
Our confidence in the findings of this analysis is limited primarily by the reliability of our reviewer comment taxonomy. Although the mediocre reliability might produce a somewhat biased estimate of the difference in comment types between methodology and content reviewers, it is unlikely that this bias is large enough to eliminate the observed difference. Additionally, this study focuses on 2 reviewers at a single journal in a single specialty, and our results might not generalize to other reviewers at other journals.
In summary, this small study provides evidence that 2 dedicated methodology and statistical reviewers provided reviews that were generally consistent and emphasized methodology issues that were distinct from those raised by regular reviewers. Although these findings are insufficient to establish the value of dedicated methodology review, they do highlight the potential of such reviews to improve the methodologic quality of manuscripts.
Appendix
The 99-item taxonomy of reviewer comments.
| Symbol | Category/Item | Comments |
|---|---|---|
| 1.General | ||
| PM | a. Failure to comply with CONSORT | See also 5b |
| O | b. Inadequate definition of non-statistical terms | See 12b for statistical terms |
| O | c. Not an experiment or observational study (case series, etc.) | |
| PR | d. Abstract does not adequately/fairly portray paper | If other reporting problems treat as if they are in main text |
| O | e. No novel content, no novel methods | |
| 2.Purpose | ||
| H | a. No statement of purpose | |
| H | b. Purpose too general | |
| M | c. Failure to differentiate between a descriptive and experimental purpose. Typically, a paper with no hypothesis but lots of stats tests. | Consider 12.l |
| D | d. Study question does not match stated purpose of study | |
| H | e. Paper asks the wrong question | |
| 3.Theoretical Model/Experimental Design | ||
| H | a. Absence of theoretical model | An influence diagram of what is being studied |
| H | b. Theoretical model present, but not explicitly stated and should be | |
| H | c. Theoretical model incomplete/overly simplified | Model fails to include key factors |
| PM | d. Failure to adequately describe study design or study methods | |
| D | e. Failure to justify choice of study design | Why is a retrospective study better than a prospective one |
| D | f. Failure to include relevant control/contrast/other groups | |
| D | g. Failure to collect data on potential confounding variables | |
| 4.Hypothesis | ||
| H | a. No hypothesis (when there should be) | |
| H | b. Hypothesis seemingly present but not stated | Usually experiments |
| H | c. Hypothesis, but not stated in testable format | |
| 5.Methods—Sampling | ||
| D | a. Failure to define and/or follow clear, executable inclusion/exclusion criteria | |
| PM | b. Failure to show how sample was developed from population (no Figure 1 per CONSORT statement) | |
| D | c. Nonrandom sample accrued in systematic (biased) way | |
| PM | d. Failure to describe randomization | |
| D | e. Random allocation claimed but is actually systematic | |
| D | f. Randomization should have been stratified | |
| 6.Methods—Blinding | ||
| D | a. No statement regarding blinding when one is needed | |
| D | b. Blinding discussed, but discussion is not adequate | |
| D | c. Blinding discussed, but the design is not adequate to truly blind | |
| 7.Methods—Power | ||
| D | a. No power calculation | |
| PM | b. Power calculation—insufficient description | |
| D | c. Power calculation—wrong | |
| D | d. Power calculated on different model than that used for analysis | |
| D | e. Power calculated on outcome other than primary outcome measure | |
| 8.Methods—Outcomes | ||
| D | a. Failure to be explicit about what outcome is primary/important | |
| D | b. Outcome measure not validated | |
| D | c. Outcome measure is irrelevant (intermediate outcome) | |
| D | d. Failure to present reliability data or problem with reliability metric | |
| 9.Results—Presentation of data | ||
| S | a. Statistics used to compare baseline characteristics (and they should not be) (eg, P values in Table) | |
| PR | b. Failure to present stratified data or data adjusted for baseline differences | This is presentation, not analysis (see 12h) |
| M | c. Need to clarify or correct data that seem odd or in error | |
| PR | d. Change format (show raw data instead of percentages, show denominators) | |
| PR | e. Watch significant figures | |
| M | f. Request for additional data/information | |
| 10.Results—Accrual of data | ||
| PR | a. Failure to account for all subjects | Also see 5b for patient entry |
| PR | b. Failure to describe missing/poor data problems | |
| D | c. Reporting/recall bias | |
| 11.Results—Management of data | ||
| PM | a. Failure to describe process for cleaning/refereeing data (includes categorization) | Includes cutpoints, etc. |
| 12.Analysis—Methods and results | ||
| PR | a. Inadequate or unclear description of analytic methods | |
| S | b. Unusual or incorrect definition of statistical terms | |
| PR | c. Failure to name statistical analysis software or cite methods | |
| PR | d. Results and methods out of synch (analytic methods presented but never used, etc.) | |
| M | e. Analytic model does not match theoretical model (too simplistic, variables omitted, etc.) | |
| I | f. Statistical significance emphasized over clinical significance | P values presented instead of magnitude |
| S | g. Multiple tests instead of single test of overall model | |
| M | h. Failure to analyze data for confounding/effect modification (through stratification or modeling) | This is analysis for presentation use 9b |
| S | i. Alternate analytic strategy (choice of stats tests, etc.) suggested | |
| S | j. Additional analytic strategies suggested | |
| PR | k. P values presented without CIs or means without P values or CIs | |
| M | l. Multiple stats tests in the absence of H's (P values used when no H stated) | See 2c |
| PR | m. Should present CI of differences between values, not CIs of values | |
| S | n. Parametric statistics used on nonparametric data | |
| S | o. Continuous statistics used on categoric data | |
| M | p. Statistical testing of underpowered secondary outcomes (eg, side effects) OR use of more variables than the N can support | |
| M | q. Statistical significance used to create model (stepwise regression, etc.) | |
| M | r. Predictive model created, but not validated | |
| 13.Studies of diagnostic tests | ||
| D | a. Spectrum bias (wrong cases included in analysis) | |
| D | b. Failure to independently measure gold standard | |
| S | c. Predictive statistics used when testing statistics more appropriate | |
| PR | d. No CIs on testing parameters | |
| D | e. CI too wide (N too small) | |
| I | f. Failure to consider/discuss/adjust for misclassification bias | |
| 14.Graphics (figures and tables) | ||
| PR | a. Failure to show distribution (mean graphed as bar) [alter graph]ALSO “add CIs to a table” | |
| PR | b. Failure to depict/describe potential confounders [alter graph] | |
| PR | c. Remove chart junk (3D nonsense, gridlines, redundant labels) | |
| PR | d. Modify/correct graphic or table (in ways other than 3a, 3b, or 3c) | |
| PR | e. Add a figure/table | |
| PR | f. Drop a figure/table | |
| PR | g. Contradiction between text and a table/figure | |
| 15.Conclusions | ||
| I | a. Not supported by the data | |
| I | b. Go beyond target population of sample | |
| 16.Limitations | ||
| I | a. No limitations section (“Add a limitations section”) | |
| I | b. Failure to consider alternative explanations | |
| I | c. Failure to consider problems with internal validity | |
| I | d. Failure to consider problems with external validity | |
| I | e. Failure to conduct sensitivity analysis | |
| 17.Non- methodology/statistical comments | ||
| O | a. Introduction and discussion longer than methods/results (more a review paper than research). Also includes “Discussion too long” | |
| O | b. Reference is incorrect | |
| O | c. References not up to date | |
| O | d. Language and style problems | |
| O | e. Has been done before (not sufficiently original) | |
| O | 18.Other (statistical/methods not covered in 1-16 | |
References
- . The effectiveness of editorial peer review. In: Jefferson T editors. Peer Review in Health Sciences. London, United Kingdom: BMJ Books; 1999;p. 45–56
- . How well does a journal's peer review process function? A survey of authors' opinions. JAMA. 1994;272:152–153
- . Readers' evaluation of effect of peer review and editing on quality of articles in the Nederlands Tijdschrift voor Geneeskunde. Lancet. 1996;348:1480–1483
- Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med. 1994;121:11–21
- . Editorial peer review: its development and rationale. In: Jefferson T editors. Peer Review in Health Sciences. London, United Kingdom: BMJ Books; 1999;p. 3–13
- . Instruments for assessing the quality of drug studies published in the medical literature. JAMA. 1994;272:101–104
- Do readers and peer reviewers agree on manuscript quality?. JAMA. 1994;272:117–119
- . Reproducibility of peer review in clinical neuroscience. Is agreement between reviewers any greater than would be expected by chance alone?. Brain. 2000;123:1964–1969
- . An exploratory study of statistical assessment of papers published in the British Medical Journal. JAMA. 1990;263:1355–1357
- The effect of dedicated methodology and statistical review on published manuscript quality. Ann Emerg Med. 2002;40:334–337
☆ Author contributions: DLS, RLW, and FCD designed the study. CT, FCD, and DLS collected and analyzed the data. FCD wrote the initial draft, and FCD, DLS, and RLW edited the manuscript. FCD and DLS take responsibility for the paper as a whole.
☆☆ The 99-item taxonomy of reviewer comments is included as an Appendix in the full-text, online version of this article. Access the Annals' Web site at www.mosby.com/AnnEmergMed . Information is also available at ACEP's home page at www.acep.org/AnnEmergMed .
★ Reprints not available from the authors. Address for correspondence: Frank C. Day, MD, MPH, 924 Westwood Boulevard, Suite 300, Los Angeles, CA 90048; fax 310-794-0599; E-mail fday@ucla.edu
PII: S0196-0644(02)00048-3
doi:10.1067/mem.2002.127326
© 2002 American College of Emergency Physicians. Published by Elsevier Inc. All rights reserved.
Refers to article:
- Research into peer review and scientific publication: Journals look in the mirror
- Graphical literacy: The quality of graphs in a large-circulation journal
- Effect of structured workshop training on subsequent performance of journal peer reviewers
- The effect of dedicated methodology and statistical review on published manuscript quality
