Annals of Emergency Medicine
Volume 40, Issue 3 , Pages 329-333, September 2002

The use of dedicated methodology and statistical reviewers for peer review: A content analysis of comments to authors made by methodology and regular reviewers☆☆

Presented at the Fourth International Congress on Peer Review in Biomedical Publication, Barcelona, Spain, September 2001.

UCLA Emergency Medicine Center, UCLA School of Medicine, Los Angeles, CA (Day, Schriger, Todd), and the Department of Emergency Medicine, University of Florida Health Science Center Jacksonville, Jacksonville, FL (Wears)

Received 19 March 2002; accepted 6 June 2002.

Article Outline

Abstract 

Study objective: In 1997, Annals of Emergency Medicine initiated a protocol by which every original research article, in addition to each regular review, was concurrently evaluated by 1 of 2 methodology and statistical reviewers. We characterized and contrasted comments made by the methodology and regular peer reviewers. Methods: After pilot testing, interrater reliability assessment, and revision, we finalized a 99-item taxonomy of reviewer comments organized in 8 categories. Two authors, uninvolved in the writing of reviews, classified each comment from a random sample of methodology reviews from 1999. For 30 of these reviews (15 for each methodology reviewer), the 2 authors also scored all (range 2 to 5) regular reviews. Results: Sixty-five reviews by methodologist A, 60 by methodologist B, and 68 by regular reviewers were analyzed. Comments by methodologist A most frequently concerned the presentation of results (33% of all comments) and methods (17%). Methodologist B commented most frequently on presentation of results (28%) and statistical methods (16%). Regular reviewers most frequently made non-methodology/statistical comments (45%) and comments on presentation of results (18%). Of note, comments made by methodology and regular reviewers about methods issues were often contradictory. Conclusion: The distributions of comments made by the 2 methodology and statistical reviewers were similar, although reviewer A emphasized presentation and reviewer B stressed statistical issues. The regular reviewers (most of whom were unaware that a dedicated methodology and statistical reviewer would be reviewing the article) paid much less attention to methodology issues. The 2 dedicated methodology and statistical reviewers created reviews that were similarly focused and emphasized methodology issues that were distinct from the issues raised by regular reviewers. [Ann Emerg Med. 2002;40:329-333.]

 

See related articles, p. 313, p. 317, p. 323, and p. 334, and abstracts, p. 338.

Back to Article Outline

Introduction 

Prepublication manuscript review is intended to identify high-quality submitted research, generate useful feedback to authors and editors, and ultimately result in the publication of high-quality manuscripts. An analysis of the effectiveness of editorial peer review concludes that "peer review and editing lead to better reports of research results"1 on the basis of studies comparing the quality of manuscripts before and after peer review.2, 3, 4 Critics of this process note that peer review is unstandardized, subject to bias, expensive, and insufficiently validated.5

The quality of a submitted manuscript needs to be assessed in terms of both context and content. An expert reviewer, intimately familiar with the relevant field, is needed to provide context (ie, to evaluate the originality and importance of the research). These reviewers are typically academicians identified by the editorial committee as experts in their fields. Many journals also rely on these reviewers to evaluate the methodology and to judge the internal and external validity of the manuscript. Their expertise in the content area is no guarantee that these individuals have the skills needed to assess the research and analytic methods contained in an article. For this reason, some journals supplement content reviews with methodology reviews, which are typically performed by statisticians or clinicians who have additional training or experience in biostatistics or clinical epidemiology.

From 1997 to 1999, every original research article submitted to Annals of Emergency Medicine was reviewed by 1 of 2 dedicated methodology and statistical reviewers (RLW, DLS) in addition to the usual content reviewers. We performed this analysis to describe the kinds of comments made by the 2 methodology and statistical reviewers and how they differ and to contrast the content of their reviews with the content of the regular peer reviews.

Back to Article Outline

Materials and methods 

On the basis of existing taxonomies used in peer-review research,4, 6, 7 we developed and pilot tested a 17-category, 70-item taxonomy for classifying comments made in manuscript reviews. We designed the taxonomy with the goal of facilitating the detailed classification of reviewer comments about methodology but included sufficient categories to accommodate the classification of all comments. Two of the authors who were not involved in the peer-review process of the journal (FCD, CT) served as the raters responsible for classifying the content of reviews. They began by independently parsing a random sample of 10 methodology reviews into comments. We defined a comment as a distinct statement or idea found in a review, regardless of whether that statement was presented in isolation or was included in a paragraph that contained several statements. After parsing the reviews, the raters independently classified each comment using the taxonomy.

We identified and discussed classification discrepancies between the 2 raters to clarify differences among taxonomy items and revised the taxonomy accordingly. After 2 iterations of this process, each time using reviews (a total of 31) that had not been previously categorized, we finalized a 99-item taxonomy of reviewer comments organized in 8 categories (Appendix; online only).

To assess the interrater reliability of the finalized taxonomy, the raters independently scored 7 additional methodology reviews.

Journal staff randomly selected reviews from 1998 and 1999 written by methodology reviewers A and B (DLS, RLW) and redacted the reviewer's identity. The 2 raters used the final 99-item taxonomy to classify an equal number of randomly assigned methodology reviews. In addition, to compare methodology and regular reviews, the raters also classified all (range 2 to 5) regular reviews of 30 manuscripts (15 that had been reviewed by methodology reviewer A and 15 that had been reviewed by reviewer B).

Back to Article Outline

Results 

The median number of comments made per review by methodologists A and B and regular reviewers were similar, but the range of comments per review written by the regular reviewers was much larger than that of A and B (Table 1).

Table 1. Comparison of methodology reviewers A and B and the regular peer reviewers.
VariablesMethodology ReviewersRegular Reviewers
AB
No. of reviews (No. of comments)65 (623)60 (628)68 (996)
Mean No. of comments per review (range)9 (0–28)9.5 (0–32)10 (2–77)
Comments by methodologist A most frequently concerned the presentation of results and methods (Table 2). Methodologist B commented most frequently on the presentation of results and statistical methods. The vast majority (86% to 89%) of each methodology reviewers' comments concerned methodology and statistical issues.

Table 2. Distribution of reviewers' comments, by category.
SymbolCategory DescriptionTotal Comments by Category, %
Methodology Reviewer AMethodology Reviewer BRegular Reviewers
HHypothesis/purpose/theoretic model733
DStudy design/power8107
MResearch and analytic methods8115
SStatistical methods6163
PMPresentation of methods171311
PRPresentation of results332818
IInterpretation of results/limitations788
OOther comments (non-methodology/statistical)141145

Regular reviewers commented much more frequently on non-methodology and statistical issues than did the methodology reviewers. All reviewers made comments that concerned study design, the interpretation of results, and study limitations with nearly equal frequency. Importantly, comments made by methodology and regular reviewers about the same methodology issue were often contradictory (eg, the methodology reviewer advised the authors to eliminate statistical tests, whereas the regular reviewer suggested that more P values be added).

The interrater reliability of the finalized taxonomy is shown in Table 3. Thirty-five of the 78 comments were assigned to the identical item from the 99-item taxonomy by both raters (45% overall agreement). Forty-eight of the 78 items were assigned to the identical category (62% category agreement). Twenty-one of the 30 discrepancies involved the “Presentation of results” and “Other” categories (Table 3); 9 of the 14 misclassifications into the "Presentation of results" category involved a single item ("12a—Inadequate or unclear description of analytic methods").

Table 3. Interrater reliability: Agreement between the 2 raters for major categories.*
SymbolOIPRPMMSDH
H 1 31
D 5 4
S 2 25
M 1225
PM5 29
PR2114
I25
O5
*See Table 2 for definition of abbreviations. This table depicts the frequency of agreement (cells designating agreement between raters are in bold) and disagreement among the 8 categories. For example, there were 5 instances when one reviewer scored a comment as "PR (Presentation of results)" and the other reviewer scored it as "D (Study design/power)." Because the categories are not ordered, no importance should be attached to how close each disagreement cell is to the diagonal.

Back to Article Outline

Discussion 

The distributions of comments made by the 2 methodology and statistical reviewers were similar, although A emphasized presentation issues and B stressed statistical issues. Not surprisingly, the methodology and statistical reviewers frequently commented on methodology issues, whereas the regular reviewers (most of whom were unaware that a dedicated methodology reviewer would be reviewing the manuscript) focused most on topical content and, in general, paid much less attention to methodology.

This study provides evidence that dedicated methodology reviewers add something to the review process that is likely omitted in their absence. Although methodology reviewers are known to have limited interrater and intrarater reliability,7, 8 there is evidence that they find major methodologic flaws.9 In a companion study in this issue of Annals , we demonstrated that methodology comments have a positive, although incomplete, effect on published manuscript quality.10 We conclude that dedicated methodology and statistical reviewers generate different feedback to authors and editors concerning prepublished manuscripts than do regular peer reviewers.

We also found that the methodology reviewers themselves focus on slightly different areas when they review manuscripts. Is this a problem? If there was universal agreement regarding the best methods for analyzing and reporting investigational data, it would be. In the absence of such guidelines, some heterogeneity might be both unavoidable and desirable. It is important to note that the companion study in this issue provides evidence that both reviewers (although their priorities differed) were successful in detecting fatal flaws in manuscripts.10

The 99-item taxonomy demonstrated limited reliability. This reflects the general difficulty in distinguishing between a comment that addresses the process of the research and a comment that addresses the presentation of research. The following reviewer comment exemplifies the type of comment that raters typically categorized differently: "It is never stated whether a paired or unpaired test for the differences in success proportions was used—the paired test (McNemar's) would be appropriate." One rater categorized this comment as "12a—Inadequate or unclear description of analytic methods," whereas the other rater categorized the same comment as "12i–Alternate analytic strategy (choice of stats tests, etc.) suggested." The first categorization indicates that the comment refers to the presentation of results, whereas the second categorization indicates that the comment refers to the process of statistical analysis that occurred. Although both raters categorized this example as a single comment, this statement could be considered to include 2 comments. We asked each rater to independently determine what constituted a comment, and they rarely parsed the review content identically. This lowered reliability below that which could be achieved had the raters assigned a taxonomy item to comments that had already been parsed.

Our confidence in the findings of this analysis is limited primarily by the reliability of our reviewer comment taxonomy. Although the mediocre reliability might produce a somewhat biased estimate of the difference in comment types between methodology and content reviewers, it is unlikely that this bias is large enough to eliminate the observed difference. Additionally, this study focuses on 2 reviewers at a single journal in a single specialty, and our results might not generalize to other reviewers at other journals.

In summary, this small study provides evidence that 2 dedicated methodology and statistical reviewers provided reviews that were generally consistent and emphasized methodology issues that were distinct from those raised by regular reviewers. Although these findings are insufficient to establish the value of dedicated methodology review, they do highlight the potential of such reviews to improve the methodologic quality of manuscripts.

Back to Article Outline

Appendix 

The 99-item taxonomy of reviewer comments.
SymbolCategory/ItemComments
1.General
PMa. Failure to comply with CONSORTSee also 5b
Ob. Inadequate definition of non-statistical termsSee 12b for statistical terms
Oc. Not an experiment or observational study (case series, etc.)
PRd. Abstract does not adequately/fairly portray paperIf other reporting problems treat as if they are in main text
Oe. No novel content, no novel methods
2.Purpose
Ha. No statement of purpose
Hb. Purpose too general
Mc. Failure to differentiate between a descriptive and experimental purpose. Typically, a paper with no hypothesis but lots of stats tests.Consider 12.l
Dd. Study question does not match stated purpose of study
He. Paper asks the wrong question
3.Theoretical Model/Experimental Design
Ha. Absence of theoretical modelAn influence diagram of what is being studied
Hb. Theoretical model present, but not explicitly stated and should be
Hc. Theoretical model incomplete/overly simplifiedModel fails to include key factors
PMd. Failure to adequately describe study design or study methods
De. Failure to justify choice of study designWhy is a retrospective study better than a prospective one
Df. Failure to include relevant control/contrast/other groups
Dg. Failure to collect data on potential confounding variables
4.Hypothesis
Ha. No hypothesis (when there should be)
Hb. Hypothesis seemingly present but not statedUsually experiments
Hc. Hypothesis, but not stated in testable format
5.Methods—Sampling
Da. Failure to define and/or follow clear, executable inclusion/exclusion criteria
PMb. Failure to show how sample was developed from population (no Figure 1 per CONSORT statement)
Dc. Nonrandom sample accrued in systematic (biased) way
PMd. Failure to describe randomization
De. Random allocation claimed but is actually systematic
Df. Randomization should have been stratified
6.Methods—Blinding
Da. No statement regarding blinding when one is needed
Db. Blinding discussed, but discussion is not adequate
Dc. Blinding discussed, but the design is not adequate to truly blind
7.Methods—Power
Da. No power calculation
PMb. Power calculation—insufficient description
Dc. Power calculation—wrong
Dd. Power calculated on different model than that used for analysis
De. Power calculated on outcome other than primary outcome measure
8.Methods—Outcomes
Da. Failure to be explicit about what outcome is primary/important
Db. Outcome measure not validated
Dc. Outcome measure is irrelevant (intermediate outcome)
Dd. Failure to present reliability data or problem with reliability metric
9.Results—Presentation of data
Sa. Statistics used to compare baseline characteristics (and they should not be) (eg, P values in Table)
PRb. Failure to present stratified data or data adjusted for baseline differencesThis is presentation, not analysis (see 12h)
Mc. Need to clarify or correct data that seem odd or in error
PRd. Change format (show raw data instead of percentages, show denominators)
PRe. Watch significant figures
Mf. Request for additional data/information
10.Results—Accrual of data
PRa. Failure to account for all subjectsAlso see 5b for patient entry
PRb. Failure to describe missing/poor data problems
Dc. Reporting/recall bias
11.Results—Management of data
PMa. Failure to describe process for cleaning/refereeing data (includes categorization)Includes cutpoints, etc.
12.Analysis—Methods and results
PRa. Inadequate or unclear description of analytic methods
Sb. Unusual or incorrect definition of statistical terms
PRc. Failure to name statistical analysis software or cite methods
PRd. Results and methods out of synch (analytic methods presented but never used, etc.)
Me. Analytic model does not match theoretical model (too simplistic, variables omitted, etc.)
If. Statistical significance emphasized over clinical significanceP values presented instead of magnitude
Sg. Multiple tests instead of single test of overall model
Mh. Failure to analyze data for confounding/effect modification (through stratification or modeling)This is analysis for presentation use 9b
Si. Alternate analytic strategy (choice of stats tests, etc.) suggested
Sj. Additional analytic strategies suggested
PRk. P values presented without CIs or means without P values or CIs
Ml. Multiple stats tests in the absence of H's (P values used when no H stated)See 2c
PRm. Should present CI of differences between values, not CIs of values
Sn. Parametric statistics used on nonparametric data
So. Continuous statistics used on categoric data
Mp. Statistical testing of underpowered secondary outcomes (eg, side effects) OR use of more variables than the N can support
Mq. Statistical significance used to create model (stepwise regression, etc.)
Mr. Predictive model created, but not validated
13.Studies of diagnostic tests
Da. Spectrum bias (wrong cases included in analysis)
Db. Failure to independently measure gold standard
Sc. Predictive statistics used when testing statistics more appropriate
PRd. No CIs on testing parameters
De. CI too wide (N too small)
If. Failure to consider/discuss/adjust for misclassification bias
14.Graphics (figures and tables)
PRa. Failure to show distribution (mean graphed as bar) [alter graph]ALSO “add CIs to a table”
PRb. Failure to depict/describe potential confounders [alter graph]
PRc. Remove chart junk (3D nonsense, gridlines, redundant labels)
PRd. Modify/correct graphic or table (in ways other than 3a, 3b, or 3c)
PRe. Add a figure/table
PRf. Drop a figure/table
PRg. Contradiction between text and a table/figure
15.Conclusions
Ia. Not supported by the data
Ib. Go beyond target population of sample
16.Limitations
Ia. No limitations section (“Add a limitations section”)
Ib. Failure to consider alternative explanations
Ic. Failure to consider problems with internal validity
Id. Failure to consider problems with external validity
Ie. Failure to conduct sensitivity analysis
17.Non- methodology/statistical comments
Oa. Introduction and discussion longer than methods/results (more a review paper than research). Also includes “Discussion too long”
Ob. Reference is incorrect
Oc. References not up to date
Od. Language and style problems
Oe. Has been done before (not sufficiently original)
O18.Other (statistical/methods not covered in 1-16

PM, Presentation of methods; O, other comments (non-methodology/statistical); PR, presentation of results; H, hypothesis/purpose/theoretic model; M, research and analytic methods; D, study design/power; S, statistical methods;I,interpretation of results/limitations.

Back to Article Outline

References 

  1. Fletcher R, Fletcher S. The effectiveness of editorial peer review. In:  Jefferson T editors. Peer Review in Health Sciences. London, United Kingdom: BMJ Books; 1999;p. 45–56
  2. Sweitzer BJ, Cullen DJ. How well does a journal's peer review process function? A survey of authors' opinions. JAMA. 1994;272:152–153
  3. Pierie JP, Walvoort HC, Overbeke AJ. Readers' evaluation of effect of peer review and editing on quality of articles in the Nederlands Tijdschrift voor Geneeskunde. Lancet. 1996;348:1480–1483
  4. Goodman SN, Berlin J, Fletcher SW, et al.  Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med. 1994;121:11–21
  5. Rennie D. Editorial peer review: its development and rationale. In:  Jefferson T editors. Peer Review in Health Sciences. London, United Kingdom: BMJ Books; 1999;p. 3–13
  6. Cho MK, Bero LA. Instruments for assessing the quality of drug studies published in the medical literature. JAMA. 1994;272:101–104
  7. Justice AC, Berlin JA, Fletcher SW, et al.  Do readers and peer reviewers agree on manuscript quality?. JAMA. 1994;272:117–119
  8. Rothwell PM, Martyn CN. Reproducibility of peer review in clinical neuroscience. Is agreement between reviewers any greater than would be expected by chance alone?. Brain. 2000;123:1964–1969
  9. Gardner MJ, Bond J. An exploratory study of statistical assessment of papers published in the British Medical Journal. JAMA. 1990;263:1355–1357
  10. Schriger DL, Cooper RJ, Wears RL, et al.  The effect of dedicated methodology and statistical review on published manuscript quality. Ann Emerg Med. 2002;40:334–337

 Author contributions: DLS, RLW, and FCD designed the study. CT, FCD, and DLS collected and analyzed the data. FCD wrote the initial draft, and FCD, DLS, and RLW edited the manuscript. FCD and DLS take responsibility for the paper as a whole.

☆☆ The 99-item taxonomy of reviewer comments is included as an Appendix in the full-text, online version of this article. Access the Annals' Web site at www.mosby.com/AnnEmergMed . Information is also available at ACEP's home page at www.acep.org/AnnEmergMed .

 Reprints not available from the authors. Address for correspondence: Frank C. Day, MD, MPH, 924 Westwood Boulevard, Suite 300, Los Angeles, CA 90048; fax 310-794-0599; E-mail fday@ucla.edu

PII: S0196-0644(02)00048-3

doi:10.1067/mem.2002.127326

Refers to article:

  • Research into peer review and scientific publication: Journals look in the mirror

    Michael L. Callaham
    Annals of Emergency Medicine September 2002 (Vol. 40, Issue 3, Pages 313-316)

  • Graphical literacy: The quality of graphs in a large-circulation journal

    Richelle J. Cooper, David L. Schriger, Reb J.H. Close
    Annals of Emergency Medicine September 2002 (Vol. 40, Issue 3, Pages 317-322)

  • Effect of structured workshop training on subsequent performance of journal peer reviewers

    Michael L. Callaham, David L. Schriger
    Annals of Emergency Medicine September 2002 (Vol. 40, Issue 3, Pages 323-328)

  • The effect of dedicated methodology and statistical review on published manuscript quality

    David L. Schriger, Richelle J. Cooper, Robert L. Wears, Joseph F. Waeckerle
    Annals of Emergency Medicine September 2002 (Vol. 40, Issue 3, Pages 334-337)

Annals of Emergency Medicine
Volume 40, Issue 3 , Pages 329-333, September 2002