Outcome Measures, Interim Analyses, and Bayesian Approaches to Randomized Trials:
Answers to the September 2009 Journal Club Questions
Article Outline
Discussion Points
A. What are the qualities of an ideal outcome measure?
B. There are multiple types of outcome measures in clinical trials, including biomarkers, surrogate (aka intermediate) endpoints, and clinical (aka patient-centered) endpoints. Discuss the benefits and limitations of each of these outcome measures. Can you identify a recent high-profile case in which a drug with successful surrogate marker trials failed to demonstrate success when clinical endpoints were measured?
C. In this small, randomized, controlled trial (RCT) of tamsulosin for the treatment of distal ureterolithiasis, the authors chose to study the primary outcome of successful spontaneous passage of renal calculi by 14 days. Is this a reasonable outcome measure to evaluate tamsulosin's efficacy? How might a patient's failure to recognize stone passage affect results? Postulate why the authors used the passage of calculi by 14 days rather than time to passage as the primary outcome.
D. Large clinical trials often use a composite outcome measurement. For example, a new heart failure trial might measure a composite outcome of death, emergency department (ED) visits, and/or unplanned hospitalization within 30 days. What are the advantages of using a composite endpoint? How might a composite outcome affect the trial's conclusions?
E. This small trial also examines many secondary outcomes such as days of work missed, time to stone passage, adverse events, number of pain episodes, amount of pain medication used, and the number of return visits. What are the benefits of secondary outcomes, and what does a lack of a clinically important change in secondary outcomes usually mean?
A. Speculate why the editors deemed this study sufficiently important to warrant publication in Annals.
B. How did the ED-based study population and the other methodology factors discussed in answer 2a affect this study's results compared with previous studies?
A. It goes without saying that randomization is a central issue in the conduct of randomized clinical trials. Discuss the pros and cons of the following randomization techniques: simple randomization, prerandomization, block randomization, and stratified block randomization. What randomization technique did these authors select? How might a different randomization strategy have influenced the study results?
B. Do you feel that the lack of placebo tablet in this trial altered the results? What if the primary outcome were patient ratings of their pain? Do you think it would have been more or less important to blind patients to their treatment assignment through the use of a placebo tablet?
C. Imagine you are designing a placebo-controlled drug trial to treat renal colic in the ED. What practical considerations about the inclusion of the placebo do you need to weigh during the trial design? How might the study's conclusions be affected if the participants or the investigators measuring the outcome can differentiate active drug from placebo? How might investigators measure whether study participants are truly blinded to whether the treatment is active or placebo?
A. What are the reasons that investigators perform interim analysis and why might a trial be stopped? How do planned interim analyses affect the study sample size? Why is it important to specify planned interim analyses in a clinical trial that is using a classic statistical analysis?
B. The Bayesian statistical approach provides an alternative, and many would argue more robust, strategy for trial design and analysis. Describe the major differences between the Bayesian statistical approach and the more commonly used frequentist statistical approach with regard to hypothesis testing, interim analyses, analyzing the data, and interpreting the study's results.
Answer 1
Q1. A well-designed clinical trial will be meaningless if the outcome measure is inappropriate, irrelevant, or unhelpful.
Q1.a What are the qualities of an ideal outcome measure?
We find it helpful to consider separately the conceptual (is the right thing being measured) and technical (is the measure reliable and valid) aspects of outcome measures. An ideal outcome measure gets high marks in both categories. A conceptually ideal outcome directly measures an important change in the patients' health status that is the result of the study intervention. This includes outcomes such as pain, quality of life measurements, and death. Clinical (patient-oriented) outcomes are always conceptually preferred to intermediate outcomes that do not directly measure changes in patient health status. The differences between these outcome measures will be further discussed in answer 1b.
Important technical aspects of an ideal outcome include reliability (does the measure produce the same result, given the same conditions) and validity (does the outcome accurately measure the intended construct). An unfortunate truth in clinical research is that outcomes with the best technical characteristics often measure inconsequential phenomena, whereas measures of important outcomes often have poor reliability or validity. One of the great challenges of clinical research is to find an outcome that scores high marks in both areas.
Q1.b There are multiple types of outcome measures in clinical trials, including biomarkers, surrogate (aka intermediate) endpoints, and clinical (aka patient-centered) endpoints. Discuss the benefits and limitations of each of these outcome measures. Can you identify a recent high-profile case in which a drug with successful surrogate marker trials failed to demonstrate success when clinical endpoints were measured?
From a conceptual perspective, patient-centered (aka clinical) outcomes are always preferred because they directly measure how the intervention affects the patient's quality of life. A child with asthma is bothered by cough or dyspnea or has limited exercise tolerance or misses school because of his condition. It is these outcomes, not the change in 1-second forced expiratory volume, that matter to him. The professional baseball pitcher measures the success of Tommy John surgery by freedom from pain when pitching and the velocity of his fastball, not the limits of his range of motion as measured by a goniometer. Clinical outcomes capture changes in the patient's feelings, functioning, or survival. A trial that demonstrates a clinically important difference in a validly measured, clinically important outcome provides us with the best possible evidence.
Unfortunately, there is no such thing as a free lunch, and the conceptual advantages of clinical outcomes butt up against an important list of logistic and technical limitations. For many conditions, patient-centered outcomes change slowly. The required measurements, conducted during a long period, increase logistic complexity and cost. Many patient-centered outcomes are subjective. The accurate measurement of these outcomes is predicated on the existence of reliable, valid measurement instruments whose development is often time consuming and costly. The Table lists common benefits and limitations associated with patient-centered and surrogate outcome measures.
Table. Benefits and limitations of clinical and surrogate outcomes.
| Outcome Measure | Benefits | Limitations |
|---|---|---|
| Patient-centered (clinical) | Reflect patient's quality of life Decreased likelihood of reporting false clinical benefit | Often require long duration of follow-up Increased trial expense Larger sample size often required |
| Surrogate/biomarker | Shorter duration to measureable outcome Ease of measurement Useful for detecting outcomes before morbid endpoints Useful when clinical outcomes are rare | Only as reliable as the evidence that links it to a clinical outcome May overstate the effect of a treatment |
Surrogate or intermediate outcomes “function as a substitute for a patient-based clinical outcome that may be difficult to obtain due to cost, study size, or duration of follow-up needed.”1 Commonly used surrogate markers include angiographic evidence of coronary atherosclerosis as a proxy for cardiovascular mortality, tumor size in place of cancer-associated mortality, and CD4 count instead of incidence of AIDS or death.2 Surrogates are especially useful in safety and proof-of-concept trials because of the shorter time to measureable change, ease of measurement, and lower cost compared with clinical outcomes.3 However, definitive efficacy trials require a true clinical outcome because surrogates have the conceptual limitation of not measuring a change in patient quality of life.
Biomarkers are one type of surrogate outcome that objectively measure normal biological processes, pathologic processes, or pharmacologic responses to a therapeutic intervention.1 Biomarkers share all of the limitations of other intermediate outcomes. They do not directly measure a change in clinical status of the patient. They are only meaningful if there is a strong link between the marker and the clinical outcome.3, 4 In addition to the logistic benefits of all surrogate outcomes, biomarkers theoretically facilitate an improved understanding of disease pathophysiology, including the interaction of genetics, environmental influences, and clinical outcomes.
When a therapy is accepted on the basis of surrogate outcome trials, there is a risk that subsequent adequately powered patient-centered outcome trials may demonstrate a lack of effect or even harm.5 One recent compelling example of this issue involves the medication torcetrapib. Torcetrapib, a novel cholesterol-decreasing medication, functions by inhibiting cholesteryl ester transfer protein.6 The initial clinical trial by Brousseau et al, published in 2004, focused on the intermediate outcomes of high-density lipoprotein (HDL) and low-density lipoprotein (LDL) cholesterol and demonstrated that the medication, when given with or without a statin medication, decreased LDL and increased HDL.6 This was a promising result because previous studies had linked increased HDL and decreased LDL to decreased cardiovascular events.7, 8 A second small study of 162 patients, published in 2006, demonstrated dose-dependent response in HDL and LDL and found the medication to be well tolerated in a small cohort.9 In late 2006, the manufacturer abruptly halted production of the medication and clinical trials because of evidence that had not yet been published, demonstrating increased cardiovascular events and mortality in patients taking the combination of torcetrapib and a statin compared with a statin alone. These findings were later published as part of the Investigation of Lipid Level Management to Understand its Impact in Atherosclerotic Events (ILLUMINATE) trial.10 The Investigation of Lipid Level Management Using Coronary Ultrasound to Assess Reduction of Atherosclerosis by CETP Inhibition and HDL Elevation (ILLUSTRATE) trial added further doubt to the benefit of torcetrapib by demonstrating that the addition of torcetrapib did not halt the progression of atherosclerosis compared with a statin alone.11
These trials underscore the importance of clinical outcomes. All new interventions have a host of effects on the body's physiology, only some of which can be predicted according to known mechanisms and surrogates. Only with extensive clinical outcomes research and postmarketing research can practitioners and patients evaluate the true risk:benefit ratio.
Q1.c In this small, randomized, controlled trial (RCT) of tamsulosin for the treatment of distal ureterolithiasis, the authors chose to study the primary outcome of successful spontaneous passage of renal calculi by 14 days. Is this a reasonable outcome measure to evaluate tamsulosin's efficacy? How might a patient's failure to recognize stone passage affect results? Postulate why the authors used the passage of calculi by 14 days rather than time to passage as the primary outcome.
Patients with ureterolithiasis are concerned about pain and how long it will last. They may also want to know whether their stone will pass spontaneously and whether there is any chance of kidney damage. Conceptually ideal patient-oriented outcome measures address these concerns and might include the number of episodes of pain, the duration of pain, pain severity, the number or percentage of patients requiring procedures for stone removal, and the percentage of patients with a permanent change in renal function. In this study, the authors used stone passage at 14 days as the primary outcome.12 This outcome does not directly measure the symptoms and complications that are most concerning to patients.
Although the 14-day mark may be useful in defining a group that is at low risk for having a renal complication from ureterolithiasis, it may be an insensitive measure of time to stone passage in a group of patients who generally had small stones that were likely to pass quickly. The authors may have chosen the dichotomous 14-day cut point out of concern that patients could not determine exactly when the stone passed.
In this trial, 8 patients (5 in the control and 3 in the treatment arm) were unsure whether they passed a stone by the 14-day follow-up. Misclassification bias occurs when a patient passes a stone but fails to recognize spontaneous stone passage and is erroneously classified in the “No stone passed” group or vice versa. If the misclassification occurs equally between the tamsulosin and control groups, the intervention's effect is diluted toward the null hypothesis, or no difference.13 If it occurs more commonly in one group, then the study might not detect a true association or report a spurious one.14 The authors acknowledged this potential bias and performed a sensitivity analysis, finding no impact on the overall conclusions.
Q1.d Large clinical trials often use a composite outcome measurement. For example, a new heart failure trial might measure a composite outcome of death, ED visits, and/or unplanned hospitalization within 30 days. What are the advantages of using a composite endpoint? How might a composite outcome affect the trial's conclusions?
Much has been written about the advantages15, 16, 17 and disadvantages15, 18, 19, 20, 21 of composite outcome measures. The main theoretic appeal is that a well-constructed composite outcome may capture a more comprehensive picture of the effect of an intervention. Unfortunately, that is not why they are usually used. More typically, they are used to combine a series of rare events (death, need for intervention, or recurrent event), with the hopes of having a sufficient number of “positive” cases to increase the power of a study.
It is troubling that, when composite outcomes are used in this way, the component outcomes are given equal weight; a repeat visit counts just as much as a death,15 which makes little sense. Furthermore, the least significant event, such as the need for an ED visit, is often the most common and drives the differences observed between the groups. Therefore, “statistically significant” differences in the composite outcome often do not imply important differences in the most important components of that outcome. A study that uses the best single outcome may be infeasible because of the N required and its consequent cost. A study that uses the composite outcome may be feasible but risks producing results that distort reality.
Readers must understand that a difference in the composite outcome does not mean that the intervention affects each individual component or even that it affects each of the components in the same direction. Composites can hide an intervention's positive or negative effect on an individual outcome. As Freemantle et al15 detailed in their review of this topic, “the measure of a treatment effect can be diluted by an outcome that exhibits no effect being combined with a more critical measure that individually shows some evidence of benefit.” Therefore, the reader or editor must examine all components of a composite outcome carefully to judge the true clinical utility.
Q1.e This small trial also examines many secondary outcomes such as days of work missed, time to stone passage, adverse events, number of pain episodes, amount of pain medication used, and the number of return visits. What are the benefits of secondary outcomes, and what does a lack of a clinically important change in secondary outcomes usually mean?
Secondary outcomes help us understand the full effect on an intervention in the context of the primary outcome. Investigators should clearly define both primary and secondary outcomes before study initiation and register them in an online database such as http://www.clinicaltrials.gov. Unplanned analysis of secondary outcomes must always be thought of as hypothesis-generating rather than hypothesis-testing.
There is an important caveat when interpreting results for secondary outcomes, particularly those related to patient safety. The general concern is that a trial that is adequately powered to detect important differences in the primary outcome may be underpowered to do the same for secondary outcomes. This is particularly crucial when such outcomes involve patient safety. Consider a 2-limb, placebo-controlled, randomized trial with a binary primary outcome and secondary outcomes. The investigators believe that 40% of the placebo group will have a good outcome and are interested in seeing (with 80% power) whether the intervention can achieve good outcomes in 60% of cases. By typing “sampsi .4 .6, p(.8)” into Stata (StataCorp, College Station, TX), we find that 107 subjects are needed in each limb. But what if the secondary outcome was gastrointestinal bleeding, a safety concern, and we were worried that the gastrointestinal bleeding rate would double from 2% to 4%? If we were to power the study to find out how many patients were needed to have 80% power to find this difference (sampsi .02 .04, p(.8)), we would find that 1,239 subjects were needed in each limb! An investigator who used the nonsignificance of the difference in gastrointestinal bleeding in of the 214-patient study to proclaim “safety” would be misrepresenting the study's findings because that trial had no chance (actually about 6%) of finding such a difference if it did exist.
The bottom line is that one must carefully interpret results for secondary outcomes and place them in proper context. Secondary outcomes provide an important part of the story, as long as they are interpreted properly. The safety of most interventions cannot be established through RCTs but only through surveillance databases that monitor adverse effects in large numbers of patients; think Fen-Phen.22
Answer 2
Q2. Clarke et al23 have argued that every published RCT should begin with a systematic review of previous studies and conclude with a revised systematic review that shows how the RCT changes our belief about the topic. They suggest that this structure would foster the conduct of meaningful research.
Q2.a Speculate why the editors deemed this study sufficiently important to warrant publication in Annals.
Because we were not the editors who accepted this paper, we cannot state why it was accepted. Instead, we offer our speculations about the rationale for publication. Consider the following arguments made by 2 hypothetical editors:
Which editor is right? Clarke et al23 urge us to begin the research process with a formal meta-analysis. Although this concept is clear, its execution is not. When attempting to follow this suggestion, should we include only those studies that enrolled unselected ED patients? Only those that studied this drug at this dose? Only those that used stone passage within 14 days as the primary outcome? Your answer will depend on your view of the relevance of the previous knowledge. The pretrial analysis could contain 10,000 patients, no patients, or anything in between, depending on one's inclusion rules. Editor A implies that the previous knowledge is weak and this study's unique population and inclusion criteria render it important. Editor B gives greater weight to the relevance of existing knowledge, thereby diminishing the value of this 80-patient study. Who is right?
One solution is to perform an individual patient meta-analysis. By combining results for individual patients from studies rather than results of entire studies, we might be able to develop previous knowledge most relevant to this trial.24, 25 Without a methodology for defining the relevant previous knowledge, it is difficult to know how to place this article in context. We suspect that the accepting editor was aware of this dilemma but decided that it was important to get this information into the literature, even though we cannot be sure whether the findings are real or represent type II error. Such a decision reflects the growing consensus that it is important to publish methodologically sound “negative” trials.
Q2.b How did the ED-based study population and the other methodology factors discussed in answer 2a affect this study's results compared with previous studies?
As discussed in the previous answer, it is impossible to know why this trial found no difference between tamsulosin and the control group, whereas previous studies have demonstrated α-blockers to be effective.26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36 Discordant results between small clinical trials could be due to chance or could reflect differences in study design. This trial differed from previous α-blocker trials in the following: primary outcome, population selection and setting, treatment comparison, combination drug treatments, and choice of specific α-blocker. These differences in study design may alter the distributions of variables associated with stone passage—stone size, stone location, duration of symptoms, and affected side—thereby affecting study results.37, 38
This trial's primary outcome measure, proportion of stones passed in 14 days, differs from that of many trials that use time until expulsion as the primary metric.26, 27, 28, 29, 30, 31, 32, 33, 35 Most patients with small stones will pass them within 14 days, rendering the outcome measure insensitive. The authors did compare time to stone passage as a secondary outcome and found that the median number of days to stone passage was 1 day (95% confidence interval 0 to 2 days) in the tamsulosin group compared with 3 days (95% confidence interval 2 to 4 days) in the standard therapy group. This difference, although not “statistically significant,” may be clinically important and the lack of statistical significance may have more to do with the small sample size than the absence of an effect.
The most important factor in stone passage is stone size.37, 38 A commonly cited previous study included only patients with stones greater than 5 mm, whereas other trials included subgroups with greater than 5-mm stones.34, 36 The mean stone size in previous studies ranged from 4.3 mm (Han et al33) to 7.8 mm (Resim et al39) compared with the 3.5 and 3.8 mm observed in this study. α-Blocker therapy is less likely to be beneficial in populations with smaller stones because of increased spontaneous passage with small stones.40 Contrary to previous trials, Ferre et al12 explicitly excluded patients with previous urologic surgery or endoscopic intervention.28, 32, 36 Previous intervention or surgery likely corresponds with more difficult stone passage, thus increasing the potential for tamsulosin to aid with stone expulsion. Many of the differences mentioned above, as well as other unidentified factors, can be attributed to the referral bias of previous urology clinic–based studies.27, 28, 30, 32, 36, 39, 41, 42 Trials that enroll patients from urology clinics likely reflect populations that have had more difficulty passing their stones.
Answer 3
Q3. In this study the authors randomized patients to a 10-day course of standard analgesic therapy or tamsulosin plus standard therapy. Study participants were not blinded to their treatment assignment because the investigators elected not to administer placebo tablets to patients randomized to the standard analgesic-only arm.
Q3.a It goes without saying that randomization is a central issue in the conduct of randomized clinical trials. Discuss the pros and cons of the following randomization techniques: simple randomization, pre-randomization, block randomization, and stratified block randomization. What randomization technique did these authors select? How might a different randomization strategy have influenced the study results?
Randomization, conducted properly on a sufficient number of subjects, should produce study groups that are likely to have the same outcome if given the same treatment. In other words, there should be no confounding. Unfortunately, these conditions are seldom met, and the typical clinical study has insufficient numbers of subjects to ensure that groups are similar (see the May 2008 Journal Club answers for further discussion on sufficient sample size in clinical trials).43 Although statistical techniques can attempt to account for imbalances in potential confounding variables among limbs in a clinical trial, there is no way to adjust for unknown or unmeasured confounders. Consequently, investigators should choose the randomization technique that maximizes the likelihood of producing comparable groups.
Simple randomization assigns subjects to a treatment limb using some type of random number generator and a rule that says how each random number should be handled. For example, if one wanted equal numbers of patients in each limb of a 3-limb RCT, one would produce a random number between 0 and 1 for each subject and use the following rule: assign numbers between 0 and 0.333 to group 1, 0.333 and 0.666 to group 2, and 0.666 and 1 to group 3. Although this is the most straightforward method of randomization, in small trials, by chance alone, it can produce limbs with highly variant numbers of subjects and considerable variability in the characteristics of subjects in each limb. For example, when 60 patients are randomized to 3 limbs, 10 of whom have a severe form of the disease, it is quite possible that 7 of the sicker patients will be randomized to one limb. This would render the trial difficult to interpret, even though it is a legitimately performed RCT.
If an imbalance in the number of subjects per study limb is of concern, block randomization can be used. Block randomization ensures near-equal numbers of patients in each limb by performing the randomization on blocks of patients. For example, in a 3-limb trial, we might randomize 9 patients at a time, ensuring that 3 will be assigned to each limb. In that way, at the end of every 9 patients, the N for each limb will be the same (assuming no dropouts, protocol failures, etc) and at no time will the N for one limb exceed another by more than 3.
The danger of block randomization is that it can defeat allocation concealment. Imagine in our 3-limb trial that patients assigned to placebo had no immediate adverse effects, whereas patients assigned to treatment 1 got a red blotchy rash and those assigned to treatment 2 got severe chills. If we randomized in blocks of 9 and in the last 8 patients, 3 had no adverse effects, 3 had a blotchy rash, and 2 had chills, we can suspect that the next enrolled patient will be assigned to treatment 2. This corruption of allocation concealment could affect trial results if persons enrolling subjects offered enrollment selectively according to this knowledge, thereby handpicking the subject to get treatment 2. One can attempt to minimize this threat to allocation concealment by avoiding very small block sizes and by varying the block size so that one cannot assume that any particular patient completes a block.
Stratified randomization can be used when there is concern that a certain type of patient (eg, those with a more severe form of the disease) might randomly be assigned to a particular treatment, thereby creating an imbalanced, potentially confounded study. By conducting separate randomizations for patients with severe and nonsevere disease, investigators can ensure that such patients will be evenly distributed among groups. This method assumes that patients can be easily and reliably sorted into strata before randomization takes place. It also assumes that the important confounders are known. One cannot stratify on an unknown potential confounder.
Stratified block randomization combines stratified randomization and block randomization in an attempt to create equal-sized limbs with comparable patients (at least with respect to the characteristics used in the stratification).
In prerandomization, the randomization takes place before the subject is approached for enrollment. In that way, the patient knows what treatment he will receive before he offers consent. In some trials, uncertainty about which treatment will be received results in many patients refusing to participate. This produces a study population that might not match the general population (a threat to external validity). It also may make the study difficult to conduct because enrollment proceeds slowly. Prerandomization can solve this problem but at the price of introducing imbalance among limbs if there is differential refusal to participate among those offered each limb. For example, in a study comparing medical and surgical management of abdominal aortic aneurysms, patients with greater than 10-cm aneurysms offered medical management and patients whose aneurysms are less than 6-cm who are offered surgical management might refuse to participate. The consequences of this imbalance on study interpretation are obvious.
This study used simple randomization conducted after enrollment. With only 80 subjects, this cannot be considered a large trial. It is therefore quite possible, even likely, that the groups will differ by chance in some important way. For instance, it can be seen in Table 1 in the article by Ferre et al12 that the percentage of men and women was not equal between the groups. Although this might not be clinically important, similar differences for another variable might be. The most obvious potential confounding variable in this study is stone size, and there is no reason why patients could not have been randomized in 2 or more strata according to the size of their obstructing stone. This would have increased the likelihood that the groups were comparable in this respect. There is no obvious downside to stratified randomization in this study.
Q3.b Do you feel that the lack of placebo tablet in this trial altered the results? What if the primary outcome were patient ratings of their pain? Do you think it would have been more or less important to blind patients to their treatment assignment through the use of a placebo tablet?
Placebos are used to reduce bias. It is well established that patients who receive sham treatment feel and do better than those who receive nothing. A trial of “something” versus “nothing” will always favor “something” even if the “something” is a placebo. Furthermore, by using a placebo control investigators are able to maintain blinding so that patients, providers, outcome assessors, and analysts do not know which treatment each patient received. Placebo controls help investigators design experiments that are not confounded, meaning that if both groups received the same treatment, they would have the same outcome. When that is true, any difference between groups can be attributed to the treatment.
In this trial, it is possible but unlikely that the lack of placebo tablet altered results for the primary outcome in an important way. The lack of a placebo could bias the outcome in the following ways:
Although we do not believe that any of these phenomena exerted an important effect, we are somewhat puzzled about why the investigators did not use some type of placebo pill. It would have been easy and inexpensive (the placebo pills would not have to be identical to the tamsulosin pills because the goal was to blind the patients, not the investigational staff) and there was no downside. A placebo would also have reduced the possibility of confounding for the secondary outcomes “pain severity” and “days until pain free,” subjective outcomes that are more vulnerable to placebo effect, especially if participants believed that the experimental tablet would provide pain relief.
Q3.c Imagine you are designing a placebo-controlled drug trial to treat renal colic in the ED. What practical considerations about the inclusion of the placebo do you need to weigh during the trial design? How might the study's conclusions be affected if the participants or the investigators measuring the outcome can differentiate active drug from placebo? How might investigators measure whether study participants are truly blinded to whether the treatment is active or placebo?
Placebo controls should be used only when there is no evidence that an existing treatment is better than placebo. In renal colic, we know that certain analgesics are better than placebo in relieving pain. For this reason, it would be unethical to test a new analgesic against placebo. It should be tested against the standard treatment. In this case, the effect of tamsulosin-type agents on stone passage is sufficiently uncertain that a placebo-controlled trial is both ethical and preferred. A study that tested tamsulosin against another drug in its class would not determine whether either agent is better than placebo.
The benefits of placebo control were discussed in the previous answer. All of the benefits of placebo are predicated on the placebo being indistinguishable from the active drug. In single-blind studies, this means that patients should not be able to tell which drug they are getting. Had this study been placebo controlled, it is possible that those receiving active medication would have experienced mild orthostatic symptoms that would have told them that they were receiving the active drug. This would render the treatment patients unblinded and could introduce bias. It would be difficult to design a placebo that perfectly mimicked the drug's adverse effect profile without having any other clinical influences. In other situations, it is easier to create a placebo that mimics the active drug. For example, if a treatment has a certain taste, it might be possible to make a placebo that has the same taste but no active ingredient.
In double-blind studies, the construction of the placebo is more difficult because it has to be sufficiently similar to the treatment in both appearance and adverse effects that treatment and placebo are indistinguishable to clinicians and research staff. For many studies, this simply means making a pill or intravenous solution that is identical in appearance to the treatment; for others, it is far more complicated. What is the placebo for laminectomy for sciatica? What is the placebo control for acupuncture? Massage therapy? Oral activated charcoal?
One way to measure the success of a placebo is to ask patients and providers which treatment had been given. If the placebo worked well, patients and providers should be wrong just as often they are right. If either group consistently self-identifies correctly, then the placebo was ineffective. Subjects can then be asked how they knew so that future studies can avoid the problem.
Answer 4
Q4. In this study the investigators performed an interim analysis and found no issues prompting early stoppage of the trial.
Q4.a What are the reasons that investigators perform interim analyses and why might a trial be stopped? How do planned interim analyses impact the study sample size? Why is it important to pre-specify planned interim analyses in a clinical trial that is using a classical statistical analysis?
First, it is important to distinguish “interim monitoring” from “interim statistical analyses.” Many clinical trials have ongoing interim monitoring to verify study protocol compliance, ensure accurate data management, and monitor the frequency of adverse events. Unless problems are discovered, monitoring can be done without disclosing interim findings to the investigators. Typically, an independent data and safety monitoring board is given access to all data and performs these analyses. The main purpose of this type of monitoring is to optimize the quality of the trial. In rare instances, the board will find results so profoundly good or bad that they share them with the investigators, with the goal of stopping the trial in the name of patient safety (if the intervention is clearly harmful) or ethical treatment of the nonintervention group (if the intervention is overwhelmingly beneficial).
In contrast to this form of data monitoring, some large trials have prespecified interim analyses that check results at designated stages of the trial (eg, when a certain number of patients have been enrolled) to determine whether the trial should be continued or terminated. Termination rules are specified a priori and can be designed to stop a trial that is so obviously positive that there is no need to accumulate a larger number of patients. They can also be designed to terminate a trial whose results are so equivocal that there is no possibility of the trial showing meaningful benefit, even if the trial were run until the target N was reached. Stopping rules are mathematically precise and are designed to provide ultraconservative recommendations for stopping the trial. Trialists favor stopping rules because they reduce the costs of trials and may shorten the time it takes to get a drug to market. Skeptics worry that trials are stopped short the minute they reach statistical significance and that the stopping rule may be inadequately conservative to ensure that a “positive” result is not a random high blip on a curve that generally trends toward no effect.
We briefly consider the mathematics of stopping rules in Appendix E1 (available online at http://www.annemergmed.com).
Q4.b The Bayesian statistical approach provides an alternative, and many would argue more robust, strategy for trial design and analysis. Describe the major differences between the Bayesian statistical approach and the more commonly used frequentist statistical approach with regard to hypothesis testing, interim analyses, analyzing the data, and interpreting the study's results.
That frequentist statistics still predominate in the analysis of medical trials has more to do with squatter's rights than any theoretic advantage. Frequentist statistics have little to offer over Bayesian statistics, other than an aura of objectivity that does not withstand critical scrutiny. Bayesian statistics are slowly gaining a foothold as researchers and clinicians recognize that Bayesian analyses are a better match for the questions being asked in clinical research. The goal of Bayesian statistics is to modify one's knowledge according to new information, a process akin to the one used by clinicians treating patients. This homology is desirable because the statistical process is mirroring the cognitive processes of the discipline.44, 45
Frequentist statistical testing was developed by behaviorists who wanted rules that would dictate behavior.46 Such rules are particularly useful in situations in which one is forced to make a series of binary decisions and one wants to maximize the correctness of these decisions during a long series of decisions. Quality control decisions—in which one takes a random sample of the products off an assembly line on a daily basis and, according to the testing of those objects, determines whether that day's product is acceptable or should be destroyed—are an excellent example of a situation in which frequentist statistics are highly appropriate.
Unfortunately, clinical medical research does not really fit this model. Seldom are we forced to behave according to the results of a single trial. Instead, we allow evidence to accumulate and, according to much information, such as trial results, epidemiological studies, and clinical experience, determine the best way to proceed. Bayesian clinical trials, unlike their frequentist counterparts, attempt to estimate the magnitude of the difference between treatments rather than test whether that difference is statistically significant. Furthermore, by incorporating some estimate of previous knowledge, they provide an indication of how the new information—the trial results—modifies our previous belief, in the same way that a new test result modifies our differential diagnosis about a patient.
Frequentist statistical inferences are designed to be done only at prespecified interim analyses or the final analysis. The timing of Bayesian inferences need not be specified a priori. One can perform a Bayesian analysis at any point in the study without incurring any statistical penalty for repeated analyses. In the Bayesian perspective, each new patient modifies the existing knowledge to produce a new posterior estimate, and there is no harm in looking to see what that new estimate is. Because the Bayesian approach permits constant monitoring of data in a trial, investigators have the flexibility to refine the study to ensure that the trial most appropriately measures the intended study question. For example, in a 3-limb trial measuring whether 2 new treatments are better than the standard treatment for lung cancer, an interim Bayesian analysis shows that one of the new treatments performs substantially worse than standard treatment. In this framework, it is perfectly legitimate for the investigators to drop that limb from the protocol and randomize future patients to the 2 remaining limbs. This strategy has ethical advantages because as many patients as possible are given the better treatment and the trial's efficiency is maximized.
Apart from historical precedent, 2 logistic problems have slowed the acceptance of the Bayesian approach to clinical trials analysis. First, Bayesian methods are computationally intensive, and it has been only during the past 10 years that software has been developed to execute these analyses on standard desktop computers. Second, until recently, the Food and Drug Administration required frequentist analysis in trials presented in support of a new drug application. Fortunately, the Food and Drug Administration now has a more enlightened approach and welcomes trials with Bayesian analyses and appears to be ready to consider trials that use adaptive methodologies.
This Journal Club answer was intended to introduce the topics of Bayesian theory and Bayesian clinical trials and is not meant to be an exhaustive review of the Bayesian approach. Although we plan to introduce more Bayesian concepts in future Journal Clubs, readers who desire a greater understanding of the approach are directed toward the references.45, 46, 47, 48, 49
Appendix E1
In a classical statistic hypothesis-testing framework, interim analyses must be planned a priori and take into account the mathematical consequences of taking multiple looks at the data. Interim analyses to test for significance of the intervention on the outcome should be clearly defined in the study protocol before trial initiation. First, planned interim analyses and repeated looks at the data need to be accounted for during the sample size estimation. Most basic sample size estimates use an α value of 0.05, meaning that the investigators accept a 5% chance that a statistically significant outcome (ie, rejection of the null hypothesis) might occur because of chance rather than true treatment efficacy. This 5% risk of type I error, rejecting the null hypothesis when it is true, is based on a single look at the data. The likelihood of type I error increases as one looks at the data more frequently. For example, if the same data were analyzed twice, halfway through enrollment and on completion, the probability of type I error increases from 5% to 8%. If the analysis is repeated 5 or 10 times during study enrollment, the probability of finding a significant result without adjusting for the repeated looks increases to 14% and nearly 20%, respectively.2
Second, care must be taken to avoid attaching significance to random variation. If one flips a true coin 100 times, there is 73% chance that the number of heads and tails will differ by less than 10% (that is, be between 45:55 and 55:45) once all flips are completed. That does not mean that at some point on the way to 100 flips the difference did not exceed 10%. Certain data trends might appear and subsequently disappear with repeated data testing for significance, and investigators must be cautious in interpreting these temporary trends as true associations.2 Therefore, trials that intend to perform repeated analyses of the data must adjust their sample size estimate or criteria for declaring a “statistically significant” difference to ensure that the overall significance level is maintained.
There are multiple approaches to adjust the α and the required sample size for repeated testing that include classical sequential methods, Haybittle-Peto procedure, Pocock's method, O'Brien-Fleming, and Bayesian approaches. The Bayesian approach does not involve adjusting α and was discussed in answer 4b. The classical sequential model's goal is to minimize the number of subjects enrolled in a trial. This model assumes that the only decision is whether the trial enrollment should be continued or terminated early because one of the study groups is significantly superior or inferior to the other.2 This model requires the outcome to be known very soon after the intervention, and outcome data are monitored after every pair (one subject for each treatment option) is entered. A problem with this model is the need to enter subjects in pairs because these subjects might be very different in baseline characteristics. Furthermore, this method, also known as an “open plan,” has no maximum sample size and there is no guarantee that a decision to end the study will ever be reached. Therefore, this method is not often used in clinical trials.2
Alternative methods were designed that did not have the same limitations of the classical sequential model. The Haylittle-Peto procedure recommends using a very large critical value (ie, Z=±3.0) for all interim analyses, thus preserving your α for the final test. Another method was developed by Pocock that computes an adjusted α value dependent on the number of interim analyses and the overall desired significance level that typically is an α of 0.05. The O'Brien-Fleming method is the most commonly used procedure in clinical trials and involves adjusting the critical value according to the number of repeated looks at the data. A more conservative, higher critical value is used for the first look and smaller critical values are used for the later analyses as enrollment approaches the planned study sample. The more often an investigator looks at the data, the higher the critical value and the more conservative the values required to terminate the study early. An advantage of the O'Brien-Fleming method is that the critical value used at the final analysis of the data closely approximates the value that would be used if only a single analysis were performed. The Figure shows a graphic representation of these approaches.

Figure E1.
Summary of adjusted critical values based on repeated data analyses. This hypothetical example includes 3 looks (i=number of looks) at the data, including 2 interim analyses and the final data analysis on completion of enrollment.
References
- . preferred definitions and conceptual framework (Biomarkers Definitions Working Group). Clin Pharmacol Ther. 2001;69:89–95
- . Fundamentals of Clinical Trials. 3rd ed.. New York, NY: Springer Science; 1998;
- . The contribution of clinical pharmacology surrogates and models to drug development—a critical appraisal. Br J Clin Pharmacol. 1997;44:219–225
- . Measurement validity in physical therapy research. Phys Ther. 1993;73:102–115
- . Are surrogate markers adequate to assess cardiovascular disease drugs?. JAMA. 1999;282:790–795
- Effects of an inhibitor of cholesteryl ester transfer protein on HDL cholesterol. N Engl J Med. 2004;350:1505–1515
- High density lipoprotein as a protective factor against coronary heart disease (The Framingham Study). Am J Med. 1977;62:707–714
- Predicting coronary heart disease in middle-aged and older persons (The Framingham Study). JAMA. 1977;238:497–499
- Efficacy and safety of torcetrapib, a novel cholesteryl ester transfer protein inhibitor, in individuals with below-average high-density lipoprotein cholesterol levels. J Am Coll Cardiol. 2006;48:1774–1781
- Effects of torcetrapib in patients at high risk for coronary events. N Engl J Med. 2007;357:2109–2122
- Effect of torcetrapib on the progression of coronary atherosclerosis. N Engl J Med. 2007;356:1304–1316
- Tamsulosin for ureteral stones in the emergency department: a randomized, controlled trial. Ann Emerg Med. 2009;54:432–439
- . Bias due to non-differential misclassification of polytomous confounders. J Clin Epidemiol. 1993;46:57–63
- Bias due to misclassification in the estimation of relative risk. Am J Epidemiol. 1977;105:488–495
- Composite outcomes in randomized trials: greater precision but with greater uncertainty?. JAMA. 2003;289:2554–2559
- International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH harmonised tripartite guideline: statistical principles for clinical trials. Stat Med. 1999;18:1905–1942
- . Sample size calculation for complex clinical trials with survival endpoints. Control Clin Trials. 1995;16:395–407
- . Choice of endpoints in antiplatelet trials: which outcomes are most relevant to stroke patients?. Neurology. 2000;54:1022–1028
- . Platelet glycoprotein inhibitors in patients with medically managed acute coronary syndrome: does the enthusiasm exceed the science?. Ann Emerg Med. 2001;38:249–255
- Key issues in end point selection for heart failure trials: composite end points. J Card Fail. 2005;11:567–575
- Composite outcomes can distort the nature and magnitude of treatment benefits in clinical trials. Ann Intern Med. 2009;150:566–567
- Valvular heart disease associated with fenfluramine-phentermine. N Engl J Med. 1997;337:581–588
- . Reports of clinical trials should begin and end with up-to-date systematic reviews of other relevant evidence: a status report. J R Soc Med. 2007;100:187–190
- Metaanalysis of individual patient data from randomized trials: a review of methods used in practice. Clin Trials. 2005;2:209–217
- Prospective meta-analysis using individual patient data in intensive care medicine. Intensive Care Med. 2009 Sep 18;[Epub ahead of print]
- . Efficacy of tamsulosin in the medical management of juxtavesical ureteral stones. J Urol. 2003;170:2202–2205
- Nifedipine versus tamsulosin for the management of lower ureteral stones. J Urol. 2004;172:568–571
- The use of tamsulosin in the medical treatment of ureteral calculi: where do we stand?. Urol Res. 2005;33:460–464
- . Randomized trial of the efficacy of tamsulosin, nifedipine, and spasmolytic in medical expulsive therapy for distal ureteral calculi. J Urol. 2005;174:167–172
- Survey of the effect of tamsulosin and nifedipine on facilitating juxtavesical ureteral stone passage. [abstract MP03-06] J Endourol. 2005;19(suppl 1):A9
- The comparison and efficacy of 3 different a1-adrenergic blockers for distal ureteral stones. J Urol. 2005;173:2010–2012
- Medical expulsive treatment of distal ureteral stones using tamsulosin: a single center experience. J Endourol. 2006;20:12–16
- . Effect of tamsulosin on the expectant treatment of lower ureteral stones. Korean J Urol. 2006;47:708–711
- Doxazosin for the management of distal-ureteral stones. J Endourol. 2007;21:538–541
- . Efficacy of terazosin as a facilitator agent for expulsion of the lower ureteral stones. Saudi Med J. 2006;27:838–840
- Corticosteroids and tamsulosin in the medical expulsive therapy of symptomatic distal ureter stones: single drug or association?. Eur Urol. 2006;50:339–344
- A novel approach for accurate prediction of spontaneous passage of ureteral stones: support vector machines. Kidney Int. 2006;69:157–160
- . Time to stone passage for observed ureteral calculi: a guide for patient education. J Urol. 1999;162:688–690
- . Effect of tamsulosin on the number and intensity of ureteral colic in patients with lower ureteral calculus. Int J Urol. 2005;12:615–620
- Hermanns T, Sauermann P, Rufibach K, et al. Is there a role for tamsulosin in the treatment of distal ureteral stones of 7mm or less? results of a randomised, double-blind, placebo-controlled trial. Eur Urol. In press.
- Speedy elimination of ureterolithiasis in lower art of ureters with the a1 blocker tamsulosin. Int Urol Nephrol. 2002;34:25–29
- Tamsulosin in the treatment of patients with uteroliths of the lower third of the ureter clinical and pharmacoeconomic grounds. Urologiia. 2005;4:36–39
- . Acutely decompensated heart failure in a county emergency department: a double-blind randomized controlled comparison of nesiritide versus placebo treatment: answers to May 2008 Journal Club questions. Ann Emerg Med. 2008;52:458–472
- . Introduction to Bayesian statistics. In: Rothman KJ, Greenland S, Lash TL editor. Modern Epidemiology. 3rd ed.. Philadelphia, PA: Lippincott Williams & Wilkins; 2008;p. 328–344
- . Bayesian Approaches to Clinical Trials and Health-Care Evaluation. West Sussex, England: John Wiley & Sons Ltd; 2004;
- . Problems with current methods of data analysis and reporting, and suggestions for moving beyond incorrect ritual. Eur J Emerg Med. 2002;9:203–207
- . Scientific Reasoning: The Bayesian Approach. La Salle, IL: Open Court; 1989;
- . Statistics: A Bayesian Perspective. Belmont, CA: Duxbury Press; 1996;
- . Bayesian perspectives for epidemiological research, I: foundations and basic methods. Int J Epidemiol. 2006;35:765–775
Section editors: Tyler W. Barrett, MD; David L. Schriger, MD, MPH
Editor's Note: You are reading answers to the eleventh installment of Annals of Emergency Medicine Journal Club. The questions and the article they are about (Ferre et al. Ann Emerg Med. 2009;54:432-439.) were published in the September 2009 issue.Information about journal club can be found at http://www.annemergmed.com/content/journalclub.Readers should recognize that these are suggested answers. We hope they are accurate; we know that they are not comprehensive. There are many other points that could be made about these questions or about the article in general. Questions are rated “novice,” (
) “intermediate,” (
) and “advanced” (
) so that individuals planning a journal club can assign the right question to the right student. The “novice” rating does not imply that a novice should be able to spontaneously answer the question. “Novice” means we expect that someone with little background should be able to do a bit of reading, formulate an answer, and teach the material to others. Intermediate and advanced questions also will likely require some reading and research, and that reading will be sufficiently difficult that some background in clinical epidemiology will be helpful in understanding the reading and concepts.We are interested in receiving feedback about this feature. Please e-mail journalclub@acep.org with your comments.
PII: S0196-0644(09)01617-5
doi:10.1016/j.annemergmed.2009.09.027
© 2009 Published by Elsevier Inc.
Refers to article:
-
Tamsulosin for Ureteral Stones in the Emergency Department: A Randomized, Controlled Trial
, 06 February 2009
- Journal Club: Outcome Measures, Interim Analyses, and Bayesian Approaches to Randomized Trials
