| ||||||||||||||
|
|
|||||||||||||
Department of Anesthesiology, University of Michigan, Ann Arbor
Address correspondence and reprint requests to Andrew L. Rosenberg, MD, The University of Michigan, Department of Anesthesiology, UH 1H247 Box 0048, 1500 East Medical Center Dr., Ann Arbor, MI 48109-0048. Address e-mail to arosen{at}med.umich.edu.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
| Methods |
|---|
|
|
|---|
|
These 279 articles were photocopied, and all identifiers were removed from all pages by three investigators (MDN, AS, and MJS) who were not involved in further evaluation; identifiers included names and affiliations of authors, journal name, corresponding authors, or any other unique identifiers. Articles were presented to other reviewers (MLG and AR) who were blinded or masked to all unique identifiers. Articles were offered in a random order using a computer-generated randomization scheme. Both reviewers have had formal training in research design, epidemiology, and biostatistics.
We used a modified version of Chalmers quality assessment tool (9,12) to evaluate each article. This tool in its modified or original form has been used extensively to evaluate RCTs (1316). It uses a scale to evaluate eight domains associated with the study protocol and six domains related to data analysis (Table 1). The weighted scores for each domain have precise requirements for what must be recorded to achieve a certain score. Each of these domains was evaluated, and a numeric value for each was assigned depending on the quality. The scores were then used to generate a quality score for each domain. Percentages (total score divided by total possible score) were assigned because there were some items not applicable to the study under review. Thus, the scores were proportions, with the lowest possible score being 0% and the highest possible score being 100%. This method of scoring has been validated in numerous studies of clinical research (1721).
|
We derived an overall quality score for each article. Each of the two reviewers came to a consensus on those items, upon which they disagreed (22). Analysis of variance with Duncan correction for multiple comparisons was used to test overall quality score differences among the four journals; Pearsons
2 test or Fishers exact test was used, as appropriate, for comparisons of individual score assignment for each quality item. Statistical analyses were conducted using the Statistical Analysis System (SAS 8.0; SAS Institute, Cary, NC).
| Results |
|---|
|
|
|---|
|
In 20% (41 of 201) of the papers in which a control population was feasible, the control group had either a different control appearance or regimen (i.e., the experimental group and the control group did not have the identical appearance for their course of therapy or treatment) or the control regimen was unstated and could not be determined from the context of the article. In the 279 studies reviewed, the final sample size differed from the number enrolled in 10% of the studies. This was either because of significant withdrawals within studies or was not explained at all. Pretreatment variable distributions, such as demographics, and important clinical predictors, such as comorbidities or previous opioid use in pain patients, were not present in any form in 32% (86 of 271) of the study articles. Side effects were not addressed in 11% of the studies, and an additional 21% listed side effects but did not discuss the impact on study findings.
Thirty-five percent of the studies adequately reported appropriate methodology for how the process of randomization was blinded to the study investigators. For example, it could not be determined from these articles if an appropriate method of randomization (i.e., computer-generated or random numbers tables) was used or how the randomization was blinded (i.e., opaque envelopes). Twenty-two percent of studies did not report appropriate blinding of patients. In some articles, authors reported that studies were "single-blinded" but did not include a description of the blinding methodology. Other articles reported the blinding methodology, but there was evidence through patient side effects that the patients might be able to discern their treatment assignment. Thirty-seven percent of studies did not describe how observers who might influence the outcome reporting (such as a nurse in the postanesthesia care unit recording postoperative pain scores) were blinded to patient study enrollment or treatment assignment. Also, 98% of articles presented no details of efforts to blind the observers to continuing study results.
Three of 279 studies (1%) included neither P values nor test statistics, whereas 55% included either the P value, or the test statistic, but not both. Forty-one percent of studies received a good rating for statistical analysis, and an excellent rating was assigned to 11% of studies. Standard deviations or standard errors were presented in 83% of the studies; confidence intervals were presented properly in only 11%. Forty-eight percent of the studies evaluated conducted an analysis of the numbers of patients required to detect differences proposed to be important by the authors; 52% (145 of 279) conducted sample size estimates. Fifty-three percent (149 of 279) of studies had negative results. Of these 149 negative studies, 107 (72%) contained no explanations of how a Type II error may have accounted for a lack of statistical significance in key outcomes. Eighteen of 149 studies (12%) alluded to the problem or admitted to the necessity for more patients, whereas 24 of 149 (16%) estimated the statistical possibility of a Type II error after the fact (a power analysis) or commented about the confidence interval around the differences.
| Discussion |
|---|
|
|
|---|
In this study, it was important to use a quality assessment tool that had been validated for RCTs and that had been more broadly used for evaluating clinical trials in a variety of medical and other scientific disciplines. The modified version of Chalmers quality assessment tool (9,12) was selected to evaluate RCTs in the major anesthesiology journals because it has been extensively used to evaluate the quality of articles in clinical journals as well as other scientific publications. We chose the modified Chalmers tool in our study because of its attention to the details of protocol design and data analysis. Would another tool have produced similar results? Does the subjectivity that accompanies the implementation of any evaluative tool introduce bias into our assessment? Work by Detsky et al. (7) has demonstrated that, in a comparison to 18 RCTs using different tools, the overall quality assessment of articles did not change significantly from tool to tool.
Our study suggests that significant deficiencies in the quality of RCT reporting are especially prevalent for the way randomization and blinding techniques are performed or reported. Our results suggest that these two domains may also have significant review or editorial bias, as indicated by the significant differences in the proportion of studies among the four journals adequately reporting these methodologies. Randomization and blinding are particularly important because these techniques are the basis for reducing bias and are the hallmarks of the RCT. Proper randomization requires that patients have an equal chance of being assigned to either the treatment or the control groups and that the method used to assign them is free of bias. Any method in which the investigator can determine or influence the group to which the next study patient will be assigned is to be avoided. For example, randomization assignments should be determined by an individual not involved in the actual treatment; random numbers tables and computer-generated randomization assignments are less amenable to manipulation than the toss of a die or drawing or shuffling from a deck of cards. It is also important that the assignment codes, even when done properly, are concealed from the investigator i.e., randomization concealment. For example, the study assignments should be in opaque envelopes or via telephone communication from someone not involved in study implementation so that the investigator cannot determine which treatment is next in line.
Most investigators understand the importance of randomization, but we found that the details for the methods used were frequently inadequate. For example, authors reported that patients were assigned by "random envelope method" or by "choosing colored balls from an opaque bag" as well as by "a number of randomly allocated cards," by the "shuffling of sequentially numbered envelopes," or by a "systematic random sample technique." Not only are these vague, but they are also not truly unbiased because they allow a variety of factors to interfere with the treatment assignments. Moreover, we found few trials that adequately reported randomization concealment despite its fundamental importance to the validity of the RCT. Of the 279 articles we evaluated, <5% reported both the method of randomization and that the randomization results were blinded to the investigators. Our results are even less than those reported by Pua et al. (11) and probably are because of the fact that this study also evaluated randomization blinding.
It is often not possible or ethical to blind patients or investigators in certain studies, and the Chalmers assessment tool does not penalize studies in which it is impossible to blind patients or investigators. In our review, in those studies in which blinding was possible, investigators often reported that a study was single-blind or double-blind, but rarely described the actual methods used to blind those involved. Other times, investigators described little else than that the observer was independent. Ninety-eight percent of the studies presented did not mention a safety monitoring committee and gave no details of efforts to blind any observers to continuing study results.
This study also found significant flaws in the analytic methods described in anesthesiology RCTs. Nearly half of the journal articles (48%) did not state whether an estimate of sample size was made before the study began. Reporting appropriate sample size estimates to avoid Type II errors, (the a priori probability or the power of the study design to detect a statistically significant difference between the outcomes in the study groups if there is such a difference) is not only methodologically required to improve the validity of a study, it has ethical implications as well. Many clinicians would consider it unreasonable to expose patients to potentially harmful treatments, or side effects from treatments, without knowing before the study begins how many patients are required. Clinical research is currently undergoing more scrutiny by institutional review boards to justify the reasons for a research project and to provide evidence that the least risky protocol to the fewest required subjects is being performed. Of the 149 clinical trials that reported negative studies, only 28% addressed the possibility that such findings may have been caused by a small sample size or a Type II error. Researchers, and readers alike, often interpret the absence of statistical significance to mean that there is no relationship between treatment and outcome (27). In our review, we found that many authors concluded that because there was no difference in a study outcome, the treatments were equivalent or equally effective or that the new treatment was an acceptable alternative to the standard treatment. No proof of a difference is not equivalent to proof of no difference (28).
There are several limitations to our approach of measuring the quality of reporting among these RCTs. The first is that because an item was not reported does not necessarily mean that it was not performed in the actual trial. This concern has been expressed by others regarding the quality of reporting versus the quality of the design and conduct of RCTs (2932). However, it seems to be generally accepted that the methodological rigor of the study is reflected in its reporting and that incomplete reporting frequently represents poor quality studies (33). Second, because these journal articles were published several years ago, it is possible that they may not reflect current reporting practices. Journal quality may have improved, as evidenced by the work of Mohor et al. (34) comparing studies before and after suggestions for improving the quality of RCTs. Studies by Pua et al. (11) evaluating the quality of RCTs from periods between 1980 and 2000, as well as the accompanying editorial by Todd (35) urging improvement in the conduct and reporting of RCTs, also indicate that some attention is being applied to improving the quality of reporting RCTs articles. This study does not evaluate more recent articles to determine whether quality has improved. However, it is important to note that only two of the major general anesthesiology journals reviewed in this article have adopted CONSORT guidelines in their instructions to authors. Finally, there is no clear "gold standard" for evaluating controlled research reporting or quality, as evidenced by the plethora of tools available for use (36), and this study represents the use of only one such tool. Although evidence suggests that validated quality tools generally agree with each other (7), it is possible that the use of another assessment tool would have resulted in different findings.
In conclusion, our data indicate that investigators, reviewers, editorial boards, and readers alike should recognize that the quality of reporting of RCT in the anesthesiology literature has scope for improvement. Because the results of the RCTs are considered the gold standard for implementing changes in clinical practice, our findings suggest that significant improvement in the quality of the most important published articles may be achieved by a more rigorous application of the CONSORT guidelines. It would be instructive for future studies to evaluate specific journals over time that have adopted CONSORT guidelines from those that have not. Targeting improvement in the reporting and the conduct of RCTs should focus on randomization methodology, the blinding of patients, investigators, and observers, and sample size estimation and power analysis.
The authors thank Ann E. Nadeau for her assistance in the preparation of this manuscript.
| Footnotes |
|---|
Accepted for publication October 29, 2004.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
L. W. Schultheis, L. L. Mathis, R. A. Roca, A. F. Simone, S. H. Hertz, and B. A. Rappaport Pediatric drug development in anesthesiology: an FDA perspective. Anesth. Analg., July 1, 2006; 103(1): 49 - 51. [Full Text] [PDF] |
||||
![]() |
B. C.H. Tsui, L. X.Y. Li, V. Ma, A. M. Wagner, and B. T. Finucane Declining randomized clinical trials from Canadian anesthesia departments?: [Declin des etudes randomisees et controlees des departements d'anesthesie canadiens?]. Can J Anesth, March 1, 2006; 53(3): 226 - 235. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|