| ||||||||||||||
|
|
|||||||||||||




*Department of Anesthesiology and Intensive Care, Nizams Institute of Medical Sciences, Hyderabad, India; Departments of
Anesthesia and Critical Care and
Statistics and Health Studies, University of Chicago, Chicago, Illinois; and
§Department of Anesthesiology, Johns Hopkins School of Medicine, Baltimore, Maryland
Address correspondence and reprint requests to Srinivas Mantha, MD, 13/4 RT, LIGH, Barkatpura, Hyderabad 500027, India. Address e-mail to smantha{at}satyam.net.in
Abstract
In this era of medical technology assessment and evidence-based medicine, evaluating new methods to measure physiologic variables is facilitated by standardization of reporting results. It has been proposed that assessing repeatability be followed by assessing agreement with an established technique. If the "limits of agreement" (mean bias ± 2SD) are not clinically important, then one could use two measurements interchangeably. Generalizability to larger populations is facilitated by reporting confidence intervals. We identified 44 studies that compared methods of clinical measurement published during 1996 to 1998 in seven anesthesia journals. Although 42 of 44 (95.4%) used the limits of agreement methodology for analysis, several inadequacies and inconsistencies in reporting the results were noted. Limits of agreement were defined a priori in 7.1%, repeatability was evaluated in 21.4%, and relationship (pattern) between difference and average was evaluated in 7.1%. Only one of the articles reported confidence intervals. A computer macro for the Minitab statistical package (State College, PA) is described to facilitate reporting of Bland and Altman analysis with confidence intervals. We propose standardization of nomenclature in clinical measurement comparison studies.
Implications: A literature review of anesthesia journals revealed several inadequacies and inconsistencies in statistical reports of results of comparison studies with regard to interchangeability of measurement methods. We encourage journal editors to evaluate submissions on this subject carefully to ensure that their readers can draw valid conclusions about the value of new technologies.
Validation of new technology for application to clinical medicine requires comparison with older techniques or assessment of outcomes. These processes, known as medical technology assessment and evidence-based medicine, have gained prominence through publication frequency (1,2). A standard nomenclature has evolved for reporting results after comparison of new methods to monitor physiologic variables with established ones. Thus, for example, the performance of a new monitor to measure cardiac output is compared with an established thermodilution technique.
Statistical evaluations of such comparison studies are not simple. The primary aim of comparison studies is to determine whether the two methods agree sufficiently to be used interchangeability. Because analysis with correlation and least squares linear regression (also known as calibration statistics) is fundamentally misleading, Bland and Altman favored a different statistical method for assessing agreement between two methods of measurement (35). Their analysis first calculates the difference in measurement values obtained by two methods on the same subject. The mean of such differences in a sample of subjects is the estimated bias (difference between methods), and the standard deviation (SD) of the differences measures random fluctuations around this mean. If the "limits of agreement" (mean difference ± 2SD) between two methods are not clinically important, one can use the two methods interchangeably. Another essential feature of the analysis is graphical representation of the data with between-method difference (y axis) plotted against the average (x axis). Such a graph allows one to evaluate any relationship between the measurement of error (difference) and the assumed true value (average). Because results obtained in a study furnish only the sample statistics, it is necessary for generalizability of results to other populations to report confidence intervals (CIs) (6,7). CIs show a range of values based on the observed data within which, with a specified probability, the population value lies. In Bland and Altman analysis (4), CIs for mean bias, mean bias - 2SD, and mean bias + 2SD are of particular interest. We reviewed the statistical reporting of measurement comparison studies published in the anesthesia literature according to Bland and Altman analysis.
Methods
We examined the table of contents of seven anesthesia journals (Anesthesiology, Anesthesia & Analgesia, Journal of Cardiothoracic and Vascular Anesthesia, Journal of Clinical Anesthesia, British Journal of Anesthesia, Anesthesia, and Canadian Journal of Anesthesia) published between January 1996 to December 1998. Articles with titles indicating evaluation of a new measurement technique were read. The primary goal was to identify comparison studies in which interchangeability of a new measurement technique with an established method. Animal studies were excluded. To ensure accurate data transcription, each eligible study was read at least twice by one author (SM) and graded by written criteria by using an extraction chart for each article. A second authors (JFF) opinion was taken in case of confusion regarding data transcription. From each study, data were retrieved based on written evaluation standards. Random audits to ensure accuracy of some data from each article were done by a third author (MFR).
We evaluated the comparison studies according to Bland and Altman methodology (34) for the following five items: repeatability, definition of limits of agreement, representation of x axis on Bland and Altman graph, evaluation of relationship (pattern) between difference (y axis data), and average (x axis data), and report of CIs. For repeatability assessment of each study, we first determined whether repeatability is feasible (or practical), and then we determined whether repeatability was evaluated. Repeatability is determined by taking repeated measurements on a series of patients and calculating the mean and SD of differences. According to the definition of repeatability coefficient given by the British Standards Institute, the mean difference must not be significantly different from zero, and 95% of the differences are expected to lie within the range from -2SD to + 2SD of the mean (4). When reviewing a study for limits of agreement, two aspects were evaluated. We determined whether the authors correctly defined the limits as "mean bias ± 2SD." In the methods section of each article, we looked for a statement defining maximum width for limits of agreement which would not impair medical care i.e., a priori definition of the limits. We determined the x axis of a Bland and Altman graph for each study because of the potential for authors to erroneously use the x axis to represent the values of the established method rather than the average values of the two methods. The relationship (correlation) between difference in measurement values and their average is evaluated to verify whether differences vary in any systematic manner over the range of measurement (3,4).
Bland and Altman (4) derived the following formulas for CIs needed in the analysis:
For 95% CIs, t is the critical value for a 5% two-sided test drawn from tables of t distribution with n - 1 degrees of freedom (df), where n is the sample size.
The formula for calculating CI for mean bias (mean difference = d) is: d ± t x SD/
|
|
The formula for calculating CI for limits of agreement (d - 2SD and d + 2SD) is
|
|
|
|
Finally, we also tried to infer the definitions of some terms peculiar to measurement, such as accuracy, precision, and parameter (810). However, we did not evaluate the studies based on the use of these terms.
Results
We identified 66 articles in which a new measurement method was evaluated. Three animal studies were excluded, as were 19 studies in which interchangeability was not the primary goal. In two other studies, conclusions were based on correlation regression analysis. These exclusions left 42 articles for further examination (1152). In all these studies, Bland and Altman analysis was used to project the results. Table 1 lists the statistical reporting of measurement comparison studies in these studies. We noted the use of Bland and Altman plot (difference versus average) in 38 articles (90.5%). Data transcription for evaluation and summarization was possible from all the but two studies (27,51). In these two studies, the opinion of one of the coauthors (JFF) was sought to solve the problem.
|
|
|
|
|
Discussion
Error quantification is an important component in the evaluation of new measurement techniques. Bland and Altman analysis is a statistical technique that quantifies error for repeatability and limits of agreement (34). Our study identified several inadequacies and inconsistencies in the statistical reporting of studies in which new measurement systems were evaluated, although 95% of the studies used Bland and Altman methodology for analysis.
Repeatability is relevant in measurement comparison studies because poor repeatability (considerable variation in repeated measurements on the same subject) precludes the assessment of agreement between the two methods of measurement. Therefore, repeatability must be demonstrated before agreement between methods can be established.
A conclusion about interchangeability should not be based on mean bias alone but also should consider limits of agreement. For example, if a new instrument for noninvasive blood pressure measurement records systolic pressure as 120, 140, 110, 120, and 130 mm Hg in a sample of five subjects and the corresponding values obtained by direct arterial monitoring are 140, 110, 110, 100, and 160 mm Hg, respectively, then mean bias ± 2SD is 0 ± 51. This example illustrates that one can be misled in agreement evaluation if the conclusion is based on the mean bias alone disregarding the limits of agreement. This survey identified one study with such an error (28).
Ideally, the limits of agreement need to be defined a priori in the methods, and such a definition was given in only three studies (20,29,30). The American National Standards of the Association for the Advancement of Medical Instrumentation recommend that maximal bias of noninvasive arterial pressure, obtained from at least 85 patients, should not exceed 5 mm Hg ± 8 SD from a noninvasive reference method (53). The British Hypertension Society considered the above criterion too liberal and proposed an alternative grading system according to the percentage of readings
5,
10,
15 mm Hg from a noninvasive reference method (54). Unfortunately, both these criteria are not readily applicable in perioperative settings because these guidelines were planned for evaluating blood pressure instruments used in outpatient clinics. In perioperative settings, an invasive reference standard is usual. One cardiac output study defined the limits of agreement a priori as ±1 L/min (20). Although not described in methods, two studies used valid criteria for limits of agreement while evaluating results (23,39). The intraarterial blood gas monitoring study (23) used published guidelines (55) to evaluate its results. The limits of agreement for blood gas measurements are as follows: PO2 range, 30.4 to 152 mm Hg; PCO2 range, 20.5 to 80.56 mm Hg; the limits must be ±4.6 mm Hg of the reference. In another study in which an intraoperative hemoglobin monitor was evaluated (39), the limits were empirically defined as ±1 g/dL from the laboratory reference method. Defining the limits of agreement for different physiologic variables may be a difficult aspect in designing the measurement comparison studies, especially in perioperative and critical care settings, because action limits (clinically important) depend upon the clinical scenario and the status of other related variables. Nevertheless, an attempt must be made to define such limits at a minimum after pooling data from other studies. Alternatively, a delphi survey (opinion from experts) may be used to design the study. Without a priori setting of limits, widely discrepant limits of agreement have been chosen (Table 2). Such varying limits seem too difficult to accept in practice and may mislead clinicians who are inexperienced in technology of evidence-based analysis.
The x axis of the Bland and Altman analysis should ideally be represented by the average of measurement values obtained by two different methods because true value is unknown. Bland and Altman proved mathematically that the x axis must represent the average values of two methods (5). Three studies used values obtained by the established method alone on the x axis.
The plot of difference against average in Bland and Altman analysis also allows us to investigate any possible relationship (correlation) between measurement error (difference between two methods) and the assumed true value (average value of two methods). Bland and Altmans suggestions are subject to the assumption that there is no pattern in the plot of difference versus average (3,4). The correlation coefficient could be tested against the null hypothesis of r = 0 for a formal test of independence. Ideally, such independence should also be demonstrated during a repeatability experiment for each of the two methods. In other words, it is important to ensure that within-subject repeatability is not associated with the size of measurements. Otherwise, results of subsequent analysis might be misleading (3).
Although the computational scheme for CIs for Bland and Altman statistics is easy to comprehend, the algebraic calculations are tedious for repeated use. We devised a macro (see Appendix 1) for Minitab (Release 10 and above; Minitab Inc., State College, PA) to facilitate such computation and present it graphically. Minitab is statistical software that can be used for medical applications (56).
Finally, standardization of nomenclature is an important issue in scientific writing. It is common to find the terms "accuracy" and "precision" in measurement comparison studies (810). Accuracy is defined as closeness of a measurement to its true value, and the term is used when a method is compared with an external standard. In practice, one is rarely comparing a measurement with the true value because a "gold-standard" method need not necessarily give the true value. Therefore it may be preferable to avoid the word accuracy in these contexts, and use of the term "agreement" may be preferable (D. G. Altman and M. J. Bland, written communication, 1999). Precision refers to closeness of values on repeated measurements obtained by the same method, i.e., a measure of repeatability. Confusion may arise with the use of the term "precision" because of another definition found in statistical literature. A statistical dictionary (57) defines it as follows: "precision of an estimator is its tendency to have its values cluster closely about the mean of its sampling distribution." Thus, precision is related inversely to the variance of this sampling distributionthe smaller the variance, the greater is the precision. In fact, Bland and Altman used the term "precision" in the context of reporting CIs (4). In our survey of articles, "precision" was the most common incorrectly defined term and was used in contexts other than repeatability or reporting CIs. Therefore, in measurement comparison studies, avoiding the term "precision" and using the term "repeatability" may seem reasonable. If used, the term must clearly be defined (D. G. Altman, written communication, 1999). In medical literature, it is also common to find the word "parameter" used for "variable," as in "We measured the following parameters: temperature, arterial blood pressure, pulse oximetry, end-tidal carbon dioxide and cardiac output." In statistical literature, the term "variable" refers to quantities that vary from individual to individual. The term "parameter" refers to quantities defining a theoretical model (58) and is used to indicate numerical characteristics of a population that are analogous to the numerical characteristics of a sample (statistics). The unknown population parameter is estimated from a sample of values of a variable. Therefore substitution of the specific statistical term "parameter" for "variable" must be avoided.
In this era of evidence-based medicine, standardization of statistical reporting of studies facilitates easy appraisal of published material. This survey has identified several inadequacies and inconsistencies in statistical reporting of measurement comparison studies. Such inadequacies render the validity of the conclusions in each of the articles in doubt. We encourage journal editors to evaluate submissions on this subject carefully to ensure that their readers can draw valid conclusions about the value of new technologies.
Appendix 1
The macro files in Minitab use the default extension MAC. For example, this macro can be baa.mac. It must be stored in the macros subdirectory (in the Windows version) or a folder (the Macintosh version) under the main Mintab directory or folder. The macro is invoked by the following command: %baa c4 c6, if the measurement values for the two methods are entered in Columns 4 and 6 of Mintabs worksheet. After the macro is invoked, the user is asked whether the graph should be plotted with confidence intervals or just with mean bias, bias -2SD, and bias + 2SD. After the appropriate response (yes or no) from the user, the macro performs the required calculations. The text output of the macro includes confidence intervals no matter which graphical output is chosen. The macro is available for downloading from our Web site: http://mantha.uchicago.edu.
Acknowledgments
Supported by a grant from Clinical Practice Enhancement and Anesthesia Research Foundation, Chicago, IL.
The authors thank Sally Kozlik for editorial assistance.
References
This article has been cited by other articles:
![]() |
D. M. Takanishi, M. Yu, F. Lurie, E. Biuk-Aghai, H. Yamauchi, H. C. Ho, and A. D. Chapital Peripheral Blood Hematocrit in Critically Ill Surgical Patients: An Imprecise Surrogate of True Red Blood Cell Volume Anesth. Analg., June 1, 2008; 106(6): 1808 - 1812. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hwan Kim, K. Hun Lee, K. Bong Yoon, W. Young Park, and D.-M. Yoon Sonographic Estimation of Needle Depth for Cervical Epidural Blocks Anesth. Analg., May 1, 2008; 106(5): 1542 - 1547. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. S. Halvorsen, A. Sokolov, M. Cvancarova, P. K. Hol, R. Lundblad, and T. I. Tonnessen Continuous cardiac output during off-pump coronary artery bypass surgery: pulse-contour analyses vs pulmonary artery thermodilution Br. J. Anaesth., October 1, 2007; 99(4): 484 - 492. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. E. Anderson, U. Sartipy, and J. G. Jakobsson Use of conventional ECG electrodes for depth of anaesthesia monitoring using the cerebral state index: a clinical study in day surgery Br. J. Anaesth., May 1, 2007; 98(5): 645 - 648. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Boodhan, A. M. Maloney, and L. L. Dupuis Extent of Agreement in Gentamicin Concentration Between Serum That Is Drawn Peripherally and From Central Venous Catheters Pediatrics, December 1, 2006; 118(6): e1650 - e1656. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Bock, U. Hohlfeld, K. von Engeln, P. A. Meier, J. Motsch, and A. J. Tasman The accuracy of a new infrared ear thermometer in patients undergoing cardiac surgery: [La precision d'un nouveau thermometre auriculaire infrarouge chez des patients de cardiochirurgie] Can J Anesth, December 1, 2005; 52(10): 1083 - 1087. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Critchley, Z. Y. Peng, B. S. Fok, A. Lee, and R. A. Phillips Testing the Reliability of a New Ultrasonic Cardiac Output Monitor, the USCOM, by Using Aortic Flowprobes in Anesthetized Dogs Anesth. Analg., March 1, 2005; 100(3): 748 - 753. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Jahr, S. Osgood, S. J. Rothenberg, Q.-L. Li, A. W. Butch, R. Gunther, A. Cheung, and B. Driessen Lactate Measurement Interference by Hemoglobin-Based Oxygen Carriers (Oxyglobin(R), Hemopure(R), and HemolinkTM) Anesth. Analg., February 1, 2005; 100(2): 431 - 436. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Osgood, J. S. Jahr, P. Desai, J. Tsukamoto, and B. Driessen Does Methemoglobin from Oxidized Hemoglobin-Based Oxygen Carrier (Hemoglobin Glutamer-200) Interfere with Lactate Measurement (YSI 2700 SELECTTM Biochemistry Analyzer)? Anesth. Analg., February 1, 2005; 100(2): 437 - 439. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Stockl, D. Rodriguez Cabaleiro, K. Van Uytfanghe, and L. M. Thienpont Interpreting Method Comparison Studies by Use of the Bland-Altman Plot: Reflecting the Importance of Sample Size by Incorporating Confidence Limits and Predefined Error Limits in the Graphic Clin. Chem., November 1, 2004; 50(11): 2216 - 2218. [Full Text] [PDF] |
||||
![]() |
J. A. Victorino, J. B. Borges, V. N. Okamoto, G. F. J. Matos, M. R. Tucci, M. P. R. Caramez, H. Tanaka, F. S. Sipmann, D. C. B. Santos, C. S. V. Barbas, et al. Imbalances in Regional Lung Ventilation: A Validation Study on Electrical Impedance Tomography Am. J. Respir. Crit. Care Med., April 1, 2004; 169(7): 791 - 800. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. L. Szaflarski "Physicians' estimates of cardiac index and intravascular volume based on clinical assessment versus transesophageal Doppler measurements obtained by critical care nurses". Am. J. Crit. Care., March 1, 2004; 13(2): 100 - 101. [Full Text] [PDF] |
||||
![]() |
N. L. Szaflarski, P. Potter, M. Schallom, S. Davis, C. Sona, and M. McSweeney "Evaluation of chemical dot thermometers in orally intubated patients". Am. J. Crit. Care., March 1, 2004; 13(2): 169 - 170. [Full Text] [PDF] |
||||
![]() |
M. Ganter and A. Zollinger Continuous intravascular blood gas monitoring: development, current techniques, and clinical use of a commercial device Br. J. Anaesth., September 1, 2003; 91(3): 397 - 407. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Serianni, J. Barash, T. Bentley, P. Sharma, J. L. Fontana, D. Via, J. Duhm, R. Bunger, and P. D. Mongan Porcine-specific hemoglobin saturation measurements J Appl Physiol, February 1, 2003; 94(2): 561 - 566. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M.-H. Ho, A. Lee, E. Ling, A. Daly, K. Teoh, and T. E. Warkentin Agreements Between the Prothrombin Times of Blood Treated In Vitro with Heparinase During Cardiopulmonary Bypass (CPB) and Blood Sampled After CPB and Systemic Protamine Anesth. Analg., January 1, 2003; 96(1): 15 - 20. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Lurie, J. S. Jahr, and B. Driessen The Novel HemoCue(R) Plasma/Low Hemoglobin System Accurately Measures Small Concentrations of Three Different Hemoglobin-Based Oxygen Carriers in Plasma: Hemoglobin Glutamer-200 (Bovine) (Oxyglobin(R)), Hemoglobin Glutamer-250 (Bovine) (Hemopure(R)), and Hemoglobin-Raffimer (HemolinkTM) Anesth. Analg., October 1, 2002; 95(4): 870 - 873. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Dewitte, C. Fierens, D. Stockl, and L. M. Thienpont Application of the Bland-Altman Plot for Interpretation of Method-Comparison Studies: A Critical Investigation of Its Practice Clin. Chem., May 1, 2002; 48(5): 799 - 801. [Full Text] [PDF] |
||||
![]() |
J. S. Jahr, F. Lurie, B. Driessen, J. A. Davis, R. Gosselin, and R. A. Gunther The HemoCue(R), a point of care B-hemoglobin photometer, measures hemoglobin concentrations accurately when mixed in vitro with canine plasma and three hemoglobin-based oxygen carriers (HBOC): [Le photometre de chevet HemoCue(R) fournit une mesure exacte de concentrations d'hemoglobine combinees in vitro a du plasma canin et a trois transporteurs d'oxygene a base d'hemoglobine] Can J Anesth, March 1, 2002; 49(3): 243 - 248. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. K. Hofer, M. Ganter, M. Tucci, R. Klaghofer, and A. Zollinger How reliable is length-based determination of body weight and tracheal tube size in the paediatric age group? The Broselow tape reconsidered Br. J. Anaesth., February 1, 2002; 88(2): 283 - 285. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Awad, M. A. M. Ghobashy, R. G. Stout, D. G. Silverman, and K. H. Shelley How Does the Plethysmogram Derived from the Pulse Oximeter Relate to Arterial Blood Pressure in Coronary Artery Bypass Graft Patients? Anesth. Analg., December 1, 2001; 93(6): 1466 - 1471. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|