| ||||||||||||||
|
|
|||||||||||||




*Division of Management Consulting and Departments of
Anesthesia and
Health Management and Policy, University of Iowa, Iowa City;
Department of Industrial Maintenance, Jean Monnet University, Roanne Cedex, France; and ||Department of Anesthesiology, Jefferson Medical College
Address correspondence and reprint requests to Franklin Dexter, Anesthesia 6-JCP, University of Iowa, Iowa City, Iowa 52242. Address e-mail to franklin-dexter{at}uiowa.edu.
| Abstract |
|---|
|
|
|---|
2 test, Fishers exact test, Rao and Scott test, Students t-test, Clopper-Pearson confidence intervals, and Chen and Tipping modification of the Clopper-Pearson confidence intervals. Discrete-event computer simulation over many years was used to represent surgical suites with an unchanging cancellation rate. Because the true cancellation rate was fixed, the accuracy of the statistical methods could be determined. Cancellations caused by medical events, rare events, cases lasting longer than scheduled, and full postanesthesia or intensive care unit beds were modeled. We found that applying Students two-sample t-test to the transformation of the numbers of cases and canceled cases from each of six 4-wk periods was valid for most conditions. We recommend that clinicians and managers use this method in their quality monitoring reports. The other methods gave inaccurate results. For example, using
2 or Fishers exact test, hospitals may erroneously determine that cancellation rates have increased when they really are unchanged. Conversely, if inappropriate statistical methods are used, administrators may claim success at reducing cancellation rates when, in fact, the problem remains unresolved, affecting patients and clinicians. | Introduction |
|---|
|
|
|---|
There have been many research studies evaluating causes of cancellations on the day of surgery (e.g., 610). However, actually monitoring case cancellation rates and determining change over time or differences among specialties is difficult. For example, when the intensive care unit (ICU) fills, there are many cancellations both for services whose patients require postoperative ICU care and for services using the postanesthesia care unit (PACU), the ICU overflow site. If the ICU fills once or twice a month, and cancellations are being compared from one month to the next, the cancellation rate may seem to vary markedly from month to month, leading to poor management decisions.
In research studies, when statistical methods are used to compare cancellations among groups, often the
2 test is chosen (68). Confidence intervals for odds ratios are estimated by logistic regression to adjust for patients baseline characteristics (9,10). These methods are fine for analyses of patient risk of medical events because each patients risk of a medical event is statistically independent of all other patients risks of the event. When the risk of one patients case being canceled is correlated to that of another patient, such methods break down.
In the reality of clinicians and managers surgical suites, many cancellations result from nonmedical causes (e.g., full ICU, full PACU, surgeon unavailable, bad weather, or urgent cases). Whenever one of these nonmedical causes occurs, more than one case can be canceled. For example, at a university hospital with outpatient preoperative evaluation, when adults had their surgery canceled on the day of surgery, nonmedical causes were responsible for 80% of cancellations (9). At a Veterans Affairs Hospital, nonmedical reasons accounted for 67% of cancellations before the introduction of outpatient preoperative evaluations (6) and 81% of cancellations after a year of experience with this process (7). Among inpatients, 43% of cancellations were caused by nonmedical factors (11). Among all patients at a tertiary teaching hospital, 68% of cancellations had nonmedical causes (12). The issue likely is less relevant to pediatrics because percentages for nonmedical causes of case cancellation are lower than for adults: 15% in one study (10) and 33% in another (4).
We studied several statistical methods for analyzing case cancellations to determine which methods can be used accurately for clinicians and managers routine monitoring needs. In the Discussion, we include a worked example demonstrating the recommended method so that readers can easily implement the findings of the study.
| Methods |
|---|
|
|
|---|
) is set equal to 0.05 (i.e., P < 0.05 is significant), a test should not achieve significance more often than on 5% of occasions unless there are true differences between groups. Because decisions based on faulty analysis often result in the implementation of processes that waste everyones time (e.g., additional paperwork, phone calls, and laboratory and diagnostic testing), these type I errors can have a detrimental effect. Similarly, a type I error may lead some administrators to claim success at reducing cancellation rates when, in fact, there has been no change. Type II errors occur when significant differences are not detected, even though there are true differences between groups. Statistical power is high when type II error rates are low. For example, type II errors occur when some services suffer from full ICUs, but statistical tests show those cancellations do not differ significantly from those of other services. Evaluation of type II errors is relevant provided statistical methods have appropriate type I error rates.
Descriptions of Statistical Methods
Cancellations caused by medical events (Table 1) were used to represent all types of cancellations for which the fact that one patient was canceled does not change the probability that other patients have their surgeries canceled. Mr. Jones developing chest pain in the holding area before his inguinal hernia repair does not influence the probability that Mrs. Smith will have an increased temperature and white count before her total hip replacement.
|
Fishers exact test will have an appropriate type I error rate (i.e., equal to its nominal, correct value) when comparing rates of cancellations from medical events between groups (e.g., between services or between 6-mo periods). If P = 0.05 is considered significant, then 5% of comparisons should demonstrate a statistical change purely based on chance. The
2 test will behave similarly, provided there are at least five cancellations in each of the groups being compared. The
2 test can be performed in a spreadsheet such as Excel using built-in functions. A corresponding method for calculating confidence intervals for proportions is the method of Clopper-Pearson (13), implementable in Excel as one formula.
Statistical methods to analyze nonmedical causes of cancellations can consider variations in cancellation rates within and among short periods (14). The principal determinant of OR workload by subspecialty is the day of the week (15,16). Vacations, meetings, variations in clinics, etc., are often 2 weeks long. Consequently, we considered 4 weeks the shortest data collection period that would be used without considering variation by day of the week (1720). Our choice of 4 weeks was similar to previously published periods of multiples of months: 1 mo (8), 3 mo (7,9,10), 4 mo (12), and 6 mo (6,11).
The statistical methods estimate the variance in cancellation rates among different 4-week periods and add it to the estimate of the variance in cancellation rates among cases within the same period. The Rao and Scott method (21) has the highest statistical power among competing methods for comparing two groups, without exceeding nominal rates (2123). Chen and Tipping (24) described an analogous method for modifying Clopper-Pearson confidence intervals. We used sets of six 4-week periods for our comparisons. A sample size of six is small statistically, but even that duration is the longest period pooled in practice when studying cancellations (612).
Alternatively, the uncertainty in the true percentage cancellation rate within each of the 4-week periods can be ignored (25), and Students two-sample t-test with unequal variances applied to 2 samples of 6 numbers each. Confidence intervals for the means of single sets of six 4-week periods are calculated with the Student t distribution (Appendix). This approach has been used widely for the statistical analysis of other OR management data, including staffing costs (17,18), ORs in use at different times of the day (19), and OR workload for purposes of OR allocation (20). However, those values are not percentage cancellation rates with values that can be close to zero. The method may work poorly when percentages are nearly equal to zero. Consequently, we followed Shirley and Hickling (25) in using the Students t-test after transforming the percentages (26), using Equation (1) of the Appendix.
Testing Statistical Methods
The validity of statistical methods is evaluated using computer simulation. A set of real data can be used to investigate whether different statistical methods give the same answer, but that does not show which answer, if either, is correct. The underlying statistical distribution used to generate the real data would be unknown, and the real data would be only one realization of the underlying statistical distribution.
We used the above referenced research (3,4,69,12) and other papers to assure that the conditions simulated were realistic (Appendix). Computer simulation provided known, correct answers to which the results of the statistical methods could be compared.
Simulated data to test the statistical methods were obtained using ARENA version 7.01 (Rockwell Software, Sewickley, PA). For each of eight different combinations of parameter values (Appendix; Tables 27), simulation output was counts of canceled and noncanceled cases for 65,000 4-week periods of 20 workdays. Because the cancellation rate was fixed over these 5,200 years of data, we could evaluate whether statistical tests would have a type I error rate exceeding 5% (the expected value) with a P < 0.05 criterion.
|
|
|
|
|
|
| Results |
|---|
|
|
|---|
When testing for differences from one 4-week period to the next, both the
2 test and Fishers exact test had high type I error rates caused by rare events (Table 3). Results were similar when comparing six 4-week periods to the next six 4-week periods (Table 4) and when comparing cancellation rates between services (Table 5). The type I error rate for all four types of errors combined represented the mixture between achieving statistical significance too often when rare events were present and too infrequently from other causes (Tables 35). Likewise, Clopper-Pearson confidence intervals included the true rate of cancellations caused by rare events far too infrequently (Table 6).
Rao and Scott (21) and Chen and Tipping (24) methods were more accurate than the
2 and Clopper-Pearson methods, respectively, but still had type I error rates exceeding the nominal value of 5% when cancellations were caused by rare events (Tables 4 and 6). When all four types of cancellations were present, performance was sensitive to the incidence of cancellations caused by rare events.
Students t-test and analogous methods were generally accurate (Tables 46). The same finding was obtained when the counts were first transformed. The latter method had the smallest absolute difference from the expected 5% type I error rate for confidence intervals in the simulation of the five OR surgical suites with cancellations caused by rare events only (Table 6). In that circumstance, 19% of the 4-week periods had no observed cancellations, and 60% had 4 or less (see Discussion).
Table 7 studies type II errors, as described in the first section of Methods. Statistical power to detect differences in cancellation rates between services was significantly higher for Students t-test applied to transformed counts than without transformation.
| Discussion |
|---|
|
|
|---|
Table 8 provides an example of the method using real data from an academic medical center. Table 8 also shows the usefulness of the method. The method can be implemented in a few lines of computer code and a spreadsheet (e.g., Excel). A manager can test the answer provided to him or her by computer software using small amounts of data (e.g., that in Table 8). Finally, the method is based simply on the numbers of canceled versus performed cases. Although we studied effects of different types of cancellations in this paper, the usefulness of the method is unaffected by the ability of a facility to track and categorize the reason for each of its case cancellations.
|
Different Statistical Methods
We do not recommend excluding days with cancellations caused by rare events for three reasons. First, administrators can be motivated to reduce cancellations because of their economic importance. Excluding days with rare events shows smaller benefit to preventing cancellations. For example, if anesthesiologists want to show that transplant cases occurring early on weekdays markedly disrupt the elective schedule, excluding those days from the report makes no sense. Second, our experience is that many rare events are not caused by snowstorms but rather events that are hard to identify clearly in practice. For example, although a surgeons flight home may be delayed, resulting in the cancellation of his cases scheduled for the next day, that reason may not be reported in the OR log sheet. Third, the definitions and types of rare events are likely to vary among facilities and or surgical populations. Trying to set systematic and valid policies for exclusion of rare events will be challenging. The consequence of not excluding rare events is that precisely what types of cancellations are rare do not need to be defined.
We do not recommend using Fishers exact test or similar methods to compare cancellation rates when review of the data suggests that few of the observed cancellations were caused by rare events. A 1% cancellation rate attributable to rare events (Table 2) was sufficient to affect statistical methods markedly (Tables 36). Some surgical suites will have an incidence of cancellations caused by rare events of <1%. Yet, they are unlikely to know their true incidence because the upper bound on the incidence of cancellations from rare events cannot be estimated accurately using methods appropriate for medical events (Table 6). Thus, we recommend simply using Students t-test applied to transformed data for OR cancellations.
Rao and Scott and Chen and Tipping methods performed worse than we expected (2124). Our results probably differed from those previously reported because the previous papers used sample sizes applicable to toxicology studies, not case cancellations. First, we studied only six 4-week periods versus toxicology studies with 30 or so litters of pups, for which those methods perform well. Our sample size of six was probably too small for accurate estimation of the variances in cancellation rates among 4-week periods. Second, we had hundreds of scheduled cases within each 4-week period versus toxicology with litters of 212 pups. Consequently, there was relatively little uncertainty in cancellation rates within 4-week periods, just uncertainty among periods. This pattern explains why Students t-test and analogous methods performed quite well.
We did not study nonparametric methods such as Mann-Whitney-Wilcoxon (23) because parametric methods like Students t-test have higher statistical power, and, for our application, performed well after data transformation (Tables 4 and 5).
Limitations
Our results are likely valid because they seem unaffected by the characteristic of our mathematical model of a surgical suite other than with respect to the incidence of a rare event and the resulting number of case cancellations from each rare event. Thus, our results apply fully to the many surgical suites with virtually no daily cancellations because of cases running late or full PACUs. Fine tuning our model to be more realistic for any one surgical suite would likely be of little or no value other than to the extent that we more realistically model the characteristics of rare cancellations at the specific suite. The pattern of such events likely varies depending on each suites unique circumstances, such that additional realism would make results less valid for most other sites. This limitation is moot provided a statistical method is used that is robust to rare events. That is why we recommend that (almost) all facilities that monitor cancellation rates use such a method (e.g., as in Table 8).
Cancellation rates likely vary among facilities, depending partly on the types of patients receiving care. For example, some published cancellation rates (including those on the day before surgery) include 4.6% for outpatients (9), 6.6% for outpatients (6), 9% for outpatients (11), 10% among outpatients (12), 10% among pediatric outpatients (10), 12% among plastic surgery patients (8), 13% overall (7), 17% among inpatients (11), 19% among inpatients (6), and 30% among inpatients (12). We studied cancellation rates on the day of surgery between 0.8% and 6.4% (Table 2). We recommend that our results not be applied by facilities lacking at least one observed cancellation in each of the 6 studied 4-week periods. We repeated the simulations with just rare events, using only two ORs and only the service with two-hour average case durations. Confidence intervals were created using Students t distribution with the transformation applied to six 4-week periods, as in Table 6. The 95% confidence intervals failed to contain the true cancellation rate for 11.9% ± 0.3% of comparisons. This unacceptably high type I error rate occurred because 57% of 4-week periods had no cancellations. We doubt that our inability to consider less than one cancellation every four weeks is a major limitation, because when the incidence is so infrequent, most clinicians and managers would be uninterested in quantifying cancellations.
Summary
Clinicians and managers interested in routine monitoring of OR cancellation rates generally need a robust method that can be applied automatically, without a formal statistical assessment like a research study. We recommend calculating the number of canceled and performed cases during each four-week period, transforming each periods cancellation rate, and then applying Students t-test. Methods such as Fishers exact test and
2 test can give highly misleading results, resulting in inappropriate management decisions.
| Appendix |
|---|
|
|
|---|
Discrete-event computer simulation (27) was used to represent the random flow of patients from ORs through the PACU. Each workday was simulated independently of all other workdays. Simulation was performed for 5 OR and 15 OR surgical suites.
Scheduled case durations were described using different log-normal distributions for each of three services. Each service had a mean scheduled duration of 1.0, 2.0, or 3.0 h, with a common standard deviation of the logarithm of case duration in hours equal to 0.725 (28). After calculation, the scheduled durations were bounded between 0.3 and 1.9 h for the 1-h service, between 0.6 and 3.9 h for the 2-h service, and between 0.9 and 5.9 h for the 3-h service. The actual case durations were calculated using the method described by Kennedy (29) to include the differences between scheduled and actual case durations that were measured by Goldman et al (30). Specifically, actual case durations were set equal to the scheduled case duration multiplied by a normally distributed random number with a mean of 1.00 and sd of 0.25 (31).
Each turnover time ("patient out" to "patient in") was assigned a time duration generated randomly from a log-normal distribution with mean ± sd = 0.30 ± 0.20 h, bounded between 0.17 and 1.50 h.
Each OR in the surgical suite had two surgeons. The first surgeon completed his or her cases, followed by the second surgeon. The cases were divided randomly, with equal probability, between the two surgeons. Often this resulted in an unequal number of cases performed by the two surgeons in each OR. For the 5 OR and 15 OR surgical suites, 2 ORs and 5 ORs were allocated for the service with a mean duration of 1.0 h, respectively. Cases were scheduled sequentially using an 8-h workday. Adjusted use (OR time plus turnovers) was 83.7% ± 0.1% (se). For the 5 OR and 15 OR surgical suites, 2 ORs and 5 ORs were allocated for the service with a mean duration of 2.0 h. Adjusted use was 77.6% ± 0.1%. For the 5 OR and 15 OR surgical suites, 1 OR and 5 ORs were allocated for the service with a mean duration of 3.0 h. Adjusted use was 71.2% ± 0.1%.
Cancellations caused by rare events were represented by the unexpected absence of a surgeon. Whether a surgeon was unavailable was determined by a Bernoulli distributed random number. If the surgeon was unavailable, all of that surgeons cases for the day, from the preceding paragraph, were canceled. The achieved risk of any one case being canceled from this cause was 1.0% (Table 2).
Cancellations caused by medical causes were simulated by generating a Bernoulli distributed random number with a 0.8% probability. Such cancellations occured equally frequently for the three services, unlike cancellations caused by other causes.
Cancellations caused by cases running late were used to represent cancellations from any cause providing correlation in risks within services. If a case was expected, from its scheduled duration, to finish more than 0.5 h after the end of the 8-h workday, the case was canceled.
Cancellations caused by a full PACU were used to represent cancellations from any cause providing correlation in risks among services. Ten PACU beds were planned for the 5 OR surgical suite and 30 PACU beds for the 15 OR suite. Each patients time in the PACU was generated from a lognormal statistical distribution with a mean of 1.0 h and sd of 1.2 h, bounded between 0.5 and 3.0 h. If the PACU was full, discharges from ORs into the PACU were delayed in original sequence. A case was canceled if the patient was expected to enter the PACU more than 1.5 h after the end of the 8-h workday.
Whether a case was canceled was determined in the sequence of medical cause, rare event, cases running late, and then full PACU.
Freeman-Tukey Double Arcsin Transformation
The Freeman-Tukey double arcsin transformation (26) equals
|
|
where c is the number of cancellations and n is the number of scheduled cases during a 4-week period. Table 8 gives an example of applying the transformation.
The inverse of the transformation is required only when calculating 95% confidence intervals for the cancellation rate, as in Table 6. Calculate the sample mean
and sd s
of the transformed values
1,
2, ...,
p from each of the p four-week periods. Estimate confidence intervals by
|
|
where t is the inverse of the Student t-distribution with p-1 degrees of freedom. Report the value of
and its confidence intervals after taking the inverse of the transformation of Equation (1).
To calculate the inverse of the transformation, we use the bisection method in our Visual Basic for Excel code (32). For convenience, we show the steps for
. The same steps are applied to the lower and upper intervals.
|
|
Readers can check their implementation of the steps by using the transformed and nontransformed cancellation rates in Table 8.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
R. H. Epstein, F. Dexter, and E. Piotrowski Automated Correction of Room Location Errors in Anesthesia Information Management Systems Anesth. Analg., September 1, 2008; 107(3): 965 - 971. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Dexter, Y. Xiao, A. J. Dow, M. M. Strader, D. Ho, and R. E. Wachtel Coordination of Appointments for Anesthesia Care Outside of Operating Rooms Using an Enterprise-Wide Scheduling System Anesth. Analg., December 1, 2007; 105(6): 1701 - 1710. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|