## 2019

STATISTICS ROUNDTABLE

# The Folly of Youth

## Ask the right questions up front and along the way during data analysis

by Connie M. Borror

When I began analyzing data for various companies and researchers a long time ago, I learned an extremely valuable lesson: When given data, be sure to ask lots of questions about it before any analysis. This may seem completely obvious to readers of this column; however, I wanted to share a scenario with you based on what happened to me.

Working with two researchers, I was given a set of data similar to the following: 5.0, 4.0, 6.0, 3.0, 4.0, 6.0, 8.0, 8.0, 7.0 and 7.0.

Not much data to work with, but I suppose it was what they had. The first thing I thought to do was graph the data using a histogram (Figure 1). Obviously, this was not helpful, given the data set was so small. Next, a dot plot (Figure 2) and a box plot (Figure 3) were constructed. Not so useful again, but I thought, "Well, that’s about the best I can do with such a small set of data." Then some summary statistics were calculated, and a 95% confidence interval on the mean was produced (Table 1).

After I had gathered basic information, I summarized my findings in a simple short report. Being quite proud of myself, I presented the results to the researchers, thinking they would be enamored with my initial report. I could not have been more wrong. The researchers informed me the results made no sense.

As it turns out, the data given were number of outcomes, not actual measurements of an item (such as length, height and width), which is how I had treated the data. In looking at the original data again, I finally took note of the fact the decimal values were all zero. The data was really 5, 4, 6, 3, 4, 6, 8, 8, 7 and 7.

It was determined the computer package automatically reported data values with "0.0" as the convention. My thought was that some of the information in the report—such as the graphs and some summary statistics—still would be useful. I would soon realize, however, I was again incorrect.

### Off the mark

The count data given were the number of prescription errors during a 10-day period in which there were five errors on day one, four errors on day two and so on. Based on that information, looking at the raw data, it appeared as though there was an increasing trend in the number of errors over this time period. A simple time series plot of the raw data revealed the same trend (Figure 4).

As you’re probably thinking, this interpretation is incorrect as well because the total number of prescriptions filled each of the 10 days varied. The correct results to analyze and interpret involve proportions (Table 2). Figure 5 shows proportions during the 10-day period. From this display, you see an upward trend is nonexistent. If the same number of prescriptions had been filled on each of the 10 days, Figure 4 showing the number of errors would be correct and useful, although the proportion of prescription errors may be of interest and is more often reported than the number of errors.

At this point, it seemed obvious what should have been known or communicated in the first place:

• What is the goal of this project?
• What do these data truly represent?
• Who collected the data?
• What was the operational definition of an error for a prescription?
• Was one created and used by all who collected the data?
• Were all parties involved trained using this definition?

I wanted an additional question answered: Why are the data recorded in a way to indicate the values might be continuous (that is, first looking at the data, it appears that a value of 8.2, for example, may be a possibility)?

Continuing with the analysis of the original data with the new information obtained, it would be nice to say that you have all the information you need to prepare an accurate summary report on the data at this time. You now know the data represent the number of prescription errors out of the total number of prescriptions filled. There are still some questions, however, that should be asked:

• Could a single prescription have more than one possible error?
• If so, does the number of errors represent different prescriptions in error or the total number of errors seen over all prescriptions processed that day?

For example, on day one, five errors were recorded on 320 prescriptions filled that day. Does this mean that five different prescriptions out of 320 were categorized as "error"? Or, were there only three prescriptions with errors out of 320? That is, the first prescription contained three errors, a second prescription contained one error and a third prescription contained one error—for a total of five errors? These two situations are quite different and would require different analysis. An operational definition of a prescription error would eliminate this confusion.

The scenario described still happens today at varying degrees. With better training in our universities and on the job, it is unlikely someone working on a project will run away with the data to begin analysis without understanding what the data represent or the goal of the project—like I did.

But asking the right questions up front and along the way is not just helpful, it’s essential for success.

Connie M. Borror is a professor in the school of mathematical and natural sciences at Arizona State University West in Glendale. She earned her doctorate in industrial engineering from Arizona State University in Phoenix. Borror is a fellow of both ASQ and the American Statistical Association and past editor of Quality Engineering.

Out of 0 Ratings