How Quality Plays on Wall Street
Can You Trust Your Data?
How to find manual data entry errors
by Jeff Lackey
You have a deadline to complete a report, and you're in the middle of analyzing your experimental data when you realize many of the results that were manually entered into the computer from the original handwritten data sheets appear to be typos. It's not possible for you to find and correct all of them and still meet your deadline. Now what do you do? The right thing to do is to be honest and request more time to correct the errors and be able to report reliable results.
What to do in the future
Errors will occur when someone inputs data manually. No one is perfect. The question is how to discover which entries are wrong or appear to be unusual enough that they warrant validation.
Techniques to find typos have evolved from manually checking each entry to doing double entries in separate files and making electronic comparisons to having someone else do the checks. You name it, it's been tried. These methods are weak because they confirm that 100% inspection is not 100% effective. So how can the discovery of potential data entry errors be made more effective?
When you are dealing with numbers
The objective when dealing with numbers is to discover potential special causes. What better way to find special causes than to use control chart methodology to assist in this process?
The type of data to be checked and the way they are grouped may influence you to choose a particular control chart. The most universal approach for checking many types of data is using an individuals chart with a moving range chart. Occasionally x-bar charts are used with range or standard deviation charts. It depends on how the data are grouped and the magnitude of differences between groups. Through experience, you will know which chart to use, but when in doubt, use an individuals chart.
Selective sorting may help when the magnitude of differences between data groups is large. Make an individuals chart for each column if several columns of data are included in the file. Note the line numbers where there are verifiable errors in one column as these may correspond to the same line in other columns.
After you identify out-of-control signals, it's time to validate whether these are real or possible errors. Here are some examples of typical errors that will stand out:
- Missing or misplaced decimal points (61.38 instead of 6.138).
- Missing zeros immediately after the decimal point (2.54 instead of 2.054).
- A whole number instead of a decimal number, probably indicating the enter key was pressed rather than the decimal point key (6.000 instead of 6.138).
- Reversed numbers significantly different relative to other data entries (2,817 instead of 8,217).
- Missing signs (3.14 instead of 3.14).
Even conscientious efforts result in error rates in the 1 to 2% range for these types of errors. When an experimenter is trying to discover a 5% significant event, these kinds of errors can create wasteful forays when none are warranted, especially when several hundred manual data entries have been made.
Of course this technique will not find all possible data entry errors. It's entirely possible some errors will go undetected. Indeed, what was originally a special cause value could be entered incorrectly and end up looking like all the other numbers.
What about alphanumeric data?
If you are manipulating non-numeric or categorical data in Microsoft Excel, one technique is to make use of the code function. It will transform alphanumeric characters into their corresponding ASCII number, which can then be used for input to control charts.
Another technique is to simply sort the column and scan it for odd strings. Or use Excel's built-in automatic filter for each column, and click on each unique occurrence of an alphanumeric string to find oddballs. This can even be used for numerical data if the data set is not too large.
You must trust the data
Analyzing data from any experiment or quality improvement activity requires trust in the data, especially when they have been manually entered into your statistical software. On a similar note, recommendations resulting from an analysis of experimental data cannot be trusted if the data entry is unreliable. Thus, it is imperative the analyzer devote sufficient time to make sure data inputs are as reliable as possible.
Small, Bonnie B., editor, Statistical Quality Control Handbook, AT&T Customer Information Center, 1956.
Wheeler, Donald J., and David S. Chambers, Understanding Statistical Process Control, Second Edition (Knoxville, TN: SPC Press, 1992).
JEFFERY I. LACKEY is a senior quality engineer at Vistakon in Jacksonville, FL. He earned a master's degree in engineering from the University of Michigan and is a Senior Member of ASQ. Lackey is an ASQ certified quality engineer, reliability engineer and quality auditor and a Registrar Accreditation Board quality systems provisional auditor. He is also chair of ASQ's Jacksonville section.
If you would like to comment on this article, please post your remarks on the Quality Progress Discussion Board on www.asqnet.org, or e-mail them to firstname.lastname@example.org.