3.4 PER MILLION
An indirect way to find quantities you need for your investigation
by Joseph D. Conklin
The ideal approach in Six Sigma is to directly measure your inputs and outputs. What can a quality practitioner do when that’s not possible or practical? If the difficult-to-measure variable has a known or theoretically reasonable relationship to another quantity that is readily available, one option is to use this relationship to indirectly estimate the challenging variable.
Suppose you want to estimate the total tons of cement used in construction for 10 counties in a certain state. The direct approach requires drawing up a list of construction companies and contractors in the state and surveying them about their cement use. This may not be viable for several reasons.
No master list of construction companies may exist, and there may not be the money or time to draw one up and visit most or all companies. Companies may be unwilling to release internal data to outside parties if they fear publishing it might help local competitors figure out their business.
Skipping the measurement may not be an option. In the case of cement use, let’s say this number factors into models of economic activity used to forecast future business trends in the state. Without it, the forecast is incomplete or unavailable. How might an indirect approach work?
Building a solution
To illustrate, suppose the construction companies are willing to work with an industry trade association that occasionally asks a random sample of them about their cement use. Suppose further that the results are published at the state level. The companies are more willing to share their data in this way because:
- They are confident in the trade association’s ability to protect it.
- State-level numbers are aggregated at a high level so competitors can’t figure out what any single company is doing locally.
In this situation, what you want—county level numbers—is unavailable. The trade association has state-level numbers that it collects for its own purposes. It’s willing to share them to help you meet our goal. How do you indirectly estimate the county level numbers?
To keep things simple, say there are two uses that consume almost all of the cement in a state: building and maintaining sidewalks, and new building construction. With a few exceptions, sidewalk work will be managed by local governments. They will need to keep records of how many miles are covered, or maybe the local budget has a line item for dollars spent.
New buildings require permits. These are likely on file with the planning or licenses office. While gathering public records involves considerable effort, at least they have a higher chance of being accessible than they would with private company data.
After spending considerable energy explaining your needs to county governments, say you obtain the information displayed in Table 1, which provides estimates of cement use from the more-easily obtained data using a three-step process:
- Find the total permits and total sidewalk spending for all counties in the state.
- Compute the percentage shares of the state totals for the 10 counties.
- Apply the 10 county shares to the cement data from the trade association.
What cautions accompany this procedure? First, you need to appreciate the important underlying assumption: Cement tonnage has a constant relationship with sidewalk spending or total building permits. Figure 1 shows this type of relationship. In pure form, the relationship says any change in spending or permits is associated with some constant change for any level of tonnage.
This is a stricter assumption than saying tonnage should move in the same direction as either spending or permits. The true relationship may be better illustrated in Figure 2, which shows tonnage changes at a constant rate over some range of spending or permits. But different rates apply as you move from low to medium to high levels of these variables. The true relationship may be even more complicated, as shown in Figure 3: Spending or permits may have a curved relationship to tonnage.
The only sure way to know is to have the actual county level cement tonnage numbers, but then you wouldn’t need to be going through the exercise. If indirect estimation appears to be the best choice for getting at hard or otherwise impossible-to-obtain data, the best way to be sure it is really a good option is to create a checklist of tests in advance.
Concerns about confirmation
In general, the longer the checklist and the more tests on it that are confirmed, the more secure you can feel using indirect estimation. The items on the checklist will depend on the variable you are trying to estimate. With respect to the example, here are some possible tests for a checklist:
- The relationship that the indirect estimation assumes: Do people with experience in producing and using cement think it’s reasonable?
- Does the estimation produce values for a county that local producers and users believe are reasonable, considering current conditions and recent history?
- Are there additional variables or special adjustments that producers and users in a particular county can offer to improve the quality of that county’s estimate?
- Is there a county whose market is so large that companies in that county can be confident the local competitors cannot figure out their business if they report their actual use to the trade association? Can the trade association survey that county alone to produce a figure for comparison to the indirect estimate?
- Is it possible to take some fraction of the sidewalk or building projects for the county, work with the cement users to determine how many tons they used, and compare this figure for consistency against what a corresponding fraction of the county’s indirect estimate would suggest?
- This example starts at state-level numbers and estimates down to counties. Can the same indirect estimation procedure start with a national number and estimate down to states, assuming national and all state-tonnage numbers are readily available? Are state-level numbers from the indirect estimation consistent with any actual, readily available state-level numbers?
- Is there more than one variable that can be used for indirect estimates? Do most or all of them produce county level estimates that approximately agree?
The larger truth underlying a checklist such as this one is that indirect estimation is a form of modeling the data. The ultimate test of a model is how well it reflects the reality. Consider Figures 4 and 5 based on the test in option five earlier for how we might assess indirect estimation in practice.
Figures 4 and 5 show two ways option five on the checklist might work out. Normally, in the example, the local companies are not willing to report their cement tonnages out of confidentiality concerns. Suppose in the first year of indirect estimates, or every few years afterward, you are fortunate to find some fraction of the local businesses in select counties willing to report their tonnage.
Maybe you can get special funding for extra measures to keep this data protected that would not normally be available on a regular basis. Whatever fraction of the local business is reported under this special data collection, you extrapolate from that data to produce an independent estimate of the county’s total cement tonnage.
If the estimates from this independent survey track well with the indirect estimates, as in Figure 4, you have evidence the indirect estimates are a viable option. The opposite situation is shown in Figure 5.
In Figure 4, an independent check based on a sample of projects in each county leads to results that track well with the results of the indirect estimation. In this case, you can proceed to any other checklist items you have not yet completed.
Figure 5 is a case of results that don’t track well. In the Figure 5 case, there’s reason to believe the indirect approach needs replacement or adjustment because the pattern of points on the graph shows large disagreements between the two data sets.
In addition to the steps on the checklist, one more is needed to finish the job. You should inform the customer in this story—the people turning out the state economic forecast—about any indirect estimation procedure you are using. The customer can determine whether the procedure has any unforeseen negative consequences for the quality of its work.
Indirect estimation is an instance of a useful principle in Six Sigma and quality management in general. When the best option is not available, make the best use of the second-best option. Put another way, a good solution that’s implemented is better than a perfect one that’s not.
Joseph D. Conklin is a mathematical statistician in Washington, D.C. He earned a master’s degree in statistics from Virginia Tech in Blacksburg, VA and is a senior member of ASQ. Conklin is an ASQ-certified quality manager, quality engineer, quality auditor, reliability engineer and Six Sigma Black Belt.