Quality Assurance Alternatives and Techniques: A Defect-Based Survey and Analysis

Contents

Download the Article (PDF, 106 KB)

Jeff Tian, Department of Computer Science and Engineering, Southern Methodist University

This article surveys commonly used quality assurance (QA) alternatives and techniques, including preventive actions, inspection, formal verification, testing, fault tolerance, and failure impact minimization. The generic ways to deal with defects, including prevention, detection and removal, and containment, are used as the basis to classify these QA alternatives. Each QA alternative is then compared by its cost, applicability, and effectiveness over different product types and application environments. Based on these, the author recommends an integrated approach for software quality assurance and improvement.

Key words: defect, error removal, failure prevention and containment, fault detection and removal, QA alternatives and techniques

INTRODUCTION

With the pervasive use of software systems in modern society, the negative impact of software defects is also increasing. Consequently, one central activity for quality assurance (QA) is to ensure that few, if any, defects remain in the software when it is delivered to its customers or released to the market. Furthermore, one wants to ensure that, if possible, these remaining defects will cause minimal disruption or damage.

Most modern software systems beyond limited personal use have become progressively larger and more complex because of the increased need for automation, functions, features, and services.

It is nearly impossible to completely prevent or eliminate defects in such large complex systems. Instead, various QA alternatives and related techniques can be used in a concerted effort to effectively and efficiently assure their quality.

Testing is among the most commonly performed QA activities for software. It detects execution problems so that underlying causes can be identified and fixed. Inspection, on the other hand, directly detects and corrects software problems without resorting to execution. Other QA alternatives, such as formal verification, defect prevention, and fault tolerance, deal with defects in their own ways. Close examination of how different QA alternatives deal with defects can help one better use them for specific applications.

This article examines the generic ways to deal with defects and classifies QA alternatives accordingly. Existing QA alternatives are surveyed and then compared by their cost, applicability, and effectiveness under different application environments and for different product types. The article concludes with the author’s recommendation for an integrated approach for effective quality assurance and improvement.

DEFECTS AND GENERIC WAYS TO DEAL WITH DEFECTS


This section clarifies various meanings of the term defect, and then examines the generic ways to deal with defects.

Defect-Related Definitions

The term defect generally refers to some problem with the software, either with its external behavior or with its internal characteristics. The IEEE Standard 610.12 (IEEE 1990) defines the following terms related to defects:

• Failure: The inability of a system or component to perform its required functions within specified performance requirements
• Fault: An incorrect step, process, or data definition in a computer program
• Error: A human action that produces an incorrect result
The term failure refers to a behavioral deviation from the user requirement or the product specification; fault refers to an underlying condition within software that causes certain failure(s) to occur; error refers to a missing or incorrect human action resulting in certain fault(s) being injected into software. Sometimes error is also used to refer to human misconceptions or other misunderstandings or ambiguities that are the root cause for the missing or incorrect actions.

With these definitions, one can see that failures, faults, and errors are different aspects of defects. A causal relation exists among these three aspects; that is, errors may cause faults to be injected into the software, and faults may cause failures when the software is executed. This relationship is not necessarily 1-to-1. A single error may cause many faults, such as when a wrong algorithm is applied in multiple modules and causes multiple faults, and a single fault may cause many failures in repeated executions. Conversely, the same failure may be caused by several faults, such as an interface or interaction failure involving multiple modules, and the same fault may be there because of different errors. Figure 1 illustrates some of these situations: the error e3 causes multiple faults, f2 and f3, and the fault f1 is caused by multiple errors, e1 and e2.

Dealing With Defects


With the previous definitions, one can view different QA activities as attempting to prevent, eliminate, reduce, or contain various problems associated with different aspects of defects. One can classify these QA alternatives into the following three generic categories:
Defect prevention through error removal. These QA activities prevent certain types of faults from being injected into the software, which can be done in two generic ways:
1. Eliminating certain error sources by eliminating ambiguity or correcting human misconceptions
2. Fault prevention, or breaking the causal relation between error sources and faults by correcting the missing/incorrect human actions through the use of certain tools and technologies or enforcement of certain process and product standards Because errors are the missing or incorrect human actions, both the elimination of the causes for them through error source elimination and the direct correction of these actions through fault prevention contribute to error removal.

Defect reduction through fault detection and removal. These QA alternatives detect and remove faults. In fact, most traditional QA activities fall into this category. For example, inspection directly detects and removes faults in the software, while testing removes faults based on related failure observations.

Defect containment through failure prevention. These QA alternatives break the causal relation between faults and failures so that local faults will not cause global failures, thus "tolerating" these faults. A related extension is containment measures to avoid catastrophic consequences in case of failures.
figure 1These QA activities are illustrated in Figure 1, forming a series of barriers used to remove or block defect sources and prevent undesirable consequences. These barriers are depicted as the broken lines between the error sources and the software system, and between the software system and the results. Figure 1 also shows the relationship between these QA activities and related errors, faults, and failures. For example, through the error-removal activity, some of the human conceptual errors, for example, e6, are directly removed; while other incorrect actions or errors, for example, e5, are blocked and removed. Some faults, for example, f4, are directly detected through inspection and removed, while others, such as f3, are detected through testing and removed. Still others, for example, f2, were blocked through fault tolerance.

Different QA alternatives can be viewed as a concerted effort to deal with errors, faults, or failures to achieve the common goal of quality assurance and improvement. Defect prevention and defect reduction activities directly deal with the competing processes of defect injection and removal during the software development process (Humphrey 1995). They affect the defect contents, or the number of faults, in the finished software products. On the other hand, defect containment activities aim at minimizing the negative impact of these remaining faults. The author next surveys these alternatives and examines how they deal with defects in their specific ways.

DEFECT PREVENTION THROUGH ERROR REMOVAL

The QA alternatives commonly referred to as defect prevention activities can be used for most software systems to reduce the chance for defect injections and the subsequent cost to deal with these injected defects. They attempt to remove errors through error-source elimination and fault prevention. Specific alternatives for defect prevention are discussed next.

Education and Training: People-Based Solutions for Error-Source Elimination


It has long been observed by software practitioners that the people factor is the most important factor that determines the quality and, ultimately, the success or failure of most software projects. Education and training of software professionals, such as through the personal software process® (PSP) (Humphrey 1995), can help them control, manage, and improve the way they work. Such activities can also help ensure that they have few, if any, misconceptions related to the product and the product development. Eliminating these human misconceptions will help prevent certain types of faults from being injected into software products. The education and training effort for error- source elimination should focus on the following areas:
Product and domain-specific knowledge. If the people involved are not familiar with the product type or application domain, there is a good chance that wrong solutions will be implemented. For example, if programmers who only had experience with numerical computation were asked to design and implement telecommunication software systems, they may not recognize the importance of making the software work within the existing infrastructure, thus creating incompatible software.

Software development methodology expertise. This plays an important role in developing high-quality software products. For example, lack of expertise with requirement analysis and product specification usually leads to problems and rework in subsequent design, coding, and testing activities. A related issue is the required expertise with relevant software technologies and tools. For example, in an implementation of cleanroom technology (Mills, Dyer, and Linger 1987), if the developers are not familiar with the key components of formal verification or statistical testing, there is little chance for producing high-quality products.

Development process knowledge. If the project personnel do not have a good understanding of the development process, there is little chance that the process can be implemented correctly. For example, if the people involved in incremental software development do not know how the individual development efforts for different pieces or increments fit together, the uncoordinated increment development may lead to interface or interaction problems.
Formal Method: Error-Source Elimination and Fault Absence Verification

Formal development methods, or formal methods, include formal specification and formal verification. Formal specification is concerned with producing an unambiguous set of product specifications so customer requirements, as well as environmental constraints and design intentions, are correctly reflected, thus reducing the chances of accidental fault injections. Formal verification checks the conformance of software design or code to these formal specifications, thus ensuring that the software is fault-free with respect to its formal specifications.

Various techniques exist to specify and verify the "correctness" of software systems, namely, to answer the question: "What is the correct behavior and how do we verify it?" The most influential ones include axiomatic correctness, predicate transforms, and functional correctness, described in (Zelkowitz 1993). The basic ideas of axiomatic correctness can be summarized as follows:

The program states before and after executing a program segment S can be described by its pre-condition P and post-condition Q, respectively, and denoted as {P}S{Q}, indicating that "if P is true before executing S and S terminates normally, then Q will be true." This pair of logical predicates constitutes the formal specifications for the program, against which the implemented program needs to be verified. As a practical example, if a program accepts non-negative input for its input variable x, and computes its output y as the square root of x, the precondition can then be described by the logical predicate {x >= 0}, and the post-condition can be described by {y = x}.

There are axioms or inference rules to link different predicates, such as the following axiom:
Axiom A1:

{P} {R}, {R}S{Q}
_________________
{P}S{Q}



where "" is the logical relation "implies." This kind of rule is interpreted as, "if we know that the expressions above the line are true, then we can infer that the expression below the line follows." Axiom A1 states that if a program works for a given precondition it also worked for a more restrictive (or stronger) precondition. In the previous example, if one has already proven that the program S works for all nonnegative inputs, or {R}S{Q}, with R = {x >= 0}), then by applying axiom A1, one can conclude that it also works for a positive input of bounded value, that is, {P}S{Q}, with P = 0 < x <= 1000}, because P R in this case.

There is an axiom stating the pre- and post-conditions for each fundamental element of a language, for example, an assignment, an if-statement, and so on. The first type of axiom is simply in the form of {P}S{Q}. For example, the axiom for the assignment statement is given by:
Axiom A2:

{P} x y {P}

where {P } is derived from expression P with all free occurrences of x (x is not bound to other conditions) replaced by y. As a practical example, consider a program that balances a banking account: If no negative balance is allowed after each transaction, that is, {b >= 0} is the post-condition P, the precondition P, before the withdrawal of money as represented by the assignment statement, b b - w, is then represented by {b - w >= 0}, or {b >= w}, by the preceding axiom. That is, the precondition for maintaining nonnegative balance is that sufficient funds exist before each withdrawal transaction.

Another type of axiom defines the inference rules for multipart statements. For example, the following axiom gives the "meaning" for the if-then-else statement:

Axiom A3:

{P ^ B}S1{Q}, {P ^ -B}S2{Q}
_________________________
{P} if B then S1 else S2 {Q}



As a practical example, consider the following statement:

if x >= 0 then y x else y - x

with post-condition Q ={y = |x|}, precondition P = TRUE, and B ={x >= 0}. To verify this statement, one must verify: {P ^ B}S1{Q} and {P ^-B}S2{Q}.

The first branch (B) to verify is:
{x >=0} y x {y = |x|}

Applying axiom A2, one has, {x = |x|} y x {y = |x|}

Combined with the logical relation {x >= 0} {x = |x|}, by applying axiom A1, this branch is verified. The second branch (-B), can be verified similarly. Therefore, through these verification steps, one has verified the above conditional statement.

The verification process, often referred to as the proof of correctness, is a bottom-up process much like the preceding verification example for the conditional statement: One starts from individual statements, verifies intermediate conditions through axioms or inference rules, and finally verifies the pre- and post-conditions for the complete program.

The axiomatic correctness surveyed previously, as well as several other formal specification and verification techniques, are described in (Zelkowitz 1993), together with examples, discussions, comparisons, and references for additional literature. So far, the biggest obstacle to formal methods is the high cost associated with performing these human-intensive activities correctly without adequate automated support, because the proofs are typically one order of magnitude longer than the programs or designs themselves.

Defect Prevention Based on Technologies, Tools, Processes, and Standards

Besides the formal methods described previously, appropriate use of other software technologies can also help reduce the chances of fault injections. For example, the use of the information hiding principle (Parnas 1972) can help reduce the complexity of program interfaces and interactions among different components, thus reducing the possibility of interface or interaction problems.

A better-managed or more suitable process can also eliminate many systematic problems. Not following the selected process, however, also leads to some faults being injected into the software. For example, not following the defined process for system configuration and revision control may lead to inconsistencies or interface problems among different software versions or components. Therefore, ensuring appropriate process selection and conformance helps eliminate such error sources. Similarly, enforcement of selected product or development standards also reduces fault injections.

Sometimes, specific software tools can also help reduce the chances of fault injections. For example, a syntax-directed editor that automatically balances out open parenthesis, "{," with close parenthesis, "}," can help reduce syntactical problems in programs written in the C language.

Additional work is needed to guide the selection of appropriate processes, standards, tools, and technologies, or to tailor existing ones to fit the specific application environment. Effective monitoring and enforcement systems are also needed to ensure that the selected process or standard is followed, or the selected tool or technology is used properly, to reduce the chances of fault injection.

Root-Cause Analysis for Defect Prevention


Notice that many of the error-removal activities described previously implicitly assume that there are known error sources or missing/incorrect actions that result in fault injections, as follows:
• If human misconceptions are the error sources, education and training should be part of the solution.
• If imprecise designs and implementations that deviate from product specifications or design intentions are the causes for faults, formal methods should be part of the solution.
• If nonconformance to selected processes or standards is the problem that leads to fault injections, then process conformance or standard enforcement should be part of the solution.
• If there is empirical or logical evidence that certain tools or technologies can reduce fault injections under similar environments, these tools or technologies should be adopted.
Therefore, root-cause analyses are needed to establish these preconditions, so that appropriate defect prevention activities can be applied for error removal. These analyses usually take two forms: logical analysis and statistical analysis. Logical analysis examines the logical link between the faults (effects) and the corresponding errors (causes), and establishes general causal relations.

This analysis is human intensive, and should be performed by experts with thorough knowledge of the product, the development process, the application domain, and the general environment.

Statistical analysis is based on empirical evidence collected either locally or from other similar projects. These data can be fed to various models to establish the predictive relations between causes and effects. Once such causal relations are established, appropriate QA activities can then be selected and applied for error removal.

DEFECT REDUCTION THROUGH FAULT DETECTION AND REMOVAL

For most large software systems in use today, it is unrealistic to expect that error-removal or defect prevention activities can be 100 percent effective in preventing accidental fault injections. Therefore, there is a need for effective techniques to remove as many of the injected faults as possible under project constraints.

Inspection: Direct Fault Detection and Removal


Software inspections are critical examinations of software artifacts by human inspectors aimed at discovering and fixing faults in the software systems. Inspection is a well-known QA alternative familiar to most software quality professionals. The earliest and most influential work in software inspection is Fagan inspection (Fagan 1976), which organizes inspection into the following six steps:
1. Planning: Deciding what to inspect and if inspection is ready to start.
2. Overview meeting: The author meets with and gives an overview of the inspection object to the inspectors. Assignment of individual pieces among the inspectors is also done.
3. Preparation: Individual inspection is performed by each inspector.
4. Inspection meeting to collect and consolidate individual inspection results: Fault identification in this meeting is carried out as a consensus-building process.
5. Rework: The author fixes the identified problems or provides other responses.
6. Follow-up: Close the inspection process by final validation or reinspection.
Therefore, faults are detected directly in inspection, and removed as part of the inspection process.

Other variations have been proposed and used to effectively conduct inspection under different environments. A detailed discussion about inspection processes and techniques, applications and results, and related topics can be found in (Gilb and Graham 1993).

Inspection is most commonly applied to code, but it could also be applied to requirement specifications, designs, test plans and test cases, user manuals, and other documents or software artifacts. Another important benefit is the opportunity to conduct causal analysis during the inspection process, for example, as an added step in Gilb inspection (Gilb and Graham 1993). These causal analysis results can be used to guide defect prevention activities by removing identified error sources or correcting identified missing/incorrect human actions.

Testing: Failure Observation and Fault Removal

Testing is one of the most important parts of QA and the most commonly performed QA activity.

Testing involves the execution of software and the observation of the program behavior or outcome. If a failure is observed, the execution record is then analyzed to locate and fix the fault(s) that caused the failure. Various individual testing activities and techniques can be classified using various criteria, as discussed next, with a special attention paid to how they deal with defects.

When can a specific testing activity be performed and related faults be detected?

Because testing is an execution-based QA activity, a prerequisite to actual testing is the existence of the implemented software units, components, or system to be tested, although preparation for testing can be carried out in earlier phases of software development. As a result, actual testing can be divided into various subphases starting from the coding phase up to post-release product support, including: unit testing, component testing, integration testing, system testing, acceptance testing, beta testing, and so on. The observation of failures can be associated with these subphases, and the identification and removal of related faults can be associated with corresponding individual units, components, or the complete system.

What to test, and what kind of faults are found?


Black-box (or functional) testing verifies the correct handling of the external functions by the software, or whether the observed behavior conforms to user expectations or product specifications.

White-box (or structural) testing verifies the correct implementation of internal units, structures, and relations among them. Various techniques can be used to build models and generate test cases to perform systematic testing (Beizer 1990; Musa 1998). Failures related to specific external functions or internal implementations could be observed, resulting in corresponding faults being detected and removed.

When, or at what defect level, to stop testing?


Most of the traditional testing techniques and testing subphases use some kind of coverage information as the stopping criteria, with the assumption that higher coverage means higher quality or lower defect levels. For example, checklists are often used to make sure major functions and usage scenarios are tested before product release. Every statement or unit in a component must be covered before subsequent integration testing can proceed in many organizations. More formal testing techniques include control flow testing that attempts to cover execution paths and domain testing that attempts to cover boundaries between different input subdomains (Beizer 1990). Such formal coverage information can only be obtained by using expensive coverage analysis and testing tools. Rough coverage measurement, however, can be obtained easily by examining the proportion of tested items in various checklists.

On the other hand, product reliability goals can be used as a more objective criterion to stop testing. The use of this criterion requires testing to be performed under an environment that resembles actual use by target customers so that realistic reliability assessment can be obtained, resulting in the so-called statistical usage-based testing (Musa 1998).

The coverage criterion ensures that certain types of faults are detected and removed, thus reducing the number of defects, although quality is not directly assessed. The usage-based testing and the related reliability criterion ensure that the faults that are most likely to cause problems are detected and removed, and the reliability of the software reaches certain targets before testing stops.

Other Techniques for Fault Detection and Removal


Inspection is the most commonly used static technique for defect detection and removal.

Various other static techniques are available, including various formal model-based analyses such as algorithm analysis, decision-table analysis, boundary value analysis, finite-state machine and Petri-net modeling, control and data-flow analyses, software fault trees, and so on.

Similarly, in addition to testing, other dynamic, execution-based techniques also exist for fault detection and removal. For example, symbolic execution, simulation, and prototyping can help one detect and remove defects early in the software development process, before large-scale testing becomes a viable alternative. On the other hand, in-field measurement and related analyses, such as timing and performance monitoring and analysis for real-time systems, and accident reconstruction using software event trees for safety-critical systems, can also help one locate and remove related defects.

A comprehensive survey of techniques for fault detection and removal, including those mentioned previously, can be found in (Wallace, Ippolito, and Cuthill 1996).

Risk Identification and Defect Reduction


Fault distribution is highly uneven for most software products, regardless of their size, functionality, implementation language, and other characteristics. Much empirical evidence has accumulated over the years to support the so-called 80/20 rule, which states that 20 percent of the software components are responsible for 80 percent of the problems. These problematic components can generally be characterized by specific measurement properties about their design, size, complexity, change history, and other product or process characteristics. Because of the uneven fault distribution among software components, there is a great need for risk identification techniques to analyze these measurement data so that inspection, testing, and other defect detection and reduction activities can be more effectively focused on those potentially high-defect components.

A survey of these risk identification techniques and their comparison can be found in (Tian 2000), including: traditional statistical analysis techniques, principal component analysis and discriminant analysis, neural networks, tree-based modeling, pattern-matching techniques, and learning algorithms. These techniques were compared according to several criteria, including: accuracy, simplicity, early availability and stability, ease of result interpretation, constructive information and guidance for quality improvement, and availability of tool support. Appropriate risk identification techniques can be selected to fit specific application environments in order to identify high-risk software components for focused inspection and testing.

DEFECT CONTAINMENT THROUGH FAILURE PREVENTION


Because of the large size and high complexity of most software systems in use today, the aforementioned defect reduction activities can greatly reduce the number of faults but not completely eliminate them. For software systems where failure impact is substantial, such as many real-time control software used in medical, nuclear, transportation, and other embedded systems, this low defect level and failure risk may still not be adequate. Some additional QA alternatives are needed.

On the other hand, these few remaining faults may be triggered under rare conditions or unusual dynamic scenarios, making it unrealistic to try to generate the huge number of test cases to cover all these conditions or to perform exhaustive inspection or analysis based on all possible scenarios. Instead, some other means must be used to prevent failures by breaking the causal relations between these faults and the resulting failures, thus "tolerating" these faults, or to contain the failures to reduce the resulting damage.

Fault Tolerance with Recovery Blocks

Software fault tolerance ideas originate from fault tolerance designs in traditional hardware systems requiring higher levels of reliability, availability, or dependability. In such systems, spare parts and backup units are commonly used to keep the systems in operational conditions, maybe at a reduced capability, at the presence of unit or part failures. The primary software fault tolerance techniques include recovery blocks and N-version programming (NVP) covered in detail in (Lyu 1995). The author next briefly describes these techniques and examines how they deal with failures and related faults.

figure 2The use of recovery blocks introduces duplication of software executions so occasional failures only cause loss of partial computational results but not complete execution failures. For example, the ability to dynamically back up and recover from occasional lost or corrupted transactions is built into many critical databases used in financial, insurance, health care, and other industries. Figure 2 illustrates this technique, and depicts the four major activities involved:
1. Periodic checkpointing and refreshing to save the dynamic contents of software executions
2. Failure detection: If a failure is detected, the following two steps are performed.
3. Rollback by restoring the saved dynamic contents associated with the latest checkpoint
4. Rerun the lost computation, and the normal activity continues
One key decision in this technique is the checkpointing frequency: higher frequency leads to higher cost associated with frequent refreshing of the saved dynamic contents, while lower frequency leads to longer and more costly recovery. An optimal frequency balances the two and incurs minimal overall cost.

In using recovery blocks, failures are detected, but the underlying faults are not removed, although off-line activities can be carried out to identify and remove the faults in case of repeated failures. One hopes the dynamic condition or external disturbance that accompanied the original failure will not repeat, thus subsequent rerun of the lost computation can succeed and normal operation can resume. In this respect, faults are tolerated in the system, with occasional minor delays—a loss of performance tolerable under many circumstances. Repeated failures, however, have to be dealt with off-line, or by using other fault tolerance techniques, such as NVP discussed next.

Fault Tolerance with N-version Programming

figure 3NVP is another way to tolerate software faults by directly introducing duplications into the software itself (Lyu 1995). NVP is generally more suitable than recovery blocks when timely decisions or performance are critical, such as in many real-time control systems. The basic technique is illustrated in Figure 3 and briefly described here:
1. The basic functional units of the software system consist of N parallel independent versions of programs with identical functionality: version 1, version 2…version N.
2. The system input is distributed to all the N versions.
3. The individual output for each version is fed to a decision unit.
4. The decision unit determines the system output based on its inputs using a specific decision algorithm (often a majority vote, but other algorithms are also possible).
The basic assumption in NVP is that faults in different versions are independent, which implies that it is rare to have the same fault triggered by the same input and cause the same failure among different versions. Therefore, even if there is a fault that causes a local failure in version i, the whole system is likely to function correctly because the other (independent) versions are likely to function correctly under the same dynamic environment. In this way, the causal relation between local faults and system failures is broken for most local faults under most situations, thus improving the quality and reliability of the software system. One of the main research topics in NVP is to ensure that the software versions are as independent as possible so local faults can be tolerated and the resulting local failures can be contained effectively.

Safety Assurance and Failure Containment


The concerted effort of the previously described QA activities should reduce the system failure probability to a very low level. For safety critical systems, however, the primary concern is the ability to prevent accidents, where an accident is a failure with a severe consequence. Even such low failure probability is not tolerable in such systems if most failures may lead to accidents. Therefore, in addition to the aforementioned QA techniques, specific techniques are also used for safety-critical systems based on analysis of hazards, or logical preconditions for accidents. These safety assurance and improvement techniques are discussed in detail in (Leveson 1995). Following is a brief discussion of them and an analysis of how each technique deals with defects:
Hazard elimination through substitution, simplification, decoupling, elimination of specific human errors, and reduction of hazardous materials or conditions. This is similar to the error removal techniques described before but with a focus on those error sources involved in hazardous situations.
Hazard reduction through design for controllability (for example, automatic pressure release in boilers), use of barriers (for example, hardware/software interlocks), and failure minimization using safety margins and redundancy. These techniques are similar to the fault tolerance techniques discussed previously, where local failures are contained without leading to system failures.
Hazard control through reducing exposure, isolation and containment (for example, barriers between the system and the environment), protection systems (active protection activated in case of hazard), and fail-safe design (passive protection, fail in a safe state without causing further damages). These techniques reduce the severity of failures, therefore weakening the link between failures and accidents.
Damage control through escape routes, safe abandonment of products and materials, and devices for limiting physical damages to equipment or people. These techniques reduce the severity of accidents thus limiting the damage caused by these accidents.
Notice that both hazard control and damage control are post-failure activities not generally covered in the QA activities described before. These activities are specific to safety critical systems. On the other hand, many techniques for hazard elimination and reductions can also be used in general systems to reduce fault injection and to tolerate local faults.

COMPARISON AND RECOMMENDATIONS

The author next compares the different QA activities by examining their cost, applicability under different environments and development phases, and effectiveness in dealing with different types of problems. Based on this comparison, the author also provides some general recommendations.

Cost and Applicability

Testing is among the standard activities that make up the whole software development process, regardless of the process choice or the product type. Therefore, the cost and applicability of other QA alternatives are examined using testing as the baseline for comparison.

In general, the longer a fault remains in a software system, the higher the total cost (more than linear increase) associated with fixing the related problems (Boehm 1981; Humphrey 1995). In addition to fixing the original fault, the problems that must be resolved include the failures caused by the original fault, as well as other related faults that may be injected in a chain reaction because of the presence of the original fault, such as in a module that needs to interface with the module containing the original fault. Therefore, fixing problems early in the development process, or even better, preventing the injection of faults through error removal, are generally more cost-effective than dealing with the problems later in other QA activities.

Unlike testing, which can only be performed after the software system is at least partially implemented, inspection can be performed throughout the software development process and on almost any software artifacts. The cost for conducting different variations of inspection ranges from very low for informal reviews to that comparable to testing for formal inspections. According to data compiled in (Gilb and Graham 1993), inspection typically brings in a return-on-investment (ROI) ratio of around 10-to-1. This effect is particularly strong in the earlier phases of software development.

Formal verification can be viewed as an extremely structured kind of inspection where all the formally specified elements of the design or the code are formally verified. As mentioned before, the proof of correctness for a program or a design is typically one order of magnitude longer than the program or the design itself (Zelkowitz 1993), thus such human-intensive proofs cost significantly more than most inspections, and usually cost more than testing. Fault tolerance techniques cost significantly more because of the built-in duplications (Lyu 1995). Safety assurance activities cost even more because of all the associated actions taken to address both pre-failure and post-failure issues to ensure not only low probability of failure, but also to limit the failure consequences and damages (Leveson 1995). For systems requiring higher levels of quality and reliability, or for safety critical applications, however, the associated high cost is usually justified. A careful cost-benefit analysis must be performed based on historical data from the same or similar software development organizations to choose the appropriate QA alternatives for different types of software products.

Problem Types, Defect Levels, and Choice of QA Alternatives

In general, if systematic problems exist in an organization and its products, preventive action is the most effective way to deal with them. Such systematic problems are generally associated with common failures traceable to common faults, and these common faults can be traced in turn to some common errors through causal analysis. As pointed out in (Humphrey 1995): "While detecting and fixing defects is critically important, it is an inherently defensive strategy. To make significant quality improvements, you should identify the causes of these defects and take steps to eliminate them."

On the other hand, sporadic problems can generally be dealt with by other QA alternatives. One key difference between inspection and testing is the way faults are identified: inspection identifies them directly by examining the software artifact, while failures are observed during testing and related faults are identified later by using the recorded execution information.

This key difference leads to the different types of faults commonly detected using these two techniques: inspection is usually good at detecting static and localized faults, while testing is good at detecting dynamic and global faults involving multiple components in interactions (Beizer 1990; Gilb and Graham 1993). In addition, hidden faults that are not going to cause any failures in the current execution environment, for example, compatibility problems with the intended future platforms, could be detected by inspection but not by testing.

For existing products with relatively high defect levels or with many common faults, inspection is most likely to be more effective than testing, because inspection can continue after the initial fault is detected, but further testing is often blocked once a fault is encountered and a failure is observed.

In addition, when defect levels are high, execution of most test cases will result in failure observations, and the subsequent effort to locate and remove the underlying faults is similar to that for inspection. Analysis of existing high-defect projects commonly conducted in conjunction with inspection, such as in Gilb inspection (Gilb and Graham 1993) can often point to systematic problems. Such systematic problems can be most effectively addressed by defect prevention activities in successor projects.

A proof of correctness or a formal verification can only be produced if the program is fault-free with respect to its formal specifications. When verification cannot be successfully completed, further analysis often reveals accidental logical or functional faults. This is not, however, an effective method for fault detection because of the substantial effort involved in the failed verification attempt. Therefore, formal verification does not work for software with high defect levels. Fortunately, the use of formal methods, with formal specification focusing on error-source elimination and formal verification focusing on verifying the conformance in designs and code, generally results in low defect levels (Zelkowitz 1993).

Fault tolerance techniques generally involve the observations of dynamic local failures and the tolerance of the related faults but not the identification and removal of these faults. These techniques only work when defect levels are very low, because multiple fault encounters or frequent failures cannot be effectively tolerated (Lyu 1995). Therefore, other QA alternatives must be used to reduce the defects to a very low level before fault tolerance techniques can be used to further reduce the probability of system failures.

On the other hand, many software safety assurance techniques attempt to weaken the link between failures and accidents or reduce the damage associated with accidents. The focus of these activities is the post-failure accidents and the related hazard analysis and resolution. Defect levels are expected to be extremely low because these expensive techniques are generally applied as the last guard against system safety problems after traditional QA activities have been performed (Leveson 1995).

Comparison Summary and Recommendations

figure 4The previous comparison is summarized in Figure 4. Based on the comparison and analysis presented so far, the author makes the following recommendations:
• In general, a concerted effort is necessary with many different QA activities to be used in an integrated fashion to effectively and efficiently deal with defects and ensure product quality.
• Error removal greatly reduces the chance of fault injections. Therefore, such preventive actions should be an integral part of any QA plan. Causal analyses can be performed to identify systematic problems and select preventive actions to deal with the problems.
• Inspection and testing are applicable to different situations and effective for different defect types at different defect levels. Therefore, inspection can be performed first to lower defect levels, and then testing can be performed to remove the remaining faults related to dynamic scenarios and global interactions. To maximize the benefit-to-cost ratio, various risk identification techniques can be used to focus inspection and testing effort on identified high-risk product components.
• Software safety assurance (especially hazard and damage control), fault tolerance, and formal verification techniques cost significantly more to implement than traditional QA techniques. If consequence of failures is severe and potential damage is high, however, they can be used to further reduce the failure probability or to reduce the accident probability or severity.

CONCLUSIONS AND PERSPECTIVES

Because of the pervasive use and reliance on software systems today, there is a great need for effective QA alternatives and related techniques. According to the different ways these QA alternatives deal with defects, they can be classified into three categories: defect prevention, defect reduction, and defect containment.

Existing software quality literature generally covers defect reduction techniques such as testing and inspection in more details than defect prevention activities, while largely ignoring the role of defect containment in QA. The survey and classification of different QA alternatives in this article bring together information from diverse sources to offer a common starting point and information base for software quality professionals. The comparison of the applicability, effectiveness, and cost can help them choose appropriate alternatives and tailor or integrate them for specific applications.

As an immediate follow-up to this study, the author plans to collect additional data from industry to quantify the cost and benefit of different QA alternatives to better support the related cost-benefit analysis. He also plans to package application experience from industry to guide future applications. These efforts will help advance the state-of-practice in industry, where appropriate QA alternatives can be selected, tailored, and integrated by software quality professionals for effective quality assurance and improvement.

Acknowledgment

This work is supported in part by NSF CAREER award CCR-9733588, THECB/ATP award 003613-0030-1999, and Nortel Networks.

The author wishes to thank the anonymous reviewers for their constructive comments and suggestions that led to a better article.

References

Beizer, B. 1990. Software testing techniques, second edition. Boston, Mass.: International Thomson Computer Press.

Boehm, B. W. 1981. Software engineering economics. Englewood Cliffs, N. J.: Prentice Hall.

Fagan, M. E. 1976. Design and code inspections to reduce errors in program development. IBM Systems Journal 3:182-211.

Gilb, T., and D. Graham. 1993. Software inspection. London: Addison-Wesley Longman.

Humphrey, W. S. 1995. A discipline for software engineering. Reading, Mass.: Addison-Wesley.

IEEE Standard 610.12. IEEE standard glossary of software engineering terminology. 1990. New York: Institute of Electrical and Electronics Engineers.

Leveson, N. G. 1995. Safeware: System safety and computers. Reading, Mass.: Addison-Wesley.

Lyu, M. R., ed. 1995. Software fault tolerance. New York: John Wiley & Sons.

Mills, H. D., M. Dyer, and R. C. Linger. 1987. Cleanroom software engineering. IEEE Software 4, no. 5: 19-24.

Musa, J. D. 1998. Software reliability engineering. New York: McGraw-Hill.

Parnas, D. L. 1972. On the criteria to be used in decomposing systems into modules. Communications of the ACM 15, no. 12: 1053-1058.

Tian, J. 2000. Risk identification techniques for defect reduction and quality improvement. Software Quality Professional 2, no.2: 32-41.

Wallace, D. R., L. M. Ippolito, and B. Cuthill. 1996. Reference Information for the Software Verification and Validation Process. NIST Special Publication 500-234.

Zelkowitz, M. V. 1993. Role of verification in the software specification process. In Advances in Computers 36. ed.M. C. Yovits, 43-109. San Diego, Calif.: Academic Press.

BIOGRAPHY

Jeff (Jianhui) Tian has a bachelor’s degree in electrical engineering from Xi’an Jiaotong University, a master’s degree in engineering science from Harvard University, and a doctorate in computer science from the University of Maryland. He worked for IBM Software Solutions Toronto Laboratory between 1992 and 1995 as a software quality and process analyst. Since 1995, he has been an assistant professor of computer science and engineering at Southern Methodist University in Dallas, Texas. His current research interests include software testing, measurement, reliability, safety, complexity, and telecommunication software and systems. Tian is a member of IEEE and ACM. He can be reached at Southern Methodist University, Dept. of Computer Science and Engineering, Dallas, TX 75275, or by e-mail at tian@engr.smu.edu.

If you liked this article, subscribe now.

Featured advertisers


ASQ is a global community of people passionate about quality, who use the tools, their ideas and expertise to make our world work better. ASQ: The Global Voice of Quality.