June 2001
Volume 3 • Number 3
Contents
QUALITY MANAGEMENT
Quality Assurance Alternatives and Techniques: A Defect-Based
Survey and Analysis
by: Jeff Tian, Department of Computer Science and Engineering,
Southern Methodist University
This article surveys commonly used quality assurance
(QA) alternatives and techniques, including preventive actions,
inspection, formal verification, testing, fault tolerance,
and failure impact minimization. The generic ways to deal
with defects, including prevention, detection and removal,
and containment, are used as the basis to classify these
QA alternatives. Each QA alternative is then compared by
its cost, applicability, and effectiveness over different
product types and application environments. Based on these,
the author recommends an integrated approach for software
quality assurance and improvement.
Key words: defect, error removal, failure prevention and
containment, fault detection and removal, QA alternatives
and techniques
INTRODUCTION
With the pervasive use of software systems in modern society,
the negative impact of software defects is also increasing.
Consequently, one central activity for quality assurance (QA)
is to ensure that few, if any, defects remain in the software
when it is delivered to its customers or released to the market.
Furthermore, one wants to ensure that, if possible, these
remaining defects will cause minimal disruption or damage.
Most modern software systems beyond limited personal use have
become progressively larger and more complex because of the
increased need for automation, functions, features, and services.
It is nearly impossible to completely prevent or eliminate
defects in such large complex systems. Instead, various QA
alternatives and related techniques can be used in a concerted
effort to effectively and efficiently assure their quality.
Testing is among the most commonly performed QA activities
for software. It detects execution problems so that underlying
causes can be identified and fixed. Inspection, on the other
hand, directly detects and corrects software problems without
resorting to execution. Other QA alternatives, such as formal
verification, defect prevention, and fault tolerance, deal
with defects in their own ways. Close examination of how different
QA alternatives deal with defects can help one better use
them for specific applications.
This article examines the generic ways to deal with defects
and classifies QA alternatives accordingly. Existing QA alternatives
are surveyed and then compared by their cost, applicability,
and effectiveness under different application environments
and for different product types. The article concludes with
the authors recommendation for an integrated approach
for effective quality assurance and improvement.
DEFECTS AND GENERIC WAYS TO DEAL WITH DEFECTS
This section clarifies various meanings of the term defect,
and then examines the generic ways to deal with defects.
Defect-Related Definitions
The term defect generally refers to some problem with
the software, either with its external behavior or with its
internal characteristics. The IEEE Standard 610.12 (IEEE 1990)
defines the following terms related to defects:
Failure: The inability of a system or component
to perform its required functions within specified performance
requirements
Fault: An incorrect step, process, or data definition
in a computer program
Error: A human action that produces an incorrect result
The term
failure refers to a behavioral deviation from
the user requirement or the product specification;
fault
refers to an underlying condition within software that causes
certain failure(s) to occur;
error refers to a missing
or incorrect human action resulting in certain fault(s) being
injected into software. Sometimes error is also used to refer
to human misconceptions or other misunderstandings or ambiguities
that are the root cause for the missing or incorrect actions.
With these definitions, one can see that failures, faults, and
errors are different aspects of defects. A causal relation exists
among these three aspects; that is, errors may cause faults
to be injected into the software, and faults may cause failures
when the software is executed. This relationship is not necessarily
1-to-1. A single error may cause many faults, such as when a
wrong algorithm is applied in multiple modules and causes multiple
faults, and a single fault may cause many failures in repeated
executions. Conversely, the same failure may be caused by several
faults, such as an interface or interaction failure involving
multiple modules, and the same fault may be there because of
different errors. Figure 1 illustrates some of these situations:
the error
e3 causes multiple faults,
f2 and
f3,
and the fault
f1 is caused by multiple errors,
e1
and
e2.
Dealing With Defects
With the previous definitions, one can view different QA activities
as attempting to prevent, eliminate, reduce, or contain various
problems associated with different aspects of defects. One can
classify these QA alternatives into the following three generic
categories:
Defect prevention through error removal.
These QA activities prevent certain types of faults from being
injected into the software, which can be done in two generic
ways:
1. Eliminating certain error sources by
eliminating ambiguity or correcting human misconceptions
2. Fault prevention, or breaking the causal relation
between error sources and faults by correcting the missing/incorrect
human actions through the use of certain tools and technologies
or enforcement of certain process and product standards
Because errors are the missing or incorrect human actions,
both the elimination of the causes for them through error
source elimination and the direct correction of these actions
through fault prevention contribute to error removal.
Defect reduction through fault detection and removal.
These QA alternatives detect and remove faults. In fact, most
traditional QA activities fall into this category. For example,
inspection directly detects and removes faults in the software,
while testing removes faults based on related failure observations.
Defect containment through failure
prevention. These QA alternatives break the causal relation
between faults and failures so that local faults will not
cause global failures, thus "tolerating" these faults.
A related extension is containment measures to avoid catastrophic
consequences in case of failures.

These
QA activities are illustrated in
Figure
1, forming a series of barriers used to remove or block
defect sources and prevent undesirable consequences. These barriers
are depicted as the broken lines between the error sources and
the software system, and between the software system and the
results. Figure 1 also shows the relationship between these
QA activities and related errors, faults, and failures. For
example, through the error-removal activity, some of the human
conceptual errors, for example,
e6, are directly removed;
while other incorrect actions or errors, for example,
e5,
are blocked and removed. Some faults, for example,
f4,
are directly detected through inspection and removed, while
others, such as
f3, are detected through testing and
removed. Still others, for example,
f2, were blocked
through fault tolerance.
Different QA alternatives can be viewed as a concerted effort
to deal with errors, faults, or failures to achieve the common
goal of quality assurance and improvement. Defect prevention
and defect reduction activities directly deal with the competing
processes of defect injection and removal during the software
development process (Humphrey 1995). They affect the defect
contents, or the number of faults, in the finished software
products. On the other hand, defect containment activities aim
at minimizing the negative impact of these remaining faults.
The author next surveys these alternatives and examines how
they deal with defects in their specific ways.
DEFECT PREVENTION THROUGH ERROR REMOVAL
The QA alternatives commonly referred to as defect prevention
activities can be used for most software systems to reduce the
chance for defect injections and the subsequent cost to deal
with these injected defects. They attempt to remove errors through
error-source elimination and fault prevention. Specific alternatives
for defect prevention are discussed next.
Education and Training: People-Based Solutions for Error-Source
Elimination
It has long been observed by software practitioners that the
people factor is the most important factor that determines the
quality and, ultimately, the success or failure of most software
projects. Education and training of software professionals,
such as through the personal software process® (PSP) (Humphrey
1995), can help them control, manage, and improve the way they
work. Such activities can also help ensure that they have few,
if any, misconceptions related to the product and the product
development. Eliminating these human misconceptions will help
prevent certain types of faults from being injected into software
products. The education and training effort for error- source
elimination should focus on the following areas:
Product and domain-specific knowledge.
If the people involved are not familiar with the product type
or application domain, there is a good chance that wrong solutions
will be implemented. For example, if programmers who only
had experience with numerical computation were asked to design
and implement telecommunication software systems, they may
not recognize the importance of making the software work within
the existing infrastructure, thus creating incompatible software.
Software development methodology expertise.
This plays an important role in developing high-quality software
products. For example, lack of expertise with requirement
analysis and product specification usually leads to problems
and rework in subsequent design, coding, and testing activities.
A related issue is the required expertise with relevant software
technologies and tools. For example, in an implementation
of cleanroom technology (Mills, Dyer, and Linger 1987), if
the developers are not familiar with the key components of
formal verification or statistical testing, there is little
chance for producing high-quality products.
Development process knowledge. If the project
personnel do not have a good understanding of the development
process, there is little chance that the process can be implemented
correctly. For example, if the people involved in incremental
software development do not know how the individual development
efforts for different pieces or increments fit together, the
uncoordinated increment development may lead to interface
or interaction problems.
Formal Method: Error-Source Elimination and Fault Absence
Verification
Formal development methods, or formal methods, include formal
specification and formal verification. Formal specification
is concerned with producing an unambiguous set of product specifications
so customer requirements, as well as environmental constraints
and design intentions, are correctly reflected, thus reducing
the chances of accidental fault injections. Formal verification
checks the conformance of software design or code to these formal
specifications, thus ensuring that the software is fault-free
with respect to its formal specifications.
Various techniques exist to specify and verify the "correctness"
of software systems, namely, to answer the question: "What
is the correct behavior and how do we verify it?" The most
influential ones include axiomatic correctness, predicate transforms,
and functional correctness, described in (Zelkowitz 1993). The
basic ideas of axiomatic correctness can be summarized as follows:
The program states before and after executing a program segment
S can be described by its pre-condition P and post-condition
Q, respectively, and denoted as
{P}S{Q}, indicating
that "if
P is true before executing
S and
S terminates normally, then
Q will be true."
This pair of logical predicates constitutes the formal specifications
for the program, against which the implemented program needs
to be verified. As a practical example, if a program accepts
non-negative input for its input variable x, and computes its
output y as the square root of
x, the precondition can
then be described by the logical predicate {
x >=
0}, and the post-condition can be described by {
y =

x}.
There are axioms or inference rules to link different predicates,
such as the following axiom:
Axiom A1:
{P}
{R}, {R}S{Q}
_________________
{P}S{Q}
where "

"
is the logical relation "implies." This kind of rule
is interpreted as, "if we know that the expressions above
the line are true, then we can infer that the expression below
the line follows." Axiom A1 states that if a program works
for a given precondition it also worked for a more restrictive
(or stronger) precondition. In the previous example, if one
has already proven that the program
S works for all nonnegative
inputs, or
{R}S{Q}, with
R = {
x >= 0}),
then by applying axiom A1, one can conclude that it also works
for a positive input of bounded value, that is,
{P}S{Q},
with
P = 0 <
x <= 1000}, because
P
R in this case.
There is an axiom stating the pre- and post-conditions for
each fundamental element of a language, for example, an
assignment, an if-statement, and so on. The first type of axiom
is simply in the form of
{P}S{Q}. For example, the axiom
for the assignment statement is given by:
Axiom A2:
{P

}
x 
y {P}
where {P

}
is derived from expression
P with all free occurrences
of
x (
x is not bound to other conditions) replaced
by
y. As a practical example, consider a program that
balances a banking account: If no negative balance is allowed
after each transaction, that is, {
b >= 0} is the post-condition
P, the precondition P

,
before the withdrawal of money as represented by the assignment
statement,
b
b - w, is then represented by {
b - w >= 0},
or {
b >=
w}, by the preceding axiom. That is,
the precondition for maintaining nonnegative balance is that
sufficient funds exist before each withdrawal transaction.
Another type of axiom defines the inference rules for multipart
statements. For example, the following axiom gives the "meaning"
for the if-then-else statement:
Axiom A3:
{P ^ B}S1{Q}, {P ^ -B}S2{Q}
_________________________
{P} if B then S1 else S2 {Q}
As a practical example, consider the following statement:
if
x >= 0 then
y
x else y

- x
with post-condition
Q =
{y = |
x|
},
precondition
P = TRUE, and
B ={
x >=
0}. To verify this statement, one must verify:
{P ^ B}S1{Q}
and
{P ^-B}S2{Q}.
The first branch
(B) to verify is:
{x >=0} y

x {y = |x|}
Applying axiom A2, one has, {x = |x|} y

x {y = |x|}
Combined with the logical relation {x >= 0}

{x = |x|}, by applying axiom A1, this branch is verified. The
second branch (-
B), can be verified similarly. Therefore,
through these verification steps, one has verified the above
conditional statement.
The verification process, often referred to as the
proof
of correctness, is a bottom-up process much like the preceding
verification example for the conditional statement: One starts
from individual statements, verifies intermediate conditions
through axioms or inference rules, and finally verifies the
pre- and post-conditions for the complete program.
The axiomatic correctness surveyed previously, as well as several
other formal specification and verification techniques, are
described in (Zelkowitz 1993), together with examples, discussions,
comparisons, and references for additional literature. So far,
the biggest obstacle to formal methods is the high cost associated
with performing these human-intensive activities correctly without
adequate automated support, because the proofs are typically
one order of magnitude longer than the programs or designs themselves.
Defect Prevention Based on Technologies, Tools, Processes,
and Standards
Besides the formal methods described previously, appropriate
use of other software technologies can also help reduce the
chances of fault injections. For example, the use of the information
hiding principle (Parnas 1972) can help reduce the complexity
of program interfaces and interactions among different components,
thus reducing the possibility of interface or interaction problems.
A better-managed or more suitable process can also eliminate
many systematic problems. Not following the selected process,
however, also leads to some faults being injected into the software.
For example, not following the defined process for system configuration
and revision control may lead to inconsistencies or interface
problems among different software versions or components. Therefore,
ensuring appropriate process selection and conformance helps
eliminate such error sources. Similarly, enforcement of selected
product or development standards also reduces fault injections.
Sometimes, specific software tools can also help reduce the
chances of fault injections. For example, a syntax-directed
editor that automatically balances out open parenthesis, "{,"
with close parenthesis, "}," can help reduce syntactical
problems in programs written in the C language.
Additional work is needed to guide the selection of appropriate
processes, standards, tools, and technologies, or to tailor
existing ones to fit the specific application environment. Effective
monitoring and enforcement systems are also needed to ensure
that the selected process or standard is followed, or the selected
tool or technology is used properly, to reduce the chances of
fault injection.
Root-Cause Analysis for Defect Prevention
Notice that many of the error-removal activities described previously
implicitly assume that there are known error sources or missing/incorrect
actions that result in fault injections, as follows:
If human misconceptions are the error sources,
education and training should be part of the solution.
If imprecise designs and implementations that deviate
from product specifications or design intentions are the causes
for faults, formal methods should be part of the solution.
If nonconformance to selected processes or standards
is the problem that leads to fault injections, then process
conformance or standard enforcement should be part of the
solution.
If there is empirical or logical evidence that certain
tools or technologies can reduce fault injections under similar
environments, these tools or technologies should be adopted.
Therefore, root-cause analyses are needed to establish these
preconditions, so that appropriate defect prevention activities
can be applied for error removal. These analyses usually take
two forms: logical analysis and statistical analysis. Logical
analysis examines the logical link between the faults (effects)
and the corresponding errors (causes), and establishes general
causal relations.
This analysis is human intensive, and should be performed by
experts with thorough knowledge of the product, the development
process, the application domain, and the general environment.
Statistical analysis is based on empirical evidence collected
either locally or from other similar projects. These data can
be fed to various models to establish the predictive relations
between causes and effects. Once such causal relations are established,
appropriate QA activities can then be selected and applied for
error removal.
DEFECT REDUCTION THROUGH FAULT DETECTION AND REMOVAL
For most large software systems in use today, it is unrealistic
to expect that error-removal or defect prevention activities
can be 100 percent effective in preventing accidental fault
injections. Therefore, there is a need for effective techniques
to remove as many of the injected faults as possible under project
constraints.
Inspection: Direct Fault Detection and Removal
Software inspections are critical examinations of software artifacts
by human inspectors aimed at discovering and fixing faults in
the software systems. Inspection is a well-known QA alternative
familiar to most software quality professionals. The earliest
and most influential work in software inspection is Fagan inspection
(Fagan 1976), which organizes inspection into the following
six steps:
1. Planning: Deciding what to inspect and if inspection
is ready to start.
2. Overview meeting: The author meets with and gives an overview
of the inspection object to the inspectors. Assignment of
individual pieces among the inspectors is also done.
3. Preparation: Individual inspection is performed by each
inspector.
4. Inspection meeting to collect and consolidate individual
inspection results: Fault identification in this meeting is
carried out as a consensus-building process.
5. Rework: The author fixes the identified problems or provides
other responses.
6. Follow-up: Close the inspection process by final validation
or reinspection.
Therefore, faults are detected directly in inspection, and removed
as part of the inspection process.
Other variations have been proposed and used to effectively
conduct inspection under different environments. A detailed
discussion about inspection processes and techniques, applications
and results, and related topics can be found in (Gilb and Graham
1993).
Inspection is most commonly applied to code, but it could also
be applied to requirement specifications, designs, test plans
and test cases, user manuals, and other documents or software
artifacts. Another important benefit is the opportunity to conduct
causal analysis during the inspection process, for example,
as an added step in Gilb inspection (Gilb and Graham 1993).
These causal analysis results can be used to guide defect prevention
activities by removing identified error sources or correcting
identified missing/incorrect human actions.
Testing: Failure Observation and Fault Removal
Testing is one of the most important parts of QA and the most
commonly performed QA activity.
Testing involves the execution of software and the observation
of the program behavior or outcome. If a failure is observed,
the execution record is then analyzed to locate and fix the
fault(s) that caused the failure. Various individual testing
activities and techniques can be classified using various criteria,
as discussed next, with a special attention paid to how they
deal with defects.
When can a specific testing activity be performed and
related faults be detected?
Because testing is an execution-based QA activity, a prerequisite
to actual testing is the existence of the implemented software
units, components, or system to be tested, although preparation
for testing can be carried out in earlier phases of software
development. As a result, actual testing can be divided into
various subphases starting from the coding phase up to post-release
product support, including: unit testing, component testing,
integration testing, system testing, acceptance testing, beta
testing, and so on. The observation of failures can be associated
with these subphases, and the identification and removal of
related faults can be associated with corresponding individual
units, components, or the complete system.
What to test, and what kind of faults are found?
Black-box (or functional) testing verifies the correct handling
of the external functions by the software, or whether the observed
behavior conforms to user expectations or product specifications.
White-box (or structural) testing verifies the correct implementation
of internal units, structures, and relations among them. Various
techniques can be used to build models and generate test cases
to perform systematic testing (Beizer 1990; Musa 1998). Failures
related to specific external functions or internal implementations
could be observed, resulting in corresponding faults being detected
and removed.
When, or at what defect level, to stop testing?
Most of the traditional testing techniques and testing subphases
use some kind of coverage information as the stopping criteria,
with the assumption that higher coverage means higher quality
or lower defect levels. For example, checklists are often used
to make sure major functions and usage scenarios are tested
before product release. Every statement or unit in a component
must be covered before subsequent integration testing can proceed
in many organizations. More formal testing techniques include
control flow testing that attempts to cover execution paths
and domain testing that attempts to cover boundaries between
different input subdomains (Beizer 1990). Such formal coverage
information can only be obtained by using expensive coverage
analysis and testing tools. Rough coverage measurement, however,
can be obtained easily by examining the proportion of tested
items in various checklists.
On the other hand, product reliability goals can be used as
a more objective criterion to stop testing. The use of this
criterion requires testing to be performed under an environment
that resembles actual use by target customers so that realistic
reliability assessment can be obtained, resulting in the so-called
statistical usage-based testing (Musa 1998).
The coverage criterion ensures that certain types of faults
are detected and removed, thus reducing the number of defects,
although quality is not directly assessed. The usage-based testing
and the related reliability criterion ensure that the faults
that are most likely to cause problems are detected and removed,
and the reliability of the software reaches certain targets
before testing stops.
Other Techniques for Fault Detection and Removal
Inspection is the most commonly used static technique
for defect detection and removal.
Various other static techniques are available, including various
formal model-based analyses such as algorithm analysis, decision-table
analysis, boundary value analysis, finite-state machine and
Petri-net modeling, control and data-flow analyses, software
fault trees, and so on.
Similarly, in addition to testing, other dynamic, execution-based
techniques also exist for fault detection and removal. For example,
symbolic execution, simulation, and prototyping can help one
detect and remove defects early in the software development
process, before large-scale testing becomes a viable alternative.
On the other hand, in-field measurement and related analyses,
such as timing and performance monitoring and analysis for real-time
systems, and accident reconstruction using software event trees
for safety-critical systems, can also help one locate and remove
related defects.
A comprehensive survey of techniques for fault detection and
removal, including those mentioned previously, can be found
in (Wallace, Ippolito, and Cuthill 1996).
Risk Identification and Defect Reduction
Fault distribution is highly uneven for most software products,
regardless of their size, functionality, implementation language,
and other characteristics. Much empirical evidence has accumulated
over the years to support the so-called 80/20 rule, which states
that 20 percent of the software components are responsible for
80 percent of the problems. These problematic components can
generally be characterized by specific measurement properties
about their design, size, complexity, change history, and other
product or process characteristics. Because of the uneven fault
distribution among software components, there is a great need
for risk identification techniques to analyze these measurement
data so that inspection, testing, and other defect detection
and reduction activities can be more effectively focused on
those potentially high-defect components.
A survey of these risk identification techniques and their comparison
can be found in (Tian 2000), including: traditional statistical
analysis techniques, principal component analysis and discriminant
analysis, neural networks, tree-based modeling, pattern-matching
techniques, and learning algorithms. These techniques were compared
according to several criteria, including: accuracy, simplicity,
early availability and stability, ease of result interpretation,
constructive information and guidance for quality improvement,
and availability of tool support. Appropriate risk identification
techniques can be selected to fit specific application environments
in order to identify high-risk software components for focused
inspection and testing.
DEFECT CONTAINMENT THROUGH FAILURE PREVENTION
Because of the large size and high complexity of most software
systems in use today, the aforementioned defect reduction activities
can greatly reduce the number of faults but not completely eliminate
them. For software systems where failure impact is substantial,
such as many real-time control software used in medical, nuclear,
transportation, and other embedded systems, this low defect
level and failure risk may still not be adequate. Some additional
QA alternatives are needed.
On the other hand, these few remaining faults may be triggered
under rare conditions or unusual dynamic scenarios, making it
unrealistic to try to generate the huge number of test cases
to cover all these conditions or to perform exhaustive inspection
or analysis based on all possible scenarios. Instead, some other
means must be used to prevent failures by breaking the causal
relations between these faults and the resulting failures, thus
"tolerating" these faults, or to contain the failures
to reduce the resulting damage.
Fault Tolerance with Recovery Blocks
Software fault tolerance ideas originate from fault tolerance
designs in traditional hardware systems requiring higher levels
of reliability, availability, or dependability. In such systems,
spare parts and backup units are commonly used to keep the systems
in operational conditions, maybe at a reduced capability, at
the presence of unit or part failures. The primary software
fault tolerance techniques include recovery blocks and N-version
programming (NVP) covered in detail in (Lyu 1995). The author
next briefly describes these techniques and examines how they
deal with failures and related faults.

The
use of recovery blocks introduces duplication of software executions
so occasional failures only cause loss of partial computational
results but not complete execution failures. For example, the
ability to dynamically back up and recover from occasional lost
or corrupted transactions is built into many critical databases
used in financial, insurance, health care, and other industries.
Figure 2 illustrates this technique,
and depicts the four major activities involved:
1. Periodic checkpointing and refreshing to save
the dynamic contents of software executions
2. Failure detection: If a failure is detected, the following
two steps are performed.
3. Rollback by restoring the saved dynamic contents associated
with the latest checkpoint
4. Rerun the lost computation, and the normal activity continues
One key decision in this technique is the checkpointing frequency:
higher frequency leads to higher cost associated with frequent
refreshing of the saved dynamic contents, while lower frequency
leads to longer and more costly recovery. An optimal frequency
balances the two and incurs minimal overall cost.
In using recovery blocks, failures are detected, but the underlying
faults are not removed, although off-line activities can be
carried out to identify and remove the faults in case of repeated
failures. One hopes the dynamic condition or external disturbance
that accompanied the original failure will not repeat, thus
subsequent rerun of the lost computation can succeed and normal
operation can resume. In this respect, faults are tolerated
in the system, with occasional minor delaysa loss of performance
tolerable under many circumstances. Repeated failures, however,
have to be dealt with off-line, or by using other fault tolerance
techniques, such as NVP discussed next.
Fault Tolerance with N-version Programming

NVP
is another way to tolerate software faults by directly introducing
duplications into the software itself (Lyu 1995). NVP is generally
more suitable than recovery blocks when timely decisions or
performance are critical, such as in many real-time control
systems. The basic technique is illustrated in
Figure
3 and briefly described here:
1. The basic functional units of the software system
consist of N parallel independent versions of programs with
identical functionality: version 1, version 2
version
N.
2. The system input is distributed to all the N versions.
3. The individual output for each version is fed to a decision
unit.
4. The decision unit determines the system output based on
its inputs using a specific decision algorithm (often a majority
vote, but other algorithms are also possible).
The basic assumption in NVP is that faults in different versions
are independent, which implies that it is rare to have the same
fault triggered by the same input and cause the same failure
among different versions. Therefore, even if there is a fault
that causes a local failure in version i, the whole system is
likely to function correctly because the other (independent)
versions are likely to function correctly under the same dynamic
environment. In this way, the causal relation between local
faults and system failures is broken for most local faults under
most situations, thus improving the quality and reliability
of the software system. One of the main research topics in NVP
is to ensure that the software versions are as independent as
possible so local faults can be tolerated and the resulting
local failures can be contained effectively.
Safety Assurance and Failure Containment
The concerted effort of the previously described QA activities
should reduce the system failure probability to a very low level.
For safety critical systems, however, the primary concern is
the ability to prevent accidents, where an accident is a failure
with a severe consequence. Even such low failure probability
is not tolerable in such systems if most failures may lead to
accidents. Therefore, in addition to the aforementioned QA techniques,
specific techniques are also used for safety-critical systems
based on analysis of hazards, or logical preconditions for accidents.
These safety assurance and improvement techniques are discussed
in detail in (Leveson 1995). Following is a brief discussion
of them and an analysis of how each technique deals with defects:
Hazard elimination through substitution,
simplification, decoupling, elimination of specific human
errors, and reduction of hazardous materials or conditions.
This is similar to the error removal techniques described
before but with a focus on those error sources involved in
hazardous situations.
Hazard reduction through design for controllability
(for example, automatic pressure release in boilers), use
of barriers (for example, hardware/software interlocks), and
failure minimization using safety margins and redundancy.
These techniques are similar to the fault tolerance techniques
discussed previously, where local failures are contained without
leading to system failures.
Hazard control through reducing exposure, isolation
and containment (for example, barriers between the system
and the environment), protection systems (active protection
activated in case of hazard), and fail-safe design (passive
protection, fail in a safe state without causing further damages).
These techniques reduce the severity of failures, therefore
weakening the link between failures and accidents.
Damage control through escape routes, safe abandonment
of products and materials, and devices for limiting physical
damages to equipment or people. These techniques reduce the
severity of accidents thus limiting the damage caused by these
accidents.
Notice that both hazard control and damage control are post-failure
activities not generally covered in the QA activities described
before. These activities are specific to safety critical systems.
On the other hand, many techniques for hazard elimination and
reductions can also be used in general systems to reduce fault
injection and to tolerate local faults.
COMPARISON AND RECOMMENDATIONS
The author next compares the different QA activities by examining
their cost, applicability under different environments and development
phases, and effectiveness in dealing with different types of
problems. Based on this comparison, the author also provides
some general recommendations.
Cost and Applicability
Testing is among the standard activities that make up the whole
software development process, regardless of the process choice
or the product type. Therefore, the cost and applicability of
other QA alternatives are examined using testing as the baseline
for comparison.
In general, the longer a fault remains in a software system,
the higher the total cost (more than linear increase) associated
with fixing the related problems (Boehm 1981; Humphrey 1995).
In addition to fixing the original fault, the problems that
must be resolved include the failures caused by the original
fault, as well as other related faults that may be injected
in a chain reaction because of the presence of the original
fault, such as in a module that needs to interface with the
module containing the original fault. Therefore, fixing problems
early in the development process, or even better, preventing
the injection of faults through error removal, are generally
more cost-effective than dealing with the problems later in
other QA activities.
Unlike testing, which can only be performed after the software
system is at least partially implemented, inspection can be
performed throughout the software development process and on
almost any software artifacts. The cost for conducting different
variations of inspection ranges from very low for informal reviews
to that comparable to testing for formal inspections. According
to data compiled in (Gilb and Graham 1993), inspection typically
brings in a return-on-investment (ROI) ratio of around 10-to-1.
This effect is particularly strong in the earlier phases of
software development.
Formal verification can be viewed as an extremely structured
kind of inspection where all the formally specified elements
of the design or the code are formally verified. As mentioned
before, the proof of correctness for a program or a design is
typically one order of magnitude longer than the program or
the design itself (Zelkowitz 1993), thus such human-intensive
proofs cost significantly more than most inspections, and usually
cost more than testing. Fault tolerance techniques cost significantly
more because of the built-in duplications (Lyu 1995). Safety
assurance activities cost even more because of all the associated
actions taken to address both pre-failure and post-failure issues
to ensure not only low probability of failure, but also to limit
the failure consequences and damages (Leveson 1995). For systems
requiring higher levels of quality and reliability, or for safety
critical applications, however, the associated high cost is
usually justified. A careful cost-benefit analysis must be performed
based on historical data from the same or similar software development
organizations to choose the appropriate QA alternatives for
different types of software products.
Problem Types, Defect Levels, and Choice of QA Alternatives
In general, if systematic problems exist in an organization
and its products, preventive action is the most effective way
to deal with them. Such systematic problems are generally associated
with common failures traceable to common faults, and these common
faults can be traced in turn to some common errors through causal
analysis. As pointed out in (Humphrey 1995): "While detecting
and fixing defects is critically important, it is an inherently
defensive strategy. To make significant quality improvements,
you should identify the causes of these defects and take steps
to eliminate them."
On the other hand, sporadic problems can generally be dealt
with by other QA alternatives. One key difference between inspection
and testing is the way faults are identified: inspection identifies
them directly by examining the software artifact, while failures
are observed during testing and related faults are identified
later by using the recorded execution information.
This key difference leads to the different types of faults commonly
detected using these two techniques: inspection is usually good
at detecting static and localized faults, while testing is good
at detecting dynamic and global faults involving multiple components
in interactions (Beizer 1990; Gilb and Graham 1993). In addition,
hidden faults that are not going to cause any failures in the
current execution environment, for example, compatibility problems
with the intended future platforms, could be detected by inspection
but not by testing.
For existing products with relatively high defect levels or
with many common faults, inspection is most likely to be more
effective than testing, because inspection can continue after
the initial fault is detected, but further testing is often
blocked once a fault is encountered and a failure is observed.
In addition, when defect levels are high, execution of most
test cases will result in failure observations, and the subsequent
effort to locate and remove the underlying faults is similar
to that for inspection. Analysis of existing high-defect projects
commonly conducted in conjunction with inspection, such as in
Gilb inspection (Gilb and Graham 1993) can often point to systematic
problems. Such systematic problems can be most effectively addressed
by defect prevention activities in successor projects.
A proof of correctness or a formal verification can only be
produced if the program is fault-free with respect to its formal
specifications. When verification cannot be successfully completed,
further analysis often reveals accidental logical or functional
faults. This is not, however, an effective method for fault
detection because of the substantial effort involved in the
failed verification attempt. Therefore, formal verification
does not work for software with high defect levels. Fortunately,
the use of formal methods, with formal specification focusing
on error-source elimination and formal verification focusing
on verifying the conformance in designs and code, generally
results in low defect levels (Zelkowitz 1993).
Fault tolerance techniques generally involve the observations
of dynamic local failures and the tolerance of the related faults
but not the identification and removal of these faults. These
techniques only work when defect levels are very low, because
multiple fault encounters or frequent failures cannot be effectively
tolerated (Lyu 1995). Therefore, other QA alternatives must
be used to reduce the defects to a very low level before fault
tolerance techniques can be used to further reduce the probability
of system failures.
On the other hand, many software safety assurance techniques
attempt to weaken the link between failures and accidents or
reduce the damage associated with accidents. The focus of these
activities is the post-failure accidents and the related hazard
analysis and resolution. Defect levels are expected to be extremely
low because these expensive techniques are generally applied
as the last guard against system safety problems after traditional
QA activities have been performed (Leveson 1995).
Comparison Summary and Recommendations

The
previous comparison is summarized in
Figure
4. Based on the comparison and analysis presented so far,
the author makes the following recommendations:
In general, a concerted effort is necessary
with many different QA activities to be used in an integrated
fashion to effectively and efficiently deal with defects and
ensure product quality.
Error removal greatly reduces the chance of fault injections.
Therefore, such preventive actions should be an integral part
of any QA plan. Causal analyses can be performed to identify
systematic problems and select preventive actions to deal
with the problems.
Inspection and testing are applicable to different
situations and effective for different defect types at different
defect levels. Therefore, inspection can be performed first
to lower defect levels, and then testing can be performed
to remove the remaining faults related to dynamic scenarios
and global interactions. To maximize the benefit-to-cost ratio,
various risk identification techniques can be used to focus
inspection and testing effort on identified high-risk product
components.
Software safety assurance (especially hazard and damage
control), fault tolerance, and formal verification techniques
cost significantly more to implement than traditional QA techniques.
If consequence of failures is severe and potential damage
is high, however, they can be used to further reduce the failure
probability or to reduce the accident probability or severity.
CONCLUSIONS AND PERSPECTIVES
Because of the pervasive use and reliance on software systems
today, there is a great need for effective QA alternatives
and related techniques. According to the different ways these
QA alternatives deal with defects, they can be classified
into three categories: defect prevention, defect reduction,
and defect containment.
Existing software quality literature generally covers defect
reduction techniques such as testing and inspection in more
details than defect prevention activities, while largely ignoring
the role of defect containment in QA. The survey and classification
of different QA alternatives in this article bring together
information from diverse sources to offer a common starting
point and information base for software quality professionals.
The comparison of the applicability, effectiveness, and cost
can help them choose appropriate alternatives and tailor or
integrate them for specific applications.
As an immediate follow-up to this study, the author plans
to collect additional data from industry to quantify the cost
and benefit of different QA alternatives to better support
the related cost-benefit analysis. He also plans to package
application experience from industry to guide future applications.
These efforts will help advance the state-of-practice in industry,
where appropriate QA alternatives can be selected, tailored,
and integrated by software quality professionals for effective
quality assurance and improvement.
Acknowledgment
This work is supported in part by NSF CAREER award CCR-9733588,
THECB/ATP award 003613-0030-1999, and Nortel Networks.
The author wishes to thank the anonymous reviewers for their
constructive comments and suggestions that led to a better
article.
References
Beizer, B. 1990. Software testing techniques, second edition.
Boston, Mass.: International Thomson Computer Press.
Boehm, B. W. 1981. Software engineering economics.
Englewood Cliffs, N. J.: Prentice Hall.
Fagan, M. E. 1976. Design and code inspections to reduce errors
in program development. IBM Systems Journal 3:182-211.
Gilb, T., and D. Graham. 1993. Software inspection.
London: Addison-Wesley Longman.
Humphrey, W. S. 1995. A discipline for software engineering.
Reading, Mass.: Addison-Wesley.
IEEE Standard 610.12. IEEE standard glossary of software
engineering terminology. 1990. New York: Institute of
Electrical and Electronics Engineers.
Leveson, N. G. 1995. Safeware: System safety and computers.
Reading, Mass.: Addison-Wesley.
Lyu, M. R., ed. 1995. Software fault tolerance. New
York: John Wiley & Sons.
Mills, H. D., M. Dyer, and R. C. Linger. 1987. Cleanroom software
engineering. IEEE Software 4, no. 5: 19-24.
Musa, J. D. 1998. Software reliability engineering.
New York: McGraw-Hill.
Parnas, D. L. 1972. On the criteria to be used in decomposing
systems into modules. Communications of the ACM 15,
no. 12: 1053-1058.
Tian, J. 2000. Risk identification techniques for defect reduction
and quality improvement. Software Quality Professional
2, no.2: 32-41.
Wallace, D. R., L. M. Ippolito, and B. Cuthill. 1996. Reference
Information for the Software Verification and Validation Process.
NIST Special Publication 500-234.
Zelkowitz, M. V. 1993. Role of verification in the software
specification process. In Advances in Computers 36.
ed.M. C. Yovits, 43-109. San Diego, Calif.: Academic Press.
BIOGRAPHY
Jeff (Jianhui) Tian has a bachelors degree in
electrical engineering from Xian Jiaotong University,
a masters degree in engineering science from Harvard
University, and a doctorate in computer science from the University
of Maryland. He worked for IBM Software Solutions Toronto
Laboratory between 1992 and 1995 as a software quality and
process analyst. Since 1995, he has been an assistant professor
of computer science and engineering at Southern Methodist
University in Dallas, Texas. His current research interests
include software testing, measurement, reliability, safety,
complexity, and telecommunication software and systems. Tian
is a member of IEEE and ACM. He can be reached at Southern
Methodist University, Dept. of Computer Science and Engineering,
Dallas, TX 75275, or by e-mail at
tian@engr.smu.edu.
If you liked this article, subscribe
now.