Chapter 1: Introduction


Michael R. Lyu (AT&T Bell Laboratories)

1.1 The Need for Reliable Software

Since the first electronic digital computer was invented almost fifty years ago[Burk46a], human beings have become dependent on computers in their daily lives. The computer revolution has created the fastest technological advancement that the world has ever seen. Today, computer hardware and software permeates our modern society. The newest cameras, VCRs, and automobiles could not be controlled and operated without computers. Computers are embedded in wristwatches, telephones, home appliances, buildings, and aircraft. Science and technology have been demanding high- performance hardware and high-quality software for making improvements and breakthroughs. We can look at virtually any industry - automotive, avionics, oil, telecommunications, banking, semiconductor, pharmaceutics - all these industries highly, if not totally, rely on computers for their functioning capabilities.

The size and complexity of computer-intensive systems has grown dramatically during the past decade, and the trend will certainly continue in the future. Contemporary examples of highly complex hardware/software systems can be found in projects undertaken by NASA, the Department of Defense, the Federal Avia- tion Administration, the telecommunications industry, and a variety of other private industries. For instance, the NASA Space Shuttle flies with approximately 500,000 lines of software code on board and approximately 3.5 million lines of code in ground control and processing. After being scaled down significantly from its original plan, the International Space Station Alpha is still projected to have millions of lines of software to operate innumerable hardware pieces for its navigation, communication, and experimentation. In the telecommunications industry, opera- tions for phone carriers are supported by hundreds of software systems, with hundreds of millions of lines of source code. In the avionics industry, almost all new payload instruments contain their own microprocessor system with extensive embedded software. A massive amount of hardware and complicated software also exists in the Federal Aviation Administration's Advanced Automation Sys- tem, the new generation air traffic control system. In our offices and homes, many personal computers cannot function without operating systems (e.g., Windows) ranging from 1 to 5 million lines of code, and many other shrink-wrapped software packages of similar size provide our daily use of these computers in a variety of applications.

The demand for complex hardware/software systems has increased more rapidly than the ability to design, implement, test, and maintain them. When the requirements for and dependen- cies on computers increase, the possibility of crises from com- puter failures also increases. The impact of these failures ranges from inconvenience (e.g., malfunctions of home appli- ances), economic damage (e.g., interruptions of banking systems), to loss of life (e.g., failures of flight systems or medical software). Needless to say, the reliability of computer systems has become a major concern for our society.

Within the computer revolution achievement has been unbal- anced: software continues to share larger burden with less pro- gress. It is the integrating potential of software that has allowed designers to contemplate more ambitious systems encom- passing a broader and more multidisciplinary scope, and it is the growth in utilization of software components that is largely responsible for the high overall complexity of many system designs. However, in stark contrast with the rapid advancement of hardware technology, proper development of software technology has failed miserably to keep pace in all measures, including quality, productivity, cost, and performance. When we entered the last decade of the 20th century, computer software had already become the major source of reported outages in many systems[Gray90a]. Consequently, recent literature is replete with horror stories of projects gone awry, generally as a result of problems traced to software.

Software failures caught spotlights in several major pro- grams. In the NASA Voyager project, the Uranus encounter was in jeopardy because of late software deliveries and reduced capabil- ity in the Deep Space Network. Several Space Shuttle missions have been delayed due to hardware/software interaction problems. In one DoD project, software problems caused the first flight of the AFTI/F-16 jet fighter to be delayed over a year, and none of the advanced modes originally planned could be used. Critical software failures have also affected numerous civil and scien- tific applications. The ozone hole over Antarctica would have received attention sooner from the scientific community if a data analysis program had not suppressed the anomalous data because it was "out of range." Software glitches in an automated baggage- handling system forced Denver International Airport to sit empty more than a year after airplanes were to fill its gates and runways[Gibb94a].

Unfortunately, software can also kill people. The massive Therac-25 radiation therapy machine had enjoyed a perfect safety record until software errors in its sophisticated control systems malfunctioned and claimed several patients' lives in 1985 and 1986[Lee92a]. On October 26, 1992, the Computer Aided Dispatch system of the London Ambulance Service broke down right after its installation, paralyzing the capability of the world's largest ambulance service to handle 5000 daily requests in carrying patients in emergency situations[SWTR93a]. In the recent avia- tion industry, although the real causes for several airliner crashes in the past few years remained mysteries, experts pointed out that software control could be the chief suspect in some of these incidences due to its inappropriate response to the pilots' desperate inquires during an abnormal flight conditions.

Software failures also have led to serious consequences in business. On January 15, 1990, a fault in a switching system's newly released software caused massive disruption of a major carrier's long-distance network, and another series of local phone outages traced to software problems occurred during the summer of 1991[Lee92a]. These critical incidences caused enor- mous revenue losses to thousands of companies relying on telecom- munications companies to support their businesses.

Many software systems and packages are distributed and installed in identical or similar copies, all of which are vulnerable to the same software failure. This is why even the most powerful software companies like Microsoft are fearful for the "killer bugs" which can easily wipe out all the profits of a glorious product if a call-back is required on the tenths of mil- lions of copies they have sold for the product[Cusu95a]. To this end, many software companies see a major share of project development costs identified with the design, implementation, and assurance of reliable software, and they recognize a tremendous need for systematic approaches using software reliability engineering techniques. Clearly, developing the required tech- niques for software reliability engineering is a major challenge to computer engineers, software engineers, and engineers of vari- ous disciplines for now and the decades to come.

1.2 Software Reliability Engineering Concepts

Software reliability engineering is centered around a very impor- tant software attribute: reliability. Software reliability is defined as the probability of failure-free software operation for a specified period of time in a specified environment[ANSI91a]. It is one of the attributes of software quality, a multi dimen- sional property including other customer satisfaction factors like functionality, usability, performance, serviceability, capa- bility, installability, maintainability, and documentation[Grad87a,Grad92a]. Software reliability, however, is generally accepted as the key factor in software quality since it quantifies software failures - the most unwanted event which makes software useless or even harmful to the whole system, and malfunctioning software may kill people. As a result, it is regarded the most important factor contributing to customer satisfaction. In fact, ISO 9000-3 specifies field failures as the basic requirement for quality metrics: "... at a minimum, some metrics should be used which represent reported field failures and/or defects form the customer's viewpoint. ... The supplier of software products should collect and act on quantita- tive measures of the quality of these software products." (See the Section 6.4.1 of[ISO91a]).

Example 1.1 shows the impact of high-severity failures to customer satisfaction.

	( Figure 1.1 - Not Shown Here )

Figure 1.1:  Correlations between Software Quality and High-Severity Failures

Example 1.1

A survey of nine large software projects was taken in[Merc94a] to study the factors contributing to customer satisfaction. These projects were telecommunications systems responsible for day-to- day operations in the U.S. local telephone business. The survey requested telephone customers to assess a quality score between 0 and 100 for each system. The average size of these projects was 1 million lines of source code.

In the meanwhile, Trouble Reports (i.e., failure reports in the field) were collected from these projects. Figure 1.1 shows the overall quality score from the survey of these projects, plotted against the number of high severity Trouble Reports.

From Figure 1.1 we can observe a high negative correlation (- 0.86) between the overall quality score and the number of high- severity failures for each project. This example illustrates that in the telecommunications industry, the number of critical software failures promptly indicate negative customer perception on overall software quality. This quality indicator is also gen- erally applicable to many other industries. []

Software reliability is also one of the system dependability concepts which are discussed in detail in Chapter 2. Example 1.2 demonstrates the impact of software reliability to system relia- bility.

Example 1.2

A military distributed processing system has an MTTF (mean time to failure, see definition on Section 1.4) requirement of 100 hours and an availability requirement of 0.99. The overall architecture of the system is shown in Figure 1.2, indicating that the system consists of three subsystems, SYS1, SYS2, SYS3, a local area network, LAN, and a 10 KW power generator GEN. In order for the system to work, all the components (except SYS2) have to work. In the early phase of system testing, hardware reliability parameters are predicted according to the MIL-HDBK- 217, and shown for each system component. Namely, above each component block in Figure 1.2 two numbers appear. The upper number represents the predicted MTTF for that component, and the lower number represents its MTTR. The units are hours. For example, SYS1 has 280 hours for MTTF and 0.53 hours for MTTR, while SYS2 and SYS3 have 387 hours for MTTF and 0.50 hours for MTTR. Note that SYS2 is configured as a triple module redundant system, shown in the dotted-line block, where the subsystem will work as long as two or more modules work. Due to this fault- tolerant capability, its MTTF improves to 5.01x10**4 hours and MTTR becomes 0.25 hours.

	( Figure 1.2 - Not Shown Here )

Figure 1.2: An Example of Predicting System Reliability

To calculate the overall system reliability, all the components in the system have to be considered. If we assume the software does not fail (a mistake often made by system reliability engineers!), the resulting system MTTF would be 125.9 hours, and MTTR would be 0.62 hours, achieving system availability of 0.995. It looks as if the system already meets its original require- ments.

But the software does fail. Both SYS2 and SYS3 software contain 300,000 lines of source code, and following the prediction model described in Chapter 3 (Section 3.8.3) and[RADC87a], the predicted initial failure rates for SYS2 software and SYS3 software are both 2.52 failures per execution hour. (Note the three SYS2 S/W are identical software copies and not fault- tolerant.) Even without considering SYS1 software failures, the system MTTF would have become 11.9 CPU minutes! If assuming MTTR is still 0.62 hours (although it should be higher since it gen- erally takes longer to reinitialize software), and CPU time and calendar time are close to each other (which is true for this distribution system), the system availability becomes 0.24, far less than it was predicted earlier! []

Note that the system presented in Example 1.2 was a real world example, and the estimated reliability parameters were actual practices following military handbooks[Lyu89a]. This example is not an extreme case. In fact, many existing large systems face the same situation: software reliability is the bottleneck of system reliability, and the maturity of software always lags behind that of hardware. Accurately modeling software reliability and predicting its trend have become criti- cal since this effort provides critical information for decision making and reliability engineering for most projects.

Reliability engineering is a daily practiced technique in many engineering disciplines. Civil engineers use it to build bridges and computer hardware engineers use it to design chips and computers. Using a similar concept in these disciplines, we define Software Reliability Engineering (SRE) as the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability[IEEE95a]. SRE therefore includes:

(1) software reliability measurement, which includes estimation and prediction, with the help of software reliability models established in the literature;

(2) the attributes and metrics of product design, development process, system architecture, software operational environ- ment, and their implications on reliability; and

(3) the application of this knowledge in specifying and guiding system software architecture, development, testing, acquisi- tion, use, and maintenance.

Based on the above definitions, this book details current SRE techniques and practices.

1.3 Book Overview

Mature engineering fields classify and organize proved solutions in handbooks so that most engineers can consistently handle com- plicated but routine designs. Handbooks of software engineering practice would be very helpful, just like all engineers have handbooks. For a long time we have not had such a thing for software, so mistakes are repeated on project after project, year after year. This is mostly because software development is an art. Although we understand a very large part of this art, it is still not a practiced engineering discipline. Software crises identified more than 25 years ago are still the well known crises of today[Gibb94a].

Fortunately, the reliability component of software engineer- ing has emerged an art to a practical engineering discipline. It is time to begin to codify our knowledge in SRE and make it available - this is the main purpose of this handbook. This handbook provides information on the key methods and methodolo- gies used in SRE, covering its state-of-the-art techniques and state-of-practice approaches. The book is divided into three parts and 17 chapters. Each chapter is written by SRE experts, including researchers and practitioners. These chapters cover the theory, design, methodology, modeling, evaluation, experi- ence, and assessment of SRE techniques and applications.

Part I of the book, composed of five chapters, sets up the technical foundations for software reliability modeling tech- niques, in which system-level dependability and reliability con- cepts, software reliability prediction and estimation models, model evaluation and recalibration techniques, and operational profile techniques are presented. In particular,

(1) Chapter 1 gives an introduction of the book, where its framework is outlined and main contents of each chapter con- veyed. Basic ideas, terminologies, and techniques in SRE are presented.

(2) Chapter 2 provides a general overview of the system dependa- bility concept, and shows that the classical reliability theory can be extended in order to be interpreted from both hardware and software viewpoint.

(3) Chapter 3 reviews the major software reliability models that appeared in the literature from both historical and applica- tions perspectives. Each model is presented with its motivation, model assumptions, data requirement, model form, estimation procedure, and general comments about its usage.

(4) Chapter 4 presents a systematic framework to conduct model evaluation of several competing reliability models, using advanced statistical criteria. Recalibration techniques which can greatly improve model performance are also intro- duced.

(5) Chapter 5 details a technique which is essential to SRE: the operational profile. The operational profile shows you how to increase productivity and reliability and speed develop- ment by allocating project resources to functions on the basis of how a system will be used.

Part II contains SRE practices and experiences in six chapters. This part of the book consists of practical experi- ences from major organizations like AT&T, Jet Propulsion Labora- tory, Tandem, IBM, NASA, Northern Telecom, and other interna- tional organizations. Various SRE procedures are implemented for particular requirements under different environments. The authors of each chapter in Part II describe the practical procedures which work for them, and convey to you their experi- ences and lessons learned. Specifically,

(1) Chapter 6 describes the best current practice in SRE adopted by over 70 projects in AT&T. This practice allows you to analyze, manage, and improve the reliability of software products, to balance customer needs in terms of cost, schedule, and quality, and to minimize the risks of releas- ing software with serious problems.

(2) Chapter 7 conveys the measurement experience in applying software reliability models to several large-scale projects at Jet Propulsion Laboratory (JPL). We discuss the SRE pro- cedures, data collection efforts, modeling approaches, data analysis methods, reliability measurement results, lessons learned, and future directions. A practical scheme to improve measurement accuracy by linear combination models is also presented.

(3) Chapter 8 shows measurement-based analysis techniques which directly measure software reliability through monitoring and recording failure occurrences in a running system under various user workloads. Experiences with Tandem GUARDIAN, IBM MVS, and VAX VMS operating systems are explored.

(4) Chapter 9 proposes a defect classification scheme which extracts semantic information from software defects such that it provides a measurement on the software development process. This chapter explains the framework, procedure, and advantage of this scheme and its successful application and deployment in many projects at IBM.

(5) Chapter 10 addresses software reliability trend analysis which can help project managers control the progress of the development activities and determine the efficiency of the test programs. Application results from a number of studies including switching systems and avionic applications are reported.

(6) Chapter 11 provides insight into the process of collecting and analyzing software reliability field data through a dis- cussion of the underlying principles and case study illus- trations. Included in the field data analysis are projects from IBM, Hitachi, Northern Telecom, and Space Shuttle Flight Software.

Emerging techniques which have been used to advance SRE research field are addressed by the six chapters in Part III. These techniques include software metrics, testing schemes, fault-tolerant software, fault-tree analysis, simulation, and neural networks. After explicitly explaining these techniques in concrete terms, authors of the chapters in Part III establish the relationships between these techniques and software reliability. Potential research topics and their directions are also addressed in detail. In summary,

(1) Chapter 12 presents the technique to incorporate software metrics for reliability assessment. This chapter makes the connection between software complexity and software relia- bility, in which both functional complexity and operational complexity of a program are examined for the development and maintenance of reliable software.

(2) Chapter 13 explores the relationship between software test- ing and reliability. In addressing the impact of testing to reliability, this chapter applies program structure metrics and code coverage data for the estimation of software relia- bility and the assessment of the risk associated with software.

(3) Chapter 14 focuses on the software fault tolerance approach as a potential technique to improve software reliability. Issues regarding the architecture, design, implementation, modeling, failure behavior, and cost of fault tolerant sys- tems are discussed.

(4) Chapter 15 introduces the fault trees technique for the reliability analysis of software systems. This technique helps you to analyze the impact of software failures on a system, to combine off-line and on-line tests to prevent or detect software failures, and to compare different design alternatives for fault tolerance with respect to both relia- bility and safety.

(5) Chapter 16 demonstrates how the simulation technique can be applied to a typical software reliability engineering pro- cess, in which many simplifying assumptions in reliability modeling could be lifted. This chapter shows the power, flexibility, and potential benefits that the simulation technique offer, together with methods for representing artifacts, activities, and events of the reliability pro- cess.

(6) Chapter 17 elaborates how the neural networks technology can be used in software reliability engineering applications, including its usage as a general reliability growth model for a better predictive accuracy, and its exercise as a classifier to identify fault-prone software modules.

In addition to these book chapters, two appendices and an MS/DOS diskette are enclosed in the book. Appendix A surveys the currently-available tools which encapsulate software reliability models and techniques. These tools include AT&T Toolkit, SMERFS, SRMP, SoRel, CASRE, and SoftRel. Appendix B reviews the analyti- cal modeling techniques, statistical techniques and reliability theory commonly used in the SRE studies. The MS/DOS disk, called Data and Tool Disk (or Data Disk), includes two directories: the DATA directory and the TOOL directory. The DATA directory con- tains more than 40 published and unpublished software failure data sets used in the book chapters, and the TOOL directory con- tains the SMERFS, CASRE, and SoftRel software reliability tools.

Finally, at the end of each book chapter is a Problems Sec- tion which allows you to take an exercise after reading the chapter.

1.4 Basic Definitions

We notice the three major components in the definition of software reliability: failure, time, and operational environ- ment. We now define these terms and other related SRE terminolo- gies. We begin with the notions of a software system and its expected service.

Software Systems. A software system is an interacting set of software subsystems that is embedded in a computing environment that provides inputs to the software system and accepts service (outputs) from the software. A software subsystem itself is com- posed of other subsystems, and so on, to a desired level of decomposition into the smallest meaningful elements (e.g., modules or files).

Service. Expected service (or "behavior") of a software system is a time-dependent sequence of outputs that agrees with the ini- tial specification from which the software implementation has been derived (for the verification purpose), or which agrees with what system users have perceived the correct values to be (for the validation purpose).

Now we observe the following situation: a software system named program is delivering an expected service to an environment or a person named user.

Failures. A failure occurs when the user perceives that the pro- gram ceases to deliver the expected service.

The user may choose to identify several severity levels of failures, such as: catastrophic, major, and minor, depending on their impacts to the system service. The definitions of these severity levels vary from system to system.

Outages. An outage is a special case of a failure, which is defined as a loss or degradation of service to a customer for a period of time (called "outage duration"). In general, outages can be caused by hardware or software failures, human errors, and environmental variables (e.g., lightning, power failures, fire, etc.). A failure resulting in the loss of functionality of the entire system is called a "system outage." An example to quantify a system outage in the telecommunications industry is to define the outage duration of telephone switching systems to be "greater than 3 seconds (due to failures that results in loss of stable calls) or greater than 30 seconds (for failures that do not result in loss of stable calls)."[Bell90a]

Faults. A fault is uncovered when either a failure of the pro- gram occurs, or an internal error (e.g., an incorrect state) is detected within the program. The cause of the failure or the internal error is said to be a fault. It is also referred as a "bug."

In most cases the fault can be identified and removed; in some cases it remains a hypothesis that cannot be adequately ver- ified (e.g., timing faults in distributed systems).

In summary, a software failure is an incorrect result with respect to the specification or an unexpected software behavior perceived by the user at the boundary of the software system, while a software fault is the identified or hypothesized cause of the software failure.

Defects. When the distinction between "fault" and "failure" is not critical, defect can be used as a generic term to refer to either a fault (cause) or a failure (effect). Chapter 9 provides a complete and practical classification of software defects from various perspectives.

Errors. The term "error" has two different meanings:

(1) A discrepancy between a computed, observed, or measured value or condition and the true, specified or theoretically correct value or condition. Errors occur when some part of the computer software produces an undesired state. Examples include exceptional conditions raised by the activation of existing software faults, and incorrect computer status due to an unexpected external interference. This term is espe- cially useful in fault-tolerant computing to describe an intermediate stage in between faults and failures.

(2) A human action that results in software containing a fault. Examples include omission or misinterpretation of user requirements in a software specification, and incorrect translation or omission of a requirement in the design specification. However, this is not a preferred usage, and the term "mistake" is used instead to avoid the confusion.

Time. Reliability quantities are defined with respect to time, although it is possible to define them with respect to other bases like program runs. We are concerned with three types of time: the execution time for a software system is the CPU time that is actually spent by the computer in executing the software, the calendar time is the time people normally experience in terms of years, months, weeks, days, etc., and the clock time is the elapsed time from start to end of computer execution in running the software. In measuring clock time, the periods during which the computer is shut down are not counted.

It is generally accepted that execution time is more ade- quate than calendar time for software reliability measurement and modeling. However, reliability quantities must ultimately be related back to calendar time for easy human interpretation, par- ticularly when managers, engineers, and customers want to compare them across different systems. As a result, translations between calendar time and execution time are required. The technique for such translations is described in [Musa87a]. If execution time is not readily available, approximations such as clock time, weighted clock time, staff working time, or units that are natural to the application, such as transactions or test cases executed, may be used.

Failure Functions. When a time basis is determined, failures can be expressed in several ways: the cumulative failure function, the failure intensity function, the failure rate function, and the mean time to failure function. The cumulative failure func- tion (also called the mean value function) denotes the average cumulative failures associated with each point of time. The failure intensity function represents the rate of change of the cumulative failure function. The failure rate function (or called the hazard rate, or the rate of occurrence of failures) is defined as the instantaneous failure rate at a time t, given that the system has not failed up to t. The mean time to failure (MTTF) function represents the expected time that the next failure will be observed. (MTTF is also known as MTBF, mean time between failures.) Note that the above three measures are closely-related and could be translated with one another. Appen- dix B provides the mathematics of these functions in detail.

Mean Time To Repair and Availability. Another quantity related to time is mean time to repair (MTTR), which represents the expected time until a system will be repaired after a failure is observed. When the MTTF and MTTR for a system are measured, its availability can be obtained. Availability is the probability that a system is available when needed. Typically, it is meas- ured by


                Availability  =  -----------

                                 MTTF + MTTR

Chapter 2 (Section 2.4.4) gives a theoretical model for availa- bility, while Chapter 11 (Section 11.8) provides some practical examples of this measure.

Operational Profile. The operational profile of a system is defined as the set of operations that the software can execute along with the probability with which they will occur. An opera- tion is a group of runs which typically involve similar process- ing. A sample operational profile is illustrated in Figure 1.3. Note that, without loss of generality, the operations can be located on the x-axis in order of the probabilities of their occurrence.

Chapter 5 provides a detailed description on the structure, development, illustration, and project application of the opera- tional profile. In general, the number of possible software operations is quite large. When it is not practical to determine all the operations and their probabilities in complete detail, operations based on grouping or partitioning of input states (or system states) into domains are determined. In the situations when an operational profile is not available or only an approxi- mation can be obtained, you may make use of code coverage data generated during reliability growth testing to obtain reliability estimates. Chapter 13 describes some methods for doing so.

               ( Figure 1.3 - Not Shown Here )

Figure 1.3: Operational Profile

Failure Data Collection. Two types of failure data, namely, failure-count data and time-between-failures data, can be col- lected for the purpose of software reliability measurement.

Failure-Count (or Failures Per Time Period) Data. This type of data tracks the number of failures detected per unit of time. Typical failure-count data is shown in Table 1.1.

                     Table 1.1:  Failure-count Data


                       Failures in   Cumulative

                  Time (hours)9   the Period     Failures


                   8             4             4

                  16             4             8

                  24             3            11

                  32             5            16

                  40             3            19

                  48             2            21

                  56             1            22

                  64             1            23

                  72             1            24


Time-Between-Failures (or Inter-Failure Times) Data. This type of data tracks the intervals between consecutive failures. Typi- cal time-between-failures data can be seen in Table 1.2.

             Table 1.2:  Time-between-failures Data


           Failure       Failure           Failure

           Number    Interval (hours)   Times (hours)


              1             0.5              0.5

              2             1.2              1.7

              3             2.8              4.5

              4             2.7              7.2

              5             2.8             10.0

              6             3.0             13.0

              7             1.8             14.8

              8             0.9             15.7

              9             1.4             17.1

             10             3.5             20.6

             11             3.4             24.0

             12             1.2             25.2

             13             0.9             26.1

             14             1.7             27.8

             15             1.4             29.2

             16             2.7             31.9

             17             3.2             35.1

             18             2.5             37.6

             19             2.0             39.6

             20             4.5             44.1

             21             3.5             47.6

             22             5.2             52.8

             23             7.2             60.0

             24            10.7             70.7


Many reliability modeling programs have the capability to estimate model parameters from either failure-count or time- between-failures data, as statistical modeling techniques can be applied to both. However, if a program accommodates only one type of data, it may be required to transform the other type.

Transformations Between Data Types. If the expected input is failure-count data, it may be obtained by transforming time- between-failures data to cumulative failure times and then simply counting the number of failures whose cumulative times occur within a specified time period. If the expected input is time- between-failures data, converting the failure-count data can be achieved by either randomly or uniformly allocating the failures for the specified time intervals, and then by calculating the time periods between adjacent failures. Some software reliabil- ity tools surveyed in Appendix A (e.g., "SMERFS" and "CASRE") incorporate the capability to do these data transformations.

Software reliability measurement includes two types of activities, reliability estimation and reliability prediction:

Estimation. This activity determines current software reliabil- ity by applying statistical inference techniques to failure data obtained during system test or during system operation. This is a measure regarding the achieved reliability from the past until the current point. Its main purpose is to assess the current reliability, and determine whether a reliability model is a good fit in retrospect.

Prediction. This activity determines future software reliability based upon available software metrics and measures. Depending on the software development stage, prediction involves different techniques:

(1) When failure data are available (e.g., software is in system test or operation stage), the estimation tech- niques can be used to parameterize and verify software reliability models, which can perform future reliabil- ity prediction.

(2) When failure data are not available (e.g., software is in the design or coding stage), the metrics obtained from the software development process and the charac- teristics of the resulting product can be used to determine reliability of the software upon testing or delivery.

The first definition is also referred to as "reliability predic- tion," and the second definition as "early prediction." When there is no ambiguity in the text, only the word "prediction" will be used.

Most current software reliability models fall in the estima- tion category to do reliability prediction. Nevertheless, a few early prediction models were proposed and described in the literature. A survey of existing estimation models and some early prediction models can be found in Chapter 3. Chapter 12 provides some product complexity metrics which can be used for early prediction purposes.

Software Reliability Models. A software reliability model speci- fies the general form of the dependence of the failure process on the principal factors that affect it: fault introduction, fault removal, and the operational environment. Figure 1.4 shows the basic ideas of software reliability modeling.

	( Figure 1.4 -- Not Shown Here )

Figure 1.4: Basic Ideas on Software Reliability Modeling

In Figure 1.4, the failure rate of a software system is gen- erally decreasing due to the discovery and removal of software failures. At any particular time (say, the point marked "present time"), it is possible to observe a history of the failure rate of the software. Software reliability modeling forecasts the curve of the failure rate by statistical evidences. The purpose of this measure is two-fold: 1) to predict the extra time needed to test the software to achieve a specified objective; 2) to predict the expected reliability of the software when the testing is finished.

Software reliability is similar to hardware reliability in that both are stochastic processes and can be described by proba- bility distributions. However, software reliability is different from hardware reliability in the sense that software does not wear out, burn out, or deteriorate, i.e., its reliability does not decrease with time. Moreover, software generally enjoys reliability growth during testing and operation since software faults can be detected and removed when software failures occur. On the other hand, software may experience reliability decrease due to abrupt changes of its operational usage or incorrect modifications to the software. Software is also continuously modified throughout its life-cycle. The malleability of software makes it inevitable for us to consider variable failure rates.

Unlike hardware faults which are mostly physical faults, software faults are design faults which are harder to visualize, classify, detect, and correct. As a result, software reliability is a much more difficult measure to obtain and analyze than hardware reliability. Usually hardware reliability theory relies on the analysis of stationary processes, because only physical faults are considered. However, with the increase of systems complexity and the introduction of design faults in software, reliability theory based on stationary process becomes unsuitable to address non-stationary phenomena such as reliability growth or reliability decrease experienced in software. This makes software reliability a challenging problem which requires an employment of several methods to attack.

1.5 Technical Areas Related to the Book

Achieving highly reliable software in the customer's perspective is a demanding job to all software engineers and reliability engineers. Adopting a similar notation from[Lapr85a,Avi86a] for system dependability, four technical methods are applicable for you to achieve reliable software systems:

(1) fault avoidance: to prevent, by construction, fault occurrences;

(2) fault removal: to detect, by verification and validation, the existence of faults and eliminate them;

(3) fault tolerance: to provide, by redundancy, service comply- ing with the specification in spite of faults having occurred or occurring;

(4) fault/failure forecasting: to estimate, by evaluation, the presence of faults and the occurrence and consequences of failures. This has been the main focus of software relia- bility modeling.

The detailed discussions on these technical areas are provided in the following sections. You can also refer to Chapter 2 (Section 2.2) for a complete list of dependability and reliability related concepts.

1.5.1 Fault Avoidance

The interactive refinement of the user's system requirement, the engineering of the software specification process, the use of good software design methods, the enforcement of structured pro- gramming discipline, and the encouragement of writing clear code are the general approaches to avoid faults in the software. These guidelines have been, and will continue to be, the funda- mental techniques in preventing software faults from being created.

Recently, formal methods have been attempted in the research community in attacking the software quality problem. In formal- methods approaches, requirement specifications are developed and maintained using mathematically trackable languages and tools. Current studies in this area have been focused on language issues and environmental supports, which include at least the following goals: (1) executable specifications for systematic and precise evaluation, (2) proof mechanisms for software verification and validation, (3) development procedures which follow incremental refinement for step-by-step verification, and (4) every work item, be it a specification or a test case, is subject to mathematically verification for its correctness and appropriate- ness.

Another fault avoidance technique, particularly popular in the software development community, is software reuse. The cru- cial measure of success in this area is the capability to proto- type and evaluate reusable synthesis techniques. This is why object-oriented paradigms and techniques are receiving much attention nowadays - largely due to their inherent properties in enforcing software reuse.

1.5.2 Fault Removal

When formal methods are in full swing, formal design proofs might be available to achieve mathematical proof-of-correctness for programs. Also fault-monitoring assertions could be employed through executable specifications, and test cases could be automatically generated to achieve efficient software verifica- tion. However, before this happens, practitioners will have to rely mostly on software testing techniques to remove existing faults. Microsoft, for example, allocates as many software tes- ters as software developers, and employs a "buddy" system which binds the developer of every software component with its tester for their daily work[Cusu95a]. The key question to reliability engineers, then, is how to derive testing quality measures (e.g., test coverage factors) and establish their relationships to reli- ability.

Another practical fault removal scheme which has been widely implemented in industry is formal inspection[Faga76a]. A formal inspection is a rigorous process focused on finding faults, correcting faults, and verifying the corrections. Formal inspec- tion is carried out by a small group of peers with a vested interest in the work product during pretest phases of the life cycle. Many companies have claimed its success[Grad92a].

1.5.3 Fault Tolerance

Fault tolerance is the survival attribute of computing systems or software in their ability to deliver continuous service to their users in the presence of faults[Avi78a]. Software fault toler- ance is concerned with all the techniques necessary to enable a system to tolerate software faults remaining in the system after its development. These software faults may or may not manifest themselves during system operations, but when they do, software fault tolerance techniques should provide the necessary mechan- isms to the software system to prevent system failure from occur- ring.

In a single-version software environment, the techniques for partially tolerating software design faults include monitoring techniques, atomicity of actions, decision verification, and exception handling. In order to fully recover from activated design faults, multiple versions of software developed via design diversity[Avi86a] are introduced, in which functionally equivalent yet independently developed software versions are applied in the system to provide ultimate tolerance to software design faults. The main approaches include the recovery blocks technique[Rand75a], the N-version programming technique[Avi77a], and the N self-checking programming technique[Lapr87a]. These approaches have found a wide range of applications in the aerospace industry, the nuclear power industry, the health care industry, the telecommunications industry, and the ground tran- sportation industry.

1.5.4 Fault/Failure Forecasting

Fault/failure forecasting involves formulation of the fault/failure relationship, an understanding of the operational environment, the establishment of reliability models, the collec- tion of failure data, the application of reliability models by tools, the selection of appropriate models, the analysis and interpretation of results, and the guidance for management decisions. The concepts and techniques laid out in[Musa87a] have provided an excellent foundation for this area. Other reference texts include[Xie91a,Neuf93a]. Besides, the July 1992 issue of IEEE Software, the November 1993 issue of IEEE Transactions on Software Engineering, and the December 1994 issue of IEEE Tran- sactions on Reliability are all devoted to this aspect of SRE. This handbook provides a comprehensive treatment to this subject.

1.5.5 Scope of this Handbook

Due to the intrinsic complexity of modern software systems, software reliability engineers have to apply a combination of the above methods for the delivery of reliable software systems. These four areas are also the main theme of the state of the art for software engineering covering a wide range of disciplines. In addition to focusing on the fault/failure forecasting area, this book attempts to address the other three technical areas as well. However, instead of incorporating all possible techniques available in software engineering, this book examines and emphasizes matured as well as emerging techniques that could be quantitatively related to software reliability.

As a general guideline, most chapters of the book are con- cerned with fault/failure forecasting, in which Chapters 1 to 5 provide technical foundations, while Chapters 6, 7, 10, and 11 present project practices and experiences, and Chapters 16 and 17 describe two emerging techniques. In addition, Chapters 9 and 12 are related to fault avoidance, and Chapters 8 and 13 address fault removal techniques. It is noted that fault avoidance and removal techniques are the subject of discussion in many software engineering book volumes. Finally, Chapters 14 and 15 cover fault tolerance techniques and the associated modeling work. For a detailed treatment on software fault tolerance, the interested readers are referred to the book volume[Lyu95a].

The scope of the handbook is summarized in Table 1.4, which provides a guideline in using this book according to various sub- jects of interest, including the four technical areas we have discussed, and some special topics which you may want to study in depth. For example, if you are interested in the topic of software reliability modeling theory (Topic 1), readings of Chapters 1, 2, 3, 4, 9, 10, 12, 14, and 16 are recommended. Note that Topics 1 and 2 in Table 1.5 are mutually exclusive. So are Topics 3 and 4, Topics 5 and 6. It is cautioned that the clas- sification of the book chapters into various topics in Table 1.5 is only for your reading convenience. This classification could be rough and subjective.


      Technical Foundations   Practices and Experiences       Emerging Techniques


      Chapter   1    2    3    4    5   6   7    8   9   10    11   12   13   14   15   16   17


      Area 1                                         X              X


      Area 2                                     X                       X


      Area 3                                                                  X    X


      Area 4    X    X    X    X    X   X   X             X    X                        X    X


      Topic 1   X    X    X    X                     X    X         X         X         X


      Topic 2                       X   X   X    X             X         X         X         X


      Topic 3   X    X    X         X   X            X              X    X    X         X


      Topic 4                  X            X    X        X    X                   X         X


      Topic 5   X    X              X   X            X                   X    X    X    X


      Topic 6             X    X            X    X        X    X    X                        X


      Topic 7                  X    X   X   X    X        X    X    X              X    X    X


      Topic 8        X         X                 X   X    X    X         X    X    X         X



          Area 1 - Fault Avoidance

          Area 2 - Fault Removal

          Area 3 - Fault Tolerance

          Area 4 - Fault/Failure Forecasting

          Topic 1 - Modeling Theory

          Topic 2 - Modeling Experience

          Topic 3 - Metrics

          Topic 4 - Measurement

          Topic 5 - Process Issues

          Topic 6 - Product Issues

          Topic 7 - Reliability Data

          Topic 8 - Analysis Techniques

1.6 Summary

The growing trend of software criticality and the unbearable consequences of software failures force us to plead urgently for software reliability engineering. This book codifies our knowledge in SRE and puts together a comprehensive and organized repository for our daily practice in software reliability. The structure of the book and key contents of each chapter are described. The definitions of major terms in SRE are provided, and fundamental concepts in software reliability modeling and measurement are discussed. Finally, The related technical areas in software engineering and some reading guide are provided for your convenience.


(1) Some hardware faults are not physical faults and have simi- lar nature to software faults. What are they?

(2) What are the main differences between software failures and hardware failures?

(3) Give several examples of software faults and software failures.

(4) Some people argue that the modeling technique for software reliability is similar to that for hardware reliability, while other people disagree. List the commonalities and differences between them.

(5) Give a couple of examples about the definitions of failure severity levels. One is qualitative and one is quantita- tive.

(6) What is the mapping relationship between faults and failures? Is it one-to-one mapping (one fault leading to one failure), one-to-many, many-to-one, or many-to-many? Discuss the mapping relationship in different conditions. What is the preferred mapping relationship? Why? How to achieve it?

(7) The term "ultra-reliability" has been used to denote highly reliable systems. This could be expressed, for example, as R(10 hour) = 0.9999999. That is, the probability that a system will fail in a 10-hour operation is 108-79. Some peo- ple proposed to make it as a reliability requirement for software. Discuss the implication of this kind of reliabil- ity requirement and its practicality.

(8) What are the difficulties and issues involved in the data collection of failure-count data and time-between-failures data?

(9) Regarding the failure data collection process, consider the following situations

(a) How do you adjust the failure times for an evolving program, i.e., a software program which changes over time through various releases?

(b) How do you handle multiple sites or versions of the software?

(10) Show that the time-between-failures data in Table 1.2 can be transformed to failure-count data in Table 1.1. Assuming random distribution, transform the failure-count data in Table 1.1 to time-between-failures data. Compare your results with Table 1.2.

(11) For the data in Table 1.1 and Table 1.2:

(a) Calculate failure intensity at the end of each time period (for Table 1.1) or failure interval (for Table 1.2).

(b) Plot the failure intensity quantities along with the time axis.

(c) Try to fit a curve on the plots manually.

(d) What are your estimates on (i) the failure rate of the next time period after observing the data in Table 1.1, and (ii) the time to next failure after observing the data in Table 1.2?

(e) What should be relationship between the two estimates you obtained in (d)? Verify it.

(12) Compare the MTTR measure for hardware and software and dis- cuss the difference.

(13) Refer to Example 1.2 and Figure 1.2:

(a) What is the failure rate of each component in Figure 1.2? What is the reliability function of each com- ponent?

(b) What assumption is made to calculate the MTTF for SYS2 in the triple module redundant configuration? If the reliability function for SYS2 is R928(t), what is the reliability function for SYS2 in the triple module redundant configuration? How is its MTTF calculated? How is its MTTR calculated?

(c) How is the overall system MTTF calculated? Verify that it is 125.9 hours when software failures are not con- sidered, and that it is 11.9 minutes when software failures are considered.

(d) How is the system MTTR calculated? Verify that it is 0.62 hours.

(e) Does the triplication of SYS2 software help in improv- ing its software MTTF? Why? If not, what techniques could be employed to improve the software MTTF?

(14) (a) What is the difference between reliability estimation and reliability prediction? Draw a figure to show their difference.

(b) What is the difference between reliability prediction and early prediction? Summarize their differences in a comparison table.

(15) Section 1.3.2 lists several criteria to evaluate software reliability models. Can you think of a good way to quantify each criterion?