<< Note - The commands used in this file are troff commands which are to be removed >> A Generalized Software Reliability Process Simulation Technique and Tool \f3Robert C. Tausworthe Michael R. Lyu\f2 Jet Propulsion Laboratory AT&T Bell Laboratories 4800 Oak Grove Drive 600 Mountain Avenue Pasadena, CA 91109 Murray Hill, NJ 07974 (818)306-6284 (908)582-5366 Key Words: Software Reliability, Simulation, Software Process, Tool. \f3ABSTRACT\f1 This paper describes the structure and rationale of the generalized software reliability process and a set of simulation techniques that may be applied for the purpose of software reliability modeling. These techniques establish a convenient means for studying a realistic, end-to-end software life cycle that includes intricate subprocess interdependencies, multiple defect categories, many factors of influence, and schedule and resource dependencies, subject to only a few fundamental assumptions, such as the independence of causes of failures. The goals of this research are dual: first, to generate data for truly satisfying the simplified assumptions of various existing models for the purpose of studying their comparative merits, and second, to enable these models to extend their merits to a less idealized, more realistic reliability life cycle. This simulation technique has been applied to data from a spacecraft project at the Jet Propulsion Laboratory; results indicate that the simulation technique potentially may lead to more accurate tracking and more timely prediction of software reliability than obtainable from analytic modeling techniques. .B .ce 1. INTRODUCTION .R Software reliability has been the subject of wide study over the past 20 years. At least 40 different models have been published in the literature so far [1]. The primary focus of these studies has been on proposing, analyzing, and evaluating the performance of models that assess current reliability and forecast future operability from observable failure data using statistical inference techniques. However, none of these models extends over the entire reliability process; most tend to focus only on failure observance during testing or operations. Moreover, none of these reliability models has emerged as ``the best'' predictor in all cases [2]. This may be due to a number of factors, such as oversimplification of the failure process, the quality of observed data, the lack of sufficient data to make sound inferences, and/or serious differences between the proposed model and the true underlying reliability process(es). It is conceivable that the basic nature of the failure process(es) may differ among individual software developments. This paper proposes a general simulation technique that relaxes or removes many of the usual reliability modeling assumptions, and expands the reliability process to encompass the \f2entire\f1 software life cycle. Typical assumptions in reliability modeling are: .IP (1) Testing (or operations) randomly encounters failures. .IP (2) Failures in non-overlapping time intervals are independent. .IP (3) The test space ``covers'' the use space (i.e., the operational profile). .IP (4) All failures are observed when they occur. .IP (5) Faults are immediately removed upon failure, or not counted again. .IP (6) Execution time is the relevant independent variable. .LP Many of these assumptions invoke controversy and require further qualifications; thus, there is advantage to their dismissal. In particular, the second assumption above can be weakened to .IP (2) Faults produce independent failures. .LP and the final four assumptions are not necessary to the technique presented here at all. The degree of commonality among test space and use space is rarely known, but can be modeled, if needed. Simulation can mimic the failure to observe an error when it has, in fact, occurred, and, additionally, mimic any system outage due to an observed failure [3]. Furthermore, simulation can easily distinguish those faults that have been removed and those that have not, so multiple failures from the same unremoved fault can be readily reproduced. Finally, while execution time is pertinent to some activities in the software life cycle, it is not appropriate to all (e.g., inspections); simulation can translate all model-pertinent times to wall-clock (or calendar) time by appropriate use of work load, computer utilization, and resource schedules. This composite process is embodied in a Monte Carlo simulation tool, \f2SoftRel\f1 [4], available through NASA's Computer Software Management Information Center as well as a diskette in [5]. But of what interest is reliability process simulation tool to the software practitioner? One powerful way of understanding a pattern in nature is to recreate it in a simulation or other representative model. Because reliability is one of the first-cited indicators of quality, a tool that can reproduce the characteristics of the process that builds reliability offers a potential for optimization via tradeoffs involving scheduling, allocation of resources, and usages of alternative technologies and methodologies. The parameters that characterize that process become metrics to be managed as means to achieve prescribed levels of quality. A simulation tool may vary from simple to complex, depending on the scope of the process being modeled and the fidelity required by the user. Most of the analytic models referred to above would require only a few inputs; the more general tool reported here, however, can use up to about 70. The former models would report only a few facts about the unfolding process, but the latter can report up to about 90. Of course, it is possible for the more complex tool to simulate the simple models, as well. The 70 input parameters are spread over 14 activities (described later) that comprise the reliability process. Thus, each subprocess only uses 5 of the parameters, on average, some of which quantify interrelationships among activities. Each of the activity submodels is thus fairly straightforward. It is not necessary to simulate the entire process all at once. If data are only available on the failure and repair portions of the process, then only the parameters that characterize these activities need to be input to the simulator. And while giving values to all of the development environment parameters and producing a project resource allocation schedule may seem daunting tasks if never done before, these should become progressively easier as histories and experience locate the stable and volatile elements of projects being undertaken by the task membership. It can be argued that the elements that characterize and control the process need to be estimated anyway, whether used by the simulator or not, in order to understand and manage the reliability process effectively. Parameters such as the expected fault density and the average defect repair cost are familiar values extracted from prior project histories. The remaining paper is organized as follows: Section 2 provides the basis for simulating the software reliability process; Section 3 briefly describes the structures and interactions of the reliability simulation package \f2SoftRel\f1; Section 4 presents a case study in which the implemented simulation technology was applied to a real-world project. Conclusions and future directions are presented in Section 5. .ce .B 2. SIMULATION BUILDING BLOCKS .R .B 2.1 Event Simulation .R The very definition of conditional event-rate processes suggests the rather simple computer simulation illustrated in the following C language segment [6]: .nf .ls 1 .ps 11 .vs 13 .ls 1 .ft C /* t and dt are assumed set prior to this point */ events = 0; T = 0.; while (T < t) { T += dt; if (chance(beta(events, T) * dt)) events++; } /* the event has occurred a number of times at this point */ .ft R .ls 2 .ps 12 .vs 12 .fi .sp .5 The $dt$ in such simulations must always be chosen such that the variations in the failure rate $beta (t)$ (see Theory Sidebar) over the incremental time intervals $(t,~t~+~dt)$ are negligible, and such that $beta (t)~dt~<~1$ (so that the instantaneous event probability does not reach unity). In the code segment above, \fCchance(x)\fR compares a $[0, 1)$-uniform \fCrandom()\fR value with \fCx\fR, thus attaining the specified instantaneous probability function. The form of \fCbeta(events, T)\fR acknowledges that the event rate function may change over time and may be sensitive to the number of event occurrences up to the current time. The computational complexity of this algorithm is $O( beta T / DELTA t)$, in constant space. The $beta$ component represents the maximum time required to compute $ beta sub n (t)$. This complexity presents no problems for computers of even moderate speed today. The above illustration of simulation is simple, and yet very powerful. For example, some published analytic models treat (or approximate) the overall reliability growth as a NHPP in execution time, while others focus on Markoff execution-time interval statistics. Many of these differ only in the forms of their rate functions [1]. Some examples are: .sp .5 .IP 1. 2 The Jelinski-Moranda model [7] deals with adjacent time-interval subprocesses in which $beta sub n (t) ~~=~~ phi ~ (n sub 0~-~n)$, where $ n sub 0$ is the (unknown) number of initial software faults and $phi$ is the per-fault failure rate. .sp .5 .IP 2. 2 The Goel-Okumoto model [8] deals with the overall reliability growth process, in which $beta (t) ~~=~~ n sub 0 ~ phi ~ exp(- phi t)$, where $n sub 0$ and $phi$ are constant parameters. It could be shown that this model produces results very much like the Jelinski-Moranda model with $n ~~=~~ n sub 0 (1 ~-~ exp( - phi t ))$. .sp .5 .IP 3. 2 The Musa-Okumoto model [9] describes the overall reliability growth process, in which $beta (t) ~~=~~ beta sub 0 / (1 ~+~ theta t )$, where $beta sub 0$ is the initial failure rate and $theta$ is a rate decay factor. Both $beta sub 0$ and $theta$ are constant parameters. .sp .5 .IP 4. 2 The Duane model [10] is an overall reliability growth model with $beta (t) ~~=~~ k b t sup { b - 1}$, where $k$ and $b$ are constant parameters. .sp .5 .IP 5. 2 The Littlewood-Verrall inverse linear model [11] is an overall reliability growth model with $beta (t) ~~= tilde~~ phi ~/~ {sqrt { 1 ~+~ kt}}$, where $phi$ and $k$ are constant parameters. .sp .5 .IP 6. 2 The Yamada delayed S-shape model [12] is yet another overall reliability growth model, with $beta (t) ~~=~~ phi gamma t~ exp(1 ~-~ gamma t)$, where $phi$ (the maximum failure rate) and $gamma$ are constant parameters. .LP .pn +2 Simulating the reliability process underlying these models is straightforward. Figure 1 illustrates the results obtained from simulating the above six models. In each of these six simulation diagrams, the rate function ($beta$) and its associated parameters are listed. The parameters are set up such that there are roughly 25 software faults initially remain in the system. The value of 25 was chosen to emphasize the variability of the failure processes at low fault rates; at higher fault concentrations, the decreasing $sigma / n bar$ produces less deviations. << Insert Figure 1 About Here >> Figure 1: Simulation Results Based on Six Software Reliability Models A number of simulations were conducted for each model to simulate the occurrences of failures versus testing time. Each diagram shows the mean of the simulation results (the line marked "m") and the confidence intervals above (the line marked "m+s") and below the standard deviation (the line marked by "m-s"). The standard deviation along the time line is also presented (the line marked by "s" in the bottom). These simulations neither validate nor invalidate whether a particular model fits an actual project's data. They merely illustrate the ease with which the characteristics of such process can be comparatively analyzed. .B 2.2 Poisson Process Simulation .R The NHPP is also easily simulated when the hazard function $lambda (t sub a , ~t sub b )$ is known in closed form. The program for counting the overall number of NHPP events that will occur over a given time interval is merely .ft C .nf .ps 11 .vs 13 .ls 1 #define produce(x) random_poisson(x) events = produce(lambda(ta, tb)); .ps 12 .vs 12 .ls 2 .fi .ft R where \fCrandom_poisson(x)\fR is a subprogram that produces a Poisson-distributed random value when passed the parameter \fCx\fR. An algorithm for generating Poisson random numbers may be found in [13]. The time profile of an NHPP may be simulated by slicing the $(0, ~t)$ interval into $dt$ time slots, recording the behavior in each slot, and progressively accumulating the details to obtain the overall event count profile, as in the following algorithm: .nf .ps 11 .vs 13 .ls 1 .ft C t = 0.; while (t < t_max) { n = produce(lambda(t, t + dt)); /* n is the fine structure */ events += n; t += dt; } .fi .ps 12 .vs 12 .ls 2 .ft R The form of the cumulative rate function \fClambda(t, t + dt)\fR may be extended to include a dependence on \fCevents\fR, thereby causing the algorithm above to approximate a non-homogeneous Markoff event-count process with increasing fidelity as \fCdt\fR becomes sufficiently small that multiple events per \fCdt\fR interval become rare. As mentioned above, however, the behavior of such simulations may be indistinguishable, even at larger \fCdt\fR, on the basis of single realizations of the event process. This hybrid form can speed up the simulation by removing the necessity of slicing time into extremely small intervals. This modified form of the simulation algorithm is called the \f2piecewise-Poisson approximation\f1 of the Markoff event-count process. .B 2.3 Multiple Categories of Events .R If the set of events {$epsilon sub i ~:~ i~=~1,~ ... ~, ~n$} that were classed together above are now partitioned into categorized subsets according to some given differentiation criteria (as for example, faults distinguished as being \f2critical\f1, \f2major\f1, or \f2minor\f1), then the partitioning of events into categories likewise partitions their rate functions into corresponding categories, to which integers could be used as indices. .\"and equivalently, the bracketed indices of the .\"rate functions into sets of integers. When an event occurs, the algorithm in Theory Sidebar produces the index of a rate function. Finding this index among the categorized subsets of integers relates the event to the distinguished category of occurrences. The behavior of multiple categories of events is thus easily simulated by changing from a single event counter, \fCevents\fR, to an array of event counters, \fCevents[ ]\fR, and altering the program as follows: .ft C .ps 11 .vs 13 .ls 1 .nf i = event_index(n, t); c = event_category(n, i); events[c]++; .fi .ft R .ps 12 .vs 12 .ls 2 The overall event classification scheme is thus encapsulated within a single \fCevent_category()\fR function for the entire categorization of events. .B 2.4 Other Event Processes .R In the software life cycle, it is often the case that, if an event of one type occurs, then there is a uniform probability $p < 1$ that another event of a different type will be triggered. (For example, suppose that for each unit of code is generated, there is a probability $p$ that a fault is created.) If there are $n$ events of the first type, then the $k$ events of the second type are governed by the binomial distribution function, which is also easily simulated [14]. .\"Moreover, when $n$ itself is a Poisson random variable with parameter .\"$lambda$, the distribution of $k$ is also Poisson, with .\"parameter $p lambda$. Thus, .\"occurrences of events of the second type may be simulated without .\"actually counting events of the first type by using .\"the \fCproduce()\fR function with parameter $p lambda$. .\" .ft C .ps 11 .vs 13 .ls 1 .nf #define select(n, p) random_binomial(n, p) . . . n = produce(lambda(t, t + dt); k = select(n, p); .fi .ft R .ps 12 .vs 12 .ls 2 Finally, when there is an ultimate number of events $N$ that a Poisson process may reach before it is terminated, and $N$ is specified in advance, then the growth of \fCevents\fR over time must be stopped after the $N$th occurrence. This type of \f2goal-limited processes\f1 is also easily simulated. .B 2.5 General Event-Rate Processes .R The simulation method of this paper is more general than is required for mere production of Markoff processes and NHPPs. Since the algorithm of Subsection 2.2 springs directly from the definition, the method is capable of simulating all event-rate random processes. It thus is possible to simulate life cycle activities that may have event count dependencies between non-overlapping time intervals and rate functions that depend on variable schedules and other irregularities over time. Whenever event functions produce homogeneous Markoff processes in a piecewise fashion, then the event processes simulated during each of these segments will follow the piecewise-Poisson approximation. The programs presented above are thus capable of simulating a much more general and realistic reliability process than has been hypothesized by any analytic model known to the authors. Programs such as those shown above are typical of methods traditionally used to analyze stochastic processes over a variety of input conditions. From a programming perspective, then, very little sophistication is required to simulate a reliability process. Insight, care, and validation are required, however, in modeling the intricate system-dynamic interrelationships among the various rate functions that characterize that process. .ce .B 3. STRUCTURE OF THE SIMULATION TOOL .R .B 3.1 Overall Simulation Context .R The techniques described in the previous Section are embodied in a software reliability process simulation package, called \fISoftRel\fR. \fISoftRel\fR simulates the entire software reliability life cycle, including the effects of interrelationships among activities. For example, \fISoftRel\fR provides for an increased likelihood of faults injected into code as the result of missing or defective requirements specifications. \fISoftRel\fR also acknowledges that testing requires the preparation and consumption of test cases, and that repairs must follow identification and isolation. \fISoftRel\fR further requires that human and computer resources be scheduled for all activities. The \fISoftRel\fR package is a prototype, currently configured to simulate processes having constant event rates per causal unit. The authors do not advocate that constant-rate processes necessarily model software reliability, nor do they endorse the prototype as a model ready for industrial use. Rather, it is regarded as a framework for experimentation, for generating data typical of analytic model assumptions for comparison with actual collected project data, and for inference of project characteristics from comparisons. Other event-rate functions will be accommodated in later versions by changing current constant rates and other parameters to properly defined functions indicated by project histories. The current input to \f2SoftRel\f1 consists of a single file that specifies the \fCdt\fR time slice, about 70 traits of the software project and its reliability process, and a list of activity, schedule, and resource allocations. Internally, these form a data structure called the \fCmodel\fR. Also internally, the set of status monitors at any given time are stored in a data structure called \fCfacts\fR, which records the overall clock time, the time and resources consumed by each activity (42 measures in total), and a snapshot of 48 measures of project status. The output from \fISoftRel\fR is a single file containing the series of \fCfacts\fR produced at each \fCdt\fR interval of time. .\"The input and output parameters of \fISoftRel\fR .\"are listed in the Parameter-List Sidebar. \f2SoftRel\f1 simulates two types of failure events, namely, defects in specification documents and faults in code. Figure 2 shows the execution context of \f2SoftRel\f1. .B 3.2 The Major Components of the Simulator .R \f2SoftRel\f1 is initialized by setting sizes of items for construction, integration, and inspection. These could have been designed just to equal the goal values given in the \fCmodel\fR, but the \fCmodel\fR values are considered only approximate. Sizes are set to Poisson random values, with the \fCmodel\fR input values as means. .ps 11 .vs 11 .ls 1 .PS < simu.pic .ls 2 .ps 12 .vs 12 .ce .ft H Figure 2: \fISoftRel\fH Execution Context .ft R In a typical software engineering life cycle, several interrelated software reliability subprocesses are taking place concurrently. The activities in these subprocesses are characterized by 14 major components in the simulator, with appropriate staffing and resource levels devoted to each activity: .IP (1) \f2Document Construction\f1: Document generation and integration are assumed to be piecewise-Poisson approximations with constant mean rates per workday specified in the \fCmodel\fR, not to exceed the goal values. Defects are assumed injected at a constant probability per documentation unit. At each injection of a defect, the document hazard increases according to the defect detection characteristic, which will be discussed further, under Document Inspection. .IP (2) \f2Document Integration\f1: Document integration consists of acquisition of reusable documentation, deletion of unwanted portions, addition of new material, and minor changes. Each of these subactivities is assumed to be a goal-limited piecewise-Poisson approximation of a type similar to the construction process described above. Defects are created as a result of each subactivity. Documentation is integrated at a constant mean rate per workday, and defects are injected at a constant probability per documentation unit. Hazard increases at each defect according to the defect detection characteristic assumed. The total current documentation units consist of new, reused minus deleted, and added units; mere changes are deemed not to alter the total volume of documentation. .IP (3) \f2Document Inspection\f1: Document inspection is a goal-limited piecewise-Poisson approximation of a type similar to document construction; both new and integrated reused documentation are assumed to be inspected at the same rate, and with the same efficiency. Documentation is inspected at a mean constant rate per workday. Inspected units are allocated among new documents and reused documents in proportion to the relative amounts of documentation in these two categories. Defects detected during inspections may not exceed those injected; the discovery of defects is characterized as a goal-limited binomial process. The defect discovery rate is assumed to be proportional to the current accumulated document hazard and the inspection efficiency. .IP (4) \f2Document Correction\f1: Defect corrections are produced at a rate determined by the staff level and the attempted-fix rate given in the \fCmodel\fR; actual corrections take place according to the defect-fix adequacy, not to exceed the actual number of defects discovered (a goal-limited binomial situation). Attempted fixes can also inject new defects and can change the overall amount of documentation via the numbers of documentation units deleted, added, and changed. True corrections decrease the document hazard, while the injection of new defects increases it. .IP (5) \f2Code Construction\f1: Production of code follows the same formulation as does document construction. However, the average pace at which faults are created is influenced not only by the usual fault density that may occur as a normal consequence of coding, but also by the density of undiscovered defects in documentation, and by the amount of missing documentation. Each fault injected increases the code hazard. But whereas document defects are only found by inspection, code faults may be found by both inspection and testing, and at different rates. .IP (6) \f2Code Integration\f1: Simulation of code integration is comparable in structure to document integration, except that code units replace document units and coding rates replace documentation rates. The fault injection rate is of the same form as that for code construction, above. Each fault increases the code hazard. .IP (7) \f2Code Inspection\f1: Code inspection mirrors the document inspection process. The number of faults discovered will not exceed the total number of as-yet undiscovered faults. The fault discovery rate is assumed to be proportional to the current accumulated fault hazard and the inspection efficiency. Since previously discovered faults may not yet have been removed at the time of discovery, the number of newly discovered faults is assumed to be in proportion to the number of as-yet undiscovered faults. .IP (8) \f2Code Correction\f1: Code correction simulation follows the same algorithm given for document correction, translated to code units. Fault hazard is reduced upon correction of a fault, and increased if any new faults are injected by the correction process. Documentation changes are produced at assumed constant mean rates per attempted correction. .IP (9) \f2Test Preparation\f1: Test preparation consists merely of producing a number of test cases in each \fCdt\fR slot in proportion to the test preparation rate, which is a constant mean number of test cases per workday. .IP (10) \f2Testing\f1: The testing activity simulation has two parts: If a test outage is in effect, the outage time indicator decrements and the time and effort increment. If an outage is not in effect, failures occur at the \fCmodel\fRed rate; the number observed is computed as a binomial process that is regulated by the probability of observation. The failure \fCrate\fR function returns a value proportional to the current hazard level. The function additionally consumes computer resources and test cases, the latter at a mean constant rate. .IP (11) \f2Fault Identification\f1: The total number of failures analyzed may not exceed the number of failures observed. Failures are analyzed at a mean constant rate per workday. The identification of faults is limited in number to those still remaining in the system. The isolation process is regulated by the fraction of faults remaining undiscovered, the adequacy of the analysis process, and the probability of faithful isolation. .IP (12) \f2Fault Repair\f1: The number of attempted repairs may not exceed the number of faults identified by inspections and testing, less those corrected after inspection, plus those identified for rework by validation and retesting. Of those attempted, a \fCselect\fR number will really be repaired, while the rest will mistakenly be reported as repaired. Repairs are assumed here to be made on faults identified for rework first. A \fCselect\fR number of new faults may be created by the attempt, and code units may be altered (deleted, added, or changed). Attempted repairs take place at a mean constant rate per workday. .IP (13) \f2Validation of Repairs\f1: The validation of attempted repairs takes place at an assumed mean constant rate per workday. The number of repairs validated may not exceed the number of attempted repairs. The number of faulty repairs detected is a \fCselect\fR number determined by the probability that validation will recognize an unrepaired fault when one exists and the probability that unrepaired faults are among those attempted repairs being validated (the repair adequacy); the detected bad fixes cannot exceed the actual number of mis-repaired faults. Those detected are designated for rework and removed from the unrepaired, undiscovered fault count. .IP (14) \f2Retesting\f1: Retesting takes place at a mean constant number of retests per workday and consumes computer resources at the scheduled rate per day. No new test cases are generated (or consumed), since the original test cases are assumed available for regression. Retesting is assumed to encounter only those failures due to unrepaired faults. .LP .B 3.3 Simulator Input and Output .R It is beyond the scope of this paper to describe each of the 70 input \fCmodel\fR parameters and the 90 output \fCfacts\fR parameters. Interested readers will find these described more fully in [4]. The input file additionally contains a list of staffing and computer resource packets, each of which allocates resources to specified activities and time slots. Slot times may overlap or leave gaps, at the discretion of the user. Such schedules are the natural outcomes of development process planning and are of fundamental importance in shaping the reliability process. At least 14 schedule packets are needed to allocate resources and time slots to each of the 14 assumed reliability process activities. More packets may appear when an activity repeats or has a non-constant resource allocation profile. Output values consist of all product, work, CPU, resource, fault, failure, and outage values. These are time-tagged in the form of a \fCfacts\fR data structure and written to the output file at each \fCdt\fR time interval for later scrutiny (e.g., plotting, trending, and \fCmodel\fR readjustments) by other application programs, such as spreadsheets. The reliability process embodied in the prototype is meant to be fairly comprehensive with respect to what really transpires during a software development. The simulator therefore requires parameters relating to the ways in which people and processes interact. The large number of parameters in the simulator might, at first, seem to present an overwhelming, impractical barrier to modeling, but it must be remembered that the true reliability process is even more complex. It was felt that the number of parameters used was the least that would be capable of reproducing the realism hoped for. Reducing the number of parameters might either reduce the fidelity of the simulation or the generality of the reliability process model. This belief may change after sufficient experimentation has taken place, whereupon selective alterations or combinations of the parameters may be indicated. In any case, these parameters could be independently estimated and continuously refined with ease. If projects do not have sufficient data about past projects to give values to certain parameters, then sensitivity analyses using \fISoftRel\fR can indicate which are the most influential and thereby where a metrics effort may prove most useful in reliability management. Alternatively, users may simplify the model to focus only on one or two activities at a time by making some of the parameters inactive. This may be done by assigning typical or default values (usually 0 or 1) to them, thereby reducing the number of measured parameters to only those that are deemed pertinent and realistic within the project context. .B .ce 4. APPLICATIONS: A PROJECT CASE STUDY .R This Section describes the application of \fISoftRel\fR to a real-world project. Project data and parameters from a subsystem of the Galileo Flight Project at the Jet Propulsion Laboratory were collected as a case study for evaluating the simulation technique. The remainder of this Section describes the project, applies the simulation technique, and compares the results with those obtained from several traditional reliability models. .B 4.1 Project Description .R Galileo is an outer planet spacecraft project that began at the start of fiscal year 1977, a mission that was originally entitled ``Jupiter Orbiter and Probe,'' or JOP. Unlike previous outer solar system missions, the Galileo orbiter was intended to remain in Jovian orbit for an extended interval of time. This would allow observations of variations in planetary and satellite features over time to augment the information obtained by single-observation opportunities afforded by previous fly-by missions. Galileo was launched in October of 1989, and will reach the Jovian system in 1995. There are two major on-board flight computers in the Galileo spacecraft: The Attitude and Articulation Control Subsystem (AACS), and the Command and Data System (CDS). A significant portion of each of these systems is embodied in software. This case study focuses on the CDS software reliability profile. The CDS flight software is characterized as real-time embedded software, written in 17,000 lines of assembly language code (16.5K lines new, 500 lines reused), with about 1400 pages of documentation (1300 pages new, 100 pages reused), produced over a period of approximately 1500 days (300 weeks). .\"1620 days (interpreted as 240 weeks at .\"5 days per week before testing, and 60 weeks at 7 days per week during .\"testing). This project went through several design reviews and code inspections, performed structured analysis and design, and recorded and tracked failures during its software testing phase. .B 4.2 Parameter Estimations and Simulation Results .R This Subsection presents a simulation of an end-to-end development project based on data from the Galileo CDS project. Some of the CDS project parameters were taken from project records; other values were estimated by personnel within the project; the remaining values were chosen by the authors as being probably typical of this project's behavior, but for which there were no immediately available data. Believed-typical values were adopted, for example, for parameters related to the injection of faults in the correction and repair processes. None of the \fCmodel\fR input parameters was set to zero. Thus, even though few verifiable \fCmodel\fR parameters were available outside the software testing phase, an entire plausible hypothetical model was nevertheless formed in order to illustrate simulation of an end-to-end reliability process. For lack of better development life cycle data, all CDS activities other than testing (e.g., construction, inspection, and anomaly removal) were presumed to take place serially, merely to observe what their simulated behaviors would be. This overall study also presumed that each activity took place without resource and schedule variations, in order to view typical Markoff reliability behavior. .\"The \fCmodel\fR parameters that were used are available .\"from the authors upon request. Figures 3 through 6 show the simulated documentation, code, defect, and fault profiles of the software, sampled every 10 days. Of particular note are the behaviors of the documentation, code, injected defects, and injected faults (precisely those activities where no project data exists). Because the numbers of units are comparatively large, the relative irregularity levels are low, as predicted from (Eq. 3). Figure 3 shows that the volume of documentation units (\fCDU_t\fR) did reach its goal, but in this case, only about 63% of the documentation was actually inspected (\fCDI_t\fR), even though the \fCmodel\fR placed a goal of 95% on inspection. This is an instance where inadequate resources were allocated to the inspection process. More resources would have been required to reach the goal. The effects of correcting defects on page count are not visible. The second rise in documentation is due to the integration of the reused 100 pages. Figure 4 similarly shows that the volume of code units (\fCCU_t\fR) did reach its goal and that the 90% inspection goal was met as well. The effects of correcting and repairing faults on code size, however, are again not visible. The injection, detection, and removal of defects (\fCE_d, D, d\fR), shown in Figure 5, are a little noisier than documentation and code production, but not much. All the detected defects were corrected (\fCD\fR = \fCd\fR), but a sizable number of defects were inserted during the correction period (days 520-580). Finally, more than 100 defects were left in the documents. The fault activity is shown in Figure 6. It exhibits the noisiest behavior of all, but is still fairly regular. The initial rise in injected faults (\fCE_f\fR) is due to construction; the second rise, due to integration, is not visible; the third, a sharp rise again, is due to the imperfect fault correction process; and the final gradual rise is due to the imperfect fault repair process. By the end of the .\" 1620-day 1500-day project, about 7 faults per kiloline of code (\fCe\fR) had been found in inspections and corrected (\fCh\fR), and about 22 faults per kiloline of code (\fCf\fR) had been uncovered by testing and removed (\fCr\fR); the fault density at delivery was about 0.2 faults per kiloline of code. .ls 1 .F+ figure Figure2.plot height 3.9i .F- .ft H .ce Figure 3: CDS Simulated Document Construction, Integration, and Inspection .ft R .F+ figure Figure3.plot height 3.9i .F- .ft H .ce Figure 4: CDS Simulated Code Construction, Integration, and Inspection .ft R .bp \& .F+ figure Figure4.plot height 3.9i .F- .ft H .ce Figure 5: CDS Simulated Defect Discovery and Correction .ft R .F+ figure Figure5.plot height 3.9i .F- .ft H .ce Figure 6: CDS Simulated Fault Injection, Discovery, Correction, and Repair .ft R .ls 2 .ps 12 .vs 12 Although the final fault discovery count is considered to be accurate, the time profile of the simulation results do not appear to be as irregular as the actual project data. It seems likely, then, that the fault discovery process here is probably not homogeneous, either. On the basis of this case study, it appears that the simulation of all reliability subprocesses will require the use of non-homogeneous event-rate models that reflect irregular workloads and schedules of life cycle activities. An example of this is given in the next Subsection. .B 4.3 Comparisons with Other Software Reliability Models .R To simulate the details of Galileo CDS testing activity, its testing phase was separated into five subactivities with constant staffing, but having irregular CPU and schedule allocations, as shown in Table 1. These schedule parameters were obtained as those necessary to fit the simulator output to project data. The fit appears adequate to describe the underlying nature of the random failure process. .sp .5 .ps 11 .vs 13 .ls 1 .TS center, box, tab(#); c c l l l c l n n n n n. activity#accumulated failures#begin week#end week#staffing#CPU _ 1 (functional test) #90 #0 #5 #2.0 #0.4 2 (feature test) #150 #5 #13 #2.0 #0.4 3 (operation test1) #300 #13 #23 #2.0 #1.2 4 (operation test2) #325 #23 #33 #2.0 #1.0 5 (operation test3) #341 #33 #40 #2.0 #2.0 .TE .ps 12 .vs 12 .ls 2 .fi .ft H .ce Table 1: Schedule of the CDS Testing .ft R Figure 7 shows the field data and the results obtained from the piecewise-homogeneous simulation process, as well as those from three other models, Jelinski-Moranda Model(JM), Musa-Okumoto Model(MO), and Littlewood-Verrall Model(LV). For better visibility of process granularity, data is shown in the form of failures per week, rather than cumulatively. The JM, MO, and LV statistics were calculated to be "one-week-ahead" predictions, in which all the failure data up to a given week were used to predict the number of failures for the next week. .bp \& .sp .ls 1 .F+ figure all.plot height 3.9i .F- .ls 2 .ce .ft H Figure 7: Various Predictions for CDS Failures per Week Data .ft R As shown in the figure, the simulation technique produced a very good forecast that could have been used for tracking the reliability status during the entire testing phase. The \f2Sum of Square Errors\f1 (SSE) for the simulation results in Figure 7 is 24.5. As a comparison with the analytical software reliability model results, the SSEs for JM, MO, and LV are 38.8, 43.3, and 99.7, respectively. We conjecture that the reliability forecast could have been accurately simulated prior to the start of testing, had actual schedule and resource plans been used \f2a priori\f1. The other models above were inadequate to predict even one week ahead, particularly the LV model. .ce .B 5. CONCLUSION .R Reliability modelers seldom have the luxury of several realizations of the same failure process to test their hypotheses concerning the nature of a system's reliability. Nor are they ever provided with data that faithfully match the assumed natures of their models. Nor are they able to probe into the underlying error and failure mechanisms in a controlled way. Rather, they are faced with the problem of not only guessing the forms and particulars of the underlying error and failure random processes from the scant, uncertain data they possess, but also with the problem of best forecasting future failures from this single data set. The assumptions of the simulation approach are certainly less restrictive than those underlying analytic models. The simulation approach solves software reliability prediction problems by producing data conforming precisely to the software failure assumptions. Simulation enables investigation of questions such as, "How does a project's observed data compare with that emanating from a NHPP having the following characteristics? ..." and "Which analytic prediction model is the best under the following assumptions? ..." We believe that the \f2SoftRel\f1 tool and its offspring will offer significant potential to researchers and practitioners in answering such questions, in evaluating sensitivities of predictions to various error and failure modeling assumptions, and in forecasting software project status profiles, such as time-lines of work products and the progress of testing, fault isolation, repair, validation, and retest efforts. Simulation of a real-world project has reinforced our confidence in the validity of the approach. We believe that homogeneous Markoff event-count models that uniformly consume resources do not adequately model the statistical failure profile of an actual project. The non-homogeneous variable-resource-schedule event-rate simulation model was capable of producing very good early forecasts of reliability growth that could prove very useful for process status assessment. .\" .\"The authors expect further collaborations between government agencies .\"and industry will continue to refine the reliability simulation technique .\"and lead to a better understanding of the reliability process and to .\"improvements in the \f2SoftRel\f1 genre of tools. .B Acknowledgements .R We wish to thank Prof. Yi-Bing Lin of National Chiao Tung University and Dr. Yu-Yun Ho of Bellcore for their valuable inputs. Portions of the research described in this paper were carried out by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. .\".nr PS 11 .\".nr VS 11 .ta 1i .nr TA 1i .sp .ce 3 .ls 1 .ps 12 .vs 14 ****************** Theory Sidebar ****************** \(bu Framework of the Discrete Event Simulation The fundamental assumption of reliability process simulation is that every stochastic event is the result of an underlying instantaneous conditional event-rate random process [14]. A conditional event-rate process is one for which the probability that an event occurs in the interval $(t, ~t ~+ ~dt)$, given that it had not occurred prior to time $t$, is equal to $beta ( t ) ~ dt$ for some function $beta ( t )$. The statistical behavior of this process is well-known [15]: The probability that an event $epsilon$ will have occurred prior to a given time $t$ is related by the expression .EQ (Eq.1) Prob "{ " epsilon ~ occurs~in ~ (0,~ t) " }" ~=~ P(t) ~=~ 1 ~-~ exp left ( ~-~int sub " 0" sup t ~beta ( tau ) ~d tau right ) ~=~ 1 ~-~ e sup {- lambda (0, ~t)} .EN When the events of interest are failures, $beta (t)$ is often referred to as the process \f2hazard rate\f1 and $lambda (0, ~t)$ is the \f2total hazard\f1. If $lambda (0, ~t)$ is known in closed form, the event probability can be analyzed as a function of time. But if many related events are intricately combined in $beta (t)$, the likelihood of a closed-form solution for event statistics dims considerably. The expressions to be solved can easily become so convoluted that calculation of results requires a computer programmed with comparatively complex algorithms. Of special interest here are discrete event-count processes that merely record the occurrences of rate-controlled events over time. The function $beta sub n (t)$ denotes the conditional occurrence rate, given that the $n$th event has already occurred by the time $t$. The integral of $beta sub n (t)$ is $lambda sub n (0,~t)$. These processes are termed non-homogeneous when $beta sub n (t)$ depends explicitly on $t$. The probability $P sub n (t)$ that $n$ events occur in $(0,t)$ is much more difficult to express than (Eq.1), and does not concern us here. One important event-rate process is the discrete Markoff process. A Markoff process is said to be homogeneous when its rate function is sensitive only to time differences, rather than to absolute time values. The notation $beta sub n (t)$, in these cases, signifies that $t$ is measured from the occurrence time $t sub n$ of the $n$th event. When the hazard rate $beta sub n (t)$ of a Markoff event-count process is independent of $n$, then one may readily verify that the general event count behavior is a non-homogeneous Poisson process (NHPP) whose mean and variance are given by .EQ (Eq.2) n bar mark ~~=~~ lambda (0, ~t) ~~;~~~~~ sigma sup 2 ~~=~~ lambda (0, ~t) .EN .EQ (Eq.3) {sigma} over {n bar} lineup ~~=~~ 1 ~/~ {sqrt{lambda (0, ~t)}} ~=~ 1 ~/~ {sqrt{n bar}} .EN The homogeneous (constant event rate) Poisson process is described by $lambda ~=~ beta t$. Homogeneous Poisson process statistics thus only apply to the homogeneous Markoff event-count process when the Markoff $beta sub n (t) ~=~ beta $ is constant. One may note from (Eq. 3) that as $n bar$ increases, the percentage deviation of the process decreases. In fact, any event process with independence among events in non-overlapping time intervals will exhibit relative fluctuations that behave as $O(1 / sqrt{n bar})$, a quantity that gets increasingly smaller for larger $n bar$. This trend signifies that Poisson and Markoff processes involving large numbers of event occurrences will tend to become percentage-wise relatively regular. If physical processes appear to be very irregular, then it will not be possible to simulate them using independent-increment assumptions with regular rate functions. There is a sense in which the NHPP form is inappropriate for describing the overall software reliability profile. Reliability of software grows only as faults are discovered and repaired, and these events occur only at a finite number of times during the life cycle. The true hazard rate presumably changes discontinuously at these times, whereas the NHPP rate changes continuously. In any case, the event-count Markoff model of software reliability is more general than the NHPP form, in that there is no assumption that its cumulative rate $lambda$ is independent of $n$ or $t sub n$. \(bu Multiple Event Processes Conditional event-rate processes are also characterized by the property that the occurrences of several independent classes of events, $epsilon sub 1 ,..., epsilon sub f$, with rate functions $beta sub n sup {[1]} (t) ,..., beta sub n sup {[f]} (t)$, respectively, together behave as if $f$ algorithms of the single-event variety were running simultaneously, each with its own separate rate function, \fCbeta[i](n, t)\fR, controlling the $n$th occurrence of event $epsilon sub i$ at time $t$. That is, the event occurrence process is equivalent to a single event-rate process governed by its composite rate function, .EQ (Eq.4) beta sub n (0,~ t) ~=~ sum from {\di=1\u} to {\uf\d} {beta sub n sup {\u [i]\d} (0, ~t)}. .EN When event occurrences in non-overlapping intervals are independent, each $( t sub a, ~t sub b )$ interval is governed by a non-homogeneous Markoff process with rate $beta sub n (t , ~t sub n )$. .EQ (Eq.5) beta sub n (t , ~t sub n ) ~=~ sum from {\di=1\u} to {\uf\d} beta sub {n sub i} sup {\u [i]\d} (t , ~t sub n sub i ) .EN When a new event $epsilon sub i$ is added (or deleted) to (or from) the distinguished class of events, $beta sub n (t , ~t sub n )$ readjusts to include (or exclude) the corresponding $beta sub {n sub i} sup {[i]} (t , ~t sub n sub i )$ function and the simulation proceeds. This characteristic provides a simple and straightforward method to simulate the effects of fault and defect injections and removals. References: [1] J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability \- Measurement, Prediction, Application McGraw-Hill Book Company, New York, New York, 1987. [2] M.R. Lyu and A. Nikora, "Using Software Reliability Models More Effectively," IEEE Software, July 1992, pp. 43-52. [3] A. von Mayrhauser, Y.K. Malaiya, J. Keables, and P.K. Srimani, "On the Need for Simulation for Better Characterization of Software Reliability," Proceedings of 1992 International Symposium on Software Reliability Engineering, Denver, Colorado, November 1993. [4] R. Tausworthe, "A General Software Reliability Process Simulation Technique," Jet Propulsion Laboratory, Technical Report 91-7, Pasadena, California, March 1991. [5] M.R. Lyu (ed.) Handbook of Software Reliability Engineering, McGraw-Hill and IEEE Computer Society Press, New York, 1996. [6] W. Kreutzer, System Simulation: Programming Stypes and Languages, International Computer Science Series, Addison-Wesley, Menlo Park, California, 1986. [7] Z. Jelinski and P.B. Moranda, "Software Reliability Research," Statistical Computer Performance Evaluation E W. Freiberber (ed.), Academic, New York, 1972, pp. 465-484. [8] A.L. Goel and K. Okumoto, "Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures," IEEE Transactions on Reliability, vol. R-28, 1979, pp. 206-211. [9] J.D. Musa and K. Okumoto, "A Logarithmic Poisson Execution Time Model for Software Reliability Measurement," Proceedings of Seventh International Conference on Software Engineering, Orlando, Florida, 1984, pp. 230-238. [10] J.T. Duane, "Learning Curve Approach to Reliability Monitoring," IEEE Transactions on Aerospace, vol. AS-2, 1964, pp. 563-566. [11] B. Littlewood and J.L. Verrall, "A Bayesian Reliability Growth Model for Computer Software," Journal Royal Statistics Society, vol. 22, 1973, pp. 332-346. [12] S. Yamada, M. Ohba, and S. Osaki, "S-Shaped Reliability Growth Modeling for Software Error Detection," IEEE Transactions on Reliability, vol. R-32, December 1983, pp. 475-478. [13] D.E. Knuth, The Art of Computer Programming: Semi-Numerical Algorithms, Addison-Wesley, Reading, Massachusetts, 1970. [14] N. Roberts, et al., Introduction to Computer Simulation, Addison-Wesley, Reading, Massachusetts, 1983. [15] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, 1965.