Using Software Reliability Models More Effectively Michael R. Lyu Allen Nikora ECE Department Jet Propulsion Laboratory The University of Iowa California Institute of Technology Iowa City, IA 52242 Pasadena, CA 91109 Abstract By making simplifying assumptions about the complicated software development and operational processes, many existing software reliability models were able to obtain a closed-form formula to describe software failure process for the purpose of reliability measurement and prediction. However, when these models were applied to real-world data, many discrepancies between actual and predicted reliability were observed. We have not seen a model that consistently performs the best when applied to a wide variety of project data. This is partly because real world pro- jects normally do not completely comply with modeling assump- tions, and partly because project data often contain noise irrelevant to the modeling process, making the prediction effort a formidable task. Given these observations, this paper proposes a different approach toward the software reliability measurement problem. Rather than attempting to formulate new models which can cope with project-related information, we investigate ways in which existing models might be used to better advantage than they currently are. The basic idea is to combine the results of mul- tiple models, based on determinations of the applicability of the models, for a better measurement of software reliability. As a result, a family of linear combination models are proposed and investigated. The predictive validity of this approach is evaluated by several historical and newly-collected failure data sets. The unsuitability of any single model can be observed from its fluctuating behavior over these data sets, upon which our linear combination models consistently demonstrate satisfactory performance. Finally, to facilitate the whole procedure, a computer-aided software engineering tool that automates a large portion of the software reliability measurement task is proposed and described. 1. Introduction The complexity and size of software systems are growing dramati- cally. The reliability of the software component is increasingly the determining factor of overall system reliability. These trends make the measurement of software reliability one of the major challenges for software engineers. Traditionally, software reliability modeling is a set of techniques that apply probabil- ity theory and statistical analysis to predict the reliability of software products, both quantitatively and objectively. A software reliability model specifies the general form of the dependence of the failure process on the principal factors that affect it: fault introduction, fault manifestation, failure detection and recovery, fault removal, and operational environ- ment[1]. The first software reliability model was formulated almost 20 years ago[2]. Since then, over 40 of them are now known to exist in the literature[1]. It is likely that many more unpublished models are in use. The primary goal of these models is to assess current reliability and forecast future reliability, based on rational assumptions for the application of statistical inference techniques to the observed failure data. A major difficulty in software reliability engineering practice is to analyze the par- ticular context in which reliability measurement is to take place so as to decide a priori which model is likely to be trustworthy. Due to the intricacy of human activities involved in software development and operation process, as well as the uncertain nature of software failure patterns, such a priori determinations have never been conclusive. It has been shown that there is no best software reliability model for every case under all cir- cumstances [3]. As a result, practitioners are left in a dilemma as to which software reliability models to choose, which pro- cedures to apply, and which prediction results to trust, while contending with varying software development and operation prac- tices. 2. Our Approach The major objective of this research is to propose a new and practical approach toward software reliability measurement which tends to produce better predictive validity of the measurement. Although the software reliability measurement problem, by its nature, involves significant uncertainties, the proposed scheme should make more accurate software reliability predictions, at least in an average sense, than the traditional approach by rely- ing on and advocating any single model. To increase the predictive validity of software reliability meas- urements, we have formalized the following general combination modeling approach: 1.Identify a basic set of models (called component models). If the project testing environments are well known, the models whose assumptions are closest to the real environments should be selected. 2.Select models that tend to cancel out in their biased (if any) predictions. 3.Separately apply each of the component models to the failure data. 4.Apply certain criteria to weigh the selected component models and form one or more linear combination models for final pred- ictions. The weights can be either static or variable. In general, this model is expressed as the following mixed dis- tribution: (math here) (math) is the predictive probability density function of the jth component model, given that i-1 data of times-between- successive-failures have been observed. Note that (math) for all t's. It is also noted that this linear combination approach tends to preserve the features inherited from its com- ponent models. It does not further complicate modeling practice since each component model performs reliability calculations independently; they are engaged into the combination model only at the last stage for the final predictions. Since the selection of component models is deemed very important, we suggest that the basic component models be selected from those implemented in two major software reliability measurement tools, SMERFS and SRMP. They include the following software reliability models: (1) Bayesian Jelinski-Moranda Model (BJM)[4], (2) Brooks and Motley Model (BM)[5], (3) Duane Model (DU) [4], (4) Geometric Model (GM) [5], (5) Goel-Okumoto Model (GO) [6], (6) Jelinski- Moranda Model (JM) [2], (7) Keiller-Littlewood Model (KL)[4], (8) Littlewood Model (LM) [4], (9) Littlewood non-homogeneous Poisson Process Model (LNHPP) [4], (10) Littlewood-Verrall Model (LV) [7], (11) Musa-Okumoto Model (MO) [8], (12) Generalized Poisson Model (PM)[5], (13) Schneidewind Model (SM)[5], and (14) Yamada Delayed S-Shape Model (YM)[5]. Note that different parameter estimation methods selected to implement these models could, to a certain extent, affect the predictive validity of the modeling results. As an example to illustrate this combination approach, we chose GO, MO, and LV as the three component models to form a set of linear combination models. Reasons for choosing them as com- ponent models are: 1.Their predictive validity has been observed in our recent investigation. In fact, they are judged to perform well by many practitioners, and they have been widely used[3]. 2.They represent different categories of models: GO (similar to JM and SM) represents the exponential shape non-homogeneous Poisson process (NHPP), MO represents the logarithmic shape NHPP model, and LV represents the inverse-polynomial shape Bayesian model. 3.Over the set of failure data we analyzed, their predictive biases tend to cancel: GO tends to be optimistic, LV tends to be pessimistic, and MO might go either way. As a result, we formulated the following set of four combination models: 1.ELC - Equally-Weighted Linear Combination Model This model is formed by assigning the three component models a constant, equal weight. The arithmetic average of all com- ponent models' predictions is taken as the ELC model predic- tion, namely, ELC = (math) remain constant and unchanged throughout the modeling process. This is similar to a Delphi survey, in which authorities work- ing independently are asked for an opinion on a subject, and an average of the results is taken. The motivation of this approach is to reduce the risk of relying on a specific model which may produce grossly inaccurate predictions, while retain- ing much of the simplicity of using the component models. 2.MLC - Median-Oriented Linear Combination Model Instead of choosing the arithmetic mean for the prediction in ELC, the component model whose predicted value lies between optimistic and pessimistic values is selected as the output of this model. The justification for this approach is that the choice of median might be more moderate than the mean in some cases, since it can better tolerate an erroneous prediction which is far away from the others. 3.ULC - Unequally-Weighted Linear Combination Model This model is similar to MLC except that instead of being solely determined by the median value, the optimistic and pes- simistic predictions will make small contributions to the final prediction. Here we use the weightings similar to the Program Evaluation and Review Technique (PERT), i.e., the formulation of this model is (math) , where O represents an optimistic prediction, P represents a pessimistic prediction, and M represents the median prediction. 4.DLC - Dynamically-Weighted Linear Combination Model In this model, we assume that the applicability of any indivi- dual model with respect to the data may change as the testing effort progresses. The weights of the component models will therefore change, based on changes in specific measures of a model's applicability throughout the test effort. A Bayesian interpretation of the prequential likelihood ratio [9] is used as a posteriori odds that one model is temporally more valid than another, which dynamically determines the weights of the component models. Note that the prequential likelihood measure could be taken over a short or long time window. Here we pick one time frame prior to each prediction as the reference in assigning weights. 3. Preliminary Evaluations 3.1. Model Comparison Criteria Of the 14 possible component models which we evaluated with respect to six selection criteria, six models (JM, GO, MO, DU, LM, LV) were judged to perform sufficient well to warrant further investigation[3]. The six criteria are[1]: 1) Model Validity; 2) Ease of Measuring Parameters; 3) Quality of Assumptions; 4) Applicability; 5) Simplicity; and 6) Insensitivity to Noise. Criteria (2), (4) and (5) were included because they have been important concerns to many practitioners. However, it is impor- tant to note that our judgements of whether a model met these criteria were qualitative rather than quantitative. In any case, Model Validity was of our particular interest. For further com- parisons on Model Validity, four formally defined measures have been adopted[4]. o Accuracy: Defined as prequential likelihood (PL) measure as follows. Let the observed data be a sequence of times between successive failures, denoted by t1, t2, ... , ti-1. The objective is to use the data to predict the future unobserved Ti. More precisely, we want a good estimate of F(t), defined as P(Ti