Using Software Reliability Models More Effectively
Michael R. Lyu Allen Nikora
ECE Department Jet Propulsion Laboratory
The University of Iowa California Institute of Technology
Iowa City, IA 52242 Pasadena, CA 91109
Abstract
By making simplifying assumptions about the complicated software
development and operational processes, many existing software
reliability models were able to obtain a closed-form formula to
describe software failure process for the purpose of reliability
measurement and prediction. However, when these models were
applied to real-world data, many discrepancies between actual and
predicted reliability were observed. We have not seen a model
that consistently performs the best when applied to a wide
variety of project data. This is partly because real world pro-
jects normally do not completely comply with modeling assump-
tions, and partly because project data often contain noise
irrelevant to the modeling process, making the prediction effort
a formidable task. Given these observations, this paper proposes
a different approach toward the software reliability measurement
problem. Rather than attempting to formulate new models which can
cope with project-related information, we investigate ways in
which existing models might be used to better advantage than they
currently are. The basic idea is to combine the results of mul-
tiple models, based on determinations of the applicability of the
models, for a better measurement of software reliability. As a
result, a family of linear combination models are proposed and
investigated. The predictive validity of this approach is
evaluated by several historical and newly-collected failure data
sets. The unsuitability of any single model can be observed from
its fluctuating behavior over these data sets, upon which our
linear combination models consistently demonstrate satisfactory
performance. Finally, to facilitate the whole procedure, a
computer-aided software engineering tool that automates a large
portion of the software reliability measurement task is proposed
and described.
1. Introduction
The complexity and size of software systems are growing dramati-
cally. The reliability of the software component is increasingly
the determining factor of overall system reliability. These
trends make the measurement of software reliability one of the
major challenges for software engineers. Traditionally, software
reliability modeling is a set of techniques that apply probabil-
ity theory and statistical analysis to predict the reliability of
software products, both quantitatively and objectively. A
software reliability model specifies the general form of the
dependence of the failure process on the principal factors that
affect it: fault introduction, fault manifestation, failure
detection and recovery, fault removal, and operational environ-
ment[1].
The first software reliability model was formulated almost 20
years ago[2]. Since then, over 40 of them are now known to exist
in the literature[1]. It is likely that many more unpublished
models are in use. The primary goal of these models is to assess
current reliability and forecast future reliability, based on
rational assumptions for the application of statistical inference
techniques to the observed failure data. A major difficulty in
software reliability engineering practice is to analyze the par-
ticular context in which reliability measurement is to take place
so as to decide a priori which model is likely to be trustworthy.
Due to the intricacy of human activities involved in software
development and operation process, as well as the uncertain
nature of software failure patterns, such a priori determinations
have never been conclusive. It has been shown that there is no
best software reliability model for every case under all cir-
cumstances [3]. As a result, practitioners are left in a dilemma
as to which software reliability models to choose, which pro-
cedures to apply, and which prediction results to trust, while
contending with varying software development and operation prac-
tices.
2. Our Approach
The major objective of this research is to propose a new and
practical approach toward software reliability measurement which
tends to produce better predictive validity of the measurement.
Although the software reliability measurement problem, by its
nature, involves significant uncertainties, the proposed scheme
should make more accurate software reliability predictions, at
least in an average sense, than the traditional approach by rely-
ing on and advocating any single model.
To increase the predictive validity of software reliability meas-
urements, we have formalized the following general combination
modeling approach:
1.Identify a basic set of models (called component models). If
the project testing environments are well known, the models
whose assumptions are closest to the real environments should
be selected.
2.Select models that tend to cancel out in their biased (if any)
predictions.
3.Separately apply each of the component models to the failure
data.
4.Apply certain criteria to weigh the selected component models
and form one or more linear combination models for final pred-
ictions. The weights can be either static or variable.
In general, this model is expressed as the following mixed dis-
tribution:
(math here)
(math) is the predictive probability density function of the jth
component model, given that i-1 data of times-between-
successive-failures have been observed. Note that (math)
for all t's. It is also noted that this linear combination
approach tends to preserve the features inherited from its com-
ponent models. It does not further complicate modeling practice
since each component model performs reliability calculations
independently; they are engaged into the combination model only
at the last stage for the final predictions.
Since the selection of component models is deemed very important,
we suggest that the basic component models be selected from those
implemented in two major software reliability measurement tools,
SMERFS and SRMP. They include the following software reliability
models: (1) Bayesian Jelinski-Moranda Model (BJM)[4], (2) Brooks
and Motley Model (BM)[5], (3) Duane Model (DU) [4], (4) Geometric
Model (GM) [5], (5) Goel-Okumoto Model (GO) [6], (6) Jelinski-
Moranda Model (JM) [2], (7) Keiller-Littlewood Model (KL)[4], (8)
Littlewood Model (LM) [4], (9) Littlewood non-homogeneous Poisson
Process Model (LNHPP) [4], (10) Littlewood-Verrall Model (LV)
[7], (11) Musa-Okumoto Model (MO) [8], (12) Generalized Poisson
Model (PM)[5], (13) Schneidewind Model (SM)[5], and (14) Yamada
Delayed S-Shape Model (YM)[5]. Note that different parameter
estimation methods selected to implement these models could, to a
certain extent, affect the predictive validity of the modeling
results.
As an example to illustrate this combination approach, we chose
GO, MO, and LV as the three component models to form a set of
linear combination models. Reasons for choosing them as com-
ponent models are:
1.Their predictive validity has been observed in our recent
investigation. In fact, they are judged to perform well by
many practitioners, and they have been widely used[3].
2.They represent different categories of models: GO (similar to
JM and SM) represents the exponential shape non-homogeneous
Poisson process (NHPP), MO represents the logarithmic shape
NHPP model, and LV represents the inverse-polynomial shape
Bayesian model.
3.Over the set of failure data we analyzed, their predictive
biases tend to cancel: GO tends to be optimistic, LV tends to
be pessimistic, and MO might go either way.
As a result, we formulated the following set of four combination
models:
1.ELC - Equally-Weighted Linear Combination Model
This model is formed by assigning the three component models a
constant, equal weight. The arithmetic average of all com-
ponent models' predictions is taken as the ELC model predic-
tion, namely, ELC = (math)
remain constant and unchanged throughout the modeling process.
This is similar to a Delphi survey, in which authorities work-
ing independently are asked for an opinion on a subject, and an
average of the results is taken. The motivation of this
approach is to reduce the risk of relying on a specific model
which may produce grossly inaccurate predictions, while retain-
ing much of the simplicity of using the component models.
2.MLC - Median-Oriented Linear Combination Model
Instead of choosing the arithmetic mean for the prediction in
ELC, the component model whose predicted value lies between
optimistic and pessimistic values is selected as the output of
this model. The justification for this approach is that the
choice of median might be more moderate than the mean in some
cases, since it can better tolerate an erroneous prediction
which is far away from the others.
3.ULC - Unequally-Weighted Linear Combination Model
This model is similar to MLC except that instead of being
solely determined by the median value, the optimistic and pes-
simistic predictions will make small contributions to the final
prediction. Here we use the weightings similar to the Program
Evaluation and Review Technique (PERT), i.e., the formulation
of this model is (math) , where O represents an
optimistic prediction, P represents a pessimistic prediction,
and M represents the median prediction.
4.DLC - Dynamically-Weighted Linear Combination Model
In this model, we assume that the applicability of any indivi-
dual model with respect to the data may change as the testing
effort progresses. The weights of the component models will
therefore change, based on changes in specific measures of a
model's applicability throughout the test effort. A Bayesian
interpretation of the prequential likelihood ratio [9] is used
as a posteriori odds that one model is temporally more valid
than another, which dynamically determines the weights of the
component models. Note that the prequential likelihood measure
could be taken over a short or long time window. Here we pick
one time frame prior to each prediction as the reference in
assigning weights.
3. Preliminary Evaluations
3.1. Model Comparison Criteria
Of the 14 possible component models which we evaluated with
respect to six selection criteria, six models (JM, GO, MO, DU,
LM, LV) were judged to perform sufficient well to warrant further
investigation[3]. The six criteria are[1]: 1) Model Validity; 2)
Ease of Measuring Parameters; 3) Quality of Assumptions; 4)
Applicability; 5) Simplicity; and 6) Insensitivity to Noise.
Criteria (2), (4) and (5) were included because they have been
important concerns to many practitioners. However, it is impor-
tant to note that our judgements of whether a model met these
criteria were qualitative rather than quantitative. In any case,
Model Validity was of our particular interest. For further com-
parisons on Model Validity, four formally defined measures have
been adopted[4].
o Accuracy:
Defined as prequential likelihood (PL) measure as follows. Let
the observed data be a sequence of times between successive
failures, denoted by t1, t2, ... , ti-1. The objective is to use
the data to predict the future unobserved Ti. More precisely, we
want a good estimate of F(t), defined as P(Ti