Using Software Reliability Models More Effectively


       Michael R. Lyu                  Allen Nikora
       ECE Department           Jet Propulsion Laboratory
   The University of Iowa   California Institute of Technology
    Iowa City, IA 52242             Pasadena, CA 91109



                            Abstract

By making simplifying assumptions about the complicated  software
development  and  operational  processes,  many existing software
reliability models were able to obtain a closed-form  formula  to
describe  software failure process for the purpose of reliability
measurement and prediction.   However,  when  these  models  were
applied to real-world data, many discrepancies between actual and
predicted reliability were observed.  We have not  seen  a  model
that  consistently  performs  the  best  when  applied  to a wide
variety of project data.  This is partly because real world  pro-
jects  normally  do  not  completely comply with modeling assump-
tions, and  partly  because  project  data  often  contain  noise
irrelevant  to the modeling process, making the prediction effort
a formidable task.  Given these observations, this paper proposes
a  different approach toward the software reliability measurement
problem. Rather than attempting to formulate new models which can
cope  with  project-related  information,  we investigate ways in
which existing models might be used to better advantage than they
currently  are.  The basic idea is to combine the results of mul-
tiple models, based on determinations of the applicability of the
models,  for  a better measurement of software reliability.  As a
result, a family of linear combination models  are  proposed  and
investigated.   The  predictive  validity  of  this  approach  is
evaluated by several historical and newly-collected failure  data
sets.  The unsuitability of any single model can be observed from
its fluctuating behavior over these data  sets,  upon  which  our
linear  combination  models consistently demonstrate satisfactory
performance.  Finally,  to  facilitate  the  whole  procedure,  a
computer-aided  software  engineering tool that automates a large
portion of the software reliability measurement task is  proposed
and described.



                        1.  Introduction

The complexity and size of software systems are growing  dramati-
cally.  The reliability of the software component is increasingly
the determining factor  of  overall  system  reliability.   These
trends  make  the  measurement of software reliability one of the
major challenges for software engineers.  Traditionally, software
reliability  modeling is a set of techniques that apply probabil-
ity theory and statistical analysis to predict the reliability of
software   products,  both  quantitatively  and  objectively.   A
software reliability model specifies  the  general  form  of  the
dependence  of  the failure process on the principal factors that
affect  it:  fault  introduction,  fault  manifestation,  failure
detection  and  recovery, fault removal, and operational environ-
ment[1].

The first software reliability model  was  formulated  almost  20
years ago[2].  Since then, over 40 of them are now known to exist
in the literature[1].  It is likely that  many  more  unpublished
models  are in use. The primary goal of these models is to assess
current reliability and forecast  future  reliability,  based  on
rational assumptions for the application of statistical inference
techniques to the observed failure data.  A major  difficulty  in
software  reliability engineering practice is to analyze the par-
ticular context in which reliability measurement is to take place
so as to decide a priori which model is likely to be trustworthy.
Due to the intricacy of human  activities  involved  in  software
development  and  operation  process,  as  well  as the uncertain
nature of software failure patterns, such a priori determinations
have  never  been conclusive.  It has been shown that there is no
best software reliability model for every  case  under  all  cir-
cumstances [3].  As a result, practitioners are left in a dilemma
as to which software reliability models  to  choose,  which  pro-
cedures  to  apply,  and which prediction results to trust, while
contending with varying software development and operation  prac-
tices.


                        2.  Our Approach

The major objective of this research is  to  propose  a  new  and
practical  approach toward software reliability measurement which
tends to produce better predictive validity of  the  measurement.
Although  the  software  reliability  measurement problem, by its
nature, involves significant uncertainties, the  proposed  scheme
should  make  more  accurate software reliability predictions, at
least in an average sense, than the traditional approach by rely-
ing on and advocating any single model.

To increase the predictive validity of software reliability meas-
urements,  we  have  formalized the following general combination
modeling approach:

1.Identify a basic set of models (called component  models).   If
  the  project  testing  environments  are well known, the models
  whose assumptions are closest to the real  environments  should
  be selected.

2.Select models that tend to cancel out in their biased (if  any)
  predictions.

3.Separately apply each of the component models  to  the  failure
  data.

4.Apply certain criteria to weigh the selected  component  models
  and  form one or more linear combination models for final pred-
  ictions.  The weights can be either static or variable.


In general, this model is expressed as the following  mixed  dis-
tribution:


                 (math here)

(math) is the predictive probability density function of  the  jth
component   model,   given   that   i-1  data  of  times-between-
successive-failures have been observed.  Note  that  (math)
for  all  t's.   It  is  also  noted that this linear combination
approach tends to preserve the features inherited from  its  com-
ponent  models.  It does not further complicate modeling practice
since each  component  model  performs  reliability  calculations
independently;   they are engaged into the combination model only
at the last stage for the final predictions.

Since the selection of component models is deemed very important,
we suggest that the basic component models be selected from those
implemented in two major software reliability measurement  tools,
SMERFS and SRMP.  They include the following software reliability
models: (1) Bayesian Jelinski-Moranda Model (BJM)[4], (2)  Brooks
and Motley Model (BM)[5], (3) Duane Model (DU) [4], (4) Geometric
Model (GM) [5], (5) Goel-Okumoto Model (GO)  [6],  (6)  Jelinski-
Moranda Model (JM) [2], (7) Keiller-Littlewood Model (KL)[4], (8)
Littlewood Model (LM) [4], (9) Littlewood non-homogeneous Poisson
Process  Model  (LNHPP)  [4],  (10) Littlewood-Verrall Model (LV)
[7], (11) Musa-Okumoto Model (MO) [8], (12)  Generalized  Poisson
Model  (PM)[5],  (13) Schneidewind Model (SM)[5], and (14) Yamada
Delayed S-Shape Model (YM)[5].   Note  that  different  parameter
estimation methods selected to implement these models could, to a
certain extent, affect the predictive validity  of  the  modeling
results.

As an example to illustrate this combination approach,  we  chose
GO,  MO,  and  LV  as the three component models to form a set of
linear combination models.  Reasons for  choosing  them  as  com-
ponent models are:

1.Their predictive validity  has  been  observed  in  our  recent
  investigation.   In  fact,  they  are judged to perform well by
  many practitioners, and they have been widely used[3].

2.They represent different categories of models: GO  (similar  to
  JM  and  SM)  represents  the exponential shape non-homogeneous
  Poisson process (NHPP), MO  represents  the  logarithmic  shape
  NHPP  model,  and  LV  represents  the inverse-polynomial shape
  Bayesian model.

3.Over the set of failure  data  we  analyzed,  their  predictive
  biases  tend  to cancel: GO tends to be optimistic, LV tends to
  be pessimistic, and MO might go either way.


As a result, we formulated the following set of four  combination
models:


1.ELC - Equally-Weighted Linear Combination Model
  This model is formed by assigning the three component models  a
  constant,  equal  weight.   The  arithmetic average of all com-
  ponent models' predictions is taken as the  ELC  model  predic-
  tion,  namely,  ELC  = (math)
  remain constant and unchanged throughout the modeling  process.
  This  is similar to a Delphi survey, in which authorities work-
  ing independently are asked for an opinion on a subject, and an
  average  of  the  results  is  taken.   The  motivation of this
  approach is to reduce the risk of relying on a  specific  model
  which may produce grossly inaccurate predictions, while retain-
  ing much of the simplicity of using the component models.


2.MLC - Median-Oriented Linear Combination Model
  Instead of choosing the arithmetic mean for the  prediction  in
  ELC,  the  component  model  whose predicted value lies between
  optimistic and pessimistic values is selected as the output  of
  this  model.   The  justification for this approach is that the
  choice of median might be more moderate than the mean  in  some
  cases,  since  it  can  better tolerate an erroneous prediction
  which is far away from the others.


3.ULC - Unequally-Weighted Linear Combination Model
  This model is similar to  MLC  except  that  instead  of  being
  solely  determined by the median value, the optimistic and pes-
  simistic predictions will make small contributions to the final
  prediction.   Here we use the weightings similar to the Program
  Evaluation and Review Technique (PERT), i.e.,  the  formulation
  of  this  model  is (math) , where O represents an
  optimistic prediction, P represents a  pessimistic  prediction,
  and M represents the median prediction.


4.DLC - Dynamically-Weighted Linear Combination Model
  In this model, we assume that the applicability of any  indivi-
  dual  model  with respect to the data may change as the testing
  effort progresses.  The weights of the  component  models  will
  therefore  change,  based  on changes in specific measures of a
  model's applicability throughout the test effort.   A  Bayesian
  interpretation  of the prequential likelihood ratio [9] is used
  as a posteriori odds that one model is  temporally  more  valid
  than  another,  which dynamically determines the weights of the
  component models.  Note that the prequential likelihood measure
  could  be taken over a short or long time window.  Here we pick
  one time frame prior to each prediction  as  the  reference  in
  assigning weights.



                   3.  Preliminary Evaluations

3.1.  Model Comparison Criteria

Of the 14 possible  component  models  which  we  evaluated  with
respect  to  six  selection criteria, six models (JM, GO, MO, DU,
LM, LV) were judged to perform sufficient well to warrant further
investigation[3].  The six criteria are[1]: 1) Model Validity; 2)
Ease of Measuring  Parameters;  3)  Quality  of  Assumptions;  4)
Applicability; 5) Simplicity; and 6) Insensitivity to Noise.

Criteria (2), (4) and (5) were included because  they  have  been
important  concerns to many practitioners.  However, it is impor-
tant to note that our judgements of whether  a  model  met  these
criteria were qualitative rather than quantitative.  In any case,
Model Validity was of our particular interest.  For further  com-
parisons  on  Model Validity, four formally defined measures have
been adopted[4].

o  Accuracy:

Defined as prequential likelihood (PL) measure as  follows.   Let
the  observed  data  be  a  sequence  of times between successive
failures, denoted by t1, t2, ... , ti-1. The objective is to  use
the data to predict the future unobserved Ti.  More precisely, we
want a good estimate of F(t), defined as  P(Ti<t), i.e., the pro-
bability  that Ti is less than a variable t.  The predictive dis-
tribution Fi(t) for Ti based on  t1,  t2,  ...  ,  ti-1  will  be
assumed to have a pdf (probability density function)

(math)

For such one-step-ahead predictions of Tj+1, ... , Tj+n, the pre-
quential likelihood is

(math)

Since this measure is usually very close to  zero,  we  take  its
logarithmic  value  for  comparisons.   The  resulting  number is
always negative.  Given several models  using  the  same  set  of
failure  data,  the  model with a smaller value represents a less
accurate prediction.

o  Bias:

Defined as the Kolmogorov Distance of the following  sequence  of
transformations:

(math)

which is the probability integral transform of  the  observed  ti
using  the  previously calculated predictor Fi based upon t1, t2,
... , ti-1.  To identify the direction a model is biased  toward,
we  use  the notation that a positive number means that the model
tends to be optimistic, while a negative one  represents  a  pes-
simistic model.  This is achieved by examining ui's in the u-plot
[4] to see whether they are above  (optimistic)  or  below  (pes-
simistic)  the  line  of  unit  slope through the origin.  In any
case, the smaller the absolute value of the number is,  the  less
bias the model exhibits.

o  Trend:

Defined as the Kolmogorov Distance of the following  sequence  of
transformations:

(math)



where i is less than or equal to n.  This measure represents  the
consistency  of  the  model's bias.  A small value means that the
model is more adaptable to changes  in  the  data  behavior,  and
hence it could achieve a better performance.

o  Noise:

(math)

Again, small  values  represent  less  noise  in  the  predictive
behavior  of  the  model,  indicating better smoothness.  A noise
measure of infinity (oo) indicates that  the  model  has  made  a
prediction of zero failure rate.

To compare several models for a data set, the  following  evalua-
tion algorithm is used: first we determine the rank of each model
for each measure, and then we equally weigh the ranks of the four
measures by summing them up.  The models with a lower overall sum
are judged better than those with a higher sum. It is recognized,
however,  that  different  weights  for  these  measures might be
applied.  Moreover, the value of each measure should be  examined
in  case  some "wild" measure might totally disqualify a model in
that measurement.  Nevertheless, we decided to  use  this  simple
ranking  algorithm  without elaborating the details of each meas-
ure, since such elaborations might involve  subjective  judgement
calls which could be themselves biased.

3.2  Preliminary Results on Model Validity

In order to obtain preliminary results for the  validity  of  the
proposed  linear  combination  models,  three  sets  of published
data[10] were applied.  The result from one of them is  shown  in
Table 1.


          _______________________________________________________________________________________________________
                            Data Set 3 (207 data points/starting data - 60)
          _______________________________________________________________________________________________________
 Measure      JM       GO       MO       DU       LM       LV      ELC      ULC      MLC      DLC
          ______________________________________________________________________________________________________
            -811.1   -811.2   -811.1   -814.3   -811.3   -812.7   -810.8   -810.8   -811.1   -809.1
           Accuracy    (4)      (7)      (4)      (10)     (8)      (9)      (2)      (2)      (4)      (1)
          _______________________________________________________________________________________________________
            .0835    .0761    .0586    .0994    .0829    -.0845   .0640    .0594    .0586    .0649
           Bias        (8)      (6)      (1)      (10)     (7)      (9)      (4)      (3)      (1)      (5)
          _______________________________________________________________________________________________________
            .0623    .0663    .0487    .0740    .0602    .0630    .0467    .0474    .0480    .0462
           Trend       (7)      (9)      (5)      (10)     (6)      (8)      (2)      (3)      (4)      (1)
          _______________________________________________________________________________________________________
            5.384    5.209    4.088    2.426    6.002    3.714    4.224    4.196    4.073    3.901
           Noise       (9)      (8)      (5)      (1)      (10)     (2)      (7)      (6)      (4)      (3)
          ______________________________________________________________________________________________________
 Rank        (6)      (8)      (4)      (9)      (9)      (6)      (4)      (3)      (2)      (1)
          _______________________________________________________________________________________________________


             RECOMMENDED MODELS:  1. DLC  2. MLC  3. ULC  4. ELC  4. MO
                  Table 1:  Model Comparisons for Data Set 3 in[10]


In Table 1, numbers in each row represent  the  computed  measure
under  each criterion, with ranks in parentheses corresponding to
the models in columns.  The last row, "Rank", was  determined  by
equally  treating all the four criteria.  Note that the "starting
data" indicate when the model predictions  began;  previous  data
points  were used for parameter estimations.  This starting point
was chosen such that a small but reasonable set  of  data  points
could be used for the parameter estimations.

It is observed from this table that the proposed linear  combina-
tion models performed relatively well compared with the other six
models.  Model application to other data sets in[10] also  showed
similar  results.   This preliminary investigation gave us enough
confidence for further application of these new models  to  other
project data sets.


                  4.  JPL Project Descriptions

Project data recently taken from the  Jet  Propulsion  Laboratory
(JPL) was investigated for the purpose of validating the proposed
combination modeling approaches. Note that  the  following  items
were  not systematically recorded, and were generally unavailable
for use in the modeling effort:

     1.   Execution times between successive failures, or compar-
          able information (e.g., total time spent testing during
          a calendar interval).

     2.   Operational profile information (e.g., functional  area
          being  tested,  referenced  to  requirements  or design
          documentation; subsystem being tested; points at  which
          the testing method may have changed.)

The data collected from these development environments  tends  to
be  very  noisy, and the assumptions of most software reliability
models do not necessarily hold under the described circumstances.
Nevertheless, these circumstances are typical of actual practice,
and the data collection represents a typical  exercise  which  is
the best available in many existing projects.

4.1  Voyager

The Voyager 1 and 2 spacecraft were  developed  during  the  mid-
1970s  and  launched  in  mid-1977.   Both  spacecraft  flew past
Jupiter and Saturn; Voyager 2 continued the  exploration  of  the
outer  Solar  System by flying past Uranus in 1986 and Neptune in
1989.  The Voyagers were one of the first spacecraft in  which  a
significant   fraction  of  the  functionality  was  provided  by
software.  This software, totaling approximately 14,000 lines  of
uncommented  assembly language, was divided among three real-time
embedded subsystems - the Attitude and Articulation Control  Sub-
system  (AACS),  the Command and Control Subsystem (CCS), and the
Flight Data Subsystem (FDS).  The failure data we analyzed  comes
from spacecraft system testing, at which point the AACS, CCS, and
FDS had been integrated into the  spacecraft.   Among  the  items
recorded on the Problem/Failure Reports during system test are a)
Time of failure, b) Failure type, and c) Subsystem in  which  the
failure  occurred.  Fault density of Voyager software was roughly
9.5 faults/KLOC.

4.2  Galileo

Launched in 1989, Galileo was developed as a Jupiter orbiter car-
rying  an atmospheric probe.  As with the Voyagers, a large frac-
tion  of  Galileo's  functionality  was  provided  by   software.
Approximately 7,000 uncommented source lines of HAL/S were imple-
mented for the AACS, while about 15,000 source lines of  assembly
language were developed for the Command and Data Subsystem (CDS).
As with the Voyagers, the failure data comes from spacecraft sys-
tem  testing.   Estimated  fault  density of Galileo software was
10.2 faults/KLOC.

4.3  Galileo CDS

Failure data for the Galileo CDS during one phase  of  subsystem-
level  integration  testing  was available for analysis.  Because
one of us had been involved in this testing effort, some elements
of  the  testing profile could be reconstructed.  For example, it
was known to us that the number of hours per  week  during  which
testing occurred was nearly constant throughout this phase, which
was composed of two testing stages. In addition, the  main  func-
tional  areas of the software received roughly the same amount of
testing every calendar week.  This information  resulted  in  the
failure  data  being  more accurate than that for other projects.
Fault density of Galileo  CDS  software  during  its  integration
testing was roughly 10.1 faults/KLOC.

4.4  Magellan

A large portion of the on-board software for the  Magellan  Venus
radar  mapper  derives  from  that  written  for  Galileo.   Like
Galileo, Magellan has an AACS and a CDS - the  number  of  uncom-
mented  source lines of code for each is roughly the same as that
for Galileo.  As with Galileo and the Voyagers, the failure  data
comes  from the spacecraft system test period.  Magellan software
fault density was estimated to be 8.0 faults/KLOC.

4.5     Alaska SAR

The Alaska SAR facility, installed on the Fairbanks campus of the
University  of  Alaska,  is a facility for tracking and acquiring
data from Earth resources satellites in high-inclination  orbits.
Totaling  about  103,000  uncommented  source  lines of code, the
software is written in a mixture of C, Fortran, EQUEL,  and  OSL.
About  14,000  lines  were  reused  from  previous  efforts.  The
failure data reported here  was  obtained  from  the  development
organization's  anomaly reporting system during software integra-
tion and test.  As with the other projects, it was necessary  for
us  to  assume  that the amount of test time per unit interval of
calendar time was  relatively  constant,  and  that  the  testing
method  remained  constant,  since  this information was not sys-
tematically recorded.  Largely because of  this  lack  of  opera-
tional  profile  information, we decided to model the reliability
of the facility as a whole, rather  than  attempt  to  model  the
reliabilities  of  the  components.   Note  that  the  Alaska SAR
software is still under development, and  its  currently  accumu-
lated fault density is about 3.6 faults/KLOC.


                  5.  Project Data Applications

5.1  Overall Assessment

We have applied the ten competing models to the five investigated
JPL  projects  and  obtained  the  model comparison results.  For
example, Table 2 presents the comparisons of the ten ten  compet-
ing  models  for the Galileo CDS project data, where performances
of ELC and ULC are among the best.

Tables 3 and 4 list the performance comparisons for all the eight
data  sets  we  investigated.   The overall comparison is done by
using all four measures in Table 3, or by using  the  prequential
likelihood  measure  (the "Accuracy" criterion) alone in Table 4,
since it was judged to be the most important one.  In general, we
consider  a  model  as  being  satisfactory  if and only if it is
ranked 4 or better out of the 10 models for a particular project.
To  extend this idea, we define a "handicap" value, which is cal-
culated by subtracting 4 (the "par" value) from  the  rank  of  a
model  for  each data set before its ranks being summed up in the
overall evaluation.  (Or subtract 32 from the "Sum of  Rank"  row
in  Tables  3 and 4.) A negative handicap value represents satis-
factory overall preformance for the eight data sets.


          ____________________________________________________________________________________________________
                  Galileo CDS Flight Software (358 data points/starting data-152)
          ____________________________________________________________________________________________________
 Model        JM       GO       MO       DU       LM       LV      ELC      ULC      MLC      DLC
          ___________________________________________________________________________________________________
            -643.0   -639.3   -681.1   -728.5   -643.0   -612.3   -618.7   -626.9   -681.1   -606.1
           Accuracy    (6)      (5)      (8)      (10)     (6)      (2)      (3)      (4)      (8)      (1)
          ____________________________________________________________________________________________________
            .1783    .1783    .1700    .1748    .1784    -.2581   .1732    .1599    .1700    .1845
           Bias        (6)      (6)      (2)      (5)      (8)      (10)     (4)      (1)      (2)      (9)
          ____________________________________________________________________________________________________
            .3450    .3408    .4262    .4282    .3450    .2426    .2855    .3072    .4261    .2618
           Trend       (6)      (5)      (9)      (10)     (6)      (1)      (3)      (4)      (8)      (2)
          ____________________________________________________________________________________________________
            4.042    3.908    2.673    2.287    4.042    2.564    2.853    2.958    2.672    11.19
           Noise       (8)      (7)      (4)      (1)      (8)      (2)      (5)      (6)      (3)      (10)
          ___________________________________________________________________________________________________
 Rank        (8)      (6)      (6)      (8)      (10)     (1)      (1)      (1)      (4)      (5)
          ____________________________________________________________________________________________________


                     RECOMMENDED MODELS:  1. LV  1. ELC  1. ULC
              Table 2:  Model Comparisons for the Galileo CDS Subsystem



          _________________________________________________________________________________
           Summary of Model Ranking for Each Data by All Four Criteria
          _________________________________________________________________________________
 Model             JM     GO     MO    DU     LM     LV    ELC   ULC   MLC   DLC
          ________________________________________________________________________________
 Data 1 in [10]   (10)   (9)    (1)    (6)   (8)    (6)    (4)   (2)   (3)   (5)
          _________________________________________________________________________________
 Data 2 in [10]   (9)    (10)   (6)    (7)   (8)    (1)    (4)   (5)   (2)   (2)
          _________________________________________________________________________________
 Data 3 in [10]   (6)    (8)    (4)    (9)   (9)    (6)    (4)   (3)   (2)   (1)
          _________________________________________________________________________________
 Voyager          (10)   (7)    (6)    (7)   (9)    (2)    (2)   (4)   (5)   (1)
          _________________________________________________________________________________
 Galileo          (5)    (7)    (10)   (6)   (9)    (4)    (1)   (3)   (8)   (2)
          _________________________________________________________________________________
 Galileo CDS      (8)    (6)    (6)    (8)   (10)   (1)    (1)   (1)   (4)   (5)
          _________________________________________________________________________________
 Magellan         (5)    (5)    (8)    (1)   (9)    (10)   (1)   (5)   (4)   (3)
          _________________________________________________________________________________
 Alaska SAR       (1)    (5)    (1)    (9)   (3)    (10)   (8)   (7)   (3)   (6)
          ________________________________________________________________________________
 Sum of Rank       54     57     42    53     65     40    25    30    31    25
          _________________________________________________________________________________
 "Handicap"       +22    +25    +10    +21   +33     +8    -7    -2    -1    -7
          ________________________________________________________________________________
 Total Rank       (8)    (9)    (6)    (7)   (10)   (5)    (1)   (3)   (4)   (1)
          _________________________________________________________________________________

             Table 3:  Overall Model Comparisons Using All Four Criteria


          ______________________________________________________________________________
      Summary of Model Ranking for Each Data Using the Accuracy Measure
          ______________________________________________________________________________
 Model             JM    GO    MO     DU    LM    LV    ELC   ULC   MLC   DLC
          _____________________________________________________________________________
 Data 1 in [10]   (10)   (9)   (2)   (8)    (6)   (7)   (5)   (4)   (3)   (1)
          ______________________________________________________________________________
 Data 2 in [10]   (7)    (9)   (4)   (10)   (7)   (1)   (4)   (4)   (3)   (2)
          ______________________________________________________________________________
 Data 3 in [10]   (4)    (7)   (4)   (10)   (8)   (9)   (2)   (2)   (4)   (1)
          ______________________________________________________________________________
 Voyager          (10)   (7)   (6)   (8)    (9)   (2)   (3)   (4)   (5)   (1)
          ______________________________________________________________________________
 Galileo          (5)    (7)   (9)   (10)   (5)   (4)   (2)   (3)   (8)   (1)
          ______________________________________________________________________________
 Galileo CDS      (6)    (5)   (8)   (10)   (6)   (2)   (3)   (4)   (8)   (1)
          ______________________________________________________________________________
 Magellan         (6)    (6)   (6)   (2)    (6)   (5)   (3)   (4)   (6)   (1)
          ______________________________________________________________________________
 Alaska SAR       (2)    (6)   (2)   (10)   (2)   (9)   (8)   (7)   (2)   (1)
          _____________________________________________________________________________
 Sum of Rank       50    56    41     68    49    39    30    32    39     9
          ______________________________________________________________________________
 "Handicap"       +18    +24   +9    +36    +17   +7    -2     0    +7    -23
          _____________________________________________________________________________
 Total Rank       (8)    (9)   (6)   (10)   (7)   (4)   (2)   (3)   (4)   (1)
          ______________________________________________________________________________

             Table 4:  Overall Model Comparisons by the Accuracy Measure


There are several important points we can observe from these sum-
mary tables:

1.   There are two sets of models under investigation  here:  the
     set of single models, and the set of combination models.  In
     general, the set of combination models perform  better  than
     the set of single models.  The acceptable models (those with
     a negative "handicap"), when considering all four  measuring
     criteria  (Table 3), are exactly the four linear combination
     models.   When  considering  the  Accuracy  criterion  alone
     (Table  4), the three acceptable models, DLC, ELC, ULC, also
     belong to the combination model set. By evaluating the  han-
     dicap  value,  we also note that the combination models usu-
     ally beat the other single models by a significant number of
     "strokes."

2.   By weighting or averaging the  predictions  from  the  three
     well-known  component  models,  GO, MO, and LV, the combina-
     tional models appear to be less sensitive to potential  data
     noise  than  their component models and other single models.
     This is  reflected  in  the  investigated  data  sets  which
     include  both  execution-time  based  data and calendar-time
     based data.  Moreover, when we examine all project data  for
     the  evaluation  criteria,  we  can see that the combination
     models  could  sometimes  outperform  all  their   component
     models,  but  they  never  perform worse than the worst com-
     ponent model.

3.   The DLC and ELC Models perform  rather  consistently.   Most
     other  models  seem  to perform well for a few data sets but
     poorly for other data sets, and the fluctuation  in  perfor-
     mance  is  significant.   By preserving good properties from
     the three well-known models with equal weightings,  the  ELC
     model  achieves  a good overall performance as expected.  On
     the other hand, since the DLC model is  allowed  to  dynami-
     cally  change its weightings according to the outcome of the
     accuracy measure, it  is  not  surprising  to  see  it  con-
     sistently produce the best accuracy measure for almost every
     data set.  This further suggests that, when  other  measures
     are deemed to be important, we could use that measure as the
     weighting criterion in forming the DLC model to get the best
     result.  We will investigate this model further.


5.2  Extensions of the DLC Model

The described combination modeling approach could be extended  in
many  dimensions.  The extension of the DLC-type model is of par-
ticular interest, due to its capability  in  producing  the  best
accuracy  measure consistently.  To follow the local trend of the
prediction accuracy, the DLC model uses only one  time  frame  as
the  reference  to determine the weights on the component models.
This "one observation window"  approach  might  lose  the  global
trend  in  the  measurements.  It is natural to extend the window
size to a larger number, say, N, as the reference. This leads  to
two types of DLC models:


1.   DLC/F/N ("DLC with Fixed  N-size-window")  Model:  Make  the
     weight  assignments  for  each model based on changes in the
     Accuracy measure over the past N observations.   The  weight
     assignment for each model remains fixed for the next N pred-
     ictions, at the end of which the weights will be  recomputed
     according  to the changes in Accuracy over the past N obser-
     vations. To compute the weight of  a  component  model  "A",
     determine  the amount of change in its Accuracy measure over
     the previous N observations.  Next, identify  the  component
     model  "B"  whose  Accuracy  measure  changed the most.  The
     unnormalized weight for "A"  will simply be the ratio of the
     change  in its Accuracy measure to the change in "B's" Accu-
     racy measure.

2.   DLC/S/N ("DLC with Sliding N-size-window") Model:  Recompute
     the  weight  assignments  for each model at each data point,
     using changes in the Accuracy measure over the past N obser-
     vations  as  the  basis for determining each model's weight.
     Weights for component models are computed as in (1) above.


The difference between these two types is illustrated  in  Figure
1.


(Figure 1 here)


Figure 2 shows the accuracy measure of the DLC/F and  DLC/S  type
models.  In this figure, results of both models with window sizes
from 1 to 10 are plotted for the eight data sets.  The summary of
this  measure  for  both  models  is shown in Figure 3, where the
accuracy measure is normalized with  respect  to  the  number  of
measured  points  in each data set before being summed up for the
eight data sets.


(Figure 2 here)

(Figure 3 here)


It can be observed from Figure 2 that the  DLC/S  type  model  is
superior to the DLC/F type model.  This is intuitively true since
DLC/S allows the observing window to advance dynamically  as  the
step-by-step prediction moves ahead.  In general, the accuracy of
the DLC/F type model deteriorates  when  window  size  increases,
while  a  better performance could be achieved for the DLC/S type
model by slightly increasing the window size.  It  is  also  sug-
gested  from  Figure  3  that  a small window size of 3 to 4 time
frames is optimal for the DLC/S model under the investigated data
sets.   An  optimal  window  size  heavily  depends  on  software
development environment, testing  schemes  and  operational  pro-
files.   We do, however, believe that a small window size of less
than 5 is preferable, since it is able to catch  fast  shifts  in
model preference among the component models.

So far we have discussed the accuracy measure  in  terms  of  the
prequential  likelihood.   It  is noted that in forming the DLC/F
and DLC/S models,  other  accuracy  measures,  e.g.,  the  Akaike
Information  Criterion  (AIC)  or  mean square error (MSE), could
also be considered.  The main strength of the DLC-type models  is
their  dynamic  feature  in combining the component models, which
allows the produced output to be fed back for  model  adjustment,
depending on what the target measure is.

In addition to the DLC-type models, there are at least five other
potential extensions to the combination scheme we have proposed:

1.   We can try to apply models other than GO,  MO,  LV  as  com-
     ponent models.  If some models are judged to perform well in
     a particular data set, they should be the candidates for the
     component models to form a combination model.

2.   We can use more than three models as component  models.   It
     is  postulated  that the more component models we apply, the
     better the prediction we could expect.

3.   We can also apply alternative weighting schemes for the com-
     bination,  based  on project criteria and engineering judge-
     ments. In other words, users should be able to determine the
     way a combination model is formed.

4.   Finally, the combination models themselves could be used  as
     component models to form another combination model.

5.   As the original assumptions behind each  model  become  lost
     through  the  layers of linear combinations, a distribution-
     free (nonparametric) modeling technique may emerge.


It is noted that by  applying  more  complicated  procedures  for
software  reliability  measurement,  the  physical  meaning of an
individual model might be lost.  As a result,  insight  into  the
software   reliability  engineering  process  becomes  harder  to
obtain.  However, the main theme  of  this  approach,  as  stated
before,  is  the  validity  of  the measurements and predictions.
After all, most software reliability models view the software  as
a  "black-box"  from  which  to  observe failure data and to make
predictions.  In that regard, the combination schemes  and  their
extensions  we propose here do not degrade any properties assumed
in current software reliability modeling practices.


5.3  Long-Term Prediction Capability

Since the accuracy of long-term predictions is  deemed  important
in  software reliability model applications, we have investigated
the capability of linear combination models in  making  long-term
predictions.   In  particular, we selected ELC and DLC models and
compared them with their component models, GO, MO, and LV.   Fig-
ure  4  shows  the application results to the Galileo CDS project
data.


(Figure 4 here)
Figure 4:  Long-Term Predictions of Galileo CDS from Several Models


In Figure 4, the first 152 data points in the project, or  up  to
777  cumulative  test hours as indicated by the dashed line, were
used as parameter estimations for each model.  If  we  call  this
stage  the estimation stage, then immediately following the esti-
mation stage can be called the prediction stage.  For the Galileo
CDS,  this  is  in  fact  the natural breakdown of the two-staged
testing efforts.  The predictive curve  of  each  model  is  than
plotted  in the figure.  Note that for DLC, model preferences and
weights will be computed in the estimation phase, and the  weight
assignments become fixed in the prediction phase.

We can observe from Figure 4 that LV's predictive  curve  is  too
pessimistic,  and GO's and MO's predictive curve are too optimis-
tic.  In fact, all of them lie completely out of the actual  pro-
ject  data  curve.   ELC  and  DLC, on the other hand, compensate
these extremes and make rather reasonable long-term predictions.

Next we want to show the quantitative  comparisons  of  long-term
predictions.   Note  that  the  prequential  likelihood is mostly
applicable for comparing  step-by-step  predictions  where  model
parameters could be re-adjusted for each prediction, which is not
the case for  long-term  predictions.   Therefore,  we  use  mean
square error to show the quantitative comparisons.  Namely,


M.S.E.  =  (math)

where N is the total number of predicted points in the prediction
phase,  and  yi  and  yi  are  the predicted and actual number of
failures, respectively.

The summary of long-term predictions is  shown  in  Table  5,  in
which  we  can see ELC and DLC indeed have demonstrated capabili-
ties in making long-term predictions.   In  particular,  Table  5
indicates  that  the ELC and DLC models perform in general better
than the component models.  This is illustrated in  the  "Sum  of
MSEs"  and  "Sum  of  Ranks"  figures appearing in Table 5.  Even
though the component models may make a better prediction than ELC
and  DLC  on  several  occasions, they also perform significantly
worse on others.  In fact, the ELC and DLC models,  as  expected,
never  make the worst long-term predictions.  Overall, the combi-
nation models clearly make a better performance  over  the  eight
data sets examined in this paper.


          _________________________________________________________________________
     Summary of Model Ranking for Long-Term Predictions Using M.S.E.
          _________________________________________________________________________
                     GO         MO         LV         ELC         DLC
          ________________________________________________________________________
 Data 1 in [10]   2117(5)    687.4(4)   567.7(3)    266.7(2)   169.7(1)
          _________________________________________________________________________
 Data 2 in [10]   1455(5)    1421(4)    246.1(1)    930.5(2)   955.7(3)
          _________________________________________________________________________
 Data 3 in [10]   480.0(2)   253.2(1)    2067(5)    745.5(3)   779.8(4)
          _________________________________________________________________________
 Voyater          1089(4)    782.9(2)    5283(5)    130.1(1)   876.7(3)
          _________________________________________________________________________
 Galileo          4368(4)    4370(5)    539.3(1)    2171(3)     1791(2)
          _________________________________________________________________________
 Galileo CDS      4712(5)    3073(3)     4318(4)    1322(2)     1141(1)
          _________________________________________________________________________
 Magellan         3247(4)    3248(5)    219.5(1)    1684(3)     1354(2)
          _________________________________________________________________________
 Alaska SAR       60.22(3)   60.12(1)   104.45(5)   68.44(4)   60.15(2)
          ________________________________________________________________________
 Sum of MSEs      17528.5    13896.6     13345.0     7317.3     7128.3
          _________________________________________________________________________
 Sum of Ranks       (32)       (25)       (25)        (20)       (18)
          _________________________________________________________________________
 Overall Rank       (5)        (4)         (3)        (2)         (1)
          _________________________________________________________________________

           Table 5:  Summary of Long-Term Predictions


                       6.  The CASRE Tool

Given a failure data set, the complexity required in searching  a
good  combination model and the resulting computation tasks could
become overwhelming.  Due to  the  tedious  computation-intensive
tasks  that might be involved in the selection and application of
the component models  to  form  various  combination  models  for
investigation, a computer-aided approach is inevitable.  For this
purpose, we propose a CASE tool, called  Computer-Aided  Software
Reliability  Estimation (CASRE) system, for an automatic and sys-
tematic approach in measuring software reliability.

Figure 5 shows the high-level architecture  for  CASRE  which  is
currently  under  development.   Much of CASRE's functionality is
available in current  software  reliability  tools.   However,  a
feature  unique  to  CASRE allows users to combine the results of
several models in addition to executing a single model.  Feedback
from  the  Model  Evaluation block assists users in identifying a
model or combination of models best suited to  the  failure  data
being  analyzed.  Moreover, the i/o facility, the user interface,
and the measurement procedures are greatly enhanced in this tool.
The major CASRE functions include:


(Figure 5 here)
Figure 5:  High-Level Architecture for CASRE



o    Data Modification: CASRE allows users to create new  failure
     data files, modify existing files, and perform global opera-
     tions  on  files.  Besides,  users  may  select  appropriate
     smoothing  techniques  or  apply data transformations to the
     failure data being analyzed. The modified input data can  be
     plotted,  used  as  input to a reliability model, or written
     out to a new file for later use.

o    Failure Data Analysis: The  "Summary  Statistics"  block  in
     Figure  5 allows users to display the failure data's summary
     statistics, including the mean and  median  of  the  failure
     data, 25% and 75% hinge points, skewness, and kurtosis.

o    Modeling and Measurement: Figure 5 shows two modeling  func-
     tions.   The  "Models" block executes single software relia-
     bility models on a set of failure data.  The "Model Combina-
     tion"  block  allows  users to execute several models on the
     failure data and combine the results of those  models.   The
     block  labeled  "Model Evaluation" allows users to determine
     the applicability of a model to a set of failure data.

o    Display of Results: CASRE  can  graphically  displays  model
     results  of (1) interfailure times, (2) cumulative failures,
     (3) failure intensities, and (4) reliability  growth  curve.
     Both actual and estimated quantities could be plotted on the
     same figure. Plots also  include  user-specified  confidence
     limits and the control over the plotted range of data.  In a
     windowing environment, multiple plots  could  be  simultane-
     ously  displayed.  Plots displayed on-screen could be either
     printed, or saved as a disk file, or fed to  other  software
     (e.g.,  spreadsheet)  for  further processing.  The plotting
     function also produces graphics from Model Evaluation's out-
     puts,  which indicate the degree and direction of model bias
     and the way in which the bias changes over time.


Figures 6 and 7 show two screen dumps  for  the  described  CASRE
tool.   It  can be seen that the application of models to failure
data is a straightforward process.  The user is also given a con-
siderable  amount of choices in the models to be applied, includ-
ing individual models and combination models, and evaluation cri-
teria  to  be selected.  This combination of simple operation and
variety in the available models makes it easy  for  the  user  to
identify  an  appropriate model for a particular software project
or to investigate a family of models.



(Figure 6 here)
Figure 6:  CASRE - Initial Failure Data Display




(Figure 7 here)
Figure 7:  CASRE - Selection of the Best Model(s)


                 7.  Conclusions and Future Work

We have attempted to address one particular concern  in  software
reliability  in  this paper, which is how to make the most effec-
tive use of already existing models in actual  practice.   Toward
this end, we have proposed a set of linear combination models for
more accurate measurement of software reliability.  These  models
have  shown  promising  results  when compared to the traditional
single-model approaches.  In addition, we have  presented  poten-
tial  extensions to these models for a more complete treatment of
this specific concern.  To  relieve  some  of  the  tedious  work
involved in applying these approaches, we have proposed the CASRE
tool to automate significant portions of the software reliability
measurement task.  For the purpose of validating this linear com-
bination approach, we would recommend that more data  be  applied
to  the  proposed combinations and that the resulting predictions
across a wider variety of types of software  development  efforts
be compared.

We also realize that an  equally  important  aspect  of  software
reliability  measurement  is  the development of models that more
accurately describe the development and testing process  than  do
currently-available models.  We must state here that this subject
is beyond the  scope  of  our  current  investigations.   Largely
because  of  the  detailed  information that would be required in
such an investigation was not  available,  we  decided  that  our
efforts would be more appropriately directed to the complementary
question of making more effective use of already  existing  tech-
niques.   In future efforts, we hope to address the issue of for-
mulating more accurate models.


                       8.  Acknowledgement

The authors would like to thank Dr. Bev Littlewood for  the  per-
mission  of using his SRMP tool, and Dr. William Farr for the use
of his SMERFS tool.  The research described  in  this  paper  was
carried  out  at  the University of Iowa under a faculty starting
fund, and at the Jet Propulsion Laboratory, California  Institute
of Technology, under a contract with the National Aeronautics and
Space Administration, through the Director's Discretionary  Fund.
Support for the implementation of the CASRE tool described herein
is provided by the Air  Force  Operational  Test  and  Evaluation
Center,  under Task Order RE-182, Amendment 655, Proposal No. 80-
3417.


References


1.   J. D. Musa, A. Iannino, and K. Okumoto, Software Reliability
     -  Measurement,  Prediction,  Application,  McGraw-Hill Book
     Company, New York, New York, 1987.

2.   Z.  Jelinski  and  P.B.   Moranda,   "Software   Reliability
     Research,"  in  Statistical Computer Performance Evaluation,
     ed. W. Freiberber, pp. 465-484, Academic, New York, 1972.

3.   M.R. Lyu, "Measuring Reliability of  Embedded  Software:  An
     Empirical  Study  with  JPL  Project  Data,"  in Proceedings
     International Conference on Probabilistic Safety  Assessment
     and  Management,  pp.  493-500,  Beverly  Hills, California,
     February 1991.

4.   A.A. Abdel-Ghaly, P.Y. Chan, and B. Littlewood,  "Evaluation
     of  Competing  Software Reliability Predictions," IEEE Tran-
     sactions on Software Engineering, vol. SE-12,  pp.  950-967,
     September 1986.

5.   W.H. Farr, "A Survey of Software  Reliability  Modeling  and
     Estimation," Technical Report 82-171, NSWC, 1983.

6.   A.L. Goel and K.  Okumoto,  "Time-Dependent  Error-Detection
     Rate  Model  for  Software Reliability and Other Performance
     Measures," IEEE Transactions on Reliability, vol. R-28,  pp.
     206-211, 1979.

7.   B. Littlewood and  J.L.  Verrall,  "A  Bayesian  Reliability
     Growth  Model  for Computer Software," Journal Royal Statis-
     tics Society C, vol. 22, pp. 332-346, 1973.

8.   J.D. Musa and K. Okumoto, "A Logarithmic  Poisson  Execution
     Time   Model   for  Software  Reliability  Measurement,"  in
     Proceedings Seventh  International  Conference  on  Software
     Engineering, pp. 230-238, Orlando, Florida, 1984.

9.   A.P. Dawid, "Statistical Theory: The Prequential  Approach,"
     Journal  Royal  Statistics Society A, vol. 147, pp. 278-292,
     1984.

10.  J.D.  Musa,  "Software  Reliability  Data,"  RADC  Technical
     Report, 173 pp., DACS, Rome Air Development Center, 1980.