Handbook of Software Reliability Engineering


Foreword by Alfred V. Ahoxix
Foreword by Richard A. DeMilloxxi
Prefacexxiii


Chapter 1. Introduction

Michael R. Lyu (AT&T Bell Labs.)

1.1 The Need for Reliable Software3
1.2 Software Reliability Engineering Concepts5
1.3 Book Overview8
1.4 Basic Definitions12
1.5 Technical Areas Related to the Book19
      1.5.1 Fault Prevention19
      1.5.2 Fault Removal20
      1.5.3 Fault Tolerance20
      1.5.4 Fault/Failure Forecasting21
      1.5.5 Scope of this Handbook21
1.6 Summary22
      Problems22


Chapter 2. Software Reliability and System Reliability

Jean-Claude Laprie and Karama Kanoun (LAAS-CNRS, France)

2.1 Introduction27
2.2 The Dependability Concept28
      2.2.1 Basic Definitions28
      2.2.2 On the Impairments to Dependability28
      2.2.3 On the Attributes of Dependability32
      2.2.4 On the Means for Dependability33
2.3 Failure Behavior of an X-Ware System35
      2.3.1 Atomic Systems35
      2.3.2 Systems Made up of Components41
2.4 Failure Behavior of an X-Ware System with Service Restoration49
      2.4.1 Characterization of System Behavior50
      2.4.2 Maintenance Policies 51
      2.4.3 Reliability Modeling53
      2.4.4 Availability Modeling60
2.5 Situation with Respect to the State-of-the-Art in Reliability Evaluation64
2.6 Summary68
      Problems68


Chapter 3. Software Reliability Modeling Survey

William Farr (Naval Surface Warfare Center)

3.1 Introduction71
3.2 Historical Perspective and Implementation72
      3.2.1 Historical Background72
      3.2.2 Model Classification Scheme73
      3.2.3 Model Limitations and Implementation Issues76
3.3 Exponential Failure Time Class of Models77
      3.3.1 Jelinski-Moranda "De-Eutrophication" Model 77
      3.3.2 Nonhomogeneous Poisson Process Model 80
      3.3.3 Schneidewind's Model 82
      3.3.4 Musa's Basic Execution Time Model 87
      3.3.5 Hyperexponential Model 90
      3.3.6 Others 92
3.4 Weibull and Gamma Failure Time Class of Models 93
      3.4.1 Weibull Model 93
      3.4.2 S-Shaped Reliability Growth Model 95
3.5 Infinite Failure Category Models 98
      3.5.1 Duane's model 98
      3.5.2 Geometric Model 99
      3.5.3 Musa-Okumoto Logarithmic Poisson 102
3.6 Bayesian Models 104
      3.6.1 Littlewood-Verrall Reliability Growth Model105
      3.6.2 Other Bayesian Models 109
3.7 Model Relationships 109
      3.7.1 Generalized Exponential Model Class 109
      3.7.2 Exponential Order Statistic Model Class 111
3.8 Software Reliability Prediction in Early Phases of the Life Cycle 111
      3.8.1 Phase-Based Model 111
      3.8.2 Predicting Software Defects from Ada Design112
      3.8.3 Rome Laboratory Work 113
3.9 Summary 114
      Problems 115


Chapter 4. Techniques for Prediction Analysis and Recalibration

Sarah Brocklehurst, Bev Littlewood (City University of London)

4.1 Introduction 119
4.2 Examples of Model Disagreement and Inaccuracy 120
      4.2.1 Simple Short Term Predictions 120
      4.2.2 Longer Term Predictions 123
      4.2.3 Model Accuracy Varies from Data Source to Data Source 126
      4.2.4 Why We Cannot Select the Best Model a Priori126
      4.2.5 Discussion - a Possible Way Forward 127
4.3 Methods of Analyzing Predictive Accuracy 128
      4.3.1 Basic Ideas - Recursive Comparison of Predictions with Eventual Outcomes 128
      4.3.2 The Prequential Likelihood Ratio (PLR) 131
      4.3.3 The U-Plot 135
      4.3.4 The Y-Plot 140
      4.3.5 Discussion: the Likely Nature of Prediction Errors, and How We can Detect Inaccuracy 141
4.4 Recalibration 145
      4.4.1 The U-Plot as a Means of Detecting 'Bias' 145
      4.4.2 The Recalibration Technique 146
      4.4.3 Examples of the Power of Recalibration 147
4.5 A Worked Example 150
4.6 Discussion 156
      4.6.1 Summary of the Good News: Where We Are Now 156
      4.6.2 Limitations of Present Techniques 159
      4.6.3 Possible Avenues for Improvement of Methods160
      4.6.4 Best Advice to Potential Users 162
4.7 Summary 163
      Problems 164


Chapter 5. The Operational Profile

John Musa, Bruce Juhlin, Gene Fuoco, Diane Kropfl, and Nancy Irving (AT&T Bell Labs.)

5.1 Introduction 167
5.2 Concepts 168
5.3 Development Procedure 170
      5.3.1 Customer Type List 173
      5.3.2 User Type List 173
      5.3.3 System Mode List 174
      5.3.4 Functional Profile 176
      5.3.5 Operational Profile 183
5.4 Test Selection 194
      5.4.1 Selecting Operations 195
      5.4.2 Regression Test 196
5.5 Special Issues 197
      5.5.1 Indirect Input Variables 197
      5.5.2 Updating the Operational Profile 197
      5.5.3 Distributed Systems 198
5.6 Other Uses 199
5.7 Application to DEFINITY 200
      5.7.1 Project Description 200
      5.7.2 Development Process Description 200
      5.7.3 Describing Operational Profiles 201
      5.7.4 Implementing Operational Profiles 203
      5.7.5 Conclusion 204
5.8 Application to FASTAR (Fast Automated Restoration)204
      5.8.1 System Description 204
      5.8.2 FASTAR: SRE Implementation 206
      5.8.3 FASTAR: SRE Benefits 210
5.9 Application to the Power Quality Resource System 210
      5.9.1 Project Description 210
      5.9.2 Developing the Operational Profile 211
      5.9.3 Testing 213
      5.9.4 Conclusion 214
5.10 Summary 215
      Problems 215


Chapter 6. Best Current Practice of SRE

Mary Donnelly, Bill Everett, John Musa, and Geoff Wilson (AT&T Bell Labs.)

6.1 Introduction 219
6.2 Benefits and Approaches to SRE 220
      6.2.1 Importance and Benefits 221
      6.2.2 An SRE Success Story 221
      6.2.3 SRE Costs 222
      6.2.4 SRE Activities 223
      6.2.5 Implementing SRE Incrementally 223
      6.2.6 Implementing SRE on Existing Projects 224
      6.2.7 Implementing SRE on Short-Cycle Projects 226
6.3 SRE During Feasibility and Requirements Phase 226
      6.3.1 Feasibility Stage 226
      6.3.2 Requirements Stage 228
6.4 SRE during Design and Implementation Phase 232
      6.4.1 Design Stage 232
      6.4.2 Implementation Stage 233
6.5 SRE during the System Test and Field Trial Phase 235
      6.5.1 Determine Operational Profile 236
      6.5.2 System Test Stage 237
      6.5.3 Field Trial Stage 241
6.6 SRE during Post-Delivery and Maintenance Phase 242
      6.6.1 Project Post-Release Staff Needs 242
      6.6.2 Monitor Field Reliability vs. Objectives 243
      6.6.3 Track Customer Satisfaction 245
      6.6.4 Time New Feature Introduction by Monitoring Reliability 245
      6.6.5 Guide Produce and Process Improvement with Reliability Measures 246
6.7 Getting Started with SRE 246
      6.7.1 Prepare Your Organization for SRE 247
      6.7.2 Find More Information or Support 250
      6.7.3 Do an SRE Self-Assessment 250
6.8 Summary 252
      Problems 253


Chapter 7. Software Reliability Measurement Experience

Allen Nikora (Jet Propulsion Laboratory) and Michael R. Lyu (AT&T Bell Labs.)

7.1 Introduction 255
7.2 Measurement Framework 256
      7.2.1 Establishing Software Reliability Requirements 259
      7.2.2 Setting up a Data Collection Process 266
      7.2.3 Defining Data to be Collected 267
      7.2.4 Choosing a Preliminary Set of Software Reliability Models 272
      7.2.5 Choosing Reliability Modeling Tools 273
      7.2.6 Model Application and Application Issues 273
      7.2.7 Dealing with Evolving Software 276
      7.2.8 Practical Limits in Modeling Ultrareliability 277
7.3 Investigation at JPL 278
      7.3.1 Project Selection and Characterization 278
      7.3.2 Characterization of Available Data 280
      7.3.3 Experimental Results 280
7.4 Investigation at Bellcore 281
      7.4.1 Project Characteristics 281
      7.4.2 Data Collection 284
      7.4.3 Application Results 285
7.5 Linear Combination of Model Results 289
      7.5.1 Statically-Weighted Linear Combinations 290
      7.5.2 Weight Determination Based on Ranking Model Results 290
      7.5.3 Weight Determination Based on Changes in Prequential Likelihood 291
      7.5.4 Modeling Results 291
      7.5.5 Overall Project Results 292
      7.5.6 Extensions and Alternatives 295
      7.5.7 Long-Term Prediction Capability 298
7.6 Summary 299
      Problems 300


Chapter 8. Measurement Based Analysis of Software Reliability

Ravi K. Iyer (University of Illinois) and Inhwan Lee (Tandem, Inc.)

8.1 Introduction 303
8.2 Framework 304
      8.2.1 Overview 304
      8.2.2 Operational vs. Development Phase Evaluation 306
      8.2.3 Past Work 306
8.3 Measurement Techniques 307
      8.3.1 On-Line Machine Logging 308
      8.3.2 Manual Reporting 310
8.4 Preliminary Analysis of Data 312
      8.4.1 Data Processing 312
      8.4.2 Fault and Error Classification 314
      8.4.3 Error Propagation 317
      8.4.4 Error and Recovery Distributions 320
8.5 Detailed Analysis of Data 323
      8.5.1 Dependency Analysis 324
      8.5.2 Hardware-Related Software Errors 327
      8.5.3 Evaluation of Software Fault Tolerance 328
      8.5.4 Recurrences 329
8.6 Model Identification and Analysis of Models 333
      8.6.1 Impact of Failures on Performance 333
      8.6.2 Reliability Modeling in the Operational Phase 335
      8.6.3 Failure/Error/Recovery Model 339
      8.6.4 Multiple Error Model 344
8.7 Impact of System Activity 345
      8.7.1 Statistical Models from Measurements 345
      8.7.2 Overall System Behavior Model 348
8.8 Summary 352
      Problems 353


Chapter 9. Orthogonal Defect Classification

Ram Chillarege (IBM Research)

9.1 Introduction 359
9.2 Measurement and Software 360
      9.2.1 Software Defects 361
      9.2.2 The Spectrum of Defect Analysis 364
9.3 Principles of ODC 367
      9.3.1 The Intuition 367
      9.3.2 The Design of Orthogonal Defect Classification 370
      9.3.3 Necessary Condition 371
      9.3.4 Sufficient Conditions 373
9.4 The Defect-Type Attribute 374
9.5 Relative Risk Assessment Using Defect Types 376
      9.5.1 Subjective Aspects of Growth Curves 377
      9.5.2 Combining ODC and Growth Modeling 379
9.6 The Defect Trigger Attribute 384
      9.6.1 The Trigger Concept 384
      9.6.2 System Test Triggers 387
      9.6.3 Review and Inspection Triggers 387
      9.6.4 Function Test Triggers 388
      9.6.5 The Use of Triggers 389
9.7 Multidimensional Analysis 393
9.8 Deploying ODC 396
9.9 Summary 398
      Problems 399


Chapter 10. Trend Analysis

Karama Kanoun and Jean-Claude Laprie (LAAS-CNRS, France)

10.1 Introduction 401
10.2 Reliability Growth Characterization 402
      10.2.1 Definitions of Reliability Growth 403
      10.2.2 Graphical Interpretation of the Subadditive Property 404
      10.2.3 Subadditive Property Analysis 406
      10.2.4 Subadditive Property and Trend Change 407
      10.2.5 Some Particular Situations 408
      10.2.6 Summary 409
10.3 Trend Analysis 410
      10.3.1 Trend Tests 410
      10.3.2 Example 419
      10.3.3 Typical Results That Can Be Drawn from Trend Analyses 422
      10.3.4 Summary 424
10.4 Application to Real Systems 424
      10.4.1 Software of System SS4 425
      10.4.2 Software of System S27 427
      10.4.3 Software of System SS1 427
      10.4.4 Software of System SS2 429
      10.4.5 SAV 429
10.5 Extension to Static Analysis 431
      10.5.1 Static Analysis Conduct 431
      10.5.2 Application 433
10.6 Summary 433
      Problems 435


Chapter 11. Field Data Analysis

Wendell Jones (BNR, Inc.) and Mladen Vouk (NCSU)

11.1 Introduction 439
11.2 Data Collection Principles 441
      11.2.1 Introduction 441
      11.2.2 Failures, Faults, and Related Data 442
      11.2.3 Time 444
      11.2.4 Usage 445
      11.2.5 Data Granularity 446
      11.2.6 Data Maintenance and Validation 447
      11.2.7 Analysis Environment 448
11.3 Data Analysis Principles 449
      11.3.1 Plots and Graphs 450
      11.3.2 Data Modeling and Diagnostics 454
      11.3.3 Diagnostics for Model Determination 455
      11.3.4 Data Transformations 458
11.4 Important Topics in Analysis of Field Data 459
      11.4.1 Calendar Time 461
      11.4.2 Usage Time 461
      11.4.3 An Example 462
11.5 Calendar-Time Reliability Analysis 463
      11.5.1 Case Study (IBM Corp.) 464
      11.5.2 Case Study (Hitachi) 466
      11.5.3 Further Examples 468
11.6 Usage-Based Reliability Analysis 469
      11.6.1 Case Study (Northern Telecom Telecommunication Systems) 469
      11.6.2 Further Examples 470
11.7 Special Events 472
      11.7.1 Rare Event Models 473
      11.7.2 Case Study (Space Shuttle Flight Software)476
11.8 Availability 479
      11.8.1 Introduction 479
      11.8.2 Measuring Availability 480
      11.8.3 Empirical Unavailability 481
      11.8.4 Models 483
11.9 Summary 486
       Problems 487


Chapter 12. Software Metrics for Reliability Assessment

John Munson (University of Idaho) and Taghi Khoshgoftaar (Florida Atlantic University)

12.1 Introduction 493
12.2 Static Program Complexity 495
      12.2.1 Software Metrics 495
      12.2.2 A Domain Model of Software Attributes 496
      12.2.3 Principal Components Analysis 497
      12.2.4 The Usage of Metrics 499
      12.2.5 Relative Program Complexity 500
      12.2.6 Software Evolution 502
12.3 Dynamic Program Complexity 504
      12.3.1 Execution Profile 505
      12.3.2 Functional Complexity 505
      12.3.3 Dynamic Aspects of Functional Complexity 507
      12.3.4 Operational Complexity 509
12.4 Software Complexity and Software Quality 510
      12.4.1 An Overview 510
      12.4.2 An Application and Its Metrics 512
      12.4.3 Multivariate Analysis in Software Quality Control 514
      12.4.4 Fault Prediction Models 518
      12.4.5 Enhancing Predictive Models with Increased Domain Coverage 520
12.5 Software Reliability Modeling 523
      12.5.1 Reliability Modeling with Software Complexity Metrics 524
      12.5.2 The Incremental Build Problem 526
12.6 Summary 527
       Problems 527


Chapter 13. Software Testing and Reliability

Joseph R. Horgan (Bellcore) and Aditya P. Mathur (Purdue University)

13.1 Introduction 531
13.2 Overview of Software Testing 532
      13.2.1 Kinds of Software Testing 532
      13.2.2 Concepts from White-Box and Black-Box Testing 532
13.3 Operational Profiles 534
      13.3.1 Difficulties in Estimating the Operational Profile 535
      13.3.2 Estimating Reliability 537
13.4 Time/Structure Based Software Reliability Estimation 539
      13.4.1 Definitions and Terminology 539
      13.4.2 Basic Assumptions 540
      13.4.3 Testing Methods and Saturation Effect 541
      13.4.4 Testing Effort 541
      13.4.5 Limits of Testing Methods 542
      13.4.6 Empirical Basis of the Saturation Effect 543
      13.4.7 Reliability Overestimation due to Saturation 545
      13.4.8 Incorporating Coverage in Reliability Estimation 546
      13.4.9 Filtering Failure Data Using Coverage Information 547
      13.4.10 Selecting the Compression Ratio 551
      13.4.11 Handling Rare Events 553
13.5 A Microscopic Model of Software Risk 554
      13.5.1 A Testing-Based Model of Risk Decay 554
      13.5.2 Risk Assessment: An Example 555
      13.5.3 A Simple Risk Computation 558
      13.5.4 A Risk Browser 560
      13.5.5 The Risk Model and Software Reliability 561
13.6 Summary 563
       Problems 563


Chapter 14. Fault-Tolerant Software Reliability Engineering

David McAllister and Mladen Vouk (NCSU)

14.1 Introduction 567
14.2 Present Status 568
14.3 Principles and Terminology 569
      14.3.1 Result Verification 570
      14.3.2 Redundancy 574
      14.3.3 Failures and Faults 575
      14.3.4 Adjudication by Voting 577
      14.3.5 Tolerance 578
14.4 Basic Techniques 581
      14.4.1 Recovery Blocks 581
      14.4.2 N-Version Programming 582
14.5 Advanced Techniques 583
      14.5.1 Consensus Recovery Block 583
      14.5.2 Acceptance Voting 584
      14.5.3 N Self-Checking Programming 584
14.6 Reliability Modeling 585
      14.6.1 Diversity and Dependence of Failures 586
      14.6.2 Data-Domain Modeling 589
      14.6.3 Time-Domain Modeling 594
14.7 Reliability in the Presence of Inter-Version Failure Correlation 596
      14.7.1 An Experiment 596
      14.7.2 Failure Correlation 598
      14.7.3 Consensus Voting 599
      14.7.4 Consensus Recovery Block 601
      14.7.5 Acceptance Voting 603
14.8 Development and Testing of Multi-Version Fault-Tolerant Software 604
      14.8.1 Requirements and Design 605
      14.8.2 Verification, Validation and Testing 606
      14.8.3 Cost of Fault-Tolerant Software 607
14.9 Summary 609
        Problems 609


Chapter 15. Software Reliability Analysis using Fault Trees

Joanne Bechta Dugan (University of Virginia)

15.1 Introduction 615
15.2 Fault Tree Modeling 615
      15.2.1 Cutset Generation 617
      15.2.2 Fault Tree Analysis 619
15.3 Fault Trees as a Design Aid for Software Systems 622
15.4 Safety Validation Using Fault Trees 623
15.5 Analysis of Fault Tolerant Software Systems 627
      15.5.1 Fault Tree Model for Recovery Block System 629
      15.5.2 Fault Tree Model for N-Version Programming System 630
      15.5.3 Fault Tree Model for N Self-Checking Programming System 632
15.6 Qualitative Analysis of Fault Tolerant Software 635
      15.6.1 Methodology for Parameter Estimation from Experimental Data 635
      15.6.2 A Case Study in Parameter Estimation 639
      15.6.3 Comparative Analysis of Three Software Fault Tolerant Systems 642
15.7 System-Level Analysis of Hardware and Software System 645
      15.7.1 System Reliability/Safety Model for DRB 647
      15.7.2 System Reliability/Safety Model for NVP 648
      15.7.3 System Reliability/Safety Model for NSCP 650
      15.7.4 A Case Study in System-Level Analysis 651
15.8 Summary 657
        Problems 657


Chapter 16. Software Reliability Simulation

Robert Tausworthe (Jet Propulsion Laboratory) and Michael R. Lyu (AT&T Bell Labs.)

16.1 Introduction 661
16.2 Reliability Simulation 662
      16.2.1 The Need for Dynamic Simulation 663
      16.2.2 Dynamic Simulation Approaches 664
16.3 The Reliability Process 665
      16.3.1 The Nature of the Process 666
      16.3.2 Structures and Flows 667
      16.3.3 Interdependencies among Elements 668
      16.3.4 Software Environment Characteristics 669
16.4 Artifact-Based Simulation 669
      16.4.1 Simulator Architecture 670
      16.4.2 Results 675
16.5 Rate-Based Simulation 676
      16.5.1 Event Process Statistics 677
      16.5.2 Single-Event Process Simulation 678
      16.5.3 Recurrent Event Statistics 679
      16.5.4 Recurrent Event Simulation 681
      16.5.5 Secondary Event Simulation 682
      16.5.6 Limited Growth Simulation 683
      16.5.7 The General Simulation Algorithm 684
16.6 Rate-Based Reliability 686
      16.6.1 Rate Functions of Conventional Models 686
      16.6.2 Simulator Architecture 687
      16.6.3 Display of Results 689
16.7 The Galileo Project Application 690
      16.7.1 Simulation Experiments and Results 691
      16.7.2 Comparisons with Other Software Reliability Models 694
16.8 Summary 696
        Problems 697


Chapter 17. Neural Networks for SRE

Nachimu Karunanithi (Bellcore) and Yashwant Malaiya (Colorado State University)

17.1 Introduction 699
17.2 Neural Networks 700
      17.2.1 Processing Unit 700
      17.2.2 Architecture 702
      17.2.3 Learning Algorithms 705
      17.2.4 Backpropagation Learning 705
      17.2.5 Cascade-correlation Learning Architecture707
17.3 Application of Neural Networks for Software Reliability 709
      17.3.1 Dynamic Reliability Growth Modeling 709
      17.3.2 Identifying Fault-Prone Modules 710
17.4 Software Reliability Growth Modeling 710
      17.4.1 Training Regimes 712
      17.4.2 Data Representation Issue 712
      17.4.3 A Prediction Experiment 713
      17.4.4 Analysis of Neural Network Models 718
17.5 Identification of Fault-Prone Software Modules 718
      17.5.1 Identification of Fault-Prone Modules Using Software Metrics 719
      17.5.2 Data Set Used 719
      17.5.3 Classifiers Compared 720
      17.5.4 Data Representation 722
      17.5.5 Training Data Selection 723
      17.5.6 Experimental Approach 723
      17.5.7 Results 723
17.6 Summary 726
        Problems 726


Appendix A. Software Reliability Tools       

729


Appendix B. Review of Reliability Theory, Analytical Techniques, and Basic Statistics

747


References      

781


Index               

821