A Linear Combination Software Reliability Modeling Tool with A Graphically-Oriented User Interface Michael R. Lyu Allen P. Nikora Electrical and Computer Jet Propulsion Laboratory Engineering Dep't. California Institute of Technology University of Iowa 4800 Oak Grove Drive Iowa City, IA 52242 Pasadena, CA 91109-8099 e-mail: e-mail: lyu@hitchcock.eng.uiowa.edu bignuke@spa1.jpl.nasa.gov Thomas M. Antczak Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive Pasadena, CA 91109-8099 Abstract In previous papers, we have shown that forming linear combinations of model results will tend to yield more accurate predictions of software reliability. Using linear combinations will also simplify the software reliability practitioner's task of deciding which model or models to apply to a particular development effort. However, there are currently no tools commercially available that permit such combinations to be formed within the environment provided by the tool. In addition, most software reliability modeling tools do not take advantage of the high-resolution displays available today. Performing actions within the tool may be awkward, and the output of the tools may be understandable only to a specialist. We propose a software reliability modeling tool that allows users to formulate linear combination models, that can be operated by non- specialists, and that produces results in a form that can be understood by the software development and management personnel. 1. Introduction Over the past twenty years, many software reliability models have appeared in the literature [Littlewood et al. 86]. Many of these models have been shown to be applicable to a sufficiently large number of sets of failure data, so that development efforts would have some degree of confidence in using one or more of these models. Techniques for recalibrating models [Littlewood 90] and combining the results of models in a linear fashion [Lyu and Nikora] have been developed that appear to yield more accurate predictions than the user of a single model. However, these models have not been used as widely as one might expect. A principal factor here is that it does not seem possible to make a priori determinations of which model or models will be the best suited to a particular development effort. Another difficulty has been the lack of modeling tools that are easy for the non-specialist to use. For instance, many of the tools currently available were initially developed prior to the widespread availability of high-resolution displays, and therefore employ character-oriented user interfaces [SMERFS, SRMP, ITT Tool]. This characteristic of the tools may result in terse and cryptic command sequences, making it difficult for non-specialists or casual users to perform modeling actions with the tool. Since the results of the tools are displayed in a character-oriented fashion, the results will tend to be expressed in a way that is not easily understandable to non-specialists (e.g. model parameter values, tabular displays of interfailure times as opposed to failure rate curves). Considering the schedule pressures under which software developers and managers frequently operate, there is little incentive to learn how to operate a complicated new tool. Also, the tools available today do not allow users to form linear combinations of model results within the tool environment. In earlier papers, we have shown that linear combinations of individual models can yield more accurate reliability predictions than the individual models themselves. To form linear combinations with current tools, the tool must be run several times to obtain the results from the desired component models of the combination. These results must then be combined in an application separate from the tool. Of course, this consumes more time than would be required if linear combinations could be formed within the tool environment. In this paper we propose an architecture for a software reliability modeling tool that: 1. Supports the formation of linear combinations of model results within the tool environment. 2. Allows non-specialists to operate the tool and easily interpret the model results. We refer to this tool as a Computer-Aided Software Reliability Estimation (CASRE) tool. 2. User Analysis In developing the user interface for CASRE, it was necessary to identify the types of users that would make use of this tool. The six following types of users were identified: 1. Project managers 2. Line managers 3. Software development staff (system, software, and test engineers) 4. Software support staff (configuration management and product assurance personnel) 5. Consultants 6. Researchers For each of these user categories, we describe here their role in the software reliability measurement task, and further classify them according to schemes suggested in [Sutcliffe and Schneiderman]. Category Value Ranges User Knowledge task: Knowledge of software reliability measurement techniques, rated as novice, skilled, or expert computer: Knowledge in use of computers to accomplish task, rated as novice, skilled, or expert syntax: Knowledge of syntax of actions required to accomplish task, rated as novice, skilled, or expert Frequency: How often involved in software reliability measurement task, measured as hourly, daily, weekly, monthly, or intermittent Discretion: Rated as compulsory or optional Workload: Proportion of time estimated to be dedicated to software reli- ability measurement - rated as low, medium, high Interaction: Data entry, low-level functions (e.g. synthesis of new com- bination models), high-level functions (e.g. execution of one or more pre-specified models), all functions, uses output only 2.1. Project Managers Project Managers are typically former engineers who have made the transition to management. As such, they are familiar with basic techniques for interpreting statistical information, but may not be familiar with details of statistical modeling. These individuals typically receive reports generated by the support staff and use them as input to their decision making process. Consultants may also work with researchers to transfer academic findings to specific application domains. Knowledge task: skilled- computer: skilled- syntax: novice Frequency: monthly Discretion: optional Workload: low Interaction: Receives hardcopy reports. Rarely interacts directly with tool. 2.2. Line Managers Line Managers are usually also former engineers who have made the transition to management. As with Project Managers, these individuals are familiar with the basic techniques for interpreting statistical information. Line Managers receive reports from their support staff and use them as input to their decision process. Since Line Managers are usually closer to the actual development effort, they would tend to request reports more frequently than Project Managers. They may also use some of the basic tool capabilities (e.g. running pre-specified models, but not creating new ones). Knowledge task: skilled- computer: skilled- syntax: novice Frequency: biweekly Discretion: optional Workload: low Interaction: Receives hardcopy reports. Occasional use of high-level functions of tool. 2.3. Development Staff Users of software reliability measurement techniques within the development organization include system engineers, software engineers, programmers, and test engineers. These individuals typically have degrees in technical disciplines, extensive software development experience, and some additional training in the methods and tools that apply to their assignment. These individuals use modern software development tools on a regular basis. They will be familiar with the basics of probability theory and statistics, and may have advanced training in statistical modeling techniques. Currently, however, they rarely have had training in software reliability theory, methods, or tools. Knowledge task: skilled- computer: expert syntax: expert Frequency: weekly Discretion: compulsory, subject to Project or Line Management policy Workload: low+ Interaction: use high-level functions of tool 2.4. Support Staff Users of software reliability measurement techniques within the support staff include configuration management specialists and quality assurance personnel. These individuals include both clerical staff and technical personnel. Most of these individuals have extensive experience in configuration management and quality assurance activities across a wide range of projects. Consequently, some support staff members have training or experience with software reliability measurement techniques at various levels. Knowledge task: novice+ computer: skilled syntax: skilled Frequency: weekly Discretion: compulsory, subject to Project and Line Management policy Workload: low+ Interaction: primarily high-level functions; some low-level functions. 2.5. Consultants A software reliability consultant typically has an advanced degree in a technical discipline extensive background in all aspects of software reliability measurement, and significant software development experience. This individual plays a key role in introducing software reliability measurement techniques into a project at all levels. This includes assisting the Project and Line Managers in setting software reliability goals and interpreting results, and assisting the development and support staffs in selecting and using models and support tools. Knowledge task: expert- computer: expert syntax: expert Frequency: intermittent Discretion: optional Workload: high Interaction: all functions 2.6. Researchers Researchers are typically members of the faculty at a university who develop or refine reliability models. Researchers may work with consultants in transferring knowledge from the academic to environment to specific applications domains. Knowledge task: expert computer: skilled syntax: expert- Frequency: daily Discretion: optional Workload: high Interaction: all functions 2.7. User Analysis Summary and Recommendations Table 1 summarizes the user analysis given above. ZDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDBDDDDDDDDDDDDBDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDD? 3 3 User Knowledge 3 3 3 3 3 3 3 Task Computer Syntax 3 Frequency 3 Discretion 3 Workload 3 Interaction 3 FMMMMMMMMMMMMMXMMMMMMMMMMQMMMMMMMMMMMQMMMMMMMMMXMMMMMMMMMMMXMMMMMMMMMMMMXMMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMM5 3 Proj Mgr 3 skilled-3 skilled- 3 novice 3 monthly 3 optional 3 low 3 hardcopy 3 3 3 3 3 3 3 3 3 3 3 Line Mgr 3 skilled-3 skilled- 3 novice 3 bi-weekly 3 optional 3 low 3 hcopy, occ. high 3 3 3 3 3 3 3 3 3 level functions 3 3 3 3 3 3 3 3 3 3 3 Dev Staff 3 skilled-3 expert 3 expert 3 weekly 3 compulsory 3 low+ 3 high-level functions 3 3 3 3 3 3 3 3 3 3 3 Supp. Staff 3 novice+ 3 skilled 3 skilled 3 weekly 3 compulsory 3 low+ 3 hi-level funcs, some 3 3 3 3 3 3 3 3 3 low level fncns 3 3 3 3 3 3 3 3 3 3 3 Consultant 3 expert- 3 expert 3 expert 3 intrmt 3 optional 3 high 3 high and low 3 3 3 3 3 3 3 3 3 level functions 3 3 3 3 3 3 3 3 3 3 3 Researcher 3 expert 3 skilled 3 expert 3 daily 3 optional 3 high 3 high and low 3 3 3 3 3 3 3 3 3 level functions 3 @DDDDDDDDDDDDDADDDDDDDDDDADDDDDDDDDDDADDDDDDDDDADDDDDDDDDDDADDDDDDDDDDDDADDDDDDDDDDADDDDDDDDDDDDDDDDDDDDDDDDY Table 1 - Summary User Profiles We see from the above table that the reliability measurement task is performed within a software development effort on, at best, a weekly basis (discounting the time that may have been spent with a consultant in setting up a reliability measurement program). Also, some of the users performing the task the most frequently have the lowest level of reliability measurement knowledge. Given these factors, the goals of low learning time and good retention over time were the primary concerns in designing the CASRE interface. These findings suggest that a menu-oriented or direct manipulation style of interaction, or perhaps a combination of the two, is appropriate. While important, good speed of performance was not deemed as critical as the other two goals. When running a complicated model, such as the Littlewood-Verrall model, on a set of failure data, it is to be expected that results will not be immediately available. We therefore specified the goal that the throughput of the modeling section should at least be comparable to that of some of the more popular tools currently in use. 3. The CASRE Tool - High-Level Structure and Functionality In this section we describe the high-level architecture and the basic functionality of the CASRE tool. To implement the recommendations resulting from the user analysis, it is planned to implement CASRE on top of a windowing system (e.g. X-Windows/MOTIF, DOS Windows 3.0). Figure 1 shows the proposed high-level architecture for CASRE, whose major functional areas are: Data Modification Failure Data Analysis Modeling and Measurement Modeling/Measurement Results Display Figure 1: High-Level Architecture for CASRE Much of CASRE's functionality is available in current software reliability tools .[[ SMERFS .]], .[[ SRMP .]]. However, a feature unique to CASRE allows users to combine the results of several models in addition to executing a single model. Feedback from the Model Evaluation block assists users in identifying a model or combination of models best suited to the failure data being analyzed. Moreover, the I/O facilities, the user interface, and the measurement procedures are greatly enhanced in this tool. 3.1. Data Modification CASRE allows users to create new failure data files, modify existing files, and perform global operations on files. Editing CASRE allows users to create or alter failure history data files. A simplified spreadsheet-like user interface allows users to enter time between failures or test interval lengths and failure counts from the keyboard. Users are also allowed to invoke a preferred editor (e.g. emacs or vi). 3.2. Smoothing Since input data to the models is often fairly noisy, the following smoothing techniques are proposed: - Sliding rectangular window - Hann window - General polynomial fit - Cardinal Spline - Specific cubic-polynomial fits (e.g. B-Spline, Bezier Curve) Users select smoothing techniques appropriate to the failure data being analyzed. The smoothed input data can be plotted, used as input to a reliability model, or written out to a new file for later use. Summary statistics for the smoothed data can also be displayed (see "Failure Data Analysis" below). 3.3. Data Transformation In some situations, logarithmic, exponential, or linear transformations of the failure data produce better or more understandable results. The following operations, currently available in some tools, allow users to transform an entire set of failure data in this manner. - log(a * x(i)) + b); x(i) represents a failure data item, and a and b are user-selectable scale factors - exp(a * x(i) + b) - x(i) ** a - x(i) + a - x(i) * a - User-specified transformations might also be allowed. As with smoothing, users select a specific transformation. Users are able to manipulate transformed data as they would smoothed data. 3.4. Failure Data Analysis The "Summary Statistics" block in Figure 1 allows users to display the failure data's summary statistics, including the mean and median of the failure data, 25% and 75% hinge points, skewness, and kurtosis .[[ Hogg Craig .]]. 3.5. Modeling and Measurement Figure 1 shows two modeling functions. The "Models" block executes single software reliability models on a set of failure data. The "Model Combination" block allows users to execute several models on the failure data and combine the results of those models. We include this capability because our experience in combining the results of more than one model indicates that such "combination models" may provide more accurate reliability predictions than single models. The block labeled "Model Evaluation" allows users to determine the applicability of a model to a set of failure data. Single Model Execution Based on our experience in applying software reliability models, we include the following models in CASRE: BJM, GO, JM, KL, LM, LNHPP, LV, MO, PM, SM, and YM. The models should be implemented to allow input to be in the form of interfailure times or failure frequencies. CASRE allows users to choose the parameter estimation method (maximum likelihood, least squares, or method of moments). Model outputs include: - Current estimates of failure rate/interfailure time - Current estimates of reliability - Model parameter values, including high and low parameter values for a user-selectable confidence bound - Current values of the pdf and cdf - The probability integral transform ui .[[ Littlewood 1986 IEEE .]] - The normalized logarithmic transform of ui, yi .[[ Littlewood 1986 IEEE .]] Users can display these quantities on-screen or write them to disk. 3.6. Combination Models CASRE allows users to combine the results of several models according to the Equally-Weighted Linear Combination (ELC), Median- Oriented Linear Combination (MLC), Unequally-Weighted Linear Combination (ULC), or Dynamically-Weighted Linear Combination (DLC) schemes described in earlier papers. Users may also be allowed to define their own weighting schemes. The resulting combination models could be further used as the component models to form another combination model. 3.7. Model Evaluation CASRE includes the following statistical methods to help users determine the applicability of a model (including "combination models") to a specific failure data set: - Computation of prequential likelihood (PL) function (the "Accuracy" criterion). - Determination of the probability integral transform ui, (plotted as the u-plot - the "Bias" criterion). - Computation of yi to produce the y-plot (the "Trend" criterion). - Noisiness of model predictions (the "Noise" criterion). - The Akaike Information Criterion (AIC) .[[ Akaike IEEE .]], similar in concept to prequential likelihood, could also be implemented. This model evaluation function would also compute goodness-of-fit measures (e.g. Chi-Square test). The PL and AIC outputs are used as input to "Model Combination" to determine the relative contribution of individual models if the user has specified a combination model. 3.8 Display of Results CASRE graphically displays model results in the following forms: Interfailure times/failure frequencies, actual and estimated Cumulative failures, actual and estimated Reliability growth, actual and estimated Actual and estimated quantities are available on the same plot. Plots include user-specified confidence limits. Users are able to control the range of data to be plotted as well as the usual cosmetic aspects of the plot (e.g. X and Y scaling, titles). In a windowing environment, multiple plots could be simultaneously displayed. CASRE allows users to save plots displayed on-screen as a disk file or to print them. One public-domain tool, SMERFS .[[ SMERFS .]] version 4, can write the data used to produce a plot to a file that can be imported by a spreadsheet, a DBMS, or a statistics package for further analysis. CASRE includes this capability. The plotting function also produces u-plots and y-plots from Model Evaluation's ui and yi outputs. These plots indicate the degree and direction of model bias and the way in which the bias changes over time. 4. Application Procedure Figures 2-8 show a series of screen dumps for the described CASRE tool, using simulated failure data. It can been seen that a project application in software reliability measurement, including an extensive exercise of the linear combination approach elaborated in this paper, could be systematically investigated and engineered as much as the user wishes. Screen 1 - opening a failure data file The screen is shown in Figure 2. To choose a set of failure data on which a reliability model will be run, the user selects the "File" menu with the mouse. After selecting the "Open" option in the File menu, a dialogue box for selecting a file appears on the screen. The current directory appears in the editable text window at the top of the dialogue box. The failure history files in that directory are listed in the scrolling text window. The user selects a file by highlighting its name (scrolling the file name window if necessary) and then pressing the "Open" button. To change the current directory, the user enters the name of the new directory in the "Current Directory" window and presses the "Change Directory" button. Pressing the "Cancel" button removes the dialogue box from the screen. Screen 2 - preliminary failure data analysis The screen is shown in Figure 3. After opening a failure history file, the contents of the file are displayed in tabular and graphic forms. The tabular representation resembles a spreadsheet, and the user can perform similar types of operations (e.g. selecting a range of data, deleting one or more rows of data). All of the fields can be changed by the user except for the "Interval Number" field (or "Error Number" field if the data is interfailure times). In this example, the selected data set is in the form of test interval lengths and number of failures per test interval. The user can scroll up and down through this tabular representation and resize it as per the MOTIF or DOS Windows conventions. The large graphics window displays the same data as the worksheet. If the failure data set is interfailure times, the initial graphical display is interfailure times. If, as in this example, the failure data set is test interval lengths and failure counts, the initial graphical display is the number of failures per test interval. The display type can be changed by selecting one of the items from the "Display Type" menu associated with the graphics window. The user can move forward and backward through the data set by pressing the right arrow or left arrow buttons at the bottom of the graphics window. Finally, the iconified window at the lower left corner of the screen lists the summary statistics for the data. To open this window, the user clicks on the icon. The following information is then displayed in a separate window: - Number of observations in this data set - Type of observations made (interfailure times or test interval lengths and failure counts) - Mean value of the observations - Minimum and maximum values - Median - 25% and 75% hinges - Standard deviation and variance - Skewness and Kurtosis Screen 3 - failure data selection and edition The screen is shown in Figure 4. The user will frequently use only a portion of the data set to estimate the current reliability of the software. This is because testing methods may change during the testing effort, or different portions of the data set may represent failures in different portions of the software. To use only a subset of the selected data set, the user may simply "click and drag" on the tabular representation of the data set to highlight a specific range of observations. The user may also select previously-defined data ranges. To do this, the user chooses the "Select Range" option of the Edit menu. This brings up a dialogue box containing a scrolling text window in which the names of previously-defined data ranges and the points they represent are listed. To select a particular range, the user highlights the name of the range in the scrolling text window and presses the "OK" button. Pressing the "Cancel" button removes the dialogue box and the Edit menu from the screen. Once a range has been selected, all future modeling operations will be only for that range. The selected data range is highlighted in the tabular representation. The graphics display will change to include only the highlighted data range. All other observations will be removed from the graphics display. Screen 4 - data filtering The screen is shown in Figure 5. After selecting a data range, the user may wish to transform the file or smooth the data. Software failure data is frequently very noisy; smoothing the data or otherwise transforming it may improve the modeling results. To do this, the user selects one of the options in the "Filter" menu. There are five affine transformations which the user may apply to the data, and six types of smoothing. Transformations and smoothing operations may be pipelined - for example, the user could select the "ln(A * X(i) + B)" transformation followed by the B-spline smoothing operation. The number of filters that may be pipelined is limited only by the amount of available memory. The tabular representation of the failure is changed to reflect the filter, as is the graphical display of the data. The type of filter applied to the data is listed at the right hand edge of the graphics display window. In this example, we have applied a B spline to the data. Once a series of filters has been applied to the data, the user may remove the effect of the most recent filter by selecting the "Undo" option of the Filter menu. To remove the effect of the entire series of filters, the user selects the "Undo All Filters" option of the Filter menu. Screen 5 - applying software reliability models The screen is shown in Figure 6. After the user has opened a file, selected a data range, and done any smoothing or other transformation of the data, a software reliability model can be run on the data. In the Model menu, the user has the choice of 13 individual models or a set of models which combine the results of two or more of the individual models. The user may also choose the method of parameter estimation (maximum likelihood, least squares, or method of moments), the confidence bounds that will be calculated for the selected model, and the interval of time over which predictions of future failure behavior will be made. Screen 6 - selecting the best model(s) The screen is shown in Figure 7. There are many models from which to choose in this tool. The user may not know which model is most appropriate for the data set being analyzed. Using CASRE, the user can request, "display the results of the individual model which best meets the four prioritized criteria of accuracy (based on prequential likelihood), biasedness, trend, and noisiness of prediction." To do this, the user first selects the "Individual" option of the Model menu. A submenu then appears, on which 13 individual models are listed, as well as a "Choose Best" option. The user selects the "Choose Best" option, which results in a "Selection Criteria" dialogue box being displayed. The user moves the four sliders in this dialogue box back and forth to establish the relative priorities of the four criteria. Numerical values of the priorities are displayed in the text boxes on the right side of the dialogue box. Once the priorities have been established, the user presses the "OK" button. CASRE then proceeds to run all of the individual models against the data set, first warning the user that this is a time-consuming operation and allowing cancellation of the operation. If the user continues, CASRE provides the opportunity for cancellation at any time if the user decides that the operation is taking too much time. Screen 7 - displaying final results The screen is shown in Figure 8. Once a model has been run on the failure data, the results are graphically displayed. Actual and predicted data points are shown, as are confidence bounds. The model is identified in the window's title bar; the percent confidence bounds are given at the right side of the graphics window. This concludes one round of software reliability measurement with CASRE. 4. General Experiences and On-Going Work A prototype of the CASRE interface was first presented at the 14th Minnowbrook Workshop on Software Engineering. Remarkably, there were no suggestions for change that would have meant any significant re-organization of the tool. Currently, the Air Force Operational and Test Center (AFOTEC) is funding the implementation of this tool for a DOS Windows 3.0 environment. The modeling capability of CASRE will be based on the mathematical library of SMERFS version IV. We decided that this would be the most effective way of accomplishing the task within the allocated resources. Rather than writing a new set of modeling routines, it made more sense to make use of an already existing modeling library that had been extensively tested. Implementation of the linear combination modeling facility will be a straightforward task, since all that is needed is a control mechanism to sequence through the selected models and assign weights to the results of individual models. Since the time of the Minnowbrook presentation, some changes to the original concept have been made. The most significant change is in the model selection and application area (illustrated in Figures 6 and 7). Recall from the previous section that to execute a model, users would choose a model from a sub-menu of the Model pull-down menu. This would have resulted in a sub-menu for individual models and a set of control panels and sub-menus for the linear combination models. To run more than one model would be a tedious exercise with this type of interaction; users would have to choose one model, wait for it to complete, then choose the next model, and so forth. A discussion with the AFOTEC sponsors revealed that a more sensible model selection and execution mechanism would be a checklist, in which users would indicate all of the models, individual or combination, to be executed during a modeling run. Upon completing the checklist, users would select an item on the checklist that would start execution of the models. This would free users to perform other tasks while the models were executing. As with other applications involving possibly lengthy computations, users would be given the option to terminate execution of the chosen set of models at any time. This change has led to modifications in the drawing window in which modeling results are displayed. As originally conceived, this window would display the raw data and the results of only one model. Now that the user will be allowed to execute more than one model at a time, the drawing window will change to allow users to specify which models' results will be displayed in the window. This will be accomplished by a checklist similar to that used to specify models for execution. However, in this "display selection" checklist, only the models that have been executed will be listed. This facility will allow users to easily compare the outputs, and hence the behavior, of two or more models. 5. Conclusions and Future Work We have proposed a set of linear-combination models for more accurate measurement of software reliability. These models have shown promising results when compared with the traditional single-model approaches. To relieve the tedious work involved in applying these approaches, a CASE tool, called CASRE, is proposed to automate the software reliability measurement task. For the purpose of model validation and determining tool applicability, we need to obtain enough data to compare software reliability models and predictions across various types of software projects. Some data sets can be found in .[[ Musa Data .]], .[[ Misra IBM 1983 .]], .[[ Troy IEEE 1985 Models .]], .[[ Troy IEEE 1986 .]], .[[ Levendel 1989 .]], .[[ Levendel 1990 .]], .[[ Stampfel .]], .[[ Keller Shuttle .]], .[[ Zinnel .]], .[[ Rapp .]]. In future investigations, we will apply more data sets to the proposed combination models for the purpose of validating them, and for refining the structure and functionality of the CASRE tool. Prototypes of the CASRE tool will be prepared and refined; potential users will be identified and asked to evaluate the tool based on interaction with the prototype. These evaluations will be also be used in refining the structure and functionality of the tool. [ $LIST$ .]