Process performance modelling statistical probabilistic


Published on

The presentation details out different process performance modelling technique which can be applied in software context with explanation on how to construct and validate them. The presentation also provides guidance on how do we decide on selection of a modelling technique in a given context.

Process performance modelling statistical probabilistic

  2. 2. MODELLING IN CMMI In CMMI Model, the Process Area ‘Organization Process Performance’ calls for useful Process Performance Model (PPM)s establishment (& calibration) and Quantitative Project Management and Organizational Performance Management process areas gets more benefit by using these models to predict or to understand the uncertainties , thereby helping in reducing risk by controlling relevant process/sub processes. The PPM’s are built to predict the Quality and Process Performance Objectives and sometimes to Business Objectives (using integrated PPMs) Modelling plays a vital role in CMMI in the name of Process Performance Models. In fact we have seen Organizations decide on the goals and immediately starts looking at what is their Process Performance Model. Its also because of lack of options and clarity, considering in software the data points derived are smaller in nature and also because of process variation.
  3. 3. WHAT ARE THE CHARACTERISTICS OF A GOOD PPM? One or more of the measureable attributes represent controllable inputs tied to a sub process to enable performance of ―what-if analyses for planning, dynamic re-planning, and problem resolution. Process performance models include statistical, probabilistic and simulation based models that predict interim or final results by connecting past performance with future outcomes. They model the variation of the factors, and provide insight into the expected range and variation of Predicted results. A process performance model can be a collection of models that (when combined) meet the criteria of a process performance model.
  4. 4. THE ROLE OF SIMULATION & OPTIMIZATION Simulation: Its an activity of Studying the virtual behaviour of a system using the representative model/miniature by introducing expected variations in the model factors/attributes. Simulation helps us to achieve confidence on the results or to understand the uncertainty levels Optimization: In the context of Modelling, Optimization is a technique in which the model outcome can be maximized/minimized or targeted by introducing variations in the factors (with/without constraints) and using relevant Decision rules. The Values of factors for which the outcome meets the possible expected values are used as target for planning/composing process/sub process. This helps us to plan for success.
  5. 5. TYPES OF MODELS - DEFINITIONS Physical Modelling : The Physical state of a system is represented using the scaled dimensions with/without similar components. As part of Applied Physics we could see such models coming up often. Example: Prototype of a bridge, a Satellite map, etc Mathematical Modelling: With the help of data the attributes of interest are used to form the representation of a system. Often these models are used when people involved largely in making the outcome or the outcome is not possible to be replicated in laboratory. Example : Productivity model, Storm Prediction, Stock market prediction, etc Process Modelling: The Entire flow of Process with factors and conditions are modelled. Often these models are useful in understanding the bottlenecks in the process/System and to correct. Ex: Airport queue prediction, Supply chain prediction, etc
  6. 6. TREE OF MODELS Models Mathematical Statistical Probabilistic Simulation Process Model Discrete Event Continuous Physical Model Others Static Dynamic Vs Deterministic Stochastic Vs Discrete Continuous Vs Variations in the Models Regression Models, Artificial Neural Network, ARIMAX, Reliability Bayesian Belief Network, Markov Network, PNN Parametric Nonparametric Vs Monte Carlo Simulation, Discrete event Simulation Queuing Model + Discrete Event Simulation System Dynamic models Examples We often observe there is mix of techniques or combination of models used to bring the less error prone Models to predict the system
  7. 7. PROCESS OF MODELLING Model Objective Limited to Mathematical and Process Modelling Collection of Data on Relevant Factors/componen ts Formulation of Representation using Techniques Prediction of Parameters Compare with Actuals (Validation) Refine the Representation Refine the Factors/Compone nts Use for Prediction With Time & Data Yes No
  8. 8. MODELLING UNDER OUR PURVIEW We will see the following models in this presentation Regression Based Models Bayesian Belief Networks Neural Networks Fuzzy Logic Reliability Modelling Process Modelling (Discrete Event Simulation) Monte Carlo Simulation System Dynamics (Continuous Simulation)
  9. 9. REGRESSION Regression is a process of estimating relationship among the dependant and independent variables and forming relevant explanation of for dependant variable with the conditional values of Independent Variables. As a model its represented using Y=f(X)+error (unknown parameters) Y – dependent Variable , X –Independent Variables Few assumptions related to regression,  sample of data represents the population The variables are random and their errors are also random There is no multicollinearity (Correlation amongst independent variables) We are working on here with multiple regression (with many X’s) and assuming linear regression (non linear regression models exist). The X factors are either the measure of a sub process/process or it’s a factor which is influential to the data set/project /sample. Regression models are often Static models with usage of historical data coming out from multiple usage of processes (many similar projects/activities)
  10. 10. REGRESSION - STEPS  Perform a logical analysis (ex: Brainstorming with fishbone) to understand the independent variables (X) given a dependent variable (Y) Collect relevant data and plot scatter plots amongst X vs Y and X1 Vs X2 and so on.. This will help us to see if there is relationship (correlation) between X and Y, also to check on multicollinearity issues. Perform subset study to understand the best subset which gives higher R2 value and less standard error Develop a model using relevant indications on characteristics of data with continuous and categorical data From the results study the R2 value (greater than 0.7 is good) which explains how much the Y is explained by X’s . The more the better. Study the P values of Individual independent variables and it should be less than 0.05, which means there is significant relationship is there with Y.
  11. 11. REGRESSION - STEPS Study the ANOVA Resulted P value to understand the model fit and it should be less than 0.05 VIF (Variance Inflation Factor) should be less than 5 ( sample size less than 50) else less than 10, on violation of this multicollinearity possibility is high and X factors to be relooked Understand the residuals plot and it should be normally distributed, which means the prediction equation produces a line which is the best fit and gives variation on either side. R2 alone doesn’t say a model is right fit in our context, as it indicates the Xs are pretty much relevant to the variation of Y, but it never says that all relevant X’s are part of the model or there is no outlier influence. Hence beyond that , we would recommend to validate the model. Durbin Watson Statistic is used for checking Autocorrelation using the residuals, and it’s value ranges from 0 to 4. 0 indicates strong positive autocorrelation (previous data, impacts the successive time period data to increase) and 4 indicates strong negative autocorrelation (previous data, impacts the successive time period data to decrease) and 2 is no serial correlation.
  12. 12. REGRESSION - EXAMPLE Assume a case where Build Productivity is Y, Size (X1), Design Complexity(X2) and Technology (X3 – Categorical data) are forming a model as the organization believes they are logically correlated. They collect data from 20 projects and followed the steps given in the earlier slide and formed a regression model and following are the results, Dummy Variable Regression used, because Categorical Data ‘Technology’ given and 0 and 1 used in regression Lower P Value for Regression means the model is significant Higher R2 adjusted, means all X’s contributes in explanation of Y, and to 86.44% and Low Standard Error Lower P Value for all Independent Variables means they are significant in prediction Lower VIF (<5) indicates no multicollinearity risk By Technology, two regression equations are formed Residual Plot is approximately Normally Distributed Tool: Minitab outputs
  13. 13. VALIDATING MODEL ACCURACY Its important to ensure the model which we develop not only represents the system, but also has the ability to predict the outcomes with less residuals. In fact this is the part where we can actually understand whether the model meets the purpose. To check the Accuracy we can use the commonly used method MAPE (Mean Absolute Percentage Error), which calculates the percentage error across observations between the actual value and predicted value. where Ak is the actual value and Fk is the forecast value. An error value of less than 10% is acceptable. However if the values of forecasted observations are nearer to 0, then its better to avoid MAPE and instead use Symmetric Mean Absolute Percentage Error(SMAPE). Interpolation & Extrapolation: Regression models are developed using certain range of X values and the relationship holds true for within that region. Hence any data prediction , within the existing range of Xs (Interpolation)would mean we can rely on the results more. However the benefit of a model also relies on its ability to predict a situation which is not seen yet, in that cases, we expect the model to predict a range which it never encountered or the region in which the entire relationship or representation could significantly change between X’s and Y, which is extrapolation. To a smaller level extrapolation can be considered with uncertainty in mind, however larger variation of Xs , which is far away from the data used in developing the model can be avoided as the uncertainty level increases.
  14. 14. VARIANTS IN REGRESSION Statistical relationship modelling are mainly selected based on the type of data which we have with us. The X factors and Y factors are continuous or discrete determines the technique to be used in developing the statistical model. X's Y Data Type Continuous Discrete Discrete ANOVA & MANOVAChi-Square & Logit Continuous Correlation & Regression (simple/multiple/ CART, etc) Logistic Regression Few Discrete + Few Continuous Dummy Variable Regression ordinal Logit By linearity , we can classify a regression as linear, quadratic, cubic or exponential. Based on type of distribution in the correlation space, we can use relevant regression model.
  15. 15. TOOLS FOR REGRESSION Regression can be performed using Trendline functions of MS excel easily. In addition there are many free plug-ins available in the internet. However from professional statistical tools point of view, Minitab 17 has easy features for users to quickly use and control. The tool has added profilers and optimizers which are useful for simulations and optimizations (earlier we were depending on external tools for simulation). SAS JMP is another versatile tool with loads of features. If someone has used this tool for quite sometime, they will be more addictive with its level of details and responsiveness. JMP had interactive profilers for quite a long period and can handle most of the calculations. In addition, we have SPSS, Matlab tools which are also quite famous. R is the open source statistical package which can be added with relevant add-ins to develop many models. We would recommend to consider the experience & competency level of users, licensing cost, complexity of modelling and ability to simulate & optimize in deciding the right tool. Some organizations decide to develop their own tools , considering their existing source of data is in other formats, however we have seen such attempts rarely sustain and succeed. This is because, too much elapsed time, priority changes, complexity in algorithm development, limited usage, etc. Considering most of the tools support common formats, the organizations can consider to develop reports/data in these formats to feed in to proven tools/plug ins (Its just a word of free advice )
  16. 16. BAYESIAN BELIEF NETWORKS A Bayesian Network is a construct in which the probabilistic relationship between variables are used to model and calculate the Joint Probability of Target. The Network is based on Nodes and Arcs (Edges). Each variable represents a Node and their relationship with other Node is expressed using Arcs. If any given node is connected with a dependent on other variable, then it has parent node. Similarly if some other node depends on this node, then it has children node. Each node carries certain parameters (ex: Skill is a node, carries High, Medium, Low parameters) and they have probability of occurrence (Ex: High- 0.5, Medium -0.3,Low -0.2). When there is conditional independence (node has a parent) then its joint probability is calculated by considering the parent nodes(ex: Analyze Time being “Less than 4 hrs” or more, depends on Skill High/Med/Low, which is 6 different probability values). The central idea of using this in modelling is based on the posterior probability can be calculated from the prior probability of a network , which has developed with the beliefs (learning). Its based on Bayes Theorem. Bayesian is used highly in medical field, speech recognition, fraud detection, etc Constraints: The Method and supportive learning needs assistance and computational needs are also high. Hence its usage is minimal is IT Industry, however with relevant tools in place its more practical to use in IT.
  17. 17. BAYESIAN BELIEF NETWORKS- STEPS We are going to discuss on BBN mainly using BayesiaLab tool, which has all the expected features to make comprehensive model and optimize the network and indicate the variables for optimization. We can discuss on other tools in upcoming slide. A) In Bayesian, data of variables can be in discrete or continuous form, however they will be discretised using techniques like Kmeans/Equal Distance/Manual &other Methods. B) Data has to be complete for all the observations in the data set for the variables, else the tool helps us to fill the missing data C)Structure of the Network is important and it determines the relationship between variables, however it doesn’t often the cause and effect relationship instead a dependency. Domain experts along with process experts can define the structure (with relationship) manually. D) As a alternative, machine learning is available in the tool, where set of observations passed to the tool and using the learning options (structured and unstructured) the tool plots the possible relationships. The tool uses the MDL (Minimum Description Length) to identify the best possible structure. However we can logically modify the flow, by adding/deleting the Arcs (then, perform parameter estimation to updated the conditional probabilities)
  18. 18. BAYESIAN BELIEF NETWORKS- STEPS E) In order to ensure that the network is fit for prediction, we have to check the network performance. Normally this is performed using test data (separated from set of overall data) and use it to check the accuracy, otherwise the whole set is taken by tool to validate the model predicted values vs actual value. This gives the accuracy of the network in prediction. Anything above 70% is good for prediction. F) In other models we will perform simulation to see the uncertainty in achieving a target, but in probability model that step is not required, as the model directly gives probability of achieving. G)In order to perform what if and understand the role each variable in maximizing the probability of target or mean improvement of target, we can do target optimization. This helps us to run number of trials within the boundaries of variation and see the best fit value of variables which gives high probability of achieving the target. Using this values we can compose the process and monitor the sub process statistically. H) As we know some of the parameters with certainty, we can set hard evidence and calculate the probability. (Ex: Design complexity or skill is a known value, then they can be set as hard evidence and probability of productivity can be calculated.) i) Arc Influence diagram will help us in understanding the sensitivity of variables in determining the Target.
  19. 19. BAYESIAN - SAMPLE Assume a case in which we have a goal of Total Turn Around Time (TTAT) with parameters Good(<=8hrs)and bad(>8hrs). The variables which is having influence are Skill, KEDB(Known Error Database) Use and ATAT (Analyse Turn Around Time) with Met(<=1.5 hrs) and Not met (>1.5hrs), How do we go with Bayesia modelling based on previous steps. (Each incident is captured with such data and around 348 incidents from a project is used) Directed Acyclic Graph (DAG) or the Network with Nodes (Skill, etc) and Arcs (connectors) with TTAT Probability of Each Variables and their parameter using conditional and Joint Probability Total Precision is 66.38%, which is the actual vs model predicted value, in this case its marginal to accept Ex: Actual count of Bad is 149 in model and the predicted times are 104 The Current probability of Good State of TTAT is 57.46% (Refer first pic) and after optimization the Optimal Probability is 70% (for ATAT 0 refer s to met, skill 1 is High and KEDB 1 is Yes Target Optimization is set of Maximizing the probability
  20. 20. BAYESIAN TOOLS There are few tools few have worked on to get hands on experience. On selecting a tool for Bayesian modelling its important to consider that the tool has ability to machine learn, analyze and compare networks and validate the models. In addition the tool to have optimization capabilities. GENIE is a tool from Pittsburgh University, which can help us learn the model from the data. The Joint probability is calculated in the tool and using hard evidence we can see the final change in probabilities. However the optimization part (what if) is more of trial and error and not performed with specialized option. We can use excels and develop the joint probabilities and verify with GENIE on the values and accuracy of the Network. The excel sheet can be used as input for simulation and optimization with any other tool (ex: Crystal ball) and what if can be performed. For sample sheets please connect with us in our mail id given in contact us. In addition we have seen Bayes Server, which is also simpler in making the model, however the optimization part is not as easy we thought of.
  21. 21. NEURAL NETWORK In general we call it “Artificial Neural Network (ANN)” as it performs similar to human brain neurons (simpler version of it). The network is made of Input nodes, output nodes which are connected through hidden nodes and links(they carry weightage). Like human brain trains the neuron by various instances/situations and designs its reaction towards it, the network learns the input and its reaction in output, through algorithm and using machine learning. There are single layer feed forward, multilayer feed forward and recurrent layer network architecture exists. We will see the single layer feed forward in this case. Single layer of nodes which uses inputs to learn towards outputs are single layer feed forward architecture. In Neural Network we need the network to learn and develop the patters and reduce the overall network error. Then we will validate the network using a proportion of data to check the accuracy. If the learning and validation total mean squared error is less (Backpropogation method-by forward and backward pass the weights of the link are adjusted, recursively) then the network is stable. In general we are expected to use continuous variable, however discrete data is also supported with the new tools. Artificial Neural Networks is a black box technique where the inputs are used to determine the outputs but with hidden nodes, which can’t be explained by mathematical relationships/formulas. This is a non-linear method which tends to give better results than other linear models.
  22. 22. NEURAL NETWORKS - STEPS We are going to explain neural networks using JMP tool from SAS. As we discussed in regression, this tool is versatile and provides detailed statistics. A) Collect the data and check for any high variations and see the accuracy of it. B) Use the Analyze->modeling->Neural from the tool and provide X and Y details. In JMP we can give discrete data also without any problem. C) In the next step we are expected to specify the number of hidden nodes we want to have. Considering the normal version of JMP is going to allow single layer of nodes, we may specify as a rule of thumb (count of X’s * 2). D)We need to specify the method by which the data will be validated, here if we have enough data (Thumb Rule: if data count> count of x’s * 20) then we can go ahead with ‘Holdback’ method, where certain percentage of data is kept only for validation of the network, else we can use Kfold and give to give number of folds (each fold will be used for validation also). In Holdback method keep 0.2 (20%) for validation. E) We get the results with Generalized Rsqaure, and here if the value is nearer to 1 means, the network is contributing to prediction (the variables are able to explain well of the output , using this neural network). We have to check the validation Rsquare also to check how good is the results. Only when the training and validation results are nearly the same, the network is stable and we can use for prediction. In fact the validation results in a way gives the accuracy of the model and their error rate is critical to be observed.
  23. 23. NEURAL NETWORKS - STEPS F) The Root Mean Squared Error to be minimum. Typically you can compare the fit model option given in JMP which best fits the linear models and compare their Rsquare value with Neural Networks outcome. G) The best part of JMP is its having interactive profiler, which provides information of X’s value and Y’s outcome in a graphical manner. We can interactively move the values of X’s and we can see change in ‘Y’ and also change in other X’s reaction for that point of combination. H) With this profiler there is sensitivity indicator(triangle based) and desirability indicator. This acts as optimizer, where we can set the value of “Y” we want to have with Specification limits/graphical targets and for which the X’s range we will be able to get with this. There is maximization, minimization and target values for Y. I) Simulation is available as part of profiler itself and we can fix values of X’s (with variation) and using montecarlo simulation technique the tool provides simulation results, which will be helpful to understand the uncertainties.
  24. 24. NEURAL NETWORKS - SAMPLE Assume a case in which we have a goal of Total Turn Around Time (TTAT) (Less than 8hrs is target). The variables which is having influence are Skill (H,M,L), KEDB(Known Error Database) Use (Yes, No) and ATAT (Analyse Turn Around Time), How do we go with Neural Networks based on previous steps. (Around 170 data points collected from project is used) In this case, Skill and KEDB are discrete, ATAT and TTAT is continuous. Since we give ‘TTAT’ as Y, the machine performs structured learning. Here we are using Holdback method for validation and giving 20% of data used for validation and giving 6 hidden nodes for activation function. Rsqaure for training and Validation is more than .9, which means the Y is explained well by this network and its stable Structure of the Inputs, Hidden node/activation function and output. RMSE is less compared to other models The simulation function is set up with 5000 runs, and Spec limits for Y is given (output table is possible) Each variable can be interactively specified with random or fixed value. This helps in prediction with known/unknown value Desirability function shows the specs of X’s where TTAT achieves the expected value and simulation shows the confidence level
  25. 25. NEURAL NETWORK TOOLS Matlab has neural network toolbox and which seems to be user friendly and has many options and logical steps to understand and improve the modelling. What we are not sure is the simulation and optimization capabilities. The best part is they give relevant scripts which can modified or run along with existing tools. JMP has limitations when it comes to Neural Network as only single layer of hidden network can be created and options to modify learning algorithm are limited. However JMP Pro has relevant features with many options to fit our need of customization. Minitab at this moment don’t have neural networks in it. However SPSS tool contains neural network with multilayer hidden nodes formation capabilities. Nuclass 7.1 is a free tool (professional version has cost) which is specialized in Neural Network. There are many options available for us to customize the model. However it won’t be as easy like JMP or SPSS. PEERForecaster and Alyuda Forecaster are excel based neural network forecasting tools. They are easy to use to build the model, however the simulation and optimization with controllable variable is question mark with these tools.
  26. 26. RELIABILITY MODELLING Reliability is an attribute of software product which implies the probability to perform at expected level without any failure. The longer the software works without failure, the better the reliability. Reliability modelling is used in software in different conditions like defect prediction based on phase-wise defect arrival or testing defect arrival pattern, warranty defect analysis, forecasting the reliability, etc. Reliability is measured in a scale of 0 to 1 and 1 is more reliable. There is time dependent reliability, where time is an important measure as the defect occurs with time, wear out, etc. There is also non-time dependent reliability, in this case though time is a measure which communicates the defect, the defect doesn’t happen just by time but by executing faulty programs/codes in a span of time. This concept is used in software industry for MTTR (Mean Time To Repair), Incident Arrival Rate, etc. Software reliability models normally designed with the distribution curve which depicts the shape where defect identification/arrival with time reduces from peak towards a low and flatter trajectory. The shape of the curve is the best fit model and most commonly we use weibull, logistic, lognormal, Small extreme value probability distributions to fit. In software its also possible that every phase or period might be having different probability distributions. Typically the defect data can be used in terms of count of defects in a period (ex: 20 /40/55 in a day) or defect arrival time (ex: 25, 45, 60 minutes difference in which each defect entered). The PDF (Probability Distribution Function) and CDF ( Cumulative Distribution Function) are important measures to understand the pattern of defects and to predict the probability of defects in a period/time, etc.
  27. 27. RELIABILITY MODELLING- STEPS We will work on Reliability again using JMP, which is pretty for these type of modelling. We will apply reliability to see the defects arrival in maintenance engagement, where the application design complexity and skill of people who are maintaining the software varies. Remember when we develop a model, we are talking about something controllable is there, if not these models are only time dependent ones and can only help in prediction but not in controlling. In reliability we call the influencers as Accelerator, which impacts the failure. We can use weights of defects or priority as frequency and for the data point for which we are not sure about time of failure, we use Censor. Right censor is for the value for which you know only the minimum time beyond which it failed and left censor is for maximum time within which it failed. If you know the exact value, then by default its uncensored. There are many variants within reliability modelling, here we are going to use only Fit life by X modelling. A) Collect the data with defect arrival in time or defect count by in time. In this case we are going to use Life fit by X, so we can collect it by time between defects. Also update the applications complexity and team skill level along with each data entry. B) Select “Time to Event” as Y and select the accelerator(complexity measure) and use skill as separator. C) There are different distributions which are categorized by the application complexity is available. Here we have to check the Wilcoxon Group Homogeneity Test for the P value (should be less than 0.05) and ChiSquare value (should be minimal).
  28. 28. RELIABILITY MODELLING- STEPS D) To select the best fit distribution, look at the comparison criteria given in the tool, which shows -2logliklihood, AICc, BIC values. Here AICc (Corrected Akaike’s Information Criterion) should be minimal for the selected Distribution. BIC is Bayesian Information Criterion, which is more stricter as it takes the sample size in to consideration. ( In other tools , we might have Anderson Darling values, in that case select the one which has value less than or around 3 or the lowest ) E) In the particular best fit distribution, study the results for P-value, see the residual plot (Cox-Snell Residual P-plot) for their distribution. F) Quantile Tab in this tool is used for extrapolation(ex: in minitab, we can provide new parameters in a column and predict the values using estimate option) and for predicting the probability. G) The variation of accelerator can be configured and probability is kept normally at 0.5 to see that 50% of chance or to be in the median and then the expected Mean time can be kept as LSL and/or USL accordingly. The simulation results will tell us the Mean and SD, with graphical results. H) For Optimization on maintaining the Accelerator, we can use Set desirability function and can give a target for “Y” and can check the values. I) Under Parametric survival option in JMP , we can check the probability of a defect arrival in a given time, using Application complexity and Skill level.
  29. 29. RELIABILITY MODELLING- SAMPLE Lets consider the previous example where the complexity of applications are maintained at different level (controllable , assuming the code and design complexity is altered with preventive fixes and analysers) and that’s an accelerator for defect arrival time (Y) and skill of the team also plays a role (assuming the applications are running for quite sometime and many fixes are made). In this case, we want to know the probability of having mean time arrival of defect/incident beyond 250 hrs… Select Y and X’s. Select Default Arrhenius Celsius as relationship between X and Y The best distribution is ordered based on AICc value in JMP. You can select Non parametric, if the fit is not proper. Check the P Value and Chisquare Check the P Value and Chisquare Value of the best distribution (here it’s the weibull). Study the Cox-Snell residual plot for normality and in this case its normal, so it’s a good fit. Select the Quantile profiler and see the code complexity variation and probability level desirability for the Y Value. We can simulate and see for the given X and probability , where the confidence Interval of data falls. Using Parametric Survival plot(another option under reliability), we have estimated for a sample value 0.54 code complexity and skill 3 and 4 , what is the probability of mean time 200… and the probability of survival (or happening) is more for skill 3 .
  30. 30. RELIABILITY MODELLING- TOOLS Minitab also has reliability modelling and can perform almost all types of modelling which other professional tools offer. For the people who are convenient with minitab can use these options. However we have to remember that simulation and optimization is also a need for us in modelling in CMMI, so we may need to generate outputs and create ranges and simulate and optimize using Crystal ball (or any simulation tool). Reliasoft - RGA is another tool with extensive features in reliability modelling. Its comparatively user friendly tool. It’s a tool worth a try if reliability is our key concern. R- though we don’t talk much about this free statistical package, it comes with loads of add on package for every need. We have never tried, may be because we are lazy and don’t want to go out of comfort from GUI abilities of other professional tools. CASRE and SMERFS are free tools ,which we have used in some context. However we never tried the Accelerators with these tools, so we are not sure are they having the option of life fit by X modelling. However for reliability forecasting and growth they are useful at no cost. Matlab statistics tool box also contains reliability modelling features. SPSS reliability features are good enough to use for our needs in software Industry. However JMP is good from the point, that you only need one tool which gives modelling, simulation and optimiation.
  31. 31. PROCESS MODELLING (QUEUING SYSTEM) Queuing system is a one in which the entity arrival creates demand and it has to be served by limited resources assigned in the system. The system distributes its resources to handle various events in the system at any given point in time. The events are handled as discrete events in the system. There are number of queuing systems can be created, however they are based on arrival of elements, servers utilization, wait time/time spent in the system flows (between servers and with the servers). Discrete events helps the queuing model to capture the time stamps of different events and model their variation along with the queue system. This model helps to understand the resource utilization of servers, bottlenecks in the system events, idle time, etc. Discrete Event Simulation with Queue is used in many places like banks, hospitals, airport queue management, manufacturing line, supply chain ,etc. In software Industry we can use in application maintenance incident/problem handling, Dedicated service teams /functions(ex: estimation team, technical review team, Procurement, etc), Standard change Request handling and in many contexts where the arrival rate and team size plays a role in delivering on time. We also need to remember that in software context the element which comes in queue will be there in queue till its serviced and then it departs, unlike in a bank or hospital where a patient come late to the queue may not be serviced and they leave the queue.
  32. 32. PROCESS MODELLING -STEPSWe will discuss the Queuing system modelling using the tool “Processmodel”. Setting up flow: A) Its important to understand the actual flow of activities and resources in a system and then making a graphical flow and verifying it. B) Once we are sure about the graphical representation, we have to provide the distribution of time, entity arrival pattern, resource capacity and assignment, input and output queue for each entity. These can be obtained by Time motion study of the system for the first time. The tool has Stat-fit, which will help to calculate the distributions. C) Now the system contains entity arrival in a pattern with this by adding storage the entities will be retained till they get resolved. Resources can be given in shifts and by using get and free functions (we can code in a simple manner) and by defining scenarios (the controllable variables are given as scenario and mapped with values) their usage conditions can be modified to suit the actual conditions. Simulation: D) The system can be simulated with replications (keep around 5) and for a period of 1 month or more (a month can help in monitoring and control with monthly values( E) The simulation can be run with or without animation. The results are displayed as output details. The reports can be customized by adding new metrics and formulas. F) The output summary containing “Hot Spot” refers to idle time of entities or waiting time in queue. This is immediate area to work on process change and improve the condition. If there is no Hot Spot , we need to study the activity which has High Standard deviation or High Mean or both of individual activities and they become our critical sub processes to control.
  33. 33. PROCESS MODELLING -STEPS Validating Results: G) Its important to validate, whether the system replicates the real life condition by comparing the actuals with predicted values of the model. We can use MAPE and the difference should be less than 10%. Optimization: H) In order to find the best combination of resource assignment ( ex: with variation in skill and count) with different activities, we can run “SimRunner”. The scenarios which we defined earlier are going to be the controllable factors and a range (LSL and USL) is provided in the tool, similarly the objective could be to minimize the resource usage and increase entity servicing or reducing elapsed time, which can be set in tool. J) The default value of convergence, simulation length can be left as it is and the optimization is performed. The tool tries various combination of scenario value with existing system and picks the one which meets our target. These values (activity and time taken, resource skill, etc) can be used for composition of processes.
  34. 34. PROCESS MODELLING -VALIDATION In a Maintenance Project they are receiving different severity incidents (P1,P2,P3,P4) and their count is around 100 in a day with hourly variation and there are 2 shifts with 15 people each (similar skill). The different activities are studied and their elapsed time, count etc are given as distributions (with mean, S.D, median and10%,90% value). The Project team want to understand their Turn Around Time and SLA meeting. They also want to know their bottlenecks and which process to control? Flow configuring is performed within the tool window and distribution for activities are set along with resource assignment The window provides information of entities processed and time taken ,etc, it can be customized 2 replications and 720 hrs has been set in this case. The replications give distribution to us for different measures Entities movement can be studied and simulation can be made slow or fast to see what happens in a period, visually In this case there is no Non-value added Hotspot, so we will have to monitor and control the Apply Fix step of Priority 2 tickets Resolution and Response percentage of SLA’s are predicted in this case. By selecting both replications, we can see the variation also.
  35. 35. PROCESS MODELLING -TOOLS The tools of mathlab, SAS JMP has their own process flow building capabilities. However specific to queuing model , we have seen BPMN process simulation tool, which is quite exhaustive and used by many. The tool has the ability to build and simulate the model. ARIS simulation tool is also another good tool to develop process system and perform simulation. While considering the tools we also needs to see the optimization capabilities of the tools , without which we have to do many trail and error for our what if analysis.
  36. 36. FUZZY LOGIC Fuzzy Logic is a representation of a model in linguistic variable and handling the fuzziness/vagueness of their value to take decisions. It removes the sharp boundaries to describe a stratification and allows overlapping. The main idea behind Fuzzy systems is that truth values (in fuzzy logic) or membership values are indicated by a value in the range [0,1] with 0 for absolute falsity and 1 for absolute truth. Fuzzy set theory differs from conventional set theory as it allows each element of a given set to belong to that set to some degree (0 to 1), unlike in conventional method the element either belongs to or not. For example if we calculated someone’s skill index as 3.9 and we have medium group which contains skill 2.5 to 4 and High group which contains 3.5 to 5. In this case the member is part of , Medium group has around 0.07 degree and High group around 0.22 (not calculated value). This shows the Fuzziness. Remember this is not probability but its certainty which shows degree of membership in a group. In Fuzzy logic the problem is given in terms of linguistic variable, however the underlying solution is made of mathematical(numerical) relationship determined by Fuzzy rules (user given). For example, if Skill level is high and KEDB usage is High, then Turn Around Time(TAT) is Met is rule, for setting up this rule, we should study to what extent this has happened in the past. At the same time this will also be a part in Not met group of TAT to a degree. In software we use Fuzziness of data (overlapping values) and not exactly the Fuzzy rules but we allow mathematical/stochastic relationship to determine the Y in most cases. We can say a partial application of Fuzzy logic with monte carlo simulation.
  37. 37. FUZZY LOGIC- SAMPLE To understand the Fuzzy logic, we will use the tool qtfuzzylite in this case. Assume that a project is using different review techniques and able to find defects which are overlapping with each other’s output. Similarly they use different test methods and they also yield results which are overlapping with each other. The total defects found is the target and its met under a particular combination of review and Test method and we can use Fuzzy logic in modified form to demonstrate it. a) Study the distributions by Review Type and configure them in input. If there is fuzziness among the data then there can be overlap b) Study the Test method and their results, and configure their distribution in the tool c) In output Window configure the Defect Target (Met/Not met) with target values. d) The tool will help to form the rules with different combination and the user has to replace the question and give the expected target outcome. e) In the control by moving the values of Review and Test method (especially in overlapping area) the tool generates certain score ,which tells about what will the degree of membership with met and Not met. The higher value combination out of this shows there is more association with results. f) One of the way by which we can deploy this is by simulating this entire scenario multiple times and thereby making this as stochastic relationship than deterministic. Which means usage of Monte carlo simulation to get the range of possible results or probability of meeting the target using Fuzzy logic. Many a times we don’t apply Fuzzy logic to complete extent or model as it is in software industry, however the fuzziness of elements are taken and modelled using statistical or mathematical relationship to identify range of outputs . This is more of hybrid version than the true fuzzy logic modelling.
  38. 38. FUZZY LOGIC - SAMPLE Input and output Variables are described. Each Variable has different techniques and distribution here. We assumed triangular distribution here, but we are expected to use true distribution The Rule statement can be derived by pressing the magic wand button and we have to just replace the questions with the values of Total Defect (Met or Notmet) to complete the rules. As we keep moving towards the different methods the degree to which we can meet the Defect target is increasing. Also this shows in more than one type or combination we can achieve the target. Usage of Optimization technique here will reveal the best combination The Distributions are Review types and Test methods are configured and the rule which we set earlier is used to determine the degree of results.
  39. 39. MONTE CARLO SIMULATION Monte carlo simulation is used mainly to study the uncertainties in the value of interest. Its statistical method of simulation, which uses the distributions and randomness to perform simulation. In simulation model the assumptions of the system are built and a conceptual model is created, and using monte carlo method the system is studied using number of trials and variations in the distributions, which results into range of outputs. For an example to study the life of a car engine, we can’t wait till it really gets wear out, but by using different conditions and assumptions the engine is simulated to undergo various conditions and the wear out time is noted. In Monte carlo method, its like we test another 100 such engines and finally get the results plotted in histogram. The benefit is, that this is not a single point of outcome, but it’s a range, so we can understand the variation with which the life of engine could vary. Similarly since we test many, we can understand the probability of an engine having a life beyond a particular value (ex: 15 years). The computers have made the life easy for us, so instead of struggling for 100 outcomes ,we can simulate 5000, 10000 or any number of trials using the monte carlo tools. This method has helped us to convert the mathematical and deterministic relationship to be made as stochastic model by allowing range /distributions of factors involved them, there by getting the outcome also under a range. The model gives us the probability of achieving a target, which is in other words the uncertainty level. Assume a deterministic relationship of Design Effort (X1)+ Code Effort (X2)= Overall Effort(Y), which can be made as stochastic relationship by building the assumptions (variation of X1 & X2 and distribution) of variables X1, X2 and running the simulation for 1000 times and storing all the results of Y and building histogram from it. Now what we will get is a range of Y. The input variation of X1 and X2 is selected randomly from the given range of X1 and X2. For example if code effort varies from (10, 45) hrs then any random values will be selected to feed into equation and get a value of Y.
  40. 40. MONTE CARLO SIMULATION- STEPS Monte carlo technique can be demonstrated using excel formulas also, however we will discuss the relevant topics based on crystal ball (from Oracle) tool, which is another excel plug in. Performing simulation: A) The data of any variable can be studied for its distribution and central tendency and variation using minitab or excel formula. B) The influencing variable names are entered (X’s) in excel cells and their assumptions (where distributions and their values) are given C) Define the outcome variable(Y) and in the next cell give the relationship of X’s with Y. It can be a regression formula or mathematical equation, etc (with mapping of X’s assumption cell in to the formula) D) Define the outcome variable formula cell as Forecast Cell. It would require just naming the cell and providing a unit of outcome. E) In the preferences, we can set any number of simulation we want the tool to perform. If there are many X’s , then increase simulation from 1000 to 10000,etc. Keep a thumb rule of 1000 simulation per X. F) Start the simulation, the tool will run the simulations one by one and keeps the outcome in memory and then plots a Histogram of probability of occurrence with values. We can give our LSL/USL targets manually and understand the certainty by % or vice versa. This helps us to understand the Risk against achieving the target.
  41. 41. MONTE CARLO SIMULATION- STEPS Optimization: G) Though in simulation we might have seen the uncertainty of outcome, we have to remember that some X’s are controllable (Hopefully we have modelled that way) and by controlling them, we can achieve better outcome. OptQuest feature in the tool helps us to achieve the optimization by picking the right combination of X’s. H) At least one Decision Variable has to be created to run OptQuest. Decision variables are nothing but controllable variables, and without them we cant optimize. I) Define the Objective (maximize/minimize/etc with or without a LSL/USL) and tool detects Decision Variables automatically. We can introduce constraints in decision variables (Ex: A particular range within with it has to simulate). Run Simulation (Optimization is based on simulation), the tool runs with random picking values within the range of decision variables and records the outcome and for best combination of X’s for which target of Y is met, it keeps that as best choice, until something more better comes within the cycles of simulation. J) The best combination of X’s are nothing but our target values to be achieved in project and the processes which has capability to achieve these X’s are composed in Project.
  42. 42. MONTE CARLO SIMULATION- SAMPLE A Project team receives and works on medium size (200-250 FP) development activities and whenever their internal defects exceeds more than 90 or to a higher value, they have seen that UAT results in less defects. They use different techniques of review and testing based on nature of work/sub domains and each method gives an overlapping results of defect identified and there is no distinctness in their range. Now we are expected to find their certainty of finding defects more than 90 and to see what combination of review and test type, the project will find more defects. Green Cells mark the assumptions and Blue Cells mark the assumption and Amber cells mark the Decision variable. The simulation shows after 5000 tests the results predetermined combination and certainty of 96.88 % to be more than 90. Sensitivity chart shows Test Results influences around 70% of outcome and then Reviews Defining Objective, decision variable, constraints and run the optimization In Decision Variable, the review method and Test method are given as discrete values, so the tool takes different method and combinations and tries the best fit which gives maximum number of defects. We can compose our process now with the given method, as this will help us to achieve higher outcome in this case.
  43. 43. MONTE CARLO TOOLS Tools like JMP, Processmodel, BaysiaLab has in built simulation features within them and there we don’t need to use Crystal ball kind of tool. Recently in Minitab 17, we have profilers and optimizers added in the regression models, which reduces the need of additional tools. However it has limitation of only for Regression. Simulacion is a free tool and it has an acceptable usage with upto 65000 iterations and 150 input variable. This is another Excel Add on. Risk Analyzer is another tool which is similar in Crystal ball and is capable of performing most of the actions . However this is paid software. There are many free excel plugin’s are available to do Monte carlo simulation and we can also build our own simulation macros using excel.
  44. 44. SYSTEM DYNAMICS System Thinking is the basis behind System Dynamics models. We all operate in a system which continuously has some input and outputs and its influenced by many factors. Our System is represented as Stock and Flow, which keep continuously changing the state over time. The model is made of causal loops which represents the system. Often its difficult to interpret the behaviour of causal loop and its behaviour in system without adequate simulation. System Dynamics offers us the ability to model the system to simulate and study its behaviour. The model is used on strategic decision making and planning. System Dynamics is a deterministic model which can be added with Noise and using relevant software we can model the stochastic behaviour. Here we don’t study the behaviour of elements in an event, but aggregates at time slices. Considering our engagements are now becoming big and our organization’s needs decision making tools, system dynamics is an important technique in the offering to model dynamic behaviour of our systems. A basic study of causal loops and flows, stocks and auxiliary variables are recommended for the audience. However the modern tools have made it as less complex modelling technique.
  45. 45. SYSTEM DYNAMICS - STEPS We are going to explain System Dynamics using “Insightmaker” an online free tool, which gives all the basic functionalities expected at no cost. However the same can be modelled using “Vensim”, which is another powerful tool. A) Understand the problem or goal which we want to study and identify the causes relevant to it. Developing a basic causal loop diagram is recommended, however for the tool its not mandatory. B) Every system has input (here its “flow”) and these inputs are there to be processed (here we say “Stock”) at initial stage. In a system atleast one flow and stock should exist. C) As a method, we can build stock and flow continuously to model the system and Auxiliaries to be intermediate state variables used to represent the in between state and influencing factors. Connectors/Links are the way by which variable are connected. D) In this tool, create new insight and draw your system components using the tools given. Each component is highly customizable in this tool , in terms of size, color, font ,etc. E) Establish Links between variables and name all the variables in an identifiable manner. F) We can write the relationships using formulas (in the “Value” option) and we can feed in the noise using “Distributions” of data studied. The tool offers most of mathematical and logical formulas. The formulas are normally simple and anyone with excel understanding can easily write the relationships. If we use Vensim, then the tool dynamically validates the links with variables and usage of those variables in the current formula.
  46. 46. SYSTEM DYNAMICS- STEPS G) Once all the relevant relationships are established then we can simulate the model. Here simulation refers to the time period up to which we study the deterministic behaviour of the system. For example, if we configure daily data in the system then we may simulate up to a month or a quarter. This will help us to understand if the system is having accuracy. Use a method like MAPE to understand how for the results of predicted value and actual value varies. H) Monte Carlo simulation which is required to study the stochastic behaviour of the system is available under sensitivity analysis. In Vensim, this is available in Professional version (& PLE Plus). This is important for us to understand the confidence level of meeting the target and also the variation of the factors/variables (which will help in sub process management). We can get the data and graphs from the tool. I) Goal Optimization is part of this online tool, where we can select the variable we want to maximize or minimize with the relevant constraints of other variables. The optimizations results are helpful in composing the process and fixing the goals for sub processes. In Vensim it offers Calibration and policy fixing options under optimization. J) The models created using insightmaker is available in their site always and can be controlled for access and good revision management and cloning techniques are available to share with others online.
  47. 47. SYSTEM DYNAMICS- SAMPLE Assume a case where Incident tickets closed per person is the productivity considered and backlog is targeted in a maintenance project. The incidents exhibit significant difference based on priority (P1,P2 & P3,P4 as two groups) in terms of effort taken and time spent. We are requested by management to understand behaviour of these variables and optimize the factors… Blue Boxes are Stocks and Pink Eclipses are variables. There are links which connets them. The backlog and productivity is highlighted in separate colours.. Sensitivity and Optimization techniques are there in drop down. Simulation here refers to time based variation. The tool gives various factors and their values. Variation shown here is for backlog. In sensitivity analysis the tool applies Monte Carlo simulation and give confidence Level with values Optimization Results for Productivity In terms of identified Variables. Here its Effort by Sub process by Skill Simple formula Window for writing formula and logics.
  48. 48. SYSTEM DYNAMICS TOOLS As we have seen online tool with its functions, we can look at other tools. The first one which comes to mind is Vensim. The tool has free version Vensim Plus which doesn’t have monte carlo simulation and Optimization. We would recommend to use Vensim Professional which is a paid software, however has all the possible formulas and relevant support groups to resolve our doubts. This is one of the best software available in the market today. We have tried semantics, which is a free open source code based tool. We were happy about its results. However the formulas for probability distributions are not available by default. The tool uses Modelica language, so if you have a developer free for few days, we can really make this wonderful with relevant algorithms. The tool has Monte Carlo simulation for sensitivity analysis. We have worked on STELLA, and this is also a good software. However the simulation and optimization parts we couldn’t find. We have used Anylogic also, but these tools requires better understanding to make it work for System Dynamics, though they have lots of options. Apart from Powersim, Goldsim and many other tools are available in market. However from our understanding Insightmaker online tool and Vensim Professional are the two we can consider from CMMI point of view.
  49. 49. MODEL SELECTION - THOUGHT QPPO Time defect Effort Resource D D D D PRJ ORG PRJ ORG PRJ ORG PRJ ORG Queuing Models Neural Networks Bayesian belief Regression Model Bayesian Belief Simulation models Other statistical models D is based on, • High Frequency of data in project • Seasonal variation Or Time Dependency • Project characteristic differentiates from others Reliability Models Fuzzy Logic System Dynamics Neural Networks Regression Model Bayesian Belief Simulation models Other statistical models Neural Networks Simulation models System Dynamic Regression models Other statistical models Regression Models Bayesian Belief Queuing System Simulation Models Fuzzy Logic Bayesian Model Simulation Model Linear & Non Linear models to be selected based on Linearity “Y”
  50. 50. KEY CHARACTERISTICS TO DETERMINE MODEL  Robustness of model  Prediction Accuracy of model  Flexibility in varying the factors in model  Caliberation abilities of the model  Availability of relevant tool for building the model  Availability of data in the prescribed manner  Data type of the variable and factors involved in the model  Ability to include all critical factors in the primary data type (not to convert in to a different scale)
  51. 51. REFERENCES •CMMI v1.3 Nov 2010, Technical Report •If you’re living the “High Life”, you’re living the informative Material – Rusty Young, Bob Stoddard and Mike Konrad (Mar 2008) •SIMULATION AND MONTE CARLO ‘Some General Principles’ - James C. Spall (2007) •Improved MDL Score for Learning of Bayesian Networks : Zheng Yun and Kwoh Chee Keong •A New Measure for the Accuracy of a Bayesian Network: Alexandros Pappas, Duncan Gillies •Introduction To Neural Networks : Prof. George Papadourakis, Ph.D. • Monte Carlo Simulation : Fawaz hrahsheh , Dr. A. obeidat • FUZZY LOGIC - Shane Warren, Brittney Ballard •Systems Thinking, System Dynamics, Simulation – James R.Burns Summer 2009 • In addition Tool Manual of : SAS JMP, Minitab, BayesiaLab and Processmodel
  52. 52. TEAM COMPRISES OF Thirumal Shunmugaraj Sunil Shirurkar Snehal Pardhe Thanks To: Koel Bhattacharya (System Dynamics)
  53. 53. SCREENSHOTS CONTRIBUTION FROM Minitab 17 Processmodel 5.5 BayesiaLab SAS JMP 11.0 Qtfuzzylite Crystal Ball 11
  54. 54. CONTACT US