Using CART For Beginners with A Teclo Example DatasetSalford Systems
Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.
The ARIMA analytical method predicts future values of a time series using a linear combination of past values and a series of errors. It is suitable for instances when data is stationary/non stationary and is univariate, with any type of data pattern. It produces accurate, dependable forecasts for short-term planning, and provides forecasted values of target variables for user-specified periods to illustrate results for planning, production, sales and other factors.
Using CART For Beginners with A Teclo Example DatasetSalford Systems
Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.
The ARIMA analytical method predicts future values of a time series using a linear combination of past values and a series of errors. It is suitable for instances when data is stationary/non stationary and is univariate, with any type of data pattern. It produces accurate, dependable forecasts for short-term planning, and provides forecasted values of target variables for user-specified periods to illustrate results for planning, production, sales and other factors.
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...ESEM 2014
Context: Number of defects fixed in a given month is used as an input for several project management decisions such as release time, maintenance effort estimation and software quality assessment. Past activity of developers and testers may help us understand the future number of reported defects. Goal: To find a simple and easy to implement solution, predicting defect exposure. Method: We propose a temporal collaboration network model that uses the history of collaboration among developers, testers, and other issue originators to estimate the defect exposure for the next month. Results: Our empirical results show that temporal collaboration model could be used to predict the number of exposed defects in the next month with R2 values of 0.73. We also show that temporality gives a more realistic picture of collaboration network compared to a static one. Conclusions: We believe that our novel approach may be used to better plan for the upcoming releases, helping managers to make evidence based decisions
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
Primary Goals
1. To determine what factors are driving the lead conversion process.
2. To Identify which leads are more likely to convert to paid customers.
Data Description
3. Dataset consists of 4613 rows and 15 columns.
Modelling Strategies
4. Plan
4.1 Perform Dummy Encoding
4.2 List Variables for Modeling
4.3 Identify metric of interest to judge model's performance
5. Build
5.1 Build Logistic Regression Model (Preliminary Model)
5.2 Observe the metrics of the model
6. Improve
6.1 Identify the significant variables
6.2 Rebuild model
6.3 Observe the metrics of the models
7. Decide
7.1 Compare the results of Logistic Regression model (Base model) and Decision Tree Model
7.2 Conclude on best model for this project
8. Recommend
8.1 Determine factors driving the lead conversion process
8.2 Recommend what that may help to identify which leads are more likely to convert to paying customers
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
In this presentation I review various data science techniques and discuss their usefulness to pricing actuaries working in general insurance.
This presentation was originally given at the TIGI webinar in 2020.
https://www.actuaries.org.uk/learn-develop/attend-event/tigi-2020-technical-issues-general-insurance
SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technica...Oliver Barreto Rodríguez
SLALOM organized two live sessions to present the final versions of our legal terms and technical specifications for #Cloud #SLAs. The sessions provide examples showing how to practically apply SLALOM to improve current practice in the industry for # Cloud #SLAs and support development of cloud computing metrics.
The first webinar covered SLALOM Technical track "Using metrics to improve Cloud SLAs".
Introductory course on concepts used in predictive control. For more files and MATLAB suporting information go to:
http://controleducation.group.shef.ac.uk/OER_index.htm
Reinforcement Learning (RL) refers to a branch of Artificial Intelligence (AI) that is able to achieve complex goals by maximizing a reward function in real-time. Given that RL based approaches can basically be applied to any optimization problem, its enterprise adoption is picking up fast. In this talk, we will focus on Industrial Control Systems, and show why RL is 'best fit' for many control optimization problems, from controlling combustion engines, to robotic arms cutting metals, to air conditioning systems in buildings.
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
Project’s Primary Goals
1. To analyse past sales data to generate insights to understand what features of mobile phone that drive the sales.
2. To use these insights to efficiently plan the inventory in the next 6 months.
Data Description
3. Dataset consists of sales and product-related features.
4. Dataset contains descriptions of the top 5 most popular mobile brands.
5. Dataset consists of 418 row-instances and 16 column-features.
Strategies Deployed for Modelling
6. Check for, and treat with suitable methods, missing values in dataset.
7. Observe for, and take suitable steps to treat, outliers.
8. Check for multicollinearity amongst variables and use suitable steps to treat highly correlated variables.
9. Build a Linear Regression Model to predict the sales of mobile phones.
10. Report on the the metrics of the models.
11. Identify the significant variables, and rebuild and report on the model using only these variables only.
12. Based on the final model outcomes, determine the features driving mobile phone sales.
13. List down the recommendations to help in the inventory planning for the next 6 months.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...ESEM 2014
Context: Number of defects fixed in a given month is used as an input for several project management decisions such as release time, maintenance effort estimation and software quality assessment. Past activity of developers and testers may help us understand the future number of reported defects. Goal: To find a simple and easy to implement solution, predicting defect exposure. Method: We propose a temporal collaboration network model that uses the history of collaboration among developers, testers, and other issue originators to estimate the defect exposure for the next month. Results: Our empirical results show that temporal collaboration model could be used to predict the number of exposed defects in the next month with R2 values of 0.73. We also show that temporality gives a more realistic picture of collaboration network compared to a static one. Conclusions: We believe that our novel approach may be used to better plan for the upcoming releases, helping managers to make evidence based decisions
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
Primary Goals
1. To determine what factors are driving the lead conversion process.
2. To Identify which leads are more likely to convert to paid customers.
Data Description
3. Dataset consists of 4613 rows and 15 columns.
Modelling Strategies
4. Plan
4.1 Perform Dummy Encoding
4.2 List Variables for Modeling
4.3 Identify metric of interest to judge model's performance
5. Build
5.1 Build Logistic Regression Model (Preliminary Model)
5.2 Observe the metrics of the model
6. Improve
6.1 Identify the significant variables
6.2 Rebuild model
6.3 Observe the metrics of the models
7. Decide
7.1 Compare the results of Logistic Regression model (Base model) and Decision Tree Model
7.2 Conclude on best model for this project
8. Recommend
8.1 Determine factors driving the lead conversion process
8.2 Recommend what that may help to identify which leads are more likely to convert to paying customers
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
In this presentation I review various data science techniques and discuss their usefulness to pricing actuaries working in general insurance.
This presentation was originally given at the TIGI webinar in 2020.
https://www.actuaries.org.uk/learn-develop/attend-event/tigi-2020-technical-issues-general-insurance
SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technica...Oliver Barreto Rodríguez
SLALOM organized two live sessions to present the final versions of our legal terms and technical specifications for #Cloud #SLAs. The sessions provide examples showing how to practically apply SLALOM to improve current practice in the industry for # Cloud #SLAs and support development of cloud computing metrics.
The first webinar covered SLALOM Technical track "Using metrics to improve Cloud SLAs".
Introductory course on concepts used in predictive control. For more files and MATLAB suporting information go to:
http://controleducation.group.shef.ac.uk/OER_index.htm
Reinforcement Learning (RL) refers to a branch of Artificial Intelligence (AI) that is able to achieve complex goals by maximizing a reward function in real-time. Given that RL based approaches can basically be applied to any optimization problem, its enterprise adoption is picking up fast. In this talk, we will focus on Industrial Control Systems, and show why RL is 'best fit' for many control optimization problems, from controlling combustion engines, to robotic arms cutting metals, to air conditioning systems in buildings.
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
Project’s Primary Goals
1. To analyse past sales data to generate insights to understand what features of mobile phone that drive the sales.
2. To use these insights to efficiently plan the inventory in the next 6 months.
Data Description
3. Dataset consists of sales and product-related features.
4. Dataset contains descriptions of the top 5 most popular mobile brands.
5. Dataset consists of 418 row-instances and 16 column-features.
Strategies Deployed for Modelling
6. Check for, and treat with suitable methods, missing values in dataset.
7. Observe for, and take suitable steps to treat, outliers.
8. Check for multicollinearity amongst variables and use suitable steps to treat highly correlated variables.
9. Build a Linear Regression Model to predict the sales of mobile phones.
10. Report on the the metrics of the models.
11. Identify the significant variables, and rebuild and report on the model using only these variables only.
12. Based on the final model outcomes, determine the features driving mobile phone sales.
13. List down the recommendations to help in the inventory planning for the next 6 months.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Stock Market Trends Prediction after Earning Release.pptx
Big Data Project - Final version
1.
2. Machine Learning Analysis of
Micro & Macroeconomic
Variables
Alakar Srinivasan
Pinank Shah
Tjisana Kerr
Mihir Sanghavi
Vinayak Kishanchandani
3. Background & Objectives
Background:
S&P Global 500 is a popular index consisting of
500 of the largest listed global companies in
terms of market capitalization
Stock index prediction is a relatively old but
challenging problem to solve using new
techniques
The other challenging problem is to devise
portfolio optimization strategies to outperform
the index
Key Challenges:
Stock markets can be extremely volatile and
reactive to new events and information
Efficient Market Hypothesis suggests that stock
markets cannot be predicted or outperformed
Objective:
Compare performance of various statistical and ML prediction
techniques in estimating future daily returns of the S&P index since
stock market prediction is enticing and continues to remain one of
the ultimate challenges
8.4
12
13 12.9 13.6
15
16.2
0
5
10
15
20
2009 2010 2011 2012 2013 2014 2015
MarketCap(Billion$)
S&P Global 100 Companies: Market Capitalization ($ Billion)
4. List of Prediction Methods For Comparison
1. Linear Regression
2. Lasso Regression
3. Hold Winter Filtering
4. K Nearest Neighbors Algorithm
5. Support Vector Regression
5. 1. Linear Regression
About Linear Regression:
• A linear model that determines the relationship between a
dependent variable (daily returns) and set of explanatory
variables, in this case sector specific returns, exchange rate,
commodities price, etc.
• One of the first regression techniques to be studied extensively
• Linear models are fitted using Ordinary Least Square or OLS
approach
• Pros: Simple to execute, easy to implement, easy to interpret
model results, fast in processing
• Cons: Linear model will not be able to fit non linearity which
exists in the context of stock market prediction
• Results not reliable when very few data points are available
while over fits when large amount of data is used for training
Observation / Result:
The model included all variables except one (enterprise
index) for which p value was greater than 0.05 and hence
removed.
Therefore all other factors have been considered and
performance of model was as follows - RMSE: 47.44
Actual vs Predicted Values
6. 2. Lasso Regression
About Lasso Regression:
• Lasso stands for Least Absolute Shrinkage and Selection
Operator and belongs to class of generalized linear models
• Lasso regression performs both variable selection as well as
regularization in order to improve prediction accuracy
performance
• Pros: Lasso regression are typically better than linear
regression models as it includes variable selection as well as
regularization which helps reduce over fitting
• Lasso is better than ridge regression in that it performs both
parameter shrinkage and selection while ridge can perform only
regularization
• Cons: Although Lasso can be extended to generalized linear
models, its performance in terms of predicting non linear
relationships is constrained
Result:
Using LASSO few additional factors (1) US10 year bond
(2) Health (3) Telecom turned out to be not significant for
the model and hence were eliminated due to high p value
Due to further variable selection and coefficient shrinkage,
the prediction performance of the model improved to
RMSE: 43.34
Observation:
• Using LASSO we found that the factors: (1) US10 year
bond (2) Health (3) Telecom are not significant for the
model and can be eliminated since
7. 3. Holt Winters Filtering
About Holt Winter:
• Holt Winters filtering can be used to perform the
exponential smoothing technique which is a simple
forecasting method for time series data
• The method can apply as many as three low pass filters
recursively with exponential windows and defined with
parameters alpha, beta and gamma
• Exponential smoothing is different from simple moving
average in that it assigns exponentially decreasing
weights to historical data instead of equal weights in the
time window
• Pros: Better than SMA in that it determines the right
parameters to assign exponentially decreasing weights to
historical data to improve prediction performance
• Cons: Although simpler to execute than more complex
time series forecasting techniques such as ARIMA,
prediction accuracy is lower in comparison
Observation / Result:
Since the daily returns data is stationary and non
seasonal, only the alpha parameter was used to perform
exponential smoothing.
An alpha of 0.191 was found to minimize the SSE
although in terms of prediction performance exponential
smoothing was found to be much lower than others
8. 4. K Nearest Neighbors
About K-NN:
• The KNN algorithm is a method for classifying objects
based on closest training examples in the feature space
• The KNN is the fundamental and simplest classification
technique when there is little or no prior knowledge
about the distribution of the data
• In addition to classification, KNN can also be used for
regression and prediction. The average response of the K
nearest neighbors is taken as the predicted value
• Pros: Computationally efficient and can provide higher
accuracy if there is good correlation between historical
and future events
• Pros: Performs in an unsupervised fashion and hence
there is no requirement to train the model
• Cons: Input parameters and dimensions need to be
scaled appropriately in order for KNN to perform well
Observation / Result:
• KNN regression with k=5 gives the best RMSE of 15.5498
on the test data
• To predict tomorrow’s close of S&P500 we find the k-
nearest neighbors of today’s observations from the
beginning of data (t=1) and then compute the mean of
tomorrows’ closing prices of those k nearest neighbors
9. 5. Support Vector Regression
Observation / Result:
As observed from the charts, SVR is able to predict the
daily returns better than other techniques such as
exponential smoothing and linear regression
SVR took about 2 hours to run the model compared to few
seconds for most others. However performance was as
follows - RMSE: 11.25
About SVR:
SVM technique can also be used as regression method,
maintaining all the main features that characterize the
algorithm (maximal margin)
In the case of regression, a margin of tolerance (epsilon)
is set in approximation to the SVM to predict real values
This makes the algorithm more complicated than SVM
therefore to be taken in consideration
Pros: Can function well with high number of variables,
high number of observations, can model non linearity
better than simple regression
Cons: A complicated algorithm and output which makes it
difficult to interpret results and identify drivers
Also very time consuming as compared to other models
during model training
10. Comparison of Techniques / Results
Parameters
Linear
Regression
Lasso
Regression
Holt Winter
Filtering
K-NN
Support Vector
Regression
RMSE 47.44 43.34 71.65 15.5 11.25
Processing Time Very Low Low Low Medium High
Actionability High High Low Low Medium
Implementation
Ease
High Medium High Medium Low
11. Conclusion and Further Research
Key takeaways from stock market prediction analysis:
Linear Regression has a considerably high RMSE value possibly because of the fact that the
stock market does not follow a linear model.
Best Performing Model: Support Vector Regression is the best model which is clearly evident
from the RMSE values as show in the previous slide.
SVR being non linear is a better model compared to all others which have been used.
In terms of processing time, Support Vector Regression actually took the maximum time to
process as shown in the previous slide.
12. List of Time Series Models for Equity Factors
1. Linear Approximation
2. I.I.D. Analysis
3. Technical Indicators
4. Neural Networks
13. 1. Linear Approximation
About Linear Approximations:
• Linear least squares is a method for estimating the
unknown parameters in a linear regression model, with
the goal of minimizing the differences between the
observed responses in some arbitrary dataset and the
responses predicted by the linear approximation of the
data.
• Pros: Simple model.
• Cons: Patches of positive and negative residuals if curved data.
• Polynomial regression is the relationship between the
independent variable x and the dependent variable y is
modelled as an nth degree polynomial. It fits a nonlinear
relationship between the value of x and the corresponding
conditional mean of y.
• Pros: Polynomial fits curvilinear terms better.
• Cons: There is a likelihood of over fitting when using the
polynomial regression method.
Observations - R2:
Equity Linear Poly
MS 0.000575 0.125526
JPM 0.132070 0.655651
GS 0.092261 0.279183
14. 2. I.I.D. Analysis
About I.I.D. Analysis:
A sequence or other collection of random variables is
independent and identically distributed (i.i.d.) if each random
variable has the same probability distribution as the others
and all are mutually independent
Observation / Result:
Notice that the scatter plot is symmetrical with respect to the
reference axes and it resembles a circular cloud, which implies that
all the terms are identically distributed. Thus, Returns for both sets
of data is an invariant, in particular, all the terms in the series are
independent of each other.
15. 3. Technical Indicators
About Technical Indicators
• Momentum: Momentum is the rate of acceleration of a security's price or volume. The idea of momentum in securities is that their
price is more likely to keep moving in the same direction than to change directions.
• Relative Strength Indicator: The relative strength index (RSI) is a technical momentum indicator that compares the magnitude of
recent gains to recent losses in an attempt to determine overbought and oversold conditions of an asset.
• Price Rate of Change: The price rate of change (ROC) is a technical indicator that measures the percentage change between the
most recent price and the price "n" periods in the past.
• Price Volume Trend: A technical indicator consisting of a cumulative volume line that adds or subtracts a multiple of the percentage
change in share price trend and current volume, depending upon their upward or downward movements.
• Bollinger Bands of 50 Day Moving Average: A Bollinger Band® is a band plotted two standard deviations away from a simple
moving average, developed by famous technical trader John Bollinger.