This study reviews the increasing prevalence of 3-shot points within the NBA. It also compares the record of the 5 top players in NBA history in 3-pt shots. It also considers how many good years left Curry may have.
This is the project presentation deck of my master project at Statistical Department, Cornell University in 2011. We were awarded the best statistical consulting project for that year (top one of 15 teams in total). All the data are transformed to certain scale and the material is authorized to share with client's consent.
ASSESSMENT CASE PAPER ANALYSIS / TUTORIALOUTLET DOT COMjorge0048
Ā
A report broken down into the following sections:
Summary results and recommendationsāup front, concise, and to the point.
Answers to the 6 questions askedādevote a paragraph to each, with individual headings
This study reviews the increasing prevalence of 3-shot points within the NBA. It also compares the record of the 5 top players in NBA history in 3-pt shots. It also considers how many good years left Curry may have.
This is the project presentation deck of my master project at Statistical Department, Cornell University in 2011. We were awarded the best statistical consulting project for that year (top one of 15 teams in total). All the data are transformed to certain scale and the material is authorized to share with client's consent.
ASSESSMENT CASE PAPER ANALYSIS / TUTORIALOUTLET DOT COMjorge0048
Ā
A report broken down into the following sections:
Summary results and recommendationsāup front, concise, and to the point.
Answers to the 6 questions askedādevote a paragraph to each, with individual headings
InstructionsCongratulations. You are a finalist in for a data a.docxnormanibarber20063
Ā
Instructions:
Congratulations. You are a finalist in for a data analyst position for a Major League Baseball (MLB) team. As you prepare for the final round of interviews, you've been asked to use the above data set to create a series of analytics / dashboards to help show how well the team is doing in two important KPIs: home-game attendance and salaries.
Within the MLB, the San Francisco GiantsĀ are in the:
League = National League
Division = West (W)Ā Division
The intended audience for this dashboard is the Director of Analytics.Ā
Limitations: Clearly this project is limited in terms of scope of data. In the real world setting there would be ticket sales, customer demographic information, television viewership ratings, social media mentions / hits, and a whole host of additional data to churn through. But (realistically) like any project, it's good to start with a piece of the puzzle at a time, and in sequence. So consider this an initial step in what could be a much larger project.Ā
Two files are needed for this submission: Your Power BI dashboard file and the answers to the questions below (in a Word document).Ā
Broadly speaking this project's learning outcomes include:
Ā· Assigning KPIs
Ā· Trend analysis
Ā· Comparative analysis
Ā· Creating columns and measures
Ā· Creating relationships between multiple data sources
Ā· Creating the best visualization to appropriately show the data
Hint: Use the TeamsMostRecent table as your centralized table that all others are related to (connected with). But only connect Salaries to Team_Statistics and Team_Statistics to TeamsMostRecent as you don't want to have unnecessary relationships that will cause a circular logic in your design.
Hint2: you will need to create a new column to join the Salaries and Team_statistics tables together. What 2 (or more fields) create a unique identifier for each individual rowĀ that exists inĀ both of these tables? You will need to use this field to join these tables together.
Analytics portion:
1. Get a sense of the data to start. Create a matrix that has every ball club, each year (2006-2014) and total games played. This will allow you to see if there are any significant gaps in the data. Are there? Explain.
2. a. Choose the most appropriate visualization to show the total attendance for the teamĀ from 2006 - 2014.Ā What's their trend? b. Choose the most appropriate visualization to show the total attendance for each year and each club inĀ theirĀ division attendance for 2006 - 2014. What is the trend for the team?Ā Which team came closest to surpassing them in attendance and in what year? c. Choose the most appropriate visualization to show how the team'sĀ attendanceĀ averageĀ (combined for all years, 06-14) compares with the attendance average of all other teams in the League. Sort by average attendance in descending order (Most to Least). How are they ranked? Overall is their attendance numbers considered "good" or "bad"?Ā How do you know?Ā
3. Plot all stadium addresses on a map.
Digging into the Dirichlet Distribution by Max SklarHakka Labs
Ā
When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data.
The Dirichlet distribution is surprisingly expressive on its own, but it can also be used as a building block for even more powerful and deep models such as mixtures and topic models.
1 BBS300 Empirical Research Methods for Business .docxoswald1horne84988
Ā
1
BBS300 Empirical Research Methods for Business
TSA, 2018
Assignment 1
Due: Sunday, 7 October 2018,
23:55 PM
This assignment covers material from Sessions 1-4 and is worth 20% of your total mark
of BBS300. Your solutions should be properly presented, and it is important that you
double-check your spelling and grammar and thoroughly proofread your assignment
before submitting. Instructions for assignment submission are presented in
the āAssignment 1ā link and must be strictly adhered to. No marks will be
awarded to assignments that are submitted after the due date and time.
All analyses must be carried out using SPSS, and no marks will be awarded
for assignment questions where SPSS output supporting your answer is not
provided in your Microsoft Word file submitted for the Assignment.
Questions
In this assignment, we will examine the āReal Estate Marketā dataset (described at the
end of the assignment ) and āEmployee Satisfactionā dataset. Before beginning the
assignment, read through the descriptions of these dataset and their variables carefully.
The āReal Estate Marketā dataset can be found in the file ārealestatemarket.sav,ā and
the āEmployee Satisfactionā dataset can be found in the file āemployeesatisfaction.sav.ā
You will need to carefully inspect both SPSS data files to be sure that the
specification of variable types is correct and, where appropriate, value
labels are entered.
1. (12 marks)
2
Use appropriate graphical displays and measures of centrality and dispersion
to summarise the following four variables in the āReal Estate Marketā dataset. For
graphical displays for numeric data, be sure to comment on not only the shape of
the distribution but also compliance with a normal distribution. Be sure to
include relevant SPSS output (graphs, tables) to support your answers.
(a) Price.
(b) Lot Size.
(c) Material.
(d) Condition.
2. (8 marks)
Again consider the variable Price, which records the property price (in AUD). It
is of interest to know if this is associated with the distance of the property is
located to the train station. It i s al so of i nter e st t o kn o w if th e p rop ert y
pri ce s are a sso ciate d with di st an ce to t h e ne ar e st b u s sto p. Carry out
appropriate statistical techniques to assess whether there is a significant
association between the property price and distance to the nearest train (To train)
station and the nearest bus stop (To bus). Be sure to thoroughly assess the
assumptions of your particular analysis, and be sure to include relevant SPSS
output (graphs, tables) to support your answers.
3. (7 marks)
Consider the āEmployee Satisfactionā dataset, which asked participants to provide their
level of regularity to a series of thirteen statements. Conduct an appropriate analysis
to assess the reliability of responses to these statements. If the reliability will
increa.
InstructionsCongratulations. You are a finalist in for a data a.docxnormanibarber20063
Ā
Instructions:
Congratulations. You are a finalist in for a data analyst position for a Major League Baseball (MLB) team. As you prepare for the final round of interviews, you've been asked to use the above data set to create a series of analytics / dashboards to help show how well the team is doing in two important KPIs: home-game attendance and salaries.
Within the MLB, the San Francisco GiantsĀ are in the:
League = National League
Division = West (W)Ā Division
The intended audience for this dashboard is the Director of Analytics.Ā
Limitations: Clearly this project is limited in terms of scope of data. In the real world setting there would be ticket sales, customer demographic information, television viewership ratings, social media mentions / hits, and a whole host of additional data to churn through. But (realistically) like any project, it's good to start with a piece of the puzzle at a time, and in sequence. So consider this an initial step in what could be a much larger project.Ā
Two files are needed for this submission: Your Power BI dashboard file and the answers to the questions below (in a Word document).Ā
Broadly speaking this project's learning outcomes include:
Ā· Assigning KPIs
Ā· Trend analysis
Ā· Comparative analysis
Ā· Creating columns and measures
Ā· Creating relationships between multiple data sources
Ā· Creating the best visualization to appropriately show the data
Hint: Use the TeamsMostRecent table as your centralized table that all others are related to (connected with). But only connect Salaries to Team_Statistics and Team_Statistics to TeamsMostRecent as you don't want to have unnecessary relationships that will cause a circular logic in your design.
Hint2: you will need to create a new column to join the Salaries and Team_statistics tables together. What 2 (or more fields) create a unique identifier for each individual rowĀ that exists inĀ both of these tables? You will need to use this field to join these tables together.
Analytics portion:
1. Get a sense of the data to start. Create a matrix that has every ball club, each year (2006-2014) and total games played. This will allow you to see if there are any significant gaps in the data. Are there? Explain.
2. a. Choose the most appropriate visualization to show the total attendance for the teamĀ from 2006 - 2014.Ā What's their trend? b. Choose the most appropriate visualization to show the total attendance for each year and each club inĀ theirĀ division attendance for 2006 - 2014. What is the trend for the team?Ā Which team came closest to surpassing them in attendance and in what year? c. Choose the most appropriate visualization to show how the team'sĀ attendanceĀ averageĀ (combined for all years, 06-14) compares with the attendance average of all other teams in the League. Sort by average attendance in descending order (Most to Least). How are they ranked? Overall is their attendance numbers considered "good" or "bad"?Ā How do you know?Ā
3. Plot all stadium addresses on a map.
Digging into the Dirichlet Distribution by Max SklarHakka Labs
Ā
When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data.
The Dirichlet distribution is surprisingly expressive on its own, but it can also be used as a building block for even more powerful and deep models such as mixtures and topic models.
1 BBS300 Empirical Research Methods for Business .docxoswald1horne84988
Ā
1
BBS300 Empirical Research Methods for Business
TSA, 2018
Assignment 1
Due: Sunday, 7 October 2018,
23:55 PM
This assignment covers material from Sessions 1-4 and is worth 20% of your total mark
of BBS300. Your solutions should be properly presented, and it is important that you
double-check your spelling and grammar and thoroughly proofread your assignment
before submitting. Instructions for assignment submission are presented in
the āAssignment 1ā link and must be strictly adhered to. No marks will be
awarded to assignments that are submitted after the due date and time.
All analyses must be carried out using SPSS, and no marks will be awarded
for assignment questions where SPSS output supporting your answer is not
provided in your Microsoft Word file submitted for the Assignment.
Questions
In this assignment, we will examine the āReal Estate Marketā dataset (described at the
end of the assignment ) and āEmployee Satisfactionā dataset. Before beginning the
assignment, read through the descriptions of these dataset and their variables carefully.
The āReal Estate Marketā dataset can be found in the file ārealestatemarket.sav,ā and
the āEmployee Satisfactionā dataset can be found in the file āemployeesatisfaction.sav.ā
You will need to carefully inspect both SPSS data files to be sure that the
specification of variable types is correct and, where appropriate, value
labels are entered.
1. (12 marks)
2
Use appropriate graphical displays and measures of centrality and dispersion
to summarise the following four variables in the āReal Estate Marketā dataset. For
graphical displays for numeric data, be sure to comment on not only the shape of
the distribution but also compliance with a normal distribution. Be sure to
include relevant SPSS output (graphs, tables) to support your answers.
(a) Price.
(b) Lot Size.
(c) Material.
(d) Condition.
2. (8 marks)
Again consider the variable Price, which records the property price (in AUD). It
is of interest to know if this is associated with the distance of the property is
located to the train station. It i s al so of i nter e st t o kn o w if th e p rop ert y
pri ce s are a sso ciate d with di st an ce to t h e ne ar e st b u s sto p. Carry out
appropriate statistical techniques to assess whether there is a significant
association between the property price and distance to the nearest train (To train)
station and the nearest bus stop (To bus). Be sure to thoroughly assess the
assumptions of your particular analysis, and be sure to include relevant SPSS
output (graphs, tables) to support your answers.
3. (7 marks)
Consider the āEmployee Satisfactionā dataset, which asked participants to provide their
level of regularity to a series of thirteen statements. Conduct an appropriate analysis
to assess the reliability of responses to these statements. If the reliability will
increa.
1. The Data Behind Football
An analysis by Apostolos Mourouzis
11 December 2016
Introduction
Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over
fact. With the recent boom in data availability there are ļ¬nally resources that enable an objective analysis of
the game. In this project we have chosen one particular dataset and use it to try derive relationships and
correlations that provide a diļ¬erent perspective to the world of football.
Our data has been obtained from Football-Data which gives the results of all football games in the top
divisions across Europe.
Our data set also includes betting odds for a range of betting houses. Letās take a ļ¬rst glimpse at our data
set (betting odds have been ļ¬ltered out for the sake of clarity):
## Observations: 380
## Variables: 23
## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ...
## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08...
## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United...
## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham...
## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,...
## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,...
## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A...
## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,...
## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,...
## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D...
## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S...
## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,...
## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,...
## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,...
## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7...
## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1...
## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1...
## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,...
## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,...
## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,...
## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,...
## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
There are two things we can immediately see:
ā¢ The column names are not very descriptive
ā¢ We have a lot of integer variables which are usefull for regression analysis.
In regards to the ļ¬rst observation, a guide to the heading titles is available here or in the appendix. Interpreting
these we can see that the data available is relatively high level for a football game, encompasing goals, shots,
fouls, corners and cards.
The structure of this study will be partitioned into two independant investigations looking at answering
diļ¬erent questions regarding the football world.
1
2. Investigations
Which is the Dirtiest League?
A topic of much debate for football fans, we can ļ¬nally do some statistical analysis to ļ¬nd which of the big
leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable:
ā¢ Fouls Commited in a game
ā¢ Yellow Cards Given in a game
ā¢ Red Cards given in a game
As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the
below formula:
CardScore = 5 ā Y ellow + 10 ā Red
So letās investigate the ļ¬uctuations in fouls and cards across the diļ¬erent leagues in 2015:
Bundesliga
La Liga
Premier
Serie A
20 30 40
Average Fouls Commited
League
Average Card Score
17.5
20.0
22.5
25.0
27.5
League
Bundesliga
La Liga
Premier
Serie A
As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above:
ā¢ As many football fans would expect the Italian Serie A seems to be the dirtiest - at ļ¬rst glance at least!
ā¢ The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could
mean that the fouls commited are mostly tactical.
ā¢ The Premier League seems to be the ācleanestā league, with least fouls and cards per game.
ā¢ Although La Liga has less fouls on average than most of the other leagues, it has the highest card score.
When someone in La Liga decides to foul, he fouls hard!
2
3. In order to check whether there is a real ļ¬uctuation between the leagues, we will have to perform an ANOVA
analysis. Our Null Hypothesis is that there are no diļ¬erences between the fouls/cards between the diļ¬erent
leagues.
The ANOVA test for Fouls committed:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 16335 5445 144.4 <2e-16 ***
## Residuals 1262 47603 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the ANOVA test for the Card Score:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 34785 11595 90.99 <2e-16 ***
## Residuals 1262 160817 127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis
and deduce that there is a statistical diļ¬erence between the fouls in diļ¬erent leagues.
However, this conclusion is only applicable for the 2015/2016 season. Letās investigate whether this is
consistent across the last 5 years.
A two-way ANOVA of fouls across the leagues in the last 5 years is performed below:
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 103962 34654 903.109 < 2e-16 ***
## Year 4 704 176 4.589 0.00106 **
## Div:Year 12 2829 236 6.144 7.37e-11 ***
## Residuals 7029 269716 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
So the resulting Two Factor Anova indicates that the diļ¬erent leagues do consistently have signiļ¬cant
diļ¬erences between them, but the year itself also aļ¬ects the fouls committed - although initially slightly
counterintuitive, we can see a signiļ¬cant diļ¬erence in the means between 2011 (27.9 fouls a game) and 2015
(26.6 fouls a game). Furthermore, although the Leagues do have diļ¬erent values, there is an interaction eļ¬ect
with the Year, which means that depending on the year there is a change in the number of fouls committed
within a league.
Now, although we know that there are diļ¬erences in the ādirtinessā of the leagues, we canāt say with conļ¬dence
what the order is. In order to do so we can do a Tukey analysis to see where there are signiļ¬cant diļ¬erences.
3
4. Tukey Analysis on Card Score
ā15 ā10 ā5 0 5 10
Serie AāPremier
Serie AāLa Liga
PremierāLa Liga
Serie AāBundesliga
PremierāBundesliga
La LigaāBundesliga
95% familyāwise confidence level
4
5. Tukey Analysis on Fouls
ā10 ā5 0 5 10
Serie AāPremier
Serie AāLa Liga
PremierāLa Liga
Serie AāBundesliga
PremierāBundesliga
La LigaāBundesliga
95% familyāwise confidence level
The derived rankings for Fouls and Card Score are:
Fouls Cards
1. Serie A 1. La Liga
2. Bundesliga 2. Serie A
3. La Liga 3. Bundesliga
4. Premier League 4. Premier League
We can now assert with conļ¬dence that:
Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the
age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables.
However, this may be a result of referee leniency as opposed to play style.
5
6. A Logistical Victory
Letās assume that you can obtain all the variables of a speciļ¬c game - shots, corners, cards, fouls and betting
odds for both teams. Having these, is it possible to accurately predict if a team has won?
In order to answer this question, we will perform a logistic regression using all these variables as input. Due
to the diļ¬erent nature of European leagues we will isolate this investigation to the Premier League to remove
variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if
it hasnāt.
Letās see what the distribution of wins is for homes before we start the regression:
0
50
100
150
200
ā0.5 0.0 0.5 1.0 1.5
HomeWin
count
We can see that the probability of a win is much higher, so we would expect the logical regression we perform
below to generate an equal distribution.
As we have several years worth of data there is no need to partition one data set into Training and Test. We
will use one year to generate the regression (in this particular case the 2015 season) and then test it on a
previous yearās data (the 2014 season).
First, letās attempt our ļ¬rst regression utilising all the variables:
## [1] TRUE
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01
## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01
## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01
## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01
## AR 1.55774944 0.46699806 3.3356657 8.509541e-04
6
7. ## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01
## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06
## AC 0.06158437 0.06253096 0.9848621 3.246918e-01
## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03
## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01
## HST 0.47069370 0.07674571 6.1331599 8.615050e-10
## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04
## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01
## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01
It seems our ļ¬rst attempt has identiļ¬ed many redundant variables, which we will remove one by one (starting
with the highest P value) to identify how this model changes with each variable removed. Another interesting
observation is that there is no apparent correlation between the betting odds for the game and the result.
This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1
x function. To
linearize it, we simply have to perform the same operation. The charts of the original and linearized are
shown below:
5
10
15
2 4 6 8
William Hill Home Odds
WilliamHillAwayOdds
Original
0.2
0.4
0.6
0.25 0.50 0.75
William Hill Home Odds
WilliamHillAwayOdds
Linear
Removing the highest p-value variables one by one until all remaining inputs are signiļ¬cant leaves the following
regression coeļ¬cients. The additional variables were removed in the follow order:
Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became signiļ¬cant at this stage), Lose Odds,
Away Corners, Away Shots, Home Reds
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01
## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05
## AR 1.3210913 0.45618447 2.895958 3.780024e-03
## HC -0.2860810 0.04844828 -5.904874 3.529168e-09
7
8. ## HST 0.3359608 0.05887963 5.705893 1.157349e-08
## AST -0.3508141 0.06979585 -5.026289 5.000619e-07
Testing the variance of variables reveals that Home Shots have a high correlation with another variable.
## I(1/WHH) AR HC HST AST
## 1.231508 1.012846 1.417507 1.208117 1.084360
This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating
the relationship between the two variables shows a 67% correlation. We can therefore remove this from our
regression model, Our ļ¬nal formula is:
LogisticRegression = ā0.81 + 3.79 Ć WinningOdds + 1.32 Ć AwayReds ā 0.28 Ć HomeCorners + 0.34 Ć
HomeShotsOnTarget ā 0.35 Ć AwayShotsOnTarget
And therefore the match outcome can be predicted by:
MatchOutcome =
1
1 + eā(ā0.81+3.79W HH+1.32ARā0.28HC+0.34HST ā0.35AST )
In order to evaluate our model, the most important factor is to test the classiļ¬cation correctness. Running
the derived regression formula on our 2014 dataset results in the following āpredictionsā:
## Predicted Result
## Actual Result 0 1
## 0 168 40
## 1 67 105
It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This
gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011.
We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%.
Letās do one ļ¬nal check by calculating a pseudo R-Squared value for the model. We want to maximize the
value of the McFadden variable, which has a range of 0 to 1:
## llh llhNull G2 McFadden r2ML
## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342
## r2CU
## 0.4198202
Although the McFadden isnāt too high, itās also not close to zero so we can safely vouch that our model does
have some predictive power. Given the high accuracy over the several test sets, this model can be concluded
to be satisfactory.
8
9. Appendix
Column Name Translations
Column Name Meaning
FTHG Full Time Home Goals
FTAG Full Time Away Goals
FTR Full Time Result
HTHG Half Time Home Goals
HTAG Half Time Away Goals
HTR Half Time Result
HS Home Shots
AS Away Shots
HST Home Shots on Target
AST Away Shots on Target
HF Home Fouls
AF Away Fouls
HC Home Corners
AC Away Corners
HY Home Yellows
AY Away Yellows
HR Home Reds
AR Away Reds
9