SlideShare a Scribd company logo
Niall Brooke
The Benefits of data mining to
aid in sports betting
Degree: BSc (hons) Cyber Security
Student Number:462109
Student Name: Niall Brooke
Project Start Date: September 17th
2012 (17/9/12)
Project End Date: April 19th
2013 (19/5/13)
Project Keywords: Data mining, Algorithms, Equations, Spread sheets, Statistics, Sports.
Project Word Count: 15,000
Investigating the possible benefits of data mining previous football statistics to aid in sports
betting. This will be achieved by researching many aspects of data mining and sports betting
along with possible algorithm techniques. Using this research a theoretical proposal will be
developed.
Three seasons of football statistics will be collected from two different players who play for
teams which partake in the Spanish premier league. Both players in question will be of
similar playing positions to create accurate and fair results. The statistics for each player will
cover the main aspects of their individual match performances plus the impact they have on
their team and vice versa.
Algorithms will be created to find common patterns within the data which are incoherence
with wagers the end user will be able to make. Equations will then be written which will use
the patterns found in the data to create a percentage based risk assessment. This will provide
the user a clear visual aid when deciding upon possible bets.
Niall Brooke
Contents
1 Analysis of problem............................................................................................................5
1.1 Statement of the Problem............................................................................................5
1.2 Problem detailed..........................................................................................................5
1.2.1 Anytime Goal Scorer ...........................................................................................5
1.2.2 First goal scorer....................................................................................................5
1.2.3 In play betting ......................................................................................................5
1.3 Significance of the Problem........................................................................................6
1.3.1 Misleading odds...................................................................................................6
1.3.2 Media effect .........................................................................................................6
1.3.3 Transfer market effects ........................................................................................7
1.3.4 Impulse betting.....................................................................................................7
1.4 Resources Available....................................................................................................7
1.4.1 Software...............................................................................................................7
1.4.2 Image manipulation .............................................................................................7
1.4.3 Spreadsheet application .......................................................................................8
1.4.4 Software development .........................................................................................8
1.4.5 Website development applications ......................................................................8
1.4.6 Graphical developmental platforms.....................................................................8
1.4.7 Database applications...........................................................................................8
1.4.8 Mobile development applications........................................................................9
1.5 Advantages of specific software .................................................................................9
1.5.1 Excel Advantages.................................................................................................9
1.5.2 Excel Disadvantages............................................................................................9
1.5.3 Flash Advantages.................................................................................................9
1.5.4 Flash Disadvantages.............................................................................................9
1.5.5 Dreamweaver Advantages ...................................................................................9
1.5.6 Dreamweaver Disadvantages.............................................................................10
1.6 Current Solutions.......................................................................................................10
2 Literature Review.............................................................................................................11
2.1 Machine Learning .....................................................................................................11
2.1.1 Bayesian Networks ............................................................................................11
2.1.2 Graphics model link...........................................................................................12
Niall Brooke
2.1.3 Bayesian networks in sport................................................................................13
2.2 Bookmakers...............................................................................................................14
2.2.1 Gambling market ...............................................................................................14
2.3 Sports Prediction .......................................................................................................15
2.3.1 Prediction markets .............................................................................................16
2.3.2 Forecasting methods ..........................................................................................16
3 Problem Requirements.....................................................................................................17
3.1 Problem Specification ...............................................................................................17
3.1.1 Data Source........................................................................................................17
3.1.2 User Interface.....................................................................................................18
3.1.3 Calculations........................................................................................................18
3.1.4 Output ................................................................................................................18
3.2 Possible Solutions .....................................................................................................18
3.2.1 Excel Datasheet..................................................................................................18
3.2.2 Windows Application ........................................................................................19
3.2.3 Dynamic Website...............................................................................................21
3.2.4 Mobile Application............................................................................................22
3.3 Chosen Solution ........................................................................................................23
4 Design ..............................................................................................................................24
4.1 Initial decisions .........................................................................................................24
4.2 Data ...........................................................................................................................25
4.3 Algorithm ..................................................................................................................25
4.3.1 Statistical Percentage .........................................................................................25
4.3.2 Bookmakers Percentage.....................................................................................26
4.4 Wireframe..................................................................................................................26
4.4.1 Statistic data design............................................................................................27
4.4.2 Calculations input design...................................................................................27
4.5 Process structure........................................................................................................28
4.6 User Interface............................................................................................................29
4.6.1 Calculations page...............................................................................................30
4.6.2 Ronaldo datasheet..............................................................................................30
4.6.3 Messi Datasheet .................................................................................................30
4.7 User Interactions .......................................................................................................30
Niall Brooke
4.7.1 Drop down variables..........................................................................................31
4.7.2 Player Variable...................................................................................................31
4.7.3 Venue Variable ..................................................................................................31
4.7.4 Competition Variable.........................................................................................31
4.7.5 Goal type variable..............................................................................................32
4.7.6 Odds input..........................................................................................................32
4.7.7 Calculated Percentages ......................................................................................32
4.7.8 Prediction advice................................................................................................32
5 Testing..............................................................................................................................33
5.1 Methodology .............................................................................................................33
5.1.1 Theory................................................................................................................33
5.2 Results.......................................................................................................................34
5.3 Result Analysis..........................................................................................................35
5.3.1 Method Analysis................................................................................................35
5.3.2 Method Flaws.....................................................................................................35
5.3.3 Bookmakers calculations ...................................................................................35
6 Conclusion .......................................................................................................................36
6.1 Problem Analysis Review.........................................................................................36
6.2 Review of Literature Review ....................................................................................36
6.3 Problem Requirements Review.................................................................................36
6.4 Design Review ..........................................................................................................36
6.5 Results Review..........................................................................................................36
6.6 Future development...................................................................................................37
6.6.1 The Variables.....................................................................................................37
6.6.2 Real time betting................................................................................................37
6.6.3 The platforms.....................................................................................................38
6.6.4 Business opportunity..........................................................................................38
6.6.5 Global Reach......................................................................................................38
7 References........................................................................................................................39
8 8 Appendix.......................................................................................................................43
Niall Brooke
1 Analysis of problem
1.1 Statement of the Problem
The sports gambling industry has shown substantial growth in modern times due to the
introduction of online betting. This has then lead to the development of multiple sports
forecasting systems and models. Sports forecasting has however been going on for decades
prior to this but has never received the amount of mainstream attention it does today. The
problem with current sports forecasting systems is that the majority try to predict either the
result of the exact score of a football match rather than focusing on the individual players.
1.2 Problem detailed
There are currently no publicly available sports forecasting systems that focus on the
previous statistics of individual players to predict whether they will score in a particular game
or not. Bookmakers now let their customers bet on a vast variety of different possible
outcomes for players and the team in general. This is a gap in the market where a system
could be developed, which would track player’s statistics and generate advice to gamblers on
particular wagers.
1.2.1 Anytime Goal Scorer
The system would need to scan through a player’s goal scoring history taking into
consideration many different variables and factors to determine a solid percentage of how
likely they are to score in a game. This percentage would then need to be compared to the
odds that are being given for that outcome to happen to determine whether or not statistically
it is a good investment.
1.2.2 First goal scorer
Similar to the anytime goal scorer prediction the system would need to be able to determine
how likely a player is to score first. This would require a similar algorithm to check through
the statistics of the previous encounters with that team. The percentages generated from this
would be significantly lower as the odds would be better overall. Due to this a slight
alterations may be made to some of the equations.
1.2.3 In play betting
Another issue that needs to be resolved is a system that can make predictions about players in
play. This would need to use the current percentages generated and then add them to a larger
equation to correspond with what has happened during the match. This would need to include
the current time of the match in question. In theory the deeper the match is into the current
game the less likely a player is to score and therefore their percentage of scoring should drop
as the odds increase. However this is where data mining could aid a gambler and possibly
give them the edge over a bookmaker. For example if a selected player statistically scores in
the last 5 minutes of games against teams who are currently not performing well in the league
when their team is losing, more than the average player does a system could be made that
informs the gambler that this bet has a statistically high chance of paying off even if the odds
are high and seems unlikely.
Niall Brooke
1.3 Significance of the Problem
Initially there may seem to be little significance to this particular problem in the grand scale
of things however when analysing the key factors and repercussions linked it is a big issue.
There are many forecasting systems that have already been developed but none which solve
this unique problem. There are several areas which are troubled by this current problem
which can cause a ripple effect. Mainly the customers of bookmakers deserve a truly
representative probability of their bets.
1.3.1 Misleading odds
Bookmakers like the majority of companies are trying to make as much profit as possible.
This can lead them to use questionable marketing strategies, in an attempt to entice the
customers to spend more money. Bookmakers will tend to advertise odds in large colourful
letters with the aim of making customers take bets which seem to be better than they actually
are. The bet that may be being advertised could in fact be statistically very unlikely to happen
however because it seems like a good offer people will take the bet. Bookmakers can even
use the players, managers or teams themselves to create hype to increase the amount of
wagers on a particular bet. For instance if a player who used to play for a team was playing
their first match against them, bookmakers could use this to offer a bet for that player to score
when the statistically the data shows that the player in question very rarely scores goals
against teams who are currently in form. They can use this in many different scenarios such
as goals, yellow cards or even man of the match awards. It is a similar effect that lotteries
have on customers. People who may not play the lottery on a regular basis may play when
there is a massive rollover jackpot. Even though statistically they have a minute chance of
winning because of the advertisement and hysteria it generates it causes people to gamble
with their money thinking they have a good chance of winning. When consumers would be
more likely to win by spending their money on a scratch card with better odds but still with a
substantial prize.
1.3.2 Media effect
The media can also be used by bookmakers to take advantage of their customers by creating
bets around news stories. If a player was involved in the media or newspapers this week the
bookmakers will use this to their advantage to draw in customers in order to offer a bet which
statistically isn’t fair. An example of this would be that a player such as Theo Walcott for
Arsenal FC was quoted saying he wants to win the premiership top goal scorer this season.
The bookmakers may have an offer at ‘evens’ for him to score 2 or more goals against a low
level opposition. However statistically Theo Walcott only scores 2 or more goals in 3% of the
matches he has played. Due to evens being the equivalent of 50% this bet is massively
stacked in the favour of the bookmaker. The average football fan will remember seeing Theo
Walcott’s name in the news and how he is aiming to score more goals. This could then entice
them to put a bet on thinking they have a good chance of winning when statistically it is very
unlikely.
Niall Brooke
1.3.3 Transfer market effects
Only in recent times have bookmakers started to offer wagers on player transfers between
teams. This can demonstrates how much control and influence bookmakers can have on the
customers. News outlets can base their stories on the likelihood of player being transferred
base just open the current odds that are being given. This is ever so evident when bookmakers
suspend the betting on a transfer. This means when the bookmaker decided they have stopped
taking bets on a transfer because it is almost certain to happen. However a few days later they
may reopen the transfers causing loads of customers to get worse odds that what they
originally should of.
1.3.4 Impulse betting
Customers can constantly be swayed by advertisements which seem to be too good to be true.
It is now common for adverts to be down on TV in-between football matches which offer live
in play odds. This can lead to many people making impulse bets without thinking through
how good the odds really are. The bookmakers will offer seemingly good and realistic odds
for the game that is currently being played however when the statistics are looked at it is clear
that the customers are being given a bad deal.
1.4 Resources Available
For a technological problem it is essential that the correct resources are available, these
would include, hardware, software and reading material. Liverpool John Moores University
has a wide variety of educational licenced software made available to all computing students
the software can be accessed via any of the computer science labs or remotely though the
university’s servers. There are several computer labs based on campus with up to date
hardware that has the correct specifications to run the specific software. There are two LJMU
libraries, which are open 24 hours a day 7 days a week during term time. This will allow
students to access reading material through journals, research papers and textbooks. The vast
majority of resources are available physically and digitally. All of these factors will provide a
suitable work environment for this project.
1.4.1 Software
For this project a variety of software will need to be considered, as there are many different
ways to approach this particular problem. It is essential to have word processing software,
which will be required to develop the final report and testing. For solving the problem in
question a selection of category-based software would need to be considered.
1.4.2 Image manipulation
For this problem an image manipulation application may be needed for many different tasks,
as graphics are highly common in modern applications. Adobe Photoshop is the industry
leader in this area and is available at LJMU. Image manipulation may be needed to construct
website templates, create icon graphics and to edit screen captures. This software is essential
and very important that is constantly available for use during this project.
Niall Brooke
1.4.3 Spreadsheet application
A Spreadsheet application would be very beneficial to have to solve a problem such as this,
due to any mathematical/equation elements that may present themselves. These tasks can be
done on other pieces of software however it would be much more streamlined and organized
to have them created in this way. Spreadsheet applications are designed for data manipulation
and the problem is heavily related to this subject area. The Microsoft office suit is available
which contains the Spreadsheet application Excel.
1.4.4 Software development
Having a software development platform is vital before starting to plan out a solution for a
technological problem. Visual Studio has been made available which is a suit developed by
Microsoft, which allows users to create applications in a variety of programming languages.
This could be used to create, design and then publish various solutions to this current
problem. The software also allows many different plugins and expansions, which can also aid
and customize the development process. Code or designed based applications can be created
as the software contains a simple to use GUI development kit to create the user interfaces for
applications.
1.4.5 Website development applications
Many systems in this modern time are designed for solely for the internet therefore web
development platforms are very important when collecting resources. For this reason it is
very plausible that a website may need to be created in order to manipulate data and present a
user-friendly interface. There are several web development platforms at the disposal of
LJMU students however the most commonly used is Adobe Dreamweaver as it can easily be
combined with its fellow branded software.
1.4.6 Graphical developmental platforms
It may be required to use a graphical development platform possibly in tandem with other
software to create an overall better user experience. Software such as Adobe Flash could be
implemented to present the user with a rich multimedia application. Silverlight is also
available however due to the current trends it would be more beneficial to avoid this
particular software due its small user base.
1.4.7 Database applications
The problem is heavily reliant on data so a database may be required as a standalone entity or
possibly connected to a website/application, which would then proceed to call upon the
database to present the data in a dynamic fashion. As stated before Microsoft Access would
be available in the office suit however MYSQL workbench which is also on the LJMU
servers would be a much more suitable choice as is used much more than its competitors.
MYSQL can also be used directly from other problems? so it isn’t essential to have any
particular software.
Niall Brooke
1.4.8 Mobile development applications
With smartphones becoming every increasing popular an app solution may be a very
beneficial option. Mobile applications can be done on several platforms, Adobe Flash which
has been mentioned can already output mobile applications however they are not the
standard. Java development on applications such as Eclipse can be developed using Android
SDK mobile applications, which are currently available to these project’s resources.
1.5 Advantages of specific software
1.5.1 Excel Advantages
The main positive of Excel is that it is a complete program; it doesn’t require anything else to
aid it. It is also very universal as it the file can be sent and viewed in emails and also most
smartphones can open the documents without any problems. Visual studio can also be
integrated with Excel to allow password protected documents. Excel is known for its ease of
use and helpful pop-up tutorials which help when inputting equations
1.5.2 Excel Disadvantages
A bad feature of Excel is that it can be susceptible to viruses, this is because excel can run
macros which can contain malicious code. Due to Excel projects normally being large
amount of data in one file it can end up running slower due to the file size. This leads onto a
different problem as if a user decided to split the project into multiple files it will increase the
overall chance of losing data. Finally Excel is limited to the amount of space which is
available due to the limited rows and columns.
1.5.3 Flash Advantages
The biggest advantage of flash is that it allows the creation of very interactive multimedia
content which can create very good user interactions. Flash also can be integrated in many
different situations such as websites and as standalone applications. A great benefit of flash is
that it is cross browser compatible, meaning that it isn’t a problem if the browser the user is
running doesn’t have the correct version of HTML as long as it has a flash player installed it
should run fine.
1.5.4 Flash Disadvantages
Flash has a reputation for being an unreliable piece of software; this means that it is quite
prone to crashing. Flash can also slow down website because it is running something much
more intense than usual. This can be combated with preloaders and multiple swf files which
all get loaded into one main file when called upon.
1.5.5 Dreamweaver Advantages
Dreamweaver is great for website management; it allows users to switch between different
sites with a simple option on the side. The website has a built in FTP connection tool which
makes it very simple and easy to update websites. Dreamweaver is known for its good CSS
Style sheets and makes it very simple for customization.
Niall Brooke
1.5.6 Dreamweaver Disadvantages
Dreamweaver’s simple interface can also be a drawback as it limits the learning of code as it
offers a point and click solution. Dreamweaver can also limit itself sometimes when it tries to
connect to things in a certain way and will not allow other particular options.
1.6 Current Solutions
It is a known fact that it is impossible to predict the result of every single football match as
there are far too many variables that could change at any given moment. However there are
many different ways to greatly increase the percentage chance of predicting the result. There
are some current solutions that have been researched and developed. The most common
research into predicting a football match is to review the current form of two teams playing
and any past fixtures they have played against one another.
Mathew Tucker of the University of Southampton developed a website that used multiple
algorithms in an attempt to predict football results with a higher success rate by factoring in
the players themselves. The system would constantly pull data from the BBC Sport website
and store it on a database taking into account how the team’s winning percentage is altered
depending upon each individual player. Essentially the system would get more accurate over
a longer period of time as there would be more data to compare. He was also was able to
factor in the Bookmakers by pulling data from an odd’s comparison website which allowed
him to check if the odds matched his system and if it was worth putting on a wager. [1]
The system isn’t entirely unique however it does combine many different variables that can
provide some interesting results. After testing the system it was clear there were many flaws
that needed to be fixed. The main one was that players who were in form at the present time
were in fact lowering the winning percentage of their team as the past data shows them to be
a weak link. However it did still correctly predict some results, which in reality were obvious
bookmaker’s favourites, so it was clear the system wasn’t very reliable.
The Seoul National University published a paper proposing a very comprehensive framework
for sports prediction. The framework is quite unique in 2 ways; it uses a rule based reasoner
and a Bayesian Network component. The reasoner allows the system to still forecast the
results of matches even if there is very little data available. The framework also takes into
account real times scores, which are very rare in other predictors; in essence it is more of a
simulator. Strategies and tactics that teams use are also reviewed and incorporated into the
framework, to create a very carefully constructed system. [2]
Both of these systems are good in their own way however, they both are only focusing on the
result of a team. Each team has 11 players and each one of them would be able to impact the
result of the game so that is a total of 21 extra variables compared to a system that focusing
on 1 player. Each one of these players would have hundreds of possible variables themselves
this makes calculating an exact percentage or likelihood very unreliable. A solution could
however also incorporate the teams overall likelihood of winning to add more accuracy to an
algorithm.
Niall Brooke
2 Literature Review
There have been many research papers into sports prediction, which go into depth about the
variables and algorithms that can be used to forecast the result of a sporting event.
Subsequently there is also a vast amount of research into the odds and statistics that are
involved with gambling, this can also be factored in with sports wagering. During this
literature review all these topics will be covered and there will also be an in-depth look at
machine learning in general showcasing and any possible methods can be used in sports
environment.
2.1 Machine Learning
Machine learning is a form of artificial intelligence, which essentially is a system that can
learn from data. This can be useful for situations such as detecting spam Email messages or
finding patterns in sports statistics. There are many different methods of machine learning
that can be implemented in a system, two of the most popular are decision tree’s and
Bayesian networks. Decision trees’ are used as a predictive model to observe previous data
in order to show an end target. Bayesian networks on the other hand are a mix of incidence
diagrams and Bayes theorem. They display the conditional probability and the relationships
between different variables.
2.1.1 Bayesian Networks
In modern times Bayesian networks have been increasingly used as tool for modelling
statistical problems. They have also frequently been used by in the reliability analysis
community. A reliability analyst tends to be someone who gives their input to a decision
problem. Studies in areas like this tend to be unclear due to the random fluctuations. The end
result should be made into a statistical model of random variables. The model should be
mathematically accurate but also be clear and easy to understand for the decision maker. It is
vital that a set of parameters are fully specified by statistical or judgement data. Due to both
of these sources of data being less than perfect it is important to formalise to reduce the
amount of parameters needed in the model [3].
In a statistical environment the figures that are sought after tend to be conditional
probabilities or deduced from these numbers. Due to all of these requirements there has been
a lot more focus on the traditional and flexible frameworks such as fault trees. One
framework that has stood out is set of Bayesian network modules. Bayesian networks
originally came from the area of artificial intelligence. They were used as an effective
framework for understanding uncertain knowledge. Bayesian networks can be found
originated from Almond and Barlow [4].
The suggestion of using Bayesian networks for reliability analysis has led to a flurry in
research regarding comparing the classical reliability formalisms and Bayesian networks.
Features from both modelling and analysis of reliability block diagrams have been also
compared to Bayesian networks, which have shown that they have many advantages over the
traditional frameworks. In recent times however other more general reliability models have
been compared with Bayesian related formalisms [5] and [6].
Niall Brooke
Bayesian networks over the years have had many different uses such as finding software
reliability [7], find faults in systems and maintenance modelling [8]. However they are very
commonly used to find software system reliability. It is very popular as it combines multiple
sources of information to generate a global construction of information. Fenton [9] displayed
that the vastness of well-founded underlying theory of Bayesian networks can provide a good
amount significant advantages. Woff [10] designed a series of software tests using Bayesian
networks and concluded they are well suited for these problems.
Typically it is common to see discrete variables in the Bayesian network community.
However it is must be noted that the reliability analysis would be very limited if only discrete
variables were considered. So it is important for them not to limit themselves in that
particular way. Instead it is important to use both continuous and discrete variable models.
Overall Bayesian networks are considered to constitute a modelling framework which tends
to be typically easy for domain experts to use and of course in the reliability field. In research
common aims and goals are being targeted by the Bayesian network community.
Bayesian networks belong to the Graphic models family. These are graphical elements, which
are used to display the unknown. Each node represents a random variable while in between
them lays the dependencies along with random variables. The dependencies found in the
graphical model are determined using known statistical methods. Graphical models with
undirected edges are typically referred to as Markov networks. These types of networks are
known to present a solid definition of any two different nodes based on the concept of the
Markov blanket. These types of networks are commonly used in computer vision [11].
2.1.2 Graphics model link
Bayesian networks are also linked to another graphics model commonly known as a directed
acrylic graph. This tends to be found in machine learning and artificial societies. Bayesian
Networks are very focused on a mathematical element however are also very unique. They
allow an effective representation of joint probability distribution over a set of random
variables [12].
The structure that is typically found in a directed acrylic graph is split into two different
sections; the nodes and the vertical edges. The nodes are used to represent random variables
and are displayed as circles defined by the variable name. The edges are however are the
direct dependence among with variables and are depicted as arrows between nodes [13]. Due
to the design of the directed acrylic graph it is impossible for an individual node to be its own
ancestor or descendant. This means that this condition is very important because of possible
factorization of the joint collection of probability nodes [14].
Niall Brooke
A Bayesian Network is similar to that of a conditional independence statement. This is
typically the case when each variable is independent of its nondescendants, depending upon
the state of its parents. This can be used to greatly reduce the amount of parameters that are
required to define the joint probability distribution of the variable [15]
2.1.3 Bayesian networks in sport
Bayesian networks are also used to create predictive models for sports such as football.
Knowledge can be gathered from experts or statistics to help determine to main factors than
can effects the result of a football match. These factors are very complex and are evidence
that it isn’t just luck that determines a match. It is in situations like this in which Bayesian
networks excel. It is possible for a domain expert to have collaboration with a Bayesian
network expert to create a network showcasing the importance of the relationships between
the main factors involved to determine the direction of each effect in the network. Fenton
[16] compared an experts Bayesian network to analyse the results of Tottenham hotspurs in
the 1995/1996 and 1996/1997 seasons.
(Fig 1 a Bayesian network for of Tottenham hotspurs in the 1995/1996 and 1996/1997 season)
Fenton used an expert Bayesian network, which provided excellent results over the cross
season test periods. This model didn’t take into account the players training regime in-
between the two seasons ignoring key attributes. This helped to show how that an expert
Bayesian network can be used to select the key features. Fenton concluded that the overall
procedure of machine learning with Bayesian networks provided two main positive outcomes
understanding and prediction. Even though logically speaking the more understanding we
have the better our predictions should be. It is however possible to make accurate predictions
without a substantial amount of knowledge. The understanding that is gained from learning
different processes allows the construction of models that demonstrate the relationship
between two different models. [16]
Niall Brooke
2.2 Bookmakers
Gambling has now become an integral part of the sports ecosystem. In recent years sports
betting companies have had a massive boom in their market due to online gambling.
However with all of its success the sports betting market has never been too far away from
controversy.
Sports wagering can be tracked all the way back to the original Olympic Games in Greece
whereby athletes would compete in many different events such as foot races, hurdles and
even free style fighting. These athletes would be rewarded with winnings however the
majority of money would be won in the crowd. The spectators would wager on the outcomes
of these events winning or losing entire estates at a time. Even before that the gambling can
be traced back to the Romans, who swore it was a metaphor for life itself.
Sports betting can be found in almost every sport in this day and age. Bookmakers have 100’s
of different ways to bet on single events to entice more wagers. Bookmakers use algorithms
to find patterns to predict and produce their odds. They will always put the odds in their
favour so that they can be sure to produce a profit. However from time to time they may
make a mistake or inside information may be leaked which can cause a flurry of bets and
cause the Bookmakers to close bets.
2.2.1 Gambling market
Financial markets and sports wagering are very similar, both involve investors with
heterogeneous beliefs looking to make a profit. Sports’ betting is a zero-sum game with two
traders on each side of the transaction with large amounts of money at stake.
With financial and sporting markets being so similar it is peculiar that they are organised in
such different ways. In the financial market the price constantly fluctuates to match the
supply and demand. The main goal of the market makers is to match up the buyers and
sellers. However the bookmakers in sports wagering tend to announce a price (such as the
odds for a horse to win a race or team to win a game) after this changes in the prices are very
rare and sometimes not at all [17]. If the price chosen is not the market clearing price then
that bookmaker could take a big loss [18]. If the betters notice that a bookmaker has made a
mistake with their price they can exploit this, which will lead to a big loss. Bookmakers are
categorically different in the way that casinos have risks on their games of chance such as
roulette and blackjack. This is due to the fact that these games have odds that favour the
house and with a mass amount of people playing they are guaranteed to make a profit. On the
contrary with bookmakers if they make a mistake with the odds they can make a big loss,
even in the long run. Small groups of skilled gamblers who can consistently make a profit can
be financially disastrous for bookmakers. These gamblers could either amass a huge bankroll
or sell their winning formula to others and increase the overall problem.
Niall Brooke
Although the system which bookmakers implement is peculiar there are a few ways in which
it can be very profitable. Bookmakers are very knowledgeable and have vast amount of
resources allowing them to equalise the amount of money wagered on each outcome,
presenting them with a profit regardless of the outcome as the bookmaker will charge a
commission of the bet’s known as ‘the vig’ [19]. Due to this strategy Bookmakers tend not to
focus directly on the winner of the outcome but forecasting how wagers will be placed.
Popular depictions of how bookmakers react is stressed here [20].
The second main strategy is if the bookmakers are statistically more accurate at predicting the
outcome of sporting events than the customers placing wagers. In this scenario the
bookmaker would always be able to set the correct price as it would equalise the probability
that a bet placed on either side of a wager is a winner. However this would mean that the
bookmaker would only win the commission and not the overall wager placed as it would
have now been cancelled out by covering other losses. Unlike the previous methods the
bookmakers would actually lose if the gamblers are more skilled at predicting the outcome of
an event.
The final method for bookmakers can be very profitable as it combines the positives from
both previous strategies. If the bookmaker is better at predicting the results of games but also
can effectively predict the betters themselves, they can make profits more than that just of
their commission. By systematically setting the ‘wrong’ prices for games they can start to
control how wagers are placed and then create a larger profit margin for themselves. For
example if a local bookmaker knew that customers had a trend of betting for the local team,
they could skew the odds against that particular team. However there are also constraints to
this as it couldn’t be done on a mass scale as betters who know the ‘correct’ odds could
generate a profit if the posted price diverts too much from its true form [21].
2.3 Sports Prediction
Sports forecasting is becoming ever more important in the world of sport, effecting teams
sponsors, the media and mainly the fans who are making wagers online. There has been a
surge of demand for professional advice regarding the results of sporting events. This is
typically delivered in the form of tipsters or pundits [22]. Betting odds themselves now are
used as a form of forecasting as they provide an overall prediction. Fixed odds on the other
hand are sourced from the expert predictions of the bookmakers [23].
Niall Brooke
2.3.1 Prediction markets
Prediction markets were originally used in political election results [24] and later even in
business outcomes [25]. It is now increasingly used in an attempt to predict all different types
of sporting events. This shows that prediction markets provide their own method of sports
forecasting. Essentially they are a massive group of individuals connected via the Internet
who are sharing virtual market stocks which will then have an effect on the future value of
shares depending on the market situation. When a particular outcome which is linked to a
specific market situation occurs each virtual stock bought receives a cash payoff. An example
of this would be £1 if a team wins and £0 otherwise. In prediction markets each individual
provides their own knowledge to the market so the stock prices are a representation of the
combined wisdom of everyone involved, thus creating the prediction [26].
Due to the vast amount of different forecasting methods questions have been raised as to how
effective they are. There have been studies that research the performance of betting odds and
tipsters’ methods [22]. However there has never been extensive research into a comparison
with these methods and prediction markets [27]. There is also not much research into any
possible similarities between multiple forecast methods. This could be very effective if they
were to be combined in a weighting-based overall method. However this may be beneficial to
the grand scale as if it was openly available the average sports fan may be able to take
advantage. This also aids sports betting companies such as bookmakers to improve their own
forecasts.
2.3.2 Forecasting methods
2.3.2.1 Prediction markets
The general consensus of prediction markets suggests that markets in fact solve information
problems [28]. A competitive market can achieve market efficiency through different price
mechanisms. The most effective proven method of aggravating the balance is by depriving
the individuals of information [29]. This means that the prices in a competitive market will
show a reflection of the public and private information from the individuals thus providing a
good predictor [26]. These qualities make them a very promising method to solve many
different information problems [30].
There are many online prediction markets based around sport, they will trade virtual stocks
related top future market situations, which are directly linked to the results of sporting events.
The cash pay-out of the shares of virtual stocks depends upon the actual outcome of the
fixture. This means that the price of one of the virtual stocks should then match the prediction
markets aggregate prediction of the event outcome.
The participants of a Prediction market will use their own judgment and expectations of the
result to work out the true value of the related share of virtual stock. Accordingly they will
then proceed to compare their expected cash share with the prediction markets aggregate
expectation. If the potential profit from the virtual portfolio exceeds expectations it will then
be in the general interest of the prediction market to reveal their transactions to aid the overall
Niall Brooke
strategy. This leads to participants in the future revealing their true expectations of the market
through buying and selling activities [22] Due to the individuals making their expectations
tradable a prediction market can then create a market of its own about future situations
whereby participants will compete according to their own expectations. Research has been
conducted studying empirical data, which supports the informational efficiency of such
markets [30].
2.3.2.2 Tipsters
Professional forecasts are often made sports pundits/tipsters whose predictions are normally
broadcast through media. Tipsters tend to be experts of a particular sport who will not use
any type of model to predict a result but rather use their personal experience [31]. They will
tend to only offer forecasts on popular fixtures typically with a close connection to betting.
Due to the nature of the forecast there are no financial consequences caused from the result of
the tipsters.
There is clear evidence from research that the actual forecast accuracy from a tipster is very
limited [31]. It is show than tipsters tend to perform better overall than random forecasting
methods however they come out worse than systems that always forecast a home win. This
was showcased in study that 3 tipsters received on average 42%, 41% and 43.5% while a
home win system received 47.5% [27]. The study also showed that football tipsters will tend
to have a lower average of predicting correct results as an average football fan.
2.3.2.3 Betting odds
A vast amount of analysis into economics and business research suggests that betting odds
can provide a very effective forecasting method [32]. Bookmakers will determine a set of
fixed odds determined upon their expectations of matches’ outcome based on probabilities.
Fixed odds rarely change however an influx of ‘in play’ betting odds are constantly changing.
3 Problem Requirements
3.1 Problem Specification
The problem itself can be split into multiple sections. These can in turn be solved individually
to create a final solution. The main areas which need to be resolved are: the data source, user
interaction, the calculations and finally the output.
3.1.1 Data Source
Statistical football data on individual players needs to be collected and stored so that it can
later be manipulated. The data collected needs to detail every aspect that may affect a
player’s chance of scoring a goal in a game. This will cover what competitions the player
scores the most goals in and if the game was at home or away. The data will also need to
include the minutes in which a player had previously scored, this is to aid in match betting.
Having data determined how likely a player is to score at any given moment in a game the
opposing team will also need data collected relating to the selected player. The data will
depict how often a said player on average scores against a team of that standard. Even the
Niall Brooke
attendance of matches will be considered as it may have an effect on a how a player
performs. All this data needs to be combined in order to create a definitive prediction.
3.1.2 User Interface
The user interface will need to allow the user to select a particular player and input the
required match data in order to generate a prediction. The user will need to be supplied with
correct forms and menus to input any required data.
3.1.3 Calculations
The calculations required for this problem would need to use the relevant data collected to
create multiple equations which when combined will create an overall algorithm leading to an
accurate prediction. The first main equation which will need to be calculated is the
percentages of each individual stat linked to a player. This could be achieved by combining
the binary values together to generate a solid average number. The main set of equations will
be to generate the actual percentage prediction. This will have many factors and combine
multiple different data averages. The final equation will need to convert the bookmaker’s
odds into a comparable percentage. Once all of these calculations have been created they
should be hidden from the user, this will help protect the overall algorithm to prevent any
attempt of plagiarism.
3.1.4 Output
The overall output would need to provide the user with an accurate percentage chance of the
chosen player scoring in the selected game. This will need to be displayed in three main
ways. The first showing the exact percentage chance of that player scoring with the selected
variables. Secondly displaying the percentage chance of the odds that have been given by the
bookmaker. Finally a text based message informing the user if it is a worthwhile bet to wager.
Once the user is given the prediction they will be able to reset the forms and selectable menus
so that they can use the system again.
3.2 Possible Solutions
With this particular problem there are many different possible solutions, which need to be
examined before making a final decision. The different solutions will present their own
advantages and flaws, it will be virtually impossible to find a ‘perfect’ system, however once
it has been implemented and tested there will be room for improvement. In this project we
have narrowed it down to 4 different solutions: a windows application, excel datasheet, a
dynamic website and a phone application. These four solutions will be briefly analysed
showcasing their potential features and functionality.
3.2.1 Excel Datasheet
A very feasible solution would be to develop a datasheet, which could then be used by users
on their own system. The datasheet would be created on a spreadsheet development
application such as Microsoft Excel. The datasheet would contain 3 different main sheets.
Niall Brooke
The first two sheets would contain a database of statistics for the specific players and the
remaining sheet would display the user interface. The system will allow a user to use
statistical data stored in the datasheets to determine if a particular wager is statistically a
worthwhile gamble.
The design of the datasheet would make it possible to lock/hide the datasheet from the user,
however this would prevent the user from importing new statistical data for new players or
adding to the current one. This method allows the application to grow and be expanded upon
to suit the users’ needs and personal requirements.
The user interface of the datasheet would be simple; however the user would also be given
options to edit/modify parameters that may improve their experience with the system. The
user would be given drop down menus to select the players which are currently linked up to
the databases and then enter in the variables to be presented with a solution.
The installation of the system may cause some problems to computer illiterate users. There
are two possible solutions for this situation. The first would be to release the system as raw
datasheet file, which would require the user to have the software of their system to run it.
This could lead to complications as the user may not have the required software. However
Excel is a very common piece of software that sometimes comes included with most home
computer systems. There is also open office which is a free alternative that would be able to
run the software fine. Another solution would to use software to convert the datasheet into an
executable windows/ios file, this would then allow users to install the system as an
application and launce it as a standalone piece of software.
Developing the system on Excel does however contain many advantages. Firstly the software
structure is already in place as it is being developed on software specifically for its purpose.
This will cut down on development time and allow more time to focus on the actual
equations/algorithms that are the main point of the system. The datasheet solution can also be
edited and customized by the user, which will give the application longevity.
3.2.2 Windows Application
The most commonly used solution to solve technological problems would be a windows
application. This is due to the fact that the vast majority of computer system runs on the
windows operating system. This possible solution would be designed in a programming
language such as C# or VB (Visual Basic). A software development sweet such as Visual
Studio would also be required to create such an application with maximum efficacy.
The application itself would have a GUI, which would contain multiple drop down/radio
buttons, which would allow a user to select the two different players being tested, the
opposing team and finally the odds that are being offered. The application would have the
statistical data hard coded for demonstration purposes, however with future development the
Niall Brooke
application would need to be connected to a data source such as database hosted on the
internet.
The application itself would be quite user friendly as stated the majority of users would be
familiar with the windows operating system. The application would need to be installed to the
users system by following an installation wizard. At this stage it would request to use the
network for internet access if this feature was to be implemented. The application would be
designed so that even a computer illiterate user would be able to use its functionality allowing
it to be used by the masses.
The main advantages of having a windows based application solution would be that the target
market would be very large. This is would be beneficial in a marketing/business sense. The
development of the software would have an overall return of possible profit. The software
could be released for free allowing users to trial the software and then pay for a
subscription/one off fee for access to the data which fuels the application.
There are however a few problems that developing a windows application would come
across. To create a complex application which is synced to a server in C# would require an
extensive amount of coding and experience. The development stage for the application
therefore would be time consuming. The application itself would need to go through security
checks otherwise users anti-virus software would flag the executable file as ‘suspicious’ and
attempt to prevent the install. This may deter users from trying the application. The longevity
of this solution is very positive however if the online connection is implemented there would
always be updated statistics for users to use.
(Fig 1 windows application example)
The above image displays an example of a windows application. The software is called ‘odds
wizard’ and shows a user the best possible odds for football matches being played in a time
frame. This system has a similar user interface to the possible windows solution for this
particular problem.
Niall Brooke
3.2.3 Dynamic Website
An alternate solution would be to develop the entire system to be hosted online. The system
would be available to users through a web address. The web site would be dynamic allowing
the data to be constantly updating. This would require a backend database, which is remotely
connected to the server.
The design of the website would be quite simplistic in order to improve usability and
performance. This would be important if the website contained many large graphics or high
intensive multimedia it could possibly hinder the user’s experience by causing the website to
run slower. The main purpose of the system is to provide the user with useful gambling
information so it is important is presented in a clear fashion.
(Fig 2 dynamic website example)
The above image shows the dynamic website ‘footy brain’ mentioned in current
solutions. The website follows a similar design as this possible solution. It contains just two
tabs which displays the predictions and the data. The website however doesn’t contain any
user interactions which would be added for our solution.
The solution would be developed primarily in HTML, the template being created in image an
manipulation software such as Adobe Dreamweaver. To keep everything synced the web
development aspect will be done using the same suit, Adobe Dreamweaver. The backend
database would be developed using SQL as it is the most commonly used and the industry
standard for database development.
The main benefits of implementing a web-based solution would be that users could remotely
access the system from any destination with an internet connection. This would be very user
friendly allowing users to use the service on a variety of different devices. Being online and
on a server also means that updating the system would be very easy, this would benefit both
the user and developer. The user would not need to download/install the system onto their
machine and so avoid many issues other solutions may come across.
There are however some issues that this solution would contain. The big issue with the
system being hosted online would be any problems with the hosted server would cause
problems for the users. This problem would be hard to avoid as it would be out of the
Niall Brooke
developer’s hands. A solution to this would be to have a backup server hosted locally,
however this would still cause noticeable speed drops with the website performance. The
server and domain name would also cost a yearly fee. This means that system would
constantly be losing money if the service was free. However the service could contain
premium elements that could generate income. Another solution would be to use
advertisements to generate any income if the website received high traffic volumes.
3.2.4 Mobile Application
A possible combination of different solutions could be used to develop a mobile solution.
With smartphones and apps becoming more common with the average person a mobile app
solution has very good potential. There are currently 3 main mobile operating systems
Android, Apple and Windows.
For this problem an Android based solution would be most suited. This is due the fact that
Android is currently the most used mobile operating system and is free to develop for.
Development on IOS requires a fee to become and official app developer and is much more
time consuming to develop for. Windows mobile 8 is currently new and does not have a large
market.
This solution would need to be developed on the Android SDK which is available for the
software Eclipse. Before starting the solution it is vital to determine the version of Android it
should be developed in. This is important as many devices won’t be able use some the
features in newer versions of the operating system.
Android applications are developed in Java but Eclipse offers an easy GUI to develop the
front end of applications. To test the application an emulator needs to be installed on the
development system, which then creates a virtual Android device. Android development
includes permissions, which the user must accept before downloading and installing the app.
These permissions allow the app to access information from internet or even from the device
itself. This solution would only require internet access to pull live data from a hosted server
as having a database included would be too large for an app.
(Fig 3 Android application example)
Niall Brooke
The above image is a current live application available on the android play store. The
possible solution for this problem however would be designed slightly differently however as
it would allow users to interact with the application whereas this simply shows information
pulled off a server. Our solution would contain 2 main tabs similar to that of the dynamic web
based solution but be primarily designed for touch screen user interactions.
The main advantages of a mobile-based solution would be that it would very user friendly
and target a massive emerging market. The application would be free to host and could make
a profit by either ad based revenue or even a flat fee. Making it a phone application would
also increase the usability of the system. However the application similar to the other system
would require a server to pull the data from and a solid internet connection. This would make
the application unusable in areas where there is no Wi-Fi or Cellular network connection.
3.3 Chosen Solution
We have decided upon using an Excel datasheet based solution. This has been chosen as it is
has many advantages and few negatives compared to the alternate solutions for this particular
problem. When developing a system it is beneficial to have a good understanding of the
program/programming language required to create such an application prior to the design
stage. Taking into consideration that we have multiple years’ experience with Excel it is a
comfortable development platform compared to using Java, C# or PHP which would require
additional time to further understand the languages.
Excel is the ideal solution as it fits all of the criteria for the problem specification. Firstly
similar to all the solutions it will have an interactive user interface, which is vital to the
system. This will allow the user to select from several different players using a drop down
menu. It will allow the user to Input the current score of the game being played. Also it will
have true/false buttons so the user can determine the first goal scorer odds. It will also be able
to dynamically display a percentage and message informing the user if the bet is worthwhile.
The main reason however for using Excel is the entire solution is based on an algorithm that
contains multiple equations. Due to this Excel is the standout choice as its main purpose is to
manipulate data using formulas. The application itself allows equations to interact with each
other to create new data values. Excel is also designed to store data, which is an essential part
of the solution. It is very simple to store the statistical data in binary form so that data values
can be generated and used.
The main problem with the Windows application solution in which Excel exceeds it, is the
limitation of operating systems. Due to the Windows application only running on Windows
machines it will limit the users that can use the system. However with Excel the file can be
opened on any operating system that can run a datasheet. Excel and OpenOffice are both
available for many different operating systems therefore expanding the reach of the
application.
Niall Brooke
The dynamic web solution is very effective however the main flaw is that the users need to be
connected to the internet to use the system. This is not a problem with the Excel solution as it
can be run offline on a user’s machine.
A mobile application is possibly the best alternate solution in terms of usability however it
comes across the same problems as the previously mentioned solutions as it would only
initially be developed for one version of a particular operating system and requires online
access to access the data. This is outdone by Excel as it combats both of these issues.
As initially mention there is no ‘perfect’ solution and an Excel datasheet has its own flaws
such as being less user friendly that the other solutions however with everything taken into
consideration the main purpose of this system is to solve the problem presented. The main
problem was to find evidence that data mining could in fact aid in sports betting. The main
part of the design for this solution will be based around creating the series of equations and
algorithms needed to prove this statement. Due to all these factors an Excel datasheet has a
good balance of features and usability to create a functional system to solve this problem.
4 Design
This chapter will showcase every aspect of the design and implementation of the theoretical
system developed. Excel has been selected as the platform in which the solution to the
problem will be executed. This is was due to the amount of mathematical equations and
simple output required to solve the issue. This solution dose have a user interface and
interactions however the solution is more based around the theoretical algorithm and
equations which run in the background.
4.1 Initial decisions
It was important when selecting the two players to test the system that they were both playing
in the same league and similar as possible to create a good comparison. The league itself was
a big issue initially the English Premier League seemed the optimal choice due to the media
attention and amount of bookmakers who offer a variety bets for each game. However after
looking for two candidate players who were very similar a match couldn’t be found. The
majority of research into sports forecasting tends to be based around the English league
already so the Spanish league seemed a more suited choice. The position of the players was
debated, the obvious striker position was initially discussed but it was decided that an
attacking winger may provide more interesting results. The system will allow the user to
compare the statistical data of Cristiano Ronaldo and Lionel Messi. These two players were
chosen to test the system because of many different important factors. They are both similar
ages currently in the peak of their football careers playing for the top two clubs in Spain.
They are constantly in the media spotlight which can cause an overall effect on performance.
They both have exceptional goal scoring records meaning that bookmakers will try to tempt
customers to put money on not very good odds with the assumption they are guaranteed to
score in every game.
Niall Brooke
4.2 Data
The data collected was the backbone of the system, due to the solution being based around
data mining. The data was sourced from a reliable historical sports website [ref] and then
entered into two different datasheets for each of the players. The datasheet was set out with
all the dates of matches and the time in-between. The time in-between was added for the
possibility of future calculations that takes into account how long the player has
rested/travelled from game to game. This could massively affect the performance of a player
if they had an international fixture on the other side of the world a few days before a league
fixture. Fatigue is an element that will be incorporated in the future and will be discussed
more in detail later in the paper. The data collected was stored in the form of simple digits
between 1-3 this was to make calculations and data input easier. If a player scored a goal in a
game a variety of data would be logged. First if the player started the game and if the match
itself was been played at home or away. The competition in which the player was competing
in i.e. the league or cup. Information regarding any goals the player scored such as the total
and minute it was achieved in. The datasheet can easily be modified to incorporate more
variables as the system expands.
4.3 Algorithm
The Algorithm is the heart the solution it will take the values generated from multiple
different equations to create an overall figure. In this project the overall aim was to generate a
percentage of how likely a football player was to score in a game depending upon a selection
of variables the user has chosen.
4.3.1 Statistical Percentage
Initially stats of Wayne Rooney of Manchester united were looked at to see if data mining
could in fact help predict whether it was possible to predict if a player would score. A test
was done by taking Rooney’s previous stats to create an average which could then be used to
make a prediction. The game in question was in 2012 between Manchester United and Stoke
City. Using statistics from the previous 3 seasons a week before the game it was shown that
he was very likely to score at least 1 goal. This was because Rooney had a 61% Chance of
scoring against any team with a 24% Chance of it being the first goal. There was an 87%
Chance of Manchester United scoring at least 1 goal against any team and 29% Chance of
Stoke City not conceding at least 1 goal against any team. Meaning that there is a 79%
chance Manchester United will score at least 1 goal vs. Stoke City. Rooney then had a 48%
Chance of scoring against Stoke city with a 19% Chance of it being the first goal. In the
game Rooney scored 2 goals however he wasn’t the first goal scorer which mimics the
statistics. This initial test research encouraged the current algorithm which takes into account
the individuals performances in different competitions and how their goal scoring ratio will
fluctuate depending on if the players is playing at home or away. Depending on what
variables have selected a different calculation is made by combining different equations from
the datasheet.
Niall Brooke
4.3.2 Bookmakers Percentage
To be able to make a fair comparison between the percentage generated by the system it is
essential to covert the fractional odds into a percentage as well. This percentage is worked out
slightly differently from the method of converting a standard fraction into a percentage which
is dividing the small fraction by the large one and then multiplying the result by 100. Using
this method provided uneven results as the percentages were reversing as the odds got less
than even. They were producing results such as 300% likelihood of happening which is very
unreliable. To resolve this first it was important to create scale which could be balanced
overall. A starting point was determining evens as 50% under the logic that it is just as likely
to happen as not. From this it was then possible to determine the equation to convert any
possible odds from the bookmakers to a comparable percentage.
Evens = 1 / 1 = 1 * 50 = 50%
Below evens Low / High * 50 = X Y = X -100
Above evens Low / High * 50
An example of this would Cristiano Ronaldo is 1/3 to score against Real Betis so using the
equation 1 / 3 = 0.33 * 50 = 16.6 then 16.6 – 100 = 83.4 giving him an 83% chance of
scoring according to the bookmakers.
4.3.2.1 Prediction Advice
Once both percentages have been calculated the next step is to determine whether the bet is
within a certain radio of being classed as a worthwhile gamble. For a player to score a goal
anytime in a match tends to receive bad odds for the customer. However the more specific the
calculation is by adding more variables the customer will receive a better value for money per
bet. A calculation will be made determining if the bet is; very good, good, average, bad or not
recommend. This will be over 10% so for instance if the bookmaker’s odds were 66% the
brackets would be between 56% and 76%.
4.4 Wireframe
Before actually designing the beta system in excel it is vital to create a basic design.
Wireframes designs help to create a skin like perception of a system/application. This helps
to get a good understanding of how it will eventually look like. For excel there are limitations
regarding design however it can still be laid out in multiple different ways. This helps to
create template which can be followed for future design alterations.
Niall Brooke
4.4.1 Statistic data design
The design of this page was made to be quite simplistic and generic so that it could be added
to with ease. The top variables bar showed all of the headings for the data and was made to be
frozen so when a user scrolled down the data it would always stay snapped to the top. This
was done to prevent data being entered in incorrectly and increase the user experience. Along
the side are the dates for the period of time the data was referring to.
Player 1 Player 2 User Interface
Dates
Variables
Player Data
4.4.2 Calculations input design
The design of the calculations page was made to give the user simple options to select the
variables they wanted to and get a fast result. There are 4 drop down variables which are clear
for the user. The percentages were placed in the centre of the interface as it could be
considered the most import visual the user sees. The percentage which is produced is the final
piece of output data thus leading to the prediction advice and the end of the process.
Niall Brooke
Player 1 Player 2 User Interface
Player Dropdown
Venue Dropdown
Competition Dropdown
Goal type Dropdown
Odds
Prediction Advice
Bookie precentage
Stats percentage
4.5 Process structure
This is a diagram showing the structure of how a run through of the system by as user would
work. The first main step is loading the initial user interface datasheet which will present the
designed template. The user will then be prompted to enter in the odds they have been given
for a particular bet. The user will then select the 4 variables from the drop down menus to
match the stipulations of the bet. Once this has been done the system will use all of the
variables and user input in several equations and create a scenario. This will then tell the
system to gather the required data from the statistics datasheet. The data will then be used in
an algorithm to generate the two percentages. Finally the percentages will be compared and
given a rating which will then be displayed to the user.
Niall Brooke
Input Odds
Equations
Algorithm
VariableVariableVariableVariable
Player Data
Advice
Open data sheet
4.6 User Interface
The user interface for this system is has been made very simple as the statistics it generates
are the keys to solving the problem. In total there are three different sections which are
divided by datasheets. Two of these are full of data for each individual player while they third
sheet has all of the main algorithm and equations.
Niall Brooke
4.6.1 Calculations page
This datasheet only has a small section which is visible. This puts focus on what a potential
user would see and interact with to use the system. This is a prototype version of the system
which demonstrates the algorithm and equations which solve the original problem.
4.6.2 Ronaldo datasheet
This is the datasheet for Cristiano Ronaldo’s statistics; it contains all the variables used in the
main algorithm. New data can be imported or added in manually to increase the accuracy of
the system.
4.6.3 Messi Datasheet
This is the other datasheet for Messi’s statistics as mentioned above it contains all the
information need to aid the calculations page. It should also be mentioned that new variables
can easily be added to create an updated algorithm.
4.7 User Interactions
The system has several user interactions which allow them to alter the variables to change the
statistical percentage they are generating. Effective well designed user interactions are very
important in any system but for this particular solution it would need to slick and fast for the
possibility of in play betting which have fluctuation odds. Also new users who may not be
computer illiterate would need to be able to use this easily in the future. The only selectable
cells in the datasheet are the user interactions everything else is locked to prevent any
formula deletion
Niall Brooke
4.7.1 Drop down variables
Each of the variables has its own drop down menu, this allows users to mix and match
different combinations to create their ideal bet. This method was originally designed to be
radio buttons but this seemed much more efficient.
4.7.2 Player Variable
This allows the user to select either Messi or Ronaldo. This interaction can be expanded upon
to have several players by simply adding in a new datasheet with player statistics.
4.7.3 Venue Variable
The home variable simply determines if the percentage should be calculated with the
perception that the player in question is playing home or away from their based city. Away
should be selected if the game is being played in a neutral ground such as in a final of a cup.
4.7.4 Competition Variable
This allows the user to select what competition the player is going to be participating in. This
is currently the variable with the most options and will have the most effect on the overall
percentage. This is due to the fact that a player may rarely play in a particular competition
such as the cup and therefore their goal ratio statistics may be significantly lower than the
other options
Niall Brooke
4.7.5 Goal type variable
The final variable available to users in this prototype system is the goal type which they can
either select anytime or first. This option will determine what type of bet is being made;
typically it is more common for people to bet on the first goal scorer before the match as it
will have significantly better odds. This also means that users will see a very high change in
percentages when this variable is changed sometimes even more so than the competition.
4.7.6 Odds input
The odds input supplies two boxes which allows the user to type in the odds that they have
been given by the bookmaker. The two figures are the source of data in which the percentage
for the bookmaker is calculated. There is also validation on these cells as they are the only
user input. This prevents any possible errors and crashes if the wrong type or amount of
characters were entered.
4.7.7 Calculated Percentages
This displays the calculated percentages for both the bookmaker and the statistical prediction
to the user. This is considered to be a user interaction as a user may wish to modify the
variables/odds until they find a good bet. In essence the user is controlling the percentages
hence why it is an interaction
4.7.8 Prediction advice
This is the overall result of everything involved in the system; it shows the user if a particular
bet is statistically worth it. This will be shown in variety of different colours representing the
level of risk involved with placing a bet. A traffic light approach has been used as it has been
proved to be the most common and clear indication of ranking levels. It is important to have
clear and bold colours as a user may need to quickly see if a bet is worth it before the odds
suddenly change.
Niall Brooke
5 Testing
5.1 Methodology
For this project the aim was to prove that data mining could in fact be used to aid in sports
betting. The idea has been proposed and researched in depth before with multiple systems
being developed however none of which focused on the individual players. The aim was to
develop an algorithm which would be able to generate a percentage that was able to take
multiple variables into consideration to make a solid prediction. This prediction would then
be used as a comparison with odds given to see if and advantages could be found.
5.1.1 Theory
The theory was essentially that the raw data from an individual player’s previous seasons
could be used to calculate how likely said player was to score a goal in a certain situation.
The algorithm itself would need to use solid data averages which would first need to be
calculated using multiple equations. These equations would work out the individual’s
variables one by one. For example the following equation was used to work out the
individual percentage chance 1 variable.
Total games played / Variable * 100 = X
Using this initial simple equation it was possible to work out preliminary statistics which
could later be used in the overall algorithm. An example of this would be:
Ronaldo has played a total 31 games this season and scored 20, 5 and 6 in all competitions
20 + 5 + 6 = 31 31/31 = 1
This simply shows that surprisingly on average Ronaldo has an exact goal per game ratio of
1.0 which is the equivalent to 100%. From this initial calculation it would seem that any odds
given by bookmakers which is less than 100% would be a sure win, however these stats
represents a general average for Ronaldo’s stats for instance if we add the variable of only
champions league games his average will greatly drop 80%.
Taking everything into consideration an algorithm can then be constructed to create very
precise percentages that represent a player’s likelihood of scoring a goal within a match. For
instance using the system it is possible to work out how likely Messi is to score in the last 10
minutes of an away match in the Spanish league.
5 + 3 / 41 * 100 + 22 + 51 + 14 / 3 = 36%
This takes into account all of the variables mention to create a percentage prediction, this
seems to be high for an average player however due to Messi’s abnormal goal scoring record
ratio it is very realistic for his standards.
Niall Brooke
5.2 Results
To test the results of this project a series of matches and odds were used from 2 different
bookmakers to test the accuracy of the predictions. Both players were tested so that a
comparison could be made. A total of 5 fixtures were tested for each player to give a good
spread of data to analyse.
The following table shows testing for Cristiano Ronaldo over a 5 game period
Date Fixture Prediction William Hill Ladbrooks Outcome
16/3/13 La Liga 94% 88% 91% 1 Goal
30/3/13 La Liga 92% 86% 90% 1 Goal
6/4/13 La Liga 92% 88% 89% 1 Goal
9/4/13 Europe 88% 88% 90% 2 Goals
14/5/13 La Liga 92% 90% 92% 2 Goals
From these 5 games tested for Ronaldo the statistical prediction was in fact more accurate
than then bookmaker’s prediction which was very surprising. It was beneficial for the test
that Ronaldo scored in every test match. Between the two bookmakers Ladbrokes stayed
quite firm with their percentage while William will was more flexible. This may be due to
them trying to offer better odds to entice the customers to place bets with them. The only blip
in the testing for the system was the game in Europe which had a percentage drop for
Ronaldo. This is because Ronaldo is currently on form and scoring freely in all competitions
so his previous data for Europe wasn’t as accurate due to there being a large gap since the last
match.
The following table shows testing for Lionel Messi over a 5 game period
Date Fixture Prediction William Hill Ladbrooks Outcome
12/3/13 La Liga 92% 90% 91% 2 Goal
17/3/13 La Liga 94% 94% 94% 2 Goal
30/3/13 La Liga 95% 95% 95% 1 Goal
2/4/13 Europe 94% 94% 95% 1 Goal
10/5/13 Europe 94% 65% 60% 0 Goals
The test results from Lionel Messi showed some interesting results and possible flaws with
the current design of the system. The final test match had a massive drop from the two
Bookmakers this was due to Messi being injured in a previous match and being doubtful to
start the game. The current system has no variables for injuries therefore predicted a very
high percentage even though Messi was unlikely to play the full game giving him much less
chance of scoring. Surprisingly the bookmakers still gave him good odds to score for an
injured player but due to his goal ratio this wasn’t too surprising. The overall results however
were positive with some predictions matching or surpassing the bookmakers which showed
the system can be very effective if there are no unusual variables.
Niall Brooke
5.3 Result Analysis
The 5 results showed on average that the developed system was more effective than then
bookmakers algorithm, however this may have just been coincidence as only a small amount
of games were tested. From viewing the odds the bookmaker gave out is clear that they try to
keep a similar pattern unless something drastic happens such as an injury. This is an
interesting system as always have low odds for Messi & Ronaldo to score no matter who they
are playing. Even though these two players will end up scoring a high ratio of goals they will
never score 100% of the time in every game. This is because even though for instance
Messi’s goal ratio is higher than his overall games played, he may have scored multiple times
in one match and therefore skewing the statistics.
5.3.1 Method Analysis
The solution which has been developed and implemented has many different levels of depth
and complexity. The variables are the key factors this system compared to other systems
which are currently out who focus only on the teams variables are very rarely attempt to even
include the players into their equations. However there may be a reason for this as players
can be very unpredictable.
5.3.2 Method Flaws
One flaw in which the system will come across is that individual players can be very
unpredictable as they are one person who has their own free will. This means that even if all
the statistics point to this player having a fantastic game and scoring at least 1 goal there may
other factors off the pitch. For instance a player may just have had some very bad news and
therefore they are not playing to their full potential. This may be why there are many team
based forecasting systems as they overall create an average of stability compared to 1
possibly unpredictable player. The system also has a draw back with the
5.3.3 Bookmakers calculations
Bookmakers have a way of working out their odds, each single match is individually
calculated. They work out every possible outcome before placing down any odds. This is an
example of how they calculate the distribution rate.
100% x (1 / [(1 / odds of [1]) + (1 / odds of [X]) + (1 / odds of [2])]
Taking in all of the information form the bookmaker and the system that was developed it is
clear than both have smart systems which can compete against each other without ever
finding a winner. Using a system such as this would defiantly be more beneficial for
customers than just betting on bets they believe look good, as it could all just be a ploy to
spend money.
Niall Brooke
6 Conclusion
Overall this project has been a successful endeavour; the original problem has been
theoretically solved with evidence from testing and calculations. This was achieved by
multiple different factors which all equally contributed. There were a few problems which
lead to alterations with the designs and modifications to the equations however they made the
system improve. The research into conducted looked at a variety of different research papers
and alternate solutions already available. The entire project had its strengths and weakness
and will be analysed to determine what was done well and what could have been done
differently. There is much room for improvement in this project and this has only touched
the surface of an idea that may mainstream in the not so distant future. In conclusion the
project was a success and will defiantly be expanded on in the future.
6.1 Problem Analysis Review
The problem was determined from the project question and from that a firm target was made.
The goal was to create a system/method of using previous sports statistics to aid in sports
betting. This was discussed in detail and a few current solutions were found and briefly
looked at. There was good analysis on the gap in the market for a possible system to be
created and what techniques it could use. However it would have been better to compare and
contrast some current system in more detail. This may have led to an improved overall
system if parts from other systems were incorporated
6.2 Review of Literature Review
The research conducted was good but more should have been added overall. Some good
interesting theories and information were found which helped when designing the solution
however there were several more areas which possibly should have been covered.
6.3 Problem Requirements Review
This section of the report was based upon all the possibilities and requirements that would be
in the project. Good selections of possibilities were discussed and each one had their own
strengths a weaknesses. A final solution was
6.4 Design Review
The designs were very simple as they project was mainly about the equations and logic of the
algorithm rather than create perfectly designed user interface. The designs could have
included more planning and the use of a Gantt chart.
6.5 Results Review
Overall the results were very promising they showed that the system works and that is
reliable in most scenarios. However it did show that there were a few minor problems regards
variables that could not be controlled such as injuries and red cards. This wold only is able to
be implemented when the system was live and was streaming data constantly
Niall Brooke
6.6 Future development
This project in theory solved the problem which it was presented with however there is a
massive possibility for expansion and improved which could lead to a new generation of
sports betting. The solution depicted would just be the bare bones of a fantastic concept
which could expand onto many different platforms.
6.6.1 The Variables
The data initially collected contained many more variables which could have enhanced the
overall accuracy of the prediction. One of these variables included was the attendance of
games; this could be a major factor as statistically players play much better with a home
crowd. This could mean that having a very small crowd could either make the player play
better because there is less pressure or worse because there would be less adrenaline. The
form of team in which the player is playing against could also be a variable that could have a
very large effect. This could be presented in different ways such as the teams’ current form or
their overall position last year. Players may only tend to score goals against teams that always
finish in the bottom half of the table. The form could also be analysed of the team the player
plays for and see how much of an effect that has. In the system developed we only took into
account if a player played in a match however they can also be substituted on or off. This
would lead to a better implementation of using the exact amount of minutes a player has
played to determine more accurate goal averages.
6.6.2 Real time betting
A large improved that could be made in the future is to incorporate the minutes in which a
player scores goals to allow real time betting. The system would allow the user to not only
select the initial variables but also to select the current data from a live game. This would
mean that a percentage could be calculated about a very specific situation that could then be
bet on. New data could then be looked at such as; the current score of the game, how many
minutes have been played and even any red cards could then be incorporated into the
algorithm. An example of this would be if Chelsea were playing Liverpool in the 66th
minute
and the score was 2-1 the system would be able to work out the percentage chance of
Liverpool Luis Suarez to score the next goal. Using all of the new additions this would look
at how many times Suarez has scored against teams who finish it the top 5 last season after
the 70th
minute when Liverpool were losing. This would present a unique percentage which
would then be compared to the bookmaker’s odds. This has a massive potential for success as
bookmaker will tend to lower the odds of a player scoring depending solely on factors about
the game. This would include points such as the current score and overall performance of the
player. Another major factor is the actual bets which are being place on the player. If the
player gets a flurry of bets the odds will drop and vice versa. This means that have statistical
data could prove to a much more accurate way to make short term predictions about live
games.
Niall Brooke
6.6.3 The platforms
A system such as this could be developed in the future for many different platforms which
could target the large market. The first step would be once the Excel prototype system has
been perfected, a website based system should be developed which would require a login so
that a user base would be formed. The online system would have a vast database running
behind it which would constantly pull data from a reliable statistic website to keep updated.
The system would not only allow users to create custom odds but also automatically search
and compare the best current odd differences available. Every player from the top 6 leagues
would have statistics so that users would have a wide variety of choices and get the best
possible bets. An example of how this automatic feature could benefit the customers would
be if they had no football knowledge and just wanted to know what statistically is likely to
happen and what bet would be the best value for their money. Following on from this a
windows application could be realised which synced up to the servers and allowed users to
use the application from their desktop. Finally a mobile application could be created which
would work on a similar premise to the windows application but be simpler designed and
have a touchscreen friendly layout.
6.6.4 Business opportunity
This system has the potential to earn a good profit in the future. A subscription based service
would be offered to supply the data and percentages. People may be more forthcoming about
paying for a service that has a good chance of earing them more money in the not so distant
future. People are also trying to get any type of advantage they can so this would be a very
plausible business idea. The database and algorithms would never be available its raw form to
the customers to prevent any attempt to steal anything. An idea would be to offer the service
free for a limited time to new customers so that they can get a taste of the system before
subscribing. This could work by letting them via a total of 10 calculated percentages per
Email registered account. During the early stages of the website advertisements could be used
to generate a steady income from users browsing and clicking on them. When the application
finally gets released a small charge could be issued for the download such as 59p which
would encourage new users to try it. This seems like a small fee but if the app gets
downloaded over 500,000 times it would make a very good profit.
6.6.5 Global Reach
Once the system has been full established the next step would be to expand from just one
particular sport. America has a massive market for sports betting and there are many systems
that attempt to forecast the results of different sporting events. However using an altered
version of this algorithm could gain a lot of attention. The system would be able to be used in
many different sports such as American football and basketball. These sports are highly
tactical and statistics are heavily tracked. For example in American football they tend to have
players just for particular roles in the game. They will substitute a player on maybe only for 5
minutes just to do a certain job. Even though it is much different from football it has great
potential to be expanded in that area.
Niall Brooke
Summary
The aim of this project was to find out if it was possible to use data mining to increase the
chances of winning at sports bets. Overall the project proved that it was possible however it
would take more time to fine tune a perfect algorithm.
A system was system was set up in Excel so that the formulas and equations could be tested
with a semi-automatic process. This was made to demonstrate the possibilities of this or
similar algorithm could have on the betting industry.
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx
Niall_Brooke_Project_final.docx

More Related Content

What's hot

Searchmetrics seo-ranking-factors-2013-uk
Searchmetrics seo-ranking-factors-2013-ukSearchmetrics seo-ranking-factors-2013-uk
Searchmetrics seo-ranking-factors-2013-uk
Daniel Howard
 
Corporatization and Privatization of the Government, Form #05.024
Corporatization and Privatization of the Government, Form #05.024Corporatization and Privatization of the Government, Form #05.024
Corporatization and Privatization of the Government, Form #05.024
Sovereignty Education and Defense Ministry (SEDM)
 
Hpesp wp esg_research-security_mgmtandoperations
Hpesp wp esg_research-security_mgmtandoperationsHpesp wp esg_research-security_mgmtandoperations
Hpesp wp esg_research-security_mgmtandoperationsZeno Idzerda
 
John O'Connor Master's Paper Final
John O'Connor Master's Paper FinalJohn O'Connor Master's Paper Final
John O'Connor Master's Paper FinalJohn O'Connor
 
Data communications
Data communicationsData communications
Data communications
Nzb sirji
 
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
Alan McSweeney
 
Nalpeiron explains global rules for data collection and privacy
Nalpeiron explains global rules for data collection and privacyNalpeiron explains global rules for data collection and privacy
Nalpeiron explains global rules for data collection and privacy
Jon Gillespie-Brown
 
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
confluent
 

What's hot (10)

Searchmetrics seo-ranking-factors-2013-uk
Searchmetrics seo-ranking-factors-2013-ukSearchmetrics seo-ranking-factors-2013-uk
Searchmetrics seo-ranking-factors-2013-uk
 
Corporatization and Privatization of the Government, Form #05.024
Corporatization and Privatization of the Government, Form #05.024Corporatization and Privatization of the Government, Form #05.024
Corporatization and Privatization of the Government, Form #05.024
 
Hpesp wp esg_research-security_mgmtandoperations
Hpesp wp esg_research-security_mgmtandoperationsHpesp wp esg_research-security_mgmtandoperations
Hpesp wp esg_research-security_mgmtandoperations
 
John O'Connor Master's Paper Final
John O'Connor Master's Paper FinalJohn O'Connor Master's Paper Final
John O'Connor Master's Paper Final
 
Data communications
Data communicationsData communications
Data communications
 
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
GDPR - Context, Principles, Implementation, Operation, Data Governance, Data ...
 
Nalpeiron explains global rules for data collection and privacy
Nalpeiron explains global rules for data collection and privacyNalpeiron explains global rules for data collection and privacy
Nalpeiron explains global rules for data collection and privacy
 
SEO - Google.SEO secrets
SEO - Google.SEO secretsSEO - Google.SEO secrets
SEO - Google.SEO secrets
 
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
Privacy Engineering for the World of Kafka (Alexander Cook, Privitar) Kafka S...
 
Dod marking guide
Dod marking guideDod marking guide
Dod marking guide
 

Similar to Niall_Brooke_Project_final.docx

Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
Ercis wp 18new (1)
Ercis wp 18new (1)Ercis wp 18new (1)
Ercis wp 18new (1)
Stefano Lariccia
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 
The Endpoint Security Paradox
The Endpoint Security ParadoxThe Endpoint Security Paradox
The Endpoint Security Paradox
Symantec
 
assessingthenumberofgoalsinsoccermatches
assessingthenumberofgoalsinsoccermatchesassessingthenumberofgoalsinsoccermatches
assessingthenumberofgoalsinsoccermatchesRasmus Bang Olesen
 
okafor2021.pdf
okafor2021.pdfokafor2021.pdf
okafor2021.pdf
billclintonvn
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
LinkedTV
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
Enrique de Nicolás Marín
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
Leon Henry
 
50YearsDataScience.pdf
50YearsDataScience.pdf50YearsDataScience.pdf
50YearsDataScience.pdf
Jyothi Jangam
 
NSTC Identity Management Task Force Report
NSTC Identity Management Task Force Report NSTC Identity Management Task Force Report
NSTC Identity Management Task Force Report
Duane Blackburn
 
On the Explicit and Implicit Effects of In-Game Advertising
On the Explicit and Implicit Effects of In-Game AdvertisingOn the Explicit and Implicit Effects of In-Game Advertising
On the Explicit and Implicit Effects of In-Game Advertising
Simon Usiskin
 
BSc Statistical Project
BSc Statistical ProjectBSc Statistical Project
BSc Statistical Project
Collins Okoyo
 
DM_DanielDias_2020_MEI.pdf
DM_DanielDias_2020_MEI.pdfDM_DanielDias_2020_MEI.pdf
DM_DanielDias_2020_MEI.pdf
Muthusankaranarayana1
 
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
CamWebby
 
Systems Analysis And Design Methodology And Supporting Processes
Systems Analysis And Design Methodology And Supporting ProcessesSystems Analysis And Design Methodology And Supporting Processes
Systems Analysis And Design Methodology And Supporting ProcessesAlan McSweeney
 

Similar to Niall_Brooke_Project_final.docx (20)

Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
Ercis wp 18new (1)
Ercis wp 18new (1)Ercis wp 18new (1)
Ercis wp 18new (1)
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 
The Endpoint Security Paradox
The Endpoint Security ParadoxThe Endpoint Security Paradox
The Endpoint Security Paradox
 
assessingthenumberofgoalsinsoccermatches
assessingthenumberofgoalsinsoccermatchesassessingthenumberofgoalsinsoccermatches
assessingthenumberofgoalsinsoccermatches
 
okafor2021.pdf
okafor2021.pdfokafor2021.pdf
okafor2021.pdf
 
Final Thesis
Final ThesisFinal Thesis
Final Thesis
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
 
Greenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALCGreenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALC
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
Final Report
Final ReportFinal Report
Final Report
 
50YearsDataScience.pdf
50YearsDataScience.pdf50YearsDataScience.pdf
50YearsDataScience.pdf
 
NSTC Identity Management Task Force Report
NSTC Identity Management Task Force Report NSTC Identity Management Task Force Report
NSTC Identity Management Task Force Report
 
On the Explicit and Implicit Effects of In-Game Advertising
On the Explicit and Implicit Effects of In-Game AdvertisingOn the Explicit and Implicit Effects of In-Game Advertising
On the Explicit and Implicit Effects of In-Game Advertising
 
BSc Statistical Project
BSc Statistical ProjectBSc Statistical Project
BSc Statistical Project
 
DM_DanielDias_2020_MEI.pdf
DM_DanielDias_2020_MEI.pdfDM_DanielDias_2020_MEI.pdf
DM_DanielDias_2020_MEI.pdf
 
Notes econometricswithr
Notes econometricswithrNotes econometricswithr
Notes econometricswithr
 
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
Fill-us-in: Information Asymmetry, Signals and The Role of Updates in Crowdfu...
 
Systems Analysis And Design Methodology And Supporting Processes
Systems Analysis And Design Methodology And Supporting ProcessesSystems Analysis And Design Methodology And Supporting Processes
Systems Analysis And Design Methodology And Supporting Processes
 

Niall_Brooke_Project_final.docx

  • 1. Niall Brooke The Benefits of data mining to aid in sports betting Degree: BSc (hons) Cyber Security Student Number:462109 Student Name: Niall Brooke Project Start Date: September 17th 2012 (17/9/12) Project End Date: April 19th 2013 (19/5/13) Project Keywords: Data mining, Algorithms, Equations, Spread sheets, Statistics, Sports. Project Word Count: 15,000 Investigating the possible benefits of data mining previous football statistics to aid in sports betting. This will be achieved by researching many aspects of data mining and sports betting along with possible algorithm techniques. Using this research a theoretical proposal will be developed. Three seasons of football statistics will be collected from two different players who play for teams which partake in the Spanish premier league. Both players in question will be of similar playing positions to create accurate and fair results. The statistics for each player will cover the main aspects of their individual match performances plus the impact they have on their team and vice versa. Algorithms will be created to find common patterns within the data which are incoherence with wagers the end user will be able to make. Equations will then be written which will use the patterns found in the data to create a percentage based risk assessment. This will provide the user a clear visual aid when deciding upon possible bets.
  • 2. Niall Brooke Contents 1 Analysis of problem............................................................................................................5 1.1 Statement of the Problem............................................................................................5 1.2 Problem detailed..........................................................................................................5 1.2.1 Anytime Goal Scorer ...........................................................................................5 1.2.2 First goal scorer....................................................................................................5 1.2.3 In play betting ......................................................................................................5 1.3 Significance of the Problem........................................................................................6 1.3.1 Misleading odds...................................................................................................6 1.3.2 Media effect .........................................................................................................6 1.3.3 Transfer market effects ........................................................................................7 1.3.4 Impulse betting.....................................................................................................7 1.4 Resources Available....................................................................................................7 1.4.1 Software...............................................................................................................7 1.4.2 Image manipulation .............................................................................................7 1.4.3 Spreadsheet application .......................................................................................8 1.4.4 Software development .........................................................................................8 1.4.5 Website development applications ......................................................................8 1.4.6 Graphical developmental platforms.....................................................................8 1.4.7 Database applications...........................................................................................8 1.4.8 Mobile development applications........................................................................9 1.5 Advantages of specific software .................................................................................9 1.5.1 Excel Advantages.................................................................................................9 1.5.2 Excel Disadvantages............................................................................................9 1.5.3 Flash Advantages.................................................................................................9 1.5.4 Flash Disadvantages.............................................................................................9 1.5.5 Dreamweaver Advantages ...................................................................................9 1.5.6 Dreamweaver Disadvantages.............................................................................10 1.6 Current Solutions.......................................................................................................10 2 Literature Review.............................................................................................................11 2.1 Machine Learning .....................................................................................................11 2.1.1 Bayesian Networks ............................................................................................11 2.1.2 Graphics model link...........................................................................................12
  • 3. Niall Brooke 2.1.3 Bayesian networks in sport................................................................................13 2.2 Bookmakers...............................................................................................................14 2.2.1 Gambling market ...............................................................................................14 2.3 Sports Prediction .......................................................................................................15 2.3.1 Prediction markets .............................................................................................16 2.3.2 Forecasting methods ..........................................................................................16 3 Problem Requirements.....................................................................................................17 3.1 Problem Specification ...............................................................................................17 3.1.1 Data Source........................................................................................................17 3.1.2 User Interface.....................................................................................................18 3.1.3 Calculations........................................................................................................18 3.1.4 Output ................................................................................................................18 3.2 Possible Solutions .....................................................................................................18 3.2.1 Excel Datasheet..................................................................................................18 3.2.2 Windows Application ........................................................................................19 3.2.3 Dynamic Website...............................................................................................21 3.2.4 Mobile Application............................................................................................22 3.3 Chosen Solution ........................................................................................................23 4 Design ..............................................................................................................................24 4.1 Initial decisions .........................................................................................................24 4.2 Data ...........................................................................................................................25 4.3 Algorithm ..................................................................................................................25 4.3.1 Statistical Percentage .........................................................................................25 4.3.2 Bookmakers Percentage.....................................................................................26 4.4 Wireframe..................................................................................................................26 4.4.1 Statistic data design............................................................................................27 4.4.2 Calculations input design...................................................................................27 4.5 Process structure........................................................................................................28 4.6 User Interface............................................................................................................29 4.6.1 Calculations page...............................................................................................30 4.6.2 Ronaldo datasheet..............................................................................................30 4.6.3 Messi Datasheet .................................................................................................30 4.7 User Interactions .......................................................................................................30
  • 4. Niall Brooke 4.7.1 Drop down variables..........................................................................................31 4.7.2 Player Variable...................................................................................................31 4.7.3 Venue Variable ..................................................................................................31 4.7.4 Competition Variable.........................................................................................31 4.7.5 Goal type variable..............................................................................................32 4.7.6 Odds input..........................................................................................................32 4.7.7 Calculated Percentages ......................................................................................32 4.7.8 Prediction advice................................................................................................32 5 Testing..............................................................................................................................33 5.1 Methodology .............................................................................................................33 5.1.1 Theory................................................................................................................33 5.2 Results.......................................................................................................................34 5.3 Result Analysis..........................................................................................................35 5.3.1 Method Analysis................................................................................................35 5.3.2 Method Flaws.....................................................................................................35 5.3.3 Bookmakers calculations ...................................................................................35 6 Conclusion .......................................................................................................................36 6.1 Problem Analysis Review.........................................................................................36 6.2 Review of Literature Review ....................................................................................36 6.3 Problem Requirements Review.................................................................................36 6.4 Design Review ..........................................................................................................36 6.5 Results Review..........................................................................................................36 6.6 Future development...................................................................................................37 6.6.1 The Variables.....................................................................................................37 6.6.2 Real time betting................................................................................................37 6.6.3 The platforms.....................................................................................................38 6.6.4 Business opportunity..........................................................................................38 6.6.5 Global Reach......................................................................................................38 7 References........................................................................................................................39 8 8 Appendix.......................................................................................................................43
  • 5. Niall Brooke 1 Analysis of problem 1.1 Statement of the Problem The sports gambling industry has shown substantial growth in modern times due to the introduction of online betting. This has then lead to the development of multiple sports forecasting systems and models. Sports forecasting has however been going on for decades prior to this but has never received the amount of mainstream attention it does today. The problem with current sports forecasting systems is that the majority try to predict either the result of the exact score of a football match rather than focusing on the individual players. 1.2 Problem detailed There are currently no publicly available sports forecasting systems that focus on the previous statistics of individual players to predict whether they will score in a particular game or not. Bookmakers now let their customers bet on a vast variety of different possible outcomes for players and the team in general. This is a gap in the market where a system could be developed, which would track player’s statistics and generate advice to gamblers on particular wagers. 1.2.1 Anytime Goal Scorer The system would need to scan through a player’s goal scoring history taking into consideration many different variables and factors to determine a solid percentage of how likely they are to score in a game. This percentage would then need to be compared to the odds that are being given for that outcome to happen to determine whether or not statistically it is a good investment. 1.2.2 First goal scorer Similar to the anytime goal scorer prediction the system would need to be able to determine how likely a player is to score first. This would require a similar algorithm to check through the statistics of the previous encounters with that team. The percentages generated from this would be significantly lower as the odds would be better overall. Due to this a slight alterations may be made to some of the equations. 1.2.3 In play betting Another issue that needs to be resolved is a system that can make predictions about players in play. This would need to use the current percentages generated and then add them to a larger equation to correspond with what has happened during the match. This would need to include the current time of the match in question. In theory the deeper the match is into the current game the less likely a player is to score and therefore their percentage of scoring should drop as the odds increase. However this is where data mining could aid a gambler and possibly give them the edge over a bookmaker. For example if a selected player statistically scores in the last 5 minutes of games against teams who are currently not performing well in the league when their team is losing, more than the average player does a system could be made that informs the gambler that this bet has a statistically high chance of paying off even if the odds are high and seems unlikely.
  • 6. Niall Brooke 1.3 Significance of the Problem Initially there may seem to be little significance to this particular problem in the grand scale of things however when analysing the key factors and repercussions linked it is a big issue. There are many forecasting systems that have already been developed but none which solve this unique problem. There are several areas which are troubled by this current problem which can cause a ripple effect. Mainly the customers of bookmakers deserve a truly representative probability of their bets. 1.3.1 Misleading odds Bookmakers like the majority of companies are trying to make as much profit as possible. This can lead them to use questionable marketing strategies, in an attempt to entice the customers to spend more money. Bookmakers will tend to advertise odds in large colourful letters with the aim of making customers take bets which seem to be better than they actually are. The bet that may be being advertised could in fact be statistically very unlikely to happen however because it seems like a good offer people will take the bet. Bookmakers can even use the players, managers or teams themselves to create hype to increase the amount of wagers on a particular bet. For instance if a player who used to play for a team was playing their first match against them, bookmakers could use this to offer a bet for that player to score when the statistically the data shows that the player in question very rarely scores goals against teams who are currently in form. They can use this in many different scenarios such as goals, yellow cards or even man of the match awards. It is a similar effect that lotteries have on customers. People who may not play the lottery on a regular basis may play when there is a massive rollover jackpot. Even though statistically they have a minute chance of winning because of the advertisement and hysteria it generates it causes people to gamble with their money thinking they have a good chance of winning. When consumers would be more likely to win by spending their money on a scratch card with better odds but still with a substantial prize. 1.3.2 Media effect The media can also be used by bookmakers to take advantage of their customers by creating bets around news stories. If a player was involved in the media or newspapers this week the bookmakers will use this to their advantage to draw in customers in order to offer a bet which statistically isn’t fair. An example of this would be that a player such as Theo Walcott for Arsenal FC was quoted saying he wants to win the premiership top goal scorer this season. The bookmakers may have an offer at ‘evens’ for him to score 2 or more goals against a low level opposition. However statistically Theo Walcott only scores 2 or more goals in 3% of the matches he has played. Due to evens being the equivalent of 50% this bet is massively stacked in the favour of the bookmaker. The average football fan will remember seeing Theo Walcott’s name in the news and how he is aiming to score more goals. This could then entice them to put a bet on thinking they have a good chance of winning when statistically it is very unlikely.
  • 7. Niall Brooke 1.3.3 Transfer market effects Only in recent times have bookmakers started to offer wagers on player transfers between teams. This can demonstrates how much control and influence bookmakers can have on the customers. News outlets can base their stories on the likelihood of player being transferred base just open the current odds that are being given. This is ever so evident when bookmakers suspend the betting on a transfer. This means when the bookmaker decided they have stopped taking bets on a transfer because it is almost certain to happen. However a few days later they may reopen the transfers causing loads of customers to get worse odds that what they originally should of. 1.3.4 Impulse betting Customers can constantly be swayed by advertisements which seem to be too good to be true. It is now common for adverts to be down on TV in-between football matches which offer live in play odds. This can lead to many people making impulse bets without thinking through how good the odds really are. The bookmakers will offer seemingly good and realistic odds for the game that is currently being played however when the statistics are looked at it is clear that the customers are being given a bad deal. 1.4 Resources Available For a technological problem it is essential that the correct resources are available, these would include, hardware, software and reading material. Liverpool John Moores University has a wide variety of educational licenced software made available to all computing students the software can be accessed via any of the computer science labs or remotely though the university’s servers. There are several computer labs based on campus with up to date hardware that has the correct specifications to run the specific software. There are two LJMU libraries, which are open 24 hours a day 7 days a week during term time. This will allow students to access reading material through journals, research papers and textbooks. The vast majority of resources are available physically and digitally. All of these factors will provide a suitable work environment for this project. 1.4.1 Software For this project a variety of software will need to be considered, as there are many different ways to approach this particular problem. It is essential to have word processing software, which will be required to develop the final report and testing. For solving the problem in question a selection of category-based software would need to be considered. 1.4.2 Image manipulation For this problem an image manipulation application may be needed for many different tasks, as graphics are highly common in modern applications. Adobe Photoshop is the industry leader in this area and is available at LJMU. Image manipulation may be needed to construct website templates, create icon graphics and to edit screen captures. This software is essential and very important that is constantly available for use during this project.
  • 8. Niall Brooke 1.4.3 Spreadsheet application A Spreadsheet application would be very beneficial to have to solve a problem such as this, due to any mathematical/equation elements that may present themselves. These tasks can be done on other pieces of software however it would be much more streamlined and organized to have them created in this way. Spreadsheet applications are designed for data manipulation and the problem is heavily related to this subject area. The Microsoft office suit is available which contains the Spreadsheet application Excel. 1.4.4 Software development Having a software development platform is vital before starting to plan out a solution for a technological problem. Visual Studio has been made available which is a suit developed by Microsoft, which allows users to create applications in a variety of programming languages. This could be used to create, design and then publish various solutions to this current problem. The software also allows many different plugins and expansions, which can also aid and customize the development process. Code or designed based applications can be created as the software contains a simple to use GUI development kit to create the user interfaces for applications. 1.4.5 Website development applications Many systems in this modern time are designed for solely for the internet therefore web development platforms are very important when collecting resources. For this reason it is very plausible that a website may need to be created in order to manipulate data and present a user-friendly interface. There are several web development platforms at the disposal of LJMU students however the most commonly used is Adobe Dreamweaver as it can easily be combined with its fellow branded software. 1.4.6 Graphical developmental platforms It may be required to use a graphical development platform possibly in tandem with other software to create an overall better user experience. Software such as Adobe Flash could be implemented to present the user with a rich multimedia application. Silverlight is also available however due to the current trends it would be more beneficial to avoid this particular software due its small user base. 1.4.7 Database applications The problem is heavily reliant on data so a database may be required as a standalone entity or possibly connected to a website/application, which would then proceed to call upon the database to present the data in a dynamic fashion. As stated before Microsoft Access would be available in the office suit however MYSQL workbench which is also on the LJMU servers would be a much more suitable choice as is used much more than its competitors. MYSQL can also be used directly from other problems? so it isn’t essential to have any particular software.
  • 9. Niall Brooke 1.4.8 Mobile development applications With smartphones becoming every increasing popular an app solution may be a very beneficial option. Mobile applications can be done on several platforms, Adobe Flash which has been mentioned can already output mobile applications however they are not the standard. Java development on applications such as Eclipse can be developed using Android SDK mobile applications, which are currently available to these project’s resources. 1.5 Advantages of specific software 1.5.1 Excel Advantages The main positive of Excel is that it is a complete program; it doesn’t require anything else to aid it. It is also very universal as it the file can be sent and viewed in emails and also most smartphones can open the documents without any problems. Visual studio can also be integrated with Excel to allow password protected documents. Excel is known for its ease of use and helpful pop-up tutorials which help when inputting equations 1.5.2 Excel Disadvantages A bad feature of Excel is that it can be susceptible to viruses, this is because excel can run macros which can contain malicious code. Due to Excel projects normally being large amount of data in one file it can end up running slower due to the file size. This leads onto a different problem as if a user decided to split the project into multiple files it will increase the overall chance of losing data. Finally Excel is limited to the amount of space which is available due to the limited rows and columns. 1.5.3 Flash Advantages The biggest advantage of flash is that it allows the creation of very interactive multimedia content which can create very good user interactions. Flash also can be integrated in many different situations such as websites and as standalone applications. A great benefit of flash is that it is cross browser compatible, meaning that it isn’t a problem if the browser the user is running doesn’t have the correct version of HTML as long as it has a flash player installed it should run fine. 1.5.4 Flash Disadvantages Flash has a reputation for being an unreliable piece of software; this means that it is quite prone to crashing. Flash can also slow down website because it is running something much more intense than usual. This can be combated with preloaders and multiple swf files which all get loaded into one main file when called upon. 1.5.5 Dreamweaver Advantages Dreamweaver is great for website management; it allows users to switch between different sites with a simple option on the side. The website has a built in FTP connection tool which makes it very simple and easy to update websites. Dreamweaver is known for its good CSS Style sheets and makes it very simple for customization.
  • 10. Niall Brooke 1.5.6 Dreamweaver Disadvantages Dreamweaver’s simple interface can also be a drawback as it limits the learning of code as it offers a point and click solution. Dreamweaver can also limit itself sometimes when it tries to connect to things in a certain way and will not allow other particular options. 1.6 Current Solutions It is a known fact that it is impossible to predict the result of every single football match as there are far too many variables that could change at any given moment. However there are many different ways to greatly increase the percentage chance of predicting the result. There are some current solutions that have been researched and developed. The most common research into predicting a football match is to review the current form of two teams playing and any past fixtures they have played against one another. Mathew Tucker of the University of Southampton developed a website that used multiple algorithms in an attempt to predict football results with a higher success rate by factoring in the players themselves. The system would constantly pull data from the BBC Sport website and store it on a database taking into account how the team’s winning percentage is altered depending upon each individual player. Essentially the system would get more accurate over a longer period of time as there would be more data to compare. He was also was able to factor in the Bookmakers by pulling data from an odd’s comparison website which allowed him to check if the odds matched his system and if it was worth putting on a wager. [1] The system isn’t entirely unique however it does combine many different variables that can provide some interesting results. After testing the system it was clear there were many flaws that needed to be fixed. The main one was that players who were in form at the present time were in fact lowering the winning percentage of their team as the past data shows them to be a weak link. However it did still correctly predict some results, which in reality were obvious bookmaker’s favourites, so it was clear the system wasn’t very reliable. The Seoul National University published a paper proposing a very comprehensive framework for sports prediction. The framework is quite unique in 2 ways; it uses a rule based reasoner and a Bayesian Network component. The reasoner allows the system to still forecast the results of matches even if there is very little data available. The framework also takes into account real times scores, which are very rare in other predictors; in essence it is more of a simulator. Strategies and tactics that teams use are also reviewed and incorporated into the framework, to create a very carefully constructed system. [2] Both of these systems are good in their own way however, they both are only focusing on the result of a team. Each team has 11 players and each one of them would be able to impact the result of the game so that is a total of 21 extra variables compared to a system that focusing on 1 player. Each one of these players would have hundreds of possible variables themselves this makes calculating an exact percentage or likelihood very unreliable. A solution could however also incorporate the teams overall likelihood of winning to add more accuracy to an algorithm.
  • 11. Niall Brooke 2 Literature Review There have been many research papers into sports prediction, which go into depth about the variables and algorithms that can be used to forecast the result of a sporting event. Subsequently there is also a vast amount of research into the odds and statistics that are involved with gambling, this can also be factored in with sports wagering. During this literature review all these topics will be covered and there will also be an in-depth look at machine learning in general showcasing and any possible methods can be used in sports environment. 2.1 Machine Learning Machine learning is a form of artificial intelligence, which essentially is a system that can learn from data. This can be useful for situations such as detecting spam Email messages or finding patterns in sports statistics. There are many different methods of machine learning that can be implemented in a system, two of the most popular are decision tree’s and Bayesian networks. Decision trees’ are used as a predictive model to observe previous data in order to show an end target. Bayesian networks on the other hand are a mix of incidence diagrams and Bayes theorem. They display the conditional probability and the relationships between different variables. 2.1.1 Bayesian Networks In modern times Bayesian networks have been increasingly used as tool for modelling statistical problems. They have also frequently been used by in the reliability analysis community. A reliability analyst tends to be someone who gives their input to a decision problem. Studies in areas like this tend to be unclear due to the random fluctuations. The end result should be made into a statistical model of random variables. The model should be mathematically accurate but also be clear and easy to understand for the decision maker. It is vital that a set of parameters are fully specified by statistical or judgement data. Due to both of these sources of data being less than perfect it is important to formalise to reduce the amount of parameters needed in the model [3]. In a statistical environment the figures that are sought after tend to be conditional probabilities or deduced from these numbers. Due to all of these requirements there has been a lot more focus on the traditional and flexible frameworks such as fault trees. One framework that has stood out is set of Bayesian network modules. Bayesian networks originally came from the area of artificial intelligence. They were used as an effective framework for understanding uncertain knowledge. Bayesian networks can be found originated from Almond and Barlow [4]. The suggestion of using Bayesian networks for reliability analysis has led to a flurry in research regarding comparing the classical reliability formalisms and Bayesian networks. Features from both modelling and analysis of reliability block diagrams have been also compared to Bayesian networks, which have shown that they have many advantages over the traditional frameworks. In recent times however other more general reliability models have been compared with Bayesian related formalisms [5] and [6].
  • 12. Niall Brooke Bayesian networks over the years have had many different uses such as finding software reliability [7], find faults in systems and maintenance modelling [8]. However they are very commonly used to find software system reliability. It is very popular as it combines multiple sources of information to generate a global construction of information. Fenton [9] displayed that the vastness of well-founded underlying theory of Bayesian networks can provide a good amount significant advantages. Woff [10] designed a series of software tests using Bayesian networks and concluded they are well suited for these problems. Typically it is common to see discrete variables in the Bayesian network community. However it is must be noted that the reliability analysis would be very limited if only discrete variables were considered. So it is important for them not to limit themselves in that particular way. Instead it is important to use both continuous and discrete variable models. Overall Bayesian networks are considered to constitute a modelling framework which tends to be typically easy for domain experts to use and of course in the reliability field. In research common aims and goals are being targeted by the Bayesian network community. Bayesian networks belong to the Graphic models family. These are graphical elements, which are used to display the unknown. Each node represents a random variable while in between them lays the dependencies along with random variables. The dependencies found in the graphical model are determined using known statistical methods. Graphical models with undirected edges are typically referred to as Markov networks. These types of networks are known to present a solid definition of any two different nodes based on the concept of the Markov blanket. These types of networks are commonly used in computer vision [11]. 2.1.2 Graphics model link Bayesian networks are also linked to another graphics model commonly known as a directed acrylic graph. This tends to be found in machine learning and artificial societies. Bayesian Networks are very focused on a mathematical element however are also very unique. They allow an effective representation of joint probability distribution over a set of random variables [12]. The structure that is typically found in a directed acrylic graph is split into two different sections; the nodes and the vertical edges. The nodes are used to represent random variables and are displayed as circles defined by the variable name. The edges are however are the direct dependence among with variables and are depicted as arrows between nodes [13]. Due to the design of the directed acrylic graph it is impossible for an individual node to be its own ancestor or descendant. This means that this condition is very important because of possible factorization of the joint collection of probability nodes [14].
  • 13. Niall Brooke A Bayesian Network is similar to that of a conditional independence statement. This is typically the case when each variable is independent of its nondescendants, depending upon the state of its parents. This can be used to greatly reduce the amount of parameters that are required to define the joint probability distribution of the variable [15] 2.1.3 Bayesian networks in sport Bayesian networks are also used to create predictive models for sports such as football. Knowledge can be gathered from experts or statistics to help determine to main factors than can effects the result of a football match. These factors are very complex and are evidence that it isn’t just luck that determines a match. It is in situations like this in which Bayesian networks excel. It is possible for a domain expert to have collaboration with a Bayesian network expert to create a network showcasing the importance of the relationships between the main factors involved to determine the direction of each effect in the network. Fenton [16] compared an experts Bayesian network to analyse the results of Tottenham hotspurs in the 1995/1996 and 1996/1997 seasons. (Fig 1 a Bayesian network for of Tottenham hotspurs in the 1995/1996 and 1996/1997 season) Fenton used an expert Bayesian network, which provided excellent results over the cross season test periods. This model didn’t take into account the players training regime in- between the two seasons ignoring key attributes. This helped to show how that an expert Bayesian network can be used to select the key features. Fenton concluded that the overall procedure of machine learning with Bayesian networks provided two main positive outcomes understanding and prediction. Even though logically speaking the more understanding we have the better our predictions should be. It is however possible to make accurate predictions without a substantial amount of knowledge. The understanding that is gained from learning different processes allows the construction of models that demonstrate the relationship between two different models. [16]
  • 14. Niall Brooke 2.2 Bookmakers Gambling has now become an integral part of the sports ecosystem. In recent years sports betting companies have had a massive boom in their market due to online gambling. However with all of its success the sports betting market has never been too far away from controversy. Sports wagering can be tracked all the way back to the original Olympic Games in Greece whereby athletes would compete in many different events such as foot races, hurdles and even free style fighting. These athletes would be rewarded with winnings however the majority of money would be won in the crowd. The spectators would wager on the outcomes of these events winning or losing entire estates at a time. Even before that the gambling can be traced back to the Romans, who swore it was a metaphor for life itself. Sports betting can be found in almost every sport in this day and age. Bookmakers have 100’s of different ways to bet on single events to entice more wagers. Bookmakers use algorithms to find patterns to predict and produce their odds. They will always put the odds in their favour so that they can be sure to produce a profit. However from time to time they may make a mistake or inside information may be leaked which can cause a flurry of bets and cause the Bookmakers to close bets. 2.2.1 Gambling market Financial markets and sports wagering are very similar, both involve investors with heterogeneous beliefs looking to make a profit. Sports’ betting is a zero-sum game with two traders on each side of the transaction with large amounts of money at stake. With financial and sporting markets being so similar it is peculiar that they are organised in such different ways. In the financial market the price constantly fluctuates to match the supply and demand. The main goal of the market makers is to match up the buyers and sellers. However the bookmakers in sports wagering tend to announce a price (such as the odds for a horse to win a race or team to win a game) after this changes in the prices are very rare and sometimes not at all [17]. If the price chosen is not the market clearing price then that bookmaker could take a big loss [18]. If the betters notice that a bookmaker has made a mistake with their price they can exploit this, which will lead to a big loss. Bookmakers are categorically different in the way that casinos have risks on their games of chance such as roulette and blackjack. This is due to the fact that these games have odds that favour the house and with a mass amount of people playing they are guaranteed to make a profit. On the contrary with bookmakers if they make a mistake with the odds they can make a big loss, even in the long run. Small groups of skilled gamblers who can consistently make a profit can be financially disastrous for bookmakers. These gamblers could either amass a huge bankroll or sell their winning formula to others and increase the overall problem.
  • 15. Niall Brooke Although the system which bookmakers implement is peculiar there are a few ways in which it can be very profitable. Bookmakers are very knowledgeable and have vast amount of resources allowing them to equalise the amount of money wagered on each outcome, presenting them with a profit regardless of the outcome as the bookmaker will charge a commission of the bet’s known as ‘the vig’ [19]. Due to this strategy Bookmakers tend not to focus directly on the winner of the outcome but forecasting how wagers will be placed. Popular depictions of how bookmakers react is stressed here [20]. The second main strategy is if the bookmakers are statistically more accurate at predicting the outcome of sporting events than the customers placing wagers. In this scenario the bookmaker would always be able to set the correct price as it would equalise the probability that a bet placed on either side of a wager is a winner. However this would mean that the bookmaker would only win the commission and not the overall wager placed as it would have now been cancelled out by covering other losses. Unlike the previous methods the bookmakers would actually lose if the gamblers are more skilled at predicting the outcome of an event. The final method for bookmakers can be very profitable as it combines the positives from both previous strategies. If the bookmaker is better at predicting the results of games but also can effectively predict the betters themselves, they can make profits more than that just of their commission. By systematically setting the ‘wrong’ prices for games they can start to control how wagers are placed and then create a larger profit margin for themselves. For example if a local bookmaker knew that customers had a trend of betting for the local team, they could skew the odds against that particular team. However there are also constraints to this as it couldn’t be done on a mass scale as betters who know the ‘correct’ odds could generate a profit if the posted price diverts too much from its true form [21]. 2.3 Sports Prediction Sports forecasting is becoming ever more important in the world of sport, effecting teams sponsors, the media and mainly the fans who are making wagers online. There has been a surge of demand for professional advice regarding the results of sporting events. This is typically delivered in the form of tipsters or pundits [22]. Betting odds themselves now are used as a form of forecasting as they provide an overall prediction. Fixed odds on the other hand are sourced from the expert predictions of the bookmakers [23].
  • 16. Niall Brooke 2.3.1 Prediction markets Prediction markets were originally used in political election results [24] and later even in business outcomes [25]. It is now increasingly used in an attempt to predict all different types of sporting events. This shows that prediction markets provide their own method of sports forecasting. Essentially they are a massive group of individuals connected via the Internet who are sharing virtual market stocks which will then have an effect on the future value of shares depending on the market situation. When a particular outcome which is linked to a specific market situation occurs each virtual stock bought receives a cash payoff. An example of this would be £1 if a team wins and £0 otherwise. In prediction markets each individual provides their own knowledge to the market so the stock prices are a representation of the combined wisdom of everyone involved, thus creating the prediction [26]. Due to the vast amount of different forecasting methods questions have been raised as to how effective they are. There have been studies that research the performance of betting odds and tipsters’ methods [22]. However there has never been extensive research into a comparison with these methods and prediction markets [27]. There is also not much research into any possible similarities between multiple forecast methods. This could be very effective if they were to be combined in a weighting-based overall method. However this may be beneficial to the grand scale as if it was openly available the average sports fan may be able to take advantage. This also aids sports betting companies such as bookmakers to improve their own forecasts. 2.3.2 Forecasting methods 2.3.2.1 Prediction markets The general consensus of prediction markets suggests that markets in fact solve information problems [28]. A competitive market can achieve market efficiency through different price mechanisms. The most effective proven method of aggravating the balance is by depriving the individuals of information [29]. This means that the prices in a competitive market will show a reflection of the public and private information from the individuals thus providing a good predictor [26]. These qualities make them a very promising method to solve many different information problems [30]. There are many online prediction markets based around sport, they will trade virtual stocks related top future market situations, which are directly linked to the results of sporting events. The cash pay-out of the shares of virtual stocks depends upon the actual outcome of the fixture. This means that the price of one of the virtual stocks should then match the prediction markets aggregate prediction of the event outcome. The participants of a Prediction market will use their own judgment and expectations of the result to work out the true value of the related share of virtual stock. Accordingly they will then proceed to compare their expected cash share with the prediction markets aggregate expectation. If the potential profit from the virtual portfolio exceeds expectations it will then be in the general interest of the prediction market to reveal their transactions to aid the overall
  • 17. Niall Brooke strategy. This leads to participants in the future revealing their true expectations of the market through buying and selling activities [22] Due to the individuals making their expectations tradable a prediction market can then create a market of its own about future situations whereby participants will compete according to their own expectations. Research has been conducted studying empirical data, which supports the informational efficiency of such markets [30]. 2.3.2.2 Tipsters Professional forecasts are often made sports pundits/tipsters whose predictions are normally broadcast through media. Tipsters tend to be experts of a particular sport who will not use any type of model to predict a result but rather use their personal experience [31]. They will tend to only offer forecasts on popular fixtures typically with a close connection to betting. Due to the nature of the forecast there are no financial consequences caused from the result of the tipsters. There is clear evidence from research that the actual forecast accuracy from a tipster is very limited [31]. It is show than tipsters tend to perform better overall than random forecasting methods however they come out worse than systems that always forecast a home win. This was showcased in study that 3 tipsters received on average 42%, 41% and 43.5% while a home win system received 47.5% [27]. The study also showed that football tipsters will tend to have a lower average of predicting correct results as an average football fan. 2.3.2.3 Betting odds A vast amount of analysis into economics and business research suggests that betting odds can provide a very effective forecasting method [32]. Bookmakers will determine a set of fixed odds determined upon their expectations of matches’ outcome based on probabilities. Fixed odds rarely change however an influx of ‘in play’ betting odds are constantly changing. 3 Problem Requirements 3.1 Problem Specification The problem itself can be split into multiple sections. These can in turn be solved individually to create a final solution. The main areas which need to be resolved are: the data source, user interaction, the calculations and finally the output. 3.1.1 Data Source Statistical football data on individual players needs to be collected and stored so that it can later be manipulated. The data collected needs to detail every aspect that may affect a player’s chance of scoring a goal in a game. This will cover what competitions the player scores the most goals in and if the game was at home or away. The data will also need to include the minutes in which a player had previously scored, this is to aid in match betting. Having data determined how likely a player is to score at any given moment in a game the opposing team will also need data collected relating to the selected player. The data will depict how often a said player on average scores against a team of that standard. Even the
  • 18. Niall Brooke attendance of matches will be considered as it may have an effect on a how a player performs. All this data needs to be combined in order to create a definitive prediction. 3.1.2 User Interface The user interface will need to allow the user to select a particular player and input the required match data in order to generate a prediction. The user will need to be supplied with correct forms and menus to input any required data. 3.1.3 Calculations The calculations required for this problem would need to use the relevant data collected to create multiple equations which when combined will create an overall algorithm leading to an accurate prediction. The first main equation which will need to be calculated is the percentages of each individual stat linked to a player. This could be achieved by combining the binary values together to generate a solid average number. The main set of equations will be to generate the actual percentage prediction. This will have many factors and combine multiple different data averages. The final equation will need to convert the bookmaker’s odds into a comparable percentage. Once all of these calculations have been created they should be hidden from the user, this will help protect the overall algorithm to prevent any attempt of plagiarism. 3.1.4 Output The overall output would need to provide the user with an accurate percentage chance of the chosen player scoring in the selected game. This will need to be displayed in three main ways. The first showing the exact percentage chance of that player scoring with the selected variables. Secondly displaying the percentage chance of the odds that have been given by the bookmaker. Finally a text based message informing the user if it is a worthwhile bet to wager. Once the user is given the prediction they will be able to reset the forms and selectable menus so that they can use the system again. 3.2 Possible Solutions With this particular problem there are many different possible solutions, which need to be examined before making a final decision. The different solutions will present their own advantages and flaws, it will be virtually impossible to find a ‘perfect’ system, however once it has been implemented and tested there will be room for improvement. In this project we have narrowed it down to 4 different solutions: a windows application, excel datasheet, a dynamic website and a phone application. These four solutions will be briefly analysed showcasing their potential features and functionality. 3.2.1 Excel Datasheet A very feasible solution would be to develop a datasheet, which could then be used by users on their own system. The datasheet would be created on a spreadsheet development application such as Microsoft Excel. The datasheet would contain 3 different main sheets.
  • 19. Niall Brooke The first two sheets would contain a database of statistics for the specific players and the remaining sheet would display the user interface. The system will allow a user to use statistical data stored in the datasheets to determine if a particular wager is statistically a worthwhile gamble. The design of the datasheet would make it possible to lock/hide the datasheet from the user, however this would prevent the user from importing new statistical data for new players or adding to the current one. This method allows the application to grow and be expanded upon to suit the users’ needs and personal requirements. The user interface of the datasheet would be simple; however the user would also be given options to edit/modify parameters that may improve their experience with the system. The user would be given drop down menus to select the players which are currently linked up to the databases and then enter in the variables to be presented with a solution. The installation of the system may cause some problems to computer illiterate users. There are two possible solutions for this situation. The first would be to release the system as raw datasheet file, which would require the user to have the software of their system to run it. This could lead to complications as the user may not have the required software. However Excel is a very common piece of software that sometimes comes included with most home computer systems. There is also open office which is a free alternative that would be able to run the software fine. Another solution would to use software to convert the datasheet into an executable windows/ios file, this would then allow users to install the system as an application and launce it as a standalone piece of software. Developing the system on Excel does however contain many advantages. Firstly the software structure is already in place as it is being developed on software specifically for its purpose. This will cut down on development time and allow more time to focus on the actual equations/algorithms that are the main point of the system. The datasheet solution can also be edited and customized by the user, which will give the application longevity. 3.2.2 Windows Application The most commonly used solution to solve technological problems would be a windows application. This is due to the fact that the vast majority of computer system runs on the windows operating system. This possible solution would be designed in a programming language such as C# or VB (Visual Basic). A software development sweet such as Visual Studio would also be required to create such an application with maximum efficacy. The application itself would have a GUI, which would contain multiple drop down/radio buttons, which would allow a user to select the two different players being tested, the opposing team and finally the odds that are being offered. The application would have the statistical data hard coded for demonstration purposes, however with future development the
  • 20. Niall Brooke application would need to be connected to a data source such as database hosted on the internet. The application itself would be quite user friendly as stated the majority of users would be familiar with the windows operating system. The application would need to be installed to the users system by following an installation wizard. At this stage it would request to use the network for internet access if this feature was to be implemented. The application would be designed so that even a computer illiterate user would be able to use its functionality allowing it to be used by the masses. The main advantages of having a windows based application solution would be that the target market would be very large. This is would be beneficial in a marketing/business sense. The development of the software would have an overall return of possible profit. The software could be released for free allowing users to trial the software and then pay for a subscription/one off fee for access to the data which fuels the application. There are however a few problems that developing a windows application would come across. To create a complex application which is synced to a server in C# would require an extensive amount of coding and experience. The development stage for the application therefore would be time consuming. The application itself would need to go through security checks otherwise users anti-virus software would flag the executable file as ‘suspicious’ and attempt to prevent the install. This may deter users from trying the application. The longevity of this solution is very positive however if the online connection is implemented there would always be updated statistics for users to use. (Fig 1 windows application example) The above image displays an example of a windows application. The software is called ‘odds wizard’ and shows a user the best possible odds for football matches being played in a time frame. This system has a similar user interface to the possible windows solution for this particular problem.
  • 21. Niall Brooke 3.2.3 Dynamic Website An alternate solution would be to develop the entire system to be hosted online. The system would be available to users through a web address. The web site would be dynamic allowing the data to be constantly updating. This would require a backend database, which is remotely connected to the server. The design of the website would be quite simplistic in order to improve usability and performance. This would be important if the website contained many large graphics or high intensive multimedia it could possibly hinder the user’s experience by causing the website to run slower. The main purpose of the system is to provide the user with useful gambling information so it is important is presented in a clear fashion. (Fig 2 dynamic website example) The above image shows the dynamic website ‘footy brain’ mentioned in current solutions. The website follows a similar design as this possible solution. It contains just two tabs which displays the predictions and the data. The website however doesn’t contain any user interactions which would be added for our solution. The solution would be developed primarily in HTML, the template being created in image an manipulation software such as Adobe Dreamweaver. To keep everything synced the web development aspect will be done using the same suit, Adobe Dreamweaver. The backend database would be developed using SQL as it is the most commonly used and the industry standard for database development. The main benefits of implementing a web-based solution would be that users could remotely access the system from any destination with an internet connection. This would be very user friendly allowing users to use the service on a variety of different devices. Being online and on a server also means that updating the system would be very easy, this would benefit both the user and developer. The user would not need to download/install the system onto their machine and so avoid many issues other solutions may come across. There are however some issues that this solution would contain. The big issue with the system being hosted online would be any problems with the hosted server would cause problems for the users. This problem would be hard to avoid as it would be out of the
  • 22. Niall Brooke developer’s hands. A solution to this would be to have a backup server hosted locally, however this would still cause noticeable speed drops with the website performance. The server and domain name would also cost a yearly fee. This means that system would constantly be losing money if the service was free. However the service could contain premium elements that could generate income. Another solution would be to use advertisements to generate any income if the website received high traffic volumes. 3.2.4 Mobile Application A possible combination of different solutions could be used to develop a mobile solution. With smartphones and apps becoming more common with the average person a mobile app solution has very good potential. There are currently 3 main mobile operating systems Android, Apple and Windows. For this problem an Android based solution would be most suited. This is due the fact that Android is currently the most used mobile operating system and is free to develop for. Development on IOS requires a fee to become and official app developer and is much more time consuming to develop for. Windows mobile 8 is currently new and does not have a large market. This solution would need to be developed on the Android SDK which is available for the software Eclipse. Before starting the solution it is vital to determine the version of Android it should be developed in. This is important as many devices won’t be able use some the features in newer versions of the operating system. Android applications are developed in Java but Eclipse offers an easy GUI to develop the front end of applications. To test the application an emulator needs to be installed on the development system, which then creates a virtual Android device. Android development includes permissions, which the user must accept before downloading and installing the app. These permissions allow the app to access information from internet or even from the device itself. This solution would only require internet access to pull live data from a hosted server as having a database included would be too large for an app. (Fig 3 Android application example)
  • 23. Niall Brooke The above image is a current live application available on the android play store. The possible solution for this problem however would be designed slightly differently however as it would allow users to interact with the application whereas this simply shows information pulled off a server. Our solution would contain 2 main tabs similar to that of the dynamic web based solution but be primarily designed for touch screen user interactions. The main advantages of a mobile-based solution would be that it would very user friendly and target a massive emerging market. The application would be free to host and could make a profit by either ad based revenue or even a flat fee. Making it a phone application would also increase the usability of the system. However the application similar to the other system would require a server to pull the data from and a solid internet connection. This would make the application unusable in areas where there is no Wi-Fi or Cellular network connection. 3.3 Chosen Solution We have decided upon using an Excel datasheet based solution. This has been chosen as it is has many advantages and few negatives compared to the alternate solutions for this particular problem. When developing a system it is beneficial to have a good understanding of the program/programming language required to create such an application prior to the design stage. Taking into consideration that we have multiple years’ experience with Excel it is a comfortable development platform compared to using Java, C# or PHP which would require additional time to further understand the languages. Excel is the ideal solution as it fits all of the criteria for the problem specification. Firstly similar to all the solutions it will have an interactive user interface, which is vital to the system. This will allow the user to select from several different players using a drop down menu. It will allow the user to Input the current score of the game being played. Also it will have true/false buttons so the user can determine the first goal scorer odds. It will also be able to dynamically display a percentage and message informing the user if the bet is worthwhile. The main reason however for using Excel is the entire solution is based on an algorithm that contains multiple equations. Due to this Excel is the standout choice as its main purpose is to manipulate data using formulas. The application itself allows equations to interact with each other to create new data values. Excel is also designed to store data, which is an essential part of the solution. It is very simple to store the statistical data in binary form so that data values can be generated and used. The main problem with the Windows application solution in which Excel exceeds it, is the limitation of operating systems. Due to the Windows application only running on Windows machines it will limit the users that can use the system. However with Excel the file can be opened on any operating system that can run a datasheet. Excel and OpenOffice are both available for many different operating systems therefore expanding the reach of the application.
  • 24. Niall Brooke The dynamic web solution is very effective however the main flaw is that the users need to be connected to the internet to use the system. This is not a problem with the Excel solution as it can be run offline on a user’s machine. A mobile application is possibly the best alternate solution in terms of usability however it comes across the same problems as the previously mentioned solutions as it would only initially be developed for one version of a particular operating system and requires online access to access the data. This is outdone by Excel as it combats both of these issues. As initially mention there is no ‘perfect’ solution and an Excel datasheet has its own flaws such as being less user friendly that the other solutions however with everything taken into consideration the main purpose of this system is to solve the problem presented. The main problem was to find evidence that data mining could in fact aid in sports betting. The main part of the design for this solution will be based around creating the series of equations and algorithms needed to prove this statement. Due to all these factors an Excel datasheet has a good balance of features and usability to create a functional system to solve this problem. 4 Design This chapter will showcase every aspect of the design and implementation of the theoretical system developed. Excel has been selected as the platform in which the solution to the problem will be executed. This is was due to the amount of mathematical equations and simple output required to solve the issue. This solution dose have a user interface and interactions however the solution is more based around the theoretical algorithm and equations which run in the background. 4.1 Initial decisions It was important when selecting the two players to test the system that they were both playing in the same league and similar as possible to create a good comparison. The league itself was a big issue initially the English Premier League seemed the optimal choice due to the media attention and amount of bookmakers who offer a variety bets for each game. However after looking for two candidate players who were very similar a match couldn’t be found. The majority of research into sports forecasting tends to be based around the English league already so the Spanish league seemed a more suited choice. The position of the players was debated, the obvious striker position was initially discussed but it was decided that an attacking winger may provide more interesting results. The system will allow the user to compare the statistical data of Cristiano Ronaldo and Lionel Messi. These two players were chosen to test the system because of many different important factors. They are both similar ages currently in the peak of their football careers playing for the top two clubs in Spain. They are constantly in the media spotlight which can cause an overall effect on performance. They both have exceptional goal scoring records meaning that bookmakers will try to tempt customers to put money on not very good odds with the assumption they are guaranteed to score in every game.
  • 25. Niall Brooke 4.2 Data The data collected was the backbone of the system, due to the solution being based around data mining. The data was sourced from a reliable historical sports website [ref] and then entered into two different datasheets for each of the players. The datasheet was set out with all the dates of matches and the time in-between. The time in-between was added for the possibility of future calculations that takes into account how long the player has rested/travelled from game to game. This could massively affect the performance of a player if they had an international fixture on the other side of the world a few days before a league fixture. Fatigue is an element that will be incorporated in the future and will be discussed more in detail later in the paper. The data collected was stored in the form of simple digits between 1-3 this was to make calculations and data input easier. If a player scored a goal in a game a variety of data would be logged. First if the player started the game and if the match itself was been played at home or away. The competition in which the player was competing in i.e. the league or cup. Information regarding any goals the player scored such as the total and minute it was achieved in. The datasheet can easily be modified to incorporate more variables as the system expands. 4.3 Algorithm The Algorithm is the heart the solution it will take the values generated from multiple different equations to create an overall figure. In this project the overall aim was to generate a percentage of how likely a football player was to score in a game depending upon a selection of variables the user has chosen. 4.3.1 Statistical Percentage Initially stats of Wayne Rooney of Manchester united were looked at to see if data mining could in fact help predict whether it was possible to predict if a player would score. A test was done by taking Rooney’s previous stats to create an average which could then be used to make a prediction. The game in question was in 2012 between Manchester United and Stoke City. Using statistics from the previous 3 seasons a week before the game it was shown that he was very likely to score at least 1 goal. This was because Rooney had a 61% Chance of scoring against any team with a 24% Chance of it being the first goal. There was an 87% Chance of Manchester United scoring at least 1 goal against any team and 29% Chance of Stoke City not conceding at least 1 goal against any team. Meaning that there is a 79% chance Manchester United will score at least 1 goal vs. Stoke City. Rooney then had a 48% Chance of scoring against Stoke city with a 19% Chance of it being the first goal. In the game Rooney scored 2 goals however he wasn’t the first goal scorer which mimics the statistics. This initial test research encouraged the current algorithm which takes into account the individuals performances in different competitions and how their goal scoring ratio will fluctuate depending on if the players is playing at home or away. Depending on what variables have selected a different calculation is made by combining different equations from the datasheet.
  • 26. Niall Brooke 4.3.2 Bookmakers Percentage To be able to make a fair comparison between the percentage generated by the system it is essential to covert the fractional odds into a percentage as well. This percentage is worked out slightly differently from the method of converting a standard fraction into a percentage which is dividing the small fraction by the large one and then multiplying the result by 100. Using this method provided uneven results as the percentages were reversing as the odds got less than even. They were producing results such as 300% likelihood of happening which is very unreliable. To resolve this first it was important to create scale which could be balanced overall. A starting point was determining evens as 50% under the logic that it is just as likely to happen as not. From this it was then possible to determine the equation to convert any possible odds from the bookmakers to a comparable percentage. Evens = 1 / 1 = 1 * 50 = 50% Below evens Low / High * 50 = X Y = X -100 Above evens Low / High * 50 An example of this would Cristiano Ronaldo is 1/3 to score against Real Betis so using the equation 1 / 3 = 0.33 * 50 = 16.6 then 16.6 – 100 = 83.4 giving him an 83% chance of scoring according to the bookmakers. 4.3.2.1 Prediction Advice Once both percentages have been calculated the next step is to determine whether the bet is within a certain radio of being classed as a worthwhile gamble. For a player to score a goal anytime in a match tends to receive bad odds for the customer. However the more specific the calculation is by adding more variables the customer will receive a better value for money per bet. A calculation will be made determining if the bet is; very good, good, average, bad or not recommend. This will be over 10% so for instance if the bookmaker’s odds were 66% the brackets would be between 56% and 76%. 4.4 Wireframe Before actually designing the beta system in excel it is vital to create a basic design. Wireframes designs help to create a skin like perception of a system/application. This helps to get a good understanding of how it will eventually look like. For excel there are limitations regarding design however it can still be laid out in multiple different ways. This helps to create template which can be followed for future design alterations.
  • 27. Niall Brooke 4.4.1 Statistic data design The design of this page was made to be quite simplistic and generic so that it could be added to with ease. The top variables bar showed all of the headings for the data and was made to be frozen so when a user scrolled down the data it would always stay snapped to the top. This was done to prevent data being entered in incorrectly and increase the user experience. Along the side are the dates for the period of time the data was referring to. Player 1 Player 2 User Interface Dates Variables Player Data 4.4.2 Calculations input design The design of the calculations page was made to give the user simple options to select the variables they wanted to and get a fast result. There are 4 drop down variables which are clear for the user. The percentages were placed in the centre of the interface as it could be considered the most import visual the user sees. The percentage which is produced is the final piece of output data thus leading to the prediction advice and the end of the process.
  • 28. Niall Brooke Player 1 Player 2 User Interface Player Dropdown Venue Dropdown Competition Dropdown Goal type Dropdown Odds Prediction Advice Bookie precentage Stats percentage 4.5 Process structure This is a diagram showing the structure of how a run through of the system by as user would work. The first main step is loading the initial user interface datasheet which will present the designed template. The user will then be prompted to enter in the odds they have been given for a particular bet. The user will then select the 4 variables from the drop down menus to match the stipulations of the bet. Once this has been done the system will use all of the variables and user input in several equations and create a scenario. This will then tell the system to gather the required data from the statistics datasheet. The data will then be used in an algorithm to generate the two percentages. Finally the percentages will be compared and given a rating which will then be displayed to the user.
  • 29. Niall Brooke Input Odds Equations Algorithm VariableVariableVariableVariable Player Data Advice Open data sheet 4.6 User Interface The user interface for this system is has been made very simple as the statistics it generates are the keys to solving the problem. In total there are three different sections which are divided by datasheets. Two of these are full of data for each individual player while they third sheet has all of the main algorithm and equations.
  • 30. Niall Brooke 4.6.1 Calculations page This datasheet only has a small section which is visible. This puts focus on what a potential user would see and interact with to use the system. This is a prototype version of the system which demonstrates the algorithm and equations which solve the original problem. 4.6.2 Ronaldo datasheet This is the datasheet for Cristiano Ronaldo’s statistics; it contains all the variables used in the main algorithm. New data can be imported or added in manually to increase the accuracy of the system. 4.6.3 Messi Datasheet This is the other datasheet for Messi’s statistics as mentioned above it contains all the information need to aid the calculations page. It should also be mentioned that new variables can easily be added to create an updated algorithm. 4.7 User Interactions The system has several user interactions which allow them to alter the variables to change the statistical percentage they are generating. Effective well designed user interactions are very important in any system but for this particular solution it would need to slick and fast for the possibility of in play betting which have fluctuation odds. Also new users who may not be computer illiterate would need to be able to use this easily in the future. The only selectable cells in the datasheet are the user interactions everything else is locked to prevent any formula deletion
  • 31. Niall Brooke 4.7.1 Drop down variables Each of the variables has its own drop down menu, this allows users to mix and match different combinations to create their ideal bet. This method was originally designed to be radio buttons but this seemed much more efficient. 4.7.2 Player Variable This allows the user to select either Messi or Ronaldo. This interaction can be expanded upon to have several players by simply adding in a new datasheet with player statistics. 4.7.3 Venue Variable The home variable simply determines if the percentage should be calculated with the perception that the player in question is playing home or away from their based city. Away should be selected if the game is being played in a neutral ground such as in a final of a cup. 4.7.4 Competition Variable This allows the user to select what competition the player is going to be participating in. This is currently the variable with the most options and will have the most effect on the overall percentage. This is due to the fact that a player may rarely play in a particular competition such as the cup and therefore their goal ratio statistics may be significantly lower than the other options
  • 32. Niall Brooke 4.7.5 Goal type variable The final variable available to users in this prototype system is the goal type which they can either select anytime or first. This option will determine what type of bet is being made; typically it is more common for people to bet on the first goal scorer before the match as it will have significantly better odds. This also means that users will see a very high change in percentages when this variable is changed sometimes even more so than the competition. 4.7.6 Odds input The odds input supplies two boxes which allows the user to type in the odds that they have been given by the bookmaker. The two figures are the source of data in which the percentage for the bookmaker is calculated. There is also validation on these cells as they are the only user input. This prevents any possible errors and crashes if the wrong type or amount of characters were entered. 4.7.7 Calculated Percentages This displays the calculated percentages for both the bookmaker and the statistical prediction to the user. This is considered to be a user interaction as a user may wish to modify the variables/odds until they find a good bet. In essence the user is controlling the percentages hence why it is an interaction 4.7.8 Prediction advice This is the overall result of everything involved in the system; it shows the user if a particular bet is statistically worth it. This will be shown in variety of different colours representing the level of risk involved with placing a bet. A traffic light approach has been used as it has been proved to be the most common and clear indication of ranking levels. It is important to have clear and bold colours as a user may need to quickly see if a bet is worth it before the odds suddenly change.
  • 33. Niall Brooke 5 Testing 5.1 Methodology For this project the aim was to prove that data mining could in fact be used to aid in sports betting. The idea has been proposed and researched in depth before with multiple systems being developed however none of which focused on the individual players. The aim was to develop an algorithm which would be able to generate a percentage that was able to take multiple variables into consideration to make a solid prediction. This prediction would then be used as a comparison with odds given to see if and advantages could be found. 5.1.1 Theory The theory was essentially that the raw data from an individual player’s previous seasons could be used to calculate how likely said player was to score a goal in a certain situation. The algorithm itself would need to use solid data averages which would first need to be calculated using multiple equations. These equations would work out the individual’s variables one by one. For example the following equation was used to work out the individual percentage chance 1 variable. Total games played / Variable * 100 = X Using this initial simple equation it was possible to work out preliminary statistics which could later be used in the overall algorithm. An example of this would be: Ronaldo has played a total 31 games this season and scored 20, 5 and 6 in all competitions 20 + 5 + 6 = 31 31/31 = 1 This simply shows that surprisingly on average Ronaldo has an exact goal per game ratio of 1.0 which is the equivalent to 100%. From this initial calculation it would seem that any odds given by bookmakers which is less than 100% would be a sure win, however these stats represents a general average for Ronaldo’s stats for instance if we add the variable of only champions league games his average will greatly drop 80%. Taking everything into consideration an algorithm can then be constructed to create very precise percentages that represent a player’s likelihood of scoring a goal within a match. For instance using the system it is possible to work out how likely Messi is to score in the last 10 minutes of an away match in the Spanish league. 5 + 3 / 41 * 100 + 22 + 51 + 14 / 3 = 36% This takes into account all of the variables mention to create a percentage prediction, this seems to be high for an average player however due to Messi’s abnormal goal scoring record ratio it is very realistic for his standards.
  • 34. Niall Brooke 5.2 Results To test the results of this project a series of matches and odds were used from 2 different bookmakers to test the accuracy of the predictions. Both players were tested so that a comparison could be made. A total of 5 fixtures were tested for each player to give a good spread of data to analyse. The following table shows testing for Cristiano Ronaldo over a 5 game period Date Fixture Prediction William Hill Ladbrooks Outcome 16/3/13 La Liga 94% 88% 91% 1 Goal 30/3/13 La Liga 92% 86% 90% 1 Goal 6/4/13 La Liga 92% 88% 89% 1 Goal 9/4/13 Europe 88% 88% 90% 2 Goals 14/5/13 La Liga 92% 90% 92% 2 Goals From these 5 games tested for Ronaldo the statistical prediction was in fact more accurate than then bookmaker’s prediction which was very surprising. It was beneficial for the test that Ronaldo scored in every test match. Between the two bookmakers Ladbrokes stayed quite firm with their percentage while William will was more flexible. This may be due to them trying to offer better odds to entice the customers to place bets with them. The only blip in the testing for the system was the game in Europe which had a percentage drop for Ronaldo. This is because Ronaldo is currently on form and scoring freely in all competitions so his previous data for Europe wasn’t as accurate due to there being a large gap since the last match. The following table shows testing for Lionel Messi over a 5 game period Date Fixture Prediction William Hill Ladbrooks Outcome 12/3/13 La Liga 92% 90% 91% 2 Goal 17/3/13 La Liga 94% 94% 94% 2 Goal 30/3/13 La Liga 95% 95% 95% 1 Goal 2/4/13 Europe 94% 94% 95% 1 Goal 10/5/13 Europe 94% 65% 60% 0 Goals The test results from Lionel Messi showed some interesting results and possible flaws with the current design of the system. The final test match had a massive drop from the two Bookmakers this was due to Messi being injured in a previous match and being doubtful to start the game. The current system has no variables for injuries therefore predicted a very high percentage even though Messi was unlikely to play the full game giving him much less chance of scoring. Surprisingly the bookmakers still gave him good odds to score for an injured player but due to his goal ratio this wasn’t too surprising. The overall results however were positive with some predictions matching or surpassing the bookmakers which showed the system can be very effective if there are no unusual variables.
  • 35. Niall Brooke 5.3 Result Analysis The 5 results showed on average that the developed system was more effective than then bookmakers algorithm, however this may have just been coincidence as only a small amount of games were tested. From viewing the odds the bookmaker gave out is clear that they try to keep a similar pattern unless something drastic happens such as an injury. This is an interesting system as always have low odds for Messi & Ronaldo to score no matter who they are playing. Even though these two players will end up scoring a high ratio of goals they will never score 100% of the time in every game. This is because even though for instance Messi’s goal ratio is higher than his overall games played, he may have scored multiple times in one match and therefore skewing the statistics. 5.3.1 Method Analysis The solution which has been developed and implemented has many different levels of depth and complexity. The variables are the key factors this system compared to other systems which are currently out who focus only on the teams variables are very rarely attempt to even include the players into their equations. However there may be a reason for this as players can be very unpredictable. 5.3.2 Method Flaws One flaw in which the system will come across is that individual players can be very unpredictable as they are one person who has their own free will. This means that even if all the statistics point to this player having a fantastic game and scoring at least 1 goal there may other factors off the pitch. For instance a player may just have had some very bad news and therefore they are not playing to their full potential. This may be why there are many team based forecasting systems as they overall create an average of stability compared to 1 possibly unpredictable player. The system also has a draw back with the 5.3.3 Bookmakers calculations Bookmakers have a way of working out their odds, each single match is individually calculated. They work out every possible outcome before placing down any odds. This is an example of how they calculate the distribution rate. 100% x (1 / [(1 / odds of [1]) + (1 / odds of [X]) + (1 / odds of [2])] Taking in all of the information form the bookmaker and the system that was developed it is clear than both have smart systems which can compete against each other without ever finding a winner. Using a system such as this would defiantly be more beneficial for customers than just betting on bets they believe look good, as it could all just be a ploy to spend money.
  • 36. Niall Brooke 6 Conclusion Overall this project has been a successful endeavour; the original problem has been theoretically solved with evidence from testing and calculations. This was achieved by multiple different factors which all equally contributed. There were a few problems which lead to alterations with the designs and modifications to the equations however they made the system improve. The research into conducted looked at a variety of different research papers and alternate solutions already available. The entire project had its strengths and weakness and will be analysed to determine what was done well and what could have been done differently. There is much room for improvement in this project and this has only touched the surface of an idea that may mainstream in the not so distant future. In conclusion the project was a success and will defiantly be expanded on in the future. 6.1 Problem Analysis Review The problem was determined from the project question and from that a firm target was made. The goal was to create a system/method of using previous sports statistics to aid in sports betting. This was discussed in detail and a few current solutions were found and briefly looked at. There was good analysis on the gap in the market for a possible system to be created and what techniques it could use. However it would have been better to compare and contrast some current system in more detail. This may have led to an improved overall system if parts from other systems were incorporated 6.2 Review of Literature Review The research conducted was good but more should have been added overall. Some good interesting theories and information were found which helped when designing the solution however there were several more areas which possibly should have been covered. 6.3 Problem Requirements Review This section of the report was based upon all the possibilities and requirements that would be in the project. Good selections of possibilities were discussed and each one had their own strengths a weaknesses. A final solution was 6.4 Design Review The designs were very simple as they project was mainly about the equations and logic of the algorithm rather than create perfectly designed user interface. The designs could have included more planning and the use of a Gantt chart. 6.5 Results Review Overall the results were very promising they showed that the system works and that is reliable in most scenarios. However it did show that there were a few minor problems regards variables that could not be controlled such as injuries and red cards. This wold only is able to be implemented when the system was live and was streaming data constantly
  • 37. Niall Brooke 6.6 Future development This project in theory solved the problem which it was presented with however there is a massive possibility for expansion and improved which could lead to a new generation of sports betting. The solution depicted would just be the bare bones of a fantastic concept which could expand onto many different platforms. 6.6.1 The Variables The data initially collected contained many more variables which could have enhanced the overall accuracy of the prediction. One of these variables included was the attendance of games; this could be a major factor as statistically players play much better with a home crowd. This could mean that having a very small crowd could either make the player play better because there is less pressure or worse because there would be less adrenaline. The form of team in which the player is playing against could also be a variable that could have a very large effect. This could be presented in different ways such as the teams’ current form or their overall position last year. Players may only tend to score goals against teams that always finish in the bottom half of the table. The form could also be analysed of the team the player plays for and see how much of an effect that has. In the system developed we only took into account if a player played in a match however they can also be substituted on or off. This would lead to a better implementation of using the exact amount of minutes a player has played to determine more accurate goal averages. 6.6.2 Real time betting A large improved that could be made in the future is to incorporate the minutes in which a player scores goals to allow real time betting. The system would allow the user to not only select the initial variables but also to select the current data from a live game. This would mean that a percentage could be calculated about a very specific situation that could then be bet on. New data could then be looked at such as; the current score of the game, how many minutes have been played and even any red cards could then be incorporated into the algorithm. An example of this would be if Chelsea were playing Liverpool in the 66th minute and the score was 2-1 the system would be able to work out the percentage chance of Liverpool Luis Suarez to score the next goal. Using all of the new additions this would look at how many times Suarez has scored against teams who finish it the top 5 last season after the 70th minute when Liverpool were losing. This would present a unique percentage which would then be compared to the bookmaker’s odds. This has a massive potential for success as bookmaker will tend to lower the odds of a player scoring depending solely on factors about the game. This would include points such as the current score and overall performance of the player. Another major factor is the actual bets which are being place on the player. If the player gets a flurry of bets the odds will drop and vice versa. This means that have statistical data could prove to a much more accurate way to make short term predictions about live games.
  • 38. Niall Brooke 6.6.3 The platforms A system such as this could be developed in the future for many different platforms which could target the large market. The first step would be once the Excel prototype system has been perfected, a website based system should be developed which would require a login so that a user base would be formed. The online system would have a vast database running behind it which would constantly pull data from a reliable statistic website to keep updated. The system would not only allow users to create custom odds but also automatically search and compare the best current odd differences available. Every player from the top 6 leagues would have statistics so that users would have a wide variety of choices and get the best possible bets. An example of how this automatic feature could benefit the customers would be if they had no football knowledge and just wanted to know what statistically is likely to happen and what bet would be the best value for their money. Following on from this a windows application could be realised which synced up to the servers and allowed users to use the application from their desktop. Finally a mobile application could be created which would work on a similar premise to the windows application but be simpler designed and have a touchscreen friendly layout. 6.6.4 Business opportunity This system has the potential to earn a good profit in the future. A subscription based service would be offered to supply the data and percentages. People may be more forthcoming about paying for a service that has a good chance of earing them more money in the not so distant future. People are also trying to get any type of advantage they can so this would be a very plausible business idea. The database and algorithms would never be available its raw form to the customers to prevent any attempt to steal anything. An idea would be to offer the service free for a limited time to new customers so that they can get a taste of the system before subscribing. This could work by letting them via a total of 10 calculated percentages per Email registered account. During the early stages of the website advertisements could be used to generate a steady income from users browsing and clicking on them. When the application finally gets released a small charge could be issued for the download such as 59p which would encourage new users to try it. This seems like a small fee but if the app gets downloaded over 500,000 times it would make a very good profit. 6.6.5 Global Reach Once the system has been full established the next step would be to expand from just one particular sport. America has a massive market for sports betting and there are many systems that attempt to forecast the results of different sporting events. However using an altered version of this algorithm could gain a lot of attention. The system would be able to be used in many different sports such as American football and basketball. These sports are highly tactical and statistics are heavily tracked. For example in American football they tend to have players just for particular roles in the game. They will substitute a player on maybe only for 5 minutes just to do a certain job. Even though it is much different from football it has great potential to be expanded in that area.
  • 39. Niall Brooke Summary The aim of this project was to find out if it was possible to use data mining to increase the chances of winning at sports bets. Overall the project proved that it was possible however it would take more time to fine tune a perfect algorithm. A system was system was set up in Excel so that the formulas and equations could be tested with a semi-automatic process. This was made to demonstrate the possibilities of this or similar algorithm could have on the betting industry.