SlideShare a Scribd company logo
1 of 24
EXTRACTING AND ANALYZING
CRICKET STATISTICS WITH R AND
PYSPARK
Parag Ahire
February 16, 2020
OUR GOALS TODAY
Gather data, analyze the data, visualize the data and draw some
conclusions to answer a specific question
 Data Domain – Test match data of the game of cricket
Utilize R to gather the data
Utilize PySpark/Python to summarize, visualize and analyze the
data
OUR PROGRESSION TODAY
 Get familiar with the domain and data
 Define the specific question
 Get familiar with
 The data source
 The subjects of focus
 Install the tools needed
 Write program to fetch the data
 Execute program to fetch the data
 Write program to summarize, visualize and analyze the data
 Answer the specific question
DATA DOMAIN
 The game of cricket
 Get familiar with the game
 Understand some terms, formats, player roles and rules
 Focus on four all-rounders
 Placed in the top 25 legends of the game
 Sir Ian Botham – England
 Imran Khan - Pakistan
 Kapil Dev - India
 Sir Richard Hadlee – New Zealand
 Use their bowling data
SPECIFIC QUESTION
Which of the four all-rounders was the most effective as a
bowler over their entire career? Consider statistics like wickets
taken, wickets per inning, average, economy and strike rate.
Rank the bowlers in descending order of their ability to
intimidate batsmen and deliver outstanding performances year-
after-year.
DATA SOURCE
ESPNCRICINFO – A sports news website exclusively for the
game of cricket
StatsGuru – A database of historical matches and players
from the 18th century to the present
Sample Test Scorecard
TOOLS USED
R – Version 3.1.2 or higher
R Studio - Optional
CRAN cricketr package
Java – Version 1.8
Spark – Version 3.0.0
Anaconda for Python 3
winutils.exe
WINRAR or 7-Zip
INSTALL R / R STUDIO / CRICKETR
WINDOWS
 R – https://cran.r-project.org/bin/windows/base/R-3.6.2-
win.exe
 R Studio - https://rstudio.com/products/rstudio/download/
 CRAN cricketr package
• install.packages(“cricketr”)
 Test the package
• library(cricketr)
INSTALL JDK / ANACONDA / SPARK
WINDOWS or MAC or LINUX
 Installing Apache Spark and Python
• https://sundog-education.com/spark-python/
SAMPLE R CODE
library(cricketr)
# Kapil Dev's player id : 30028
kapilDev = getPlayerData(30028, dir="C://Meetup", file="Kapil.csv",
type="bowling", homeOrAway=c(1,2),result=c(1,2,4))
write.csv(kapilDev, 'C:MeetupKapil.csv', row.names=FALSE,
quote=FALSE)
# Imran Khan's player id : 40560
# Ian Botham's player id : 9163
# Richard Hadlee's player id : 37224
SAMPLE PYSPARK CODE
def getBalls(overs):
balls = 0
dotLocation = overs.find('.')
balls = int(overs[0:dotLocation]) * 6
if dotLocation != -1:
balls += int(overs[(dotLocation + 1):])
return balls
def getBallsByColumn(overs, bpo):
balls = 0
dotLocation = overs.find('.')
balls = int(overs[0:dotLocation]) * bpo
if dotLocation != -1:
balls += int(overs[(dotLocation + 1):])
return balls
SAMPLE PYSPARK CODE
CareerBowlingStatistics = namedtuple("CareerBowlingStatistics", "Wickets, WicketsPerInning, Average, Economy, StrikeRate")
def calculateKeyStatistics(player, isBPOColumnPresent) :
print("nKey Statistics for " + player + " :")
file = "C:Meetup" + player + ".csv"
bowler = None
if isBPOColumnPresent == True:
bowler = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='false').schema(eightBallSchema).load(file)
else: bowler = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='false').schema(sixBallSchema).load(file)
bowlerValidRows = bowler.filter((bowler["Overs"] != "DNB") & (bowler["Overs"] != "TDNB"))
if isBPOColumnPresent == True:
bowlerValidRows = bowlerValidRows.withColumn('Balls', udf(getBallsByColumn)("Overs", "BPO"))
else:
bowlerValidRows = bowlerValidRows.withColumn('Balls', udf(getBalls)("Overs"))
SAMPLE PYSPARK CODE
def calculateKeyStatistics(player, isBPOColumnPresent) :
….
totalBalls = bowlerValidRows.agg(func.sum("Balls")).collect()[0][0]
totalOvers = str(int(int(totalBalls) / 6)) + "." + str(int(totalBalls) % 6)
totalInnings = bowlerValidRows.count()
totalWickets = bowlerValidRows.agg(func.sum("Wkts")).collect()[0][0]
wicketPerInning = round((float(totalWickets) / totalInnings), 3)
totalRuns = bowlerValidRows.agg(func.sum("Runs")).collect()[0][0]
average = round((float(totalRuns) / totalWickets), 3)
economy = round((float((totalRuns) * 6) / totalBalls), 3)
bowlerValidRows = bowlerValidRows.withColumn("Balls", bowlerValidRows["Balls"].cast(IntegerType()))
strikeRate = round((float(totalBalls) / totalWickets), 3)
careerBowlingStatistics = CareerBowlingStatistics(int(totalWickets), float(wicketPerInning),
float(average), float(economy), float(strikeRate))
return careerBowlingStatistics
SAMPLE PYSPARK CODE
def plotStatistic(bowlers, values, xlabel, ylabel, title, fileName):
plt.style.use('ggplot')
width = 0.35
fig, ax = plt.subplots()
x = np.arange(len(bowlers))
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.set_title(title)
ax.set_xticks(x)
ax.set_xticklabels(bowlers)
# Internationl All-Rounder Colors
rects = ax.bar(x - width/2, values, width, label=xlabel,
color=['xkcd:bright blue', 'xkcd:lime green', 'xkcd:light blue', 'xkcd:grey'])
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points", ha='center', va='bottom')
fig.tight_layout()
plt.savefig('C:Meetup' + fileName + '.png', dpi=300, bbox_inches='tight')
CAREER - WICKETS
CAREER - WICKETS PER INNING
CAREER – STRIKE RATE
CAREER - AVERAGE
CAREER - ECONOMY
RANK – STATISTICAL CRITERIA
CRITERIA 1 2 3 4
STRIKE RATE Hadlee Imran Botham Kapil
WICKETS PER INNING Hadlee Imran Botham Kapil
AVERAGE Hadlee Imran Botham Kapil
ECONOMY Imran Hadlee Kapil Botham
WICKETS Kapil Hadlee Botham Imran
OVERAL RANKING – TOP STATISTICAL CRITERIA
CRITERIA 1 2 3 4
STRIKE RATE Hadlee Imran Botham Kapil
WICKETS PER INNING Hadlee Imran Botham Kapil
AVERAGE Hadlee Imran Botham Kapil
MOST DESTRUCTIVE BOWLER
UNFAIR TO COMPARE AND RANK THESE GREATS
THEY ARE ALL HIGH ACHIEVERS
REFERENCES
What is Cricket? Get to know the sport
Introduction to Cricket – 1
Introduction to Cricket – 2
Cricket: The Basic Introduction
Cricket for Americans
ESPN's Legends of Cricket - The Top 25 Cricketers Of All Time
ESPN Legends of Cricket : Sir Ian Botham
ESPN Legends of Cricket : Imran Khan
ESPN Legends of Cricket : Kapil Dev
ESPN Legends of Cricket : Sir Richard Hadlee
REFERENCES
ESPNCRICINFO
StatsGuru
Sample Test Match Scorecard
The R Project for Statistical Computing
Rstudio (Optional)
CRAN – Package cricketr
Installing Apache Spark and Python – Sundog Education
Sundog Education – Installing Apache Spark and Python
cricketr – getPlayerData
Code

More Related Content

Similar to Analyzing Cricket Stats of Top All-Rounders with R & PySpark

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with RDatabricks
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)Ontico
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreLukas Fittl
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersICTeam S.p.A.
 
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAMSparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAMData Science Milan
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASRick Watts
 
Running R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on SparkRunning R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on SparkDatabricks
 

Similar to Analyzing Cricket Stats of Top All-Rounders with R & PySpark (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Easy R
Easy REasy R
Easy R
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
 
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAMSparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
Running R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on SparkRunning R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on Spark
 

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Analyzing Cricket Stats of Top All-Rounders with R & PySpark

  • 1. EXTRACTING AND ANALYZING CRICKET STATISTICS WITH R AND PYSPARK Parag Ahire February 16, 2020
  • 2. OUR GOALS TODAY Gather data, analyze the data, visualize the data and draw some conclusions to answer a specific question  Data Domain – Test match data of the game of cricket Utilize R to gather the data Utilize PySpark/Python to summarize, visualize and analyze the data
  • 3. OUR PROGRESSION TODAY  Get familiar with the domain and data  Define the specific question  Get familiar with  The data source  The subjects of focus  Install the tools needed  Write program to fetch the data  Execute program to fetch the data  Write program to summarize, visualize and analyze the data  Answer the specific question
  • 4. DATA DOMAIN  The game of cricket  Get familiar with the game  Understand some terms, formats, player roles and rules  Focus on four all-rounders  Placed in the top 25 legends of the game  Sir Ian Botham – England  Imran Khan - Pakistan  Kapil Dev - India  Sir Richard Hadlee – New Zealand  Use their bowling data
  • 5. SPECIFIC QUESTION Which of the four all-rounders was the most effective as a bowler over their entire career? Consider statistics like wickets taken, wickets per inning, average, economy and strike rate. Rank the bowlers in descending order of their ability to intimidate batsmen and deliver outstanding performances year- after-year.
  • 6. DATA SOURCE ESPNCRICINFO – A sports news website exclusively for the game of cricket StatsGuru – A database of historical matches and players from the 18th century to the present Sample Test Scorecard
  • 7. TOOLS USED R – Version 3.1.2 or higher R Studio - Optional CRAN cricketr package Java – Version 1.8 Spark – Version 3.0.0 Anaconda for Python 3 winutils.exe WINRAR or 7-Zip
  • 8. INSTALL R / R STUDIO / CRICKETR WINDOWS  R – https://cran.r-project.org/bin/windows/base/R-3.6.2- win.exe  R Studio - https://rstudio.com/products/rstudio/download/  CRAN cricketr package • install.packages(“cricketr”)  Test the package • library(cricketr)
  • 9. INSTALL JDK / ANACONDA / SPARK WINDOWS or MAC or LINUX  Installing Apache Spark and Python • https://sundog-education.com/spark-python/
  • 10. SAMPLE R CODE library(cricketr) # Kapil Dev's player id : 30028 kapilDev = getPlayerData(30028, dir="C://Meetup", file="Kapil.csv", type="bowling", homeOrAway=c(1,2),result=c(1,2,4)) write.csv(kapilDev, 'C:MeetupKapil.csv', row.names=FALSE, quote=FALSE) # Imran Khan's player id : 40560 # Ian Botham's player id : 9163 # Richard Hadlee's player id : 37224
  • 11. SAMPLE PYSPARK CODE def getBalls(overs): balls = 0 dotLocation = overs.find('.') balls = int(overs[0:dotLocation]) * 6 if dotLocation != -1: balls += int(overs[(dotLocation + 1):]) return balls def getBallsByColumn(overs, bpo): balls = 0 dotLocation = overs.find('.') balls = int(overs[0:dotLocation]) * bpo if dotLocation != -1: balls += int(overs[(dotLocation + 1):]) return balls
  • 12. SAMPLE PYSPARK CODE CareerBowlingStatistics = namedtuple("CareerBowlingStatistics", "Wickets, WicketsPerInning, Average, Economy, StrikeRate") def calculateKeyStatistics(player, isBPOColumnPresent) : print("nKey Statistics for " + player + " :") file = "C:Meetup" + player + ".csv" bowler = None if isBPOColumnPresent == True: bowler = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='false').schema(eightBallSchema).load(file) else: bowler = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='false').schema(sixBallSchema).load(file) bowlerValidRows = bowler.filter((bowler["Overs"] != "DNB") & (bowler["Overs"] != "TDNB")) if isBPOColumnPresent == True: bowlerValidRows = bowlerValidRows.withColumn('Balls', udf(getBallsByColumn)("Overs", "BPO")) else: bowlerValidRows = bowlerValidRows.withColumn('Balls', udf(getBalls)("Overs"))
  • 13. SAMPLE PYSPARK CODE def calculateKeyStatistics(player, isBPOColumnPresent) : …. totalBalls = bowlerValidRows.agg(func.sum("Balls")).collect()[0][0] totalOvers = str(int(int(totalBalls) / 6)) + "." + str(int(totalBalls) % 6) totalInnings = bowlerValidRows.count() totalWickets = bowlerValidRows.agg(func.sum("Wkts")).collect()[0][0] wicketPerInning = round((float(totalWickets) / totalInnings), 3) totalRuns = bowlerValidRows.agg(func.sum("Runs")).collect()[0][0] average = round((float(totalRuns) / totalWickets), 3) economy = round((float((totalRuns) * 6) / totalBalls), 3) bowlerValidRows = bowlerValidRows.withColumn("Balls", bowlerValidRows["Balls"].cast(IntegerType())) strikeRate = round((float(totalBalls) / totalWickets), 3) careerBowlingStatistics = CareerBowlingStatistics(int(totalWickets), float(wicketPerInning), float(average), float(economy), float(strikeRate)) return careerBowlingStatistics
  • 14. SAMPLE PYSPARK CODE def plotStatistic(bowlers, values, xlabel, ylabel, title, fileName): plt.style.use('ggplot') width = 0.35 fig, ax = plt.subplots() x = np.arange(len(bowlers)) ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title) ax.set_xticks(x) ax.set_xticklabels(bowlers) # Internationl All-Rounder Colors rects = ax.bar(x - width/2, values, width, label=xlabel, color=['xkcd:bright blue', 'xkcd:lime green', 'xkcd:light blue', 'xkcd:grey']) for rect in rects: height = rect.get_height() ax.annotate('{}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height), xytext=(0, 3), # 3 points vertical offset textcoords="offset points", ha='center', va='bottom') fig.tight_layout() plt.savefig('C:Meetup' + fileName + '.png', dpi=300, bbox_inches='tight')
  • 16. CAREER - WICKETS PER INNING
  • 20. RANK – STATISTICAL CRITERIA CRITERIA 1 2 3 4 STRIKE RATE Hadlee Imran Botham Kapil WICKETS PER INNING Hadlee Imran Botham Kapil AVERAGE Hadlee Imran Botham Kapil ECONOMY Imran Hadlee Kapil Botham WICKETS Kapil Hadlee Botham Imran
  • 21. OVERAL RANKING – TOP STATISTICAL CRITERIA CRITERIA 1 2 3 4 STRIKE RATE Hadlee Imran Botham Kapil WICKETS PER INNING Hadlee Imran Botham Kapil AVERAGE Hadlee Imran Botham Kapil
  • 22. MOST DESTRUCTIVE BOWLER UNFAIR TO COMPARE AND RANK THESE GREATS THEY ARE ALL HIGH ACHIEVERS
  • 23. REFERENCES What is Cricket? Get to know the sport Introduction to Cricket – 1 Introduction to Cricket – 2 Cricket: The Basic Introduction Cricket for Americans ESPN's Legends of Cricket - The Top 25 Cricketers Of All Time ESPN Legends of Cricket : Sir Ian Botham ESPN Legends of Cricket : Imran Khan ESPN Legends of Cricket : Kapil Dev ESPN Legends of Cricket : Sir Richard Hadlee
  • 24. REFERENCES ESPNCRICINFO StatsGuru Sample Test Match Scorecard The R Project for Statistical Computing Rstudio (Optional) CRAN – Package cricketr Installing Apache Spark and Python – Sundog Education Sundog Education – Installing Apache Spark and Python cricketr – getPlayerData Code