This document outlines a project to analyze cricket statistics for four all-rounders (Ian Botham, Imran Khan, Kapil Dev, and Richard Hadlee) using R and PySpark. The goals are to gather test match data, analyze it to answer a specific question, and visualize the results. The question is which bowler was most effective over their career based on statistics like wickets, average, economy, and strike rate. The document describes collecting data from online sources using R, processing it with PySpark, calculating key statistics for each player, and creating visualizations to compare the players and rank their bowling ability.
2. OUR GOALS TODAY
Gather data, analyze the data, visualize the data and draw some
conclusions to answer a specific question
Data Domain – Test match data of the game of cricket
Utilize R to gather the data
Utilize PySpark/Python to summarize, visualize and analyze the
data
3. OUR PROGRESSION TODAY
Get familiar with the domain and data
Define the specific question
Get familiar with
The data source
The subjects of focus
Install the tools needed
Write program to fetch the data
Execute program to fetch the data
Write program to summarize, visualize and analyze the data
Answer the specific question
4. DATA DOMAIN
The game of cricket
Get familiar with the game
Understand some terms, formats, player roles and rules
Focus on four all-rounders
Placed in the top 25 legends of the game
Sir Ian Botham – England
Imran Khan - Pakistan
Kapil Dev - India
Sir Richard Hadlee – New Zealand
Use their bowling data
5. SPECIFIC QUESTION
Which of the four all-rounders was the most effective as a
bowler over their entire career? Consider statistics like wickets
taken, wickets per inning, average, economy and strike rate.
Rank the bowlers in descending order of their ability to
intimidate batsmen and deliver outstanding performances year-
after-year.
6. DATA SOURCE
ESPNCRICINFO – A sports news website exclusively for the
game of cricket
StatsGuru – A database of historical matches and players
from the 18th century to the present
Sample Test Scorecard
7. TOOLS USED
R – Version 3.1.2 or higher
R Studio - Optional
CRAN cricketr package
Java – Version 1.8
Spark – Version 3.0.0
Anaconda for Python 3
winutils.exe
WINRAR or 7-Zip
8. INSTALL R / R STUDIO / CRICKETR
WINDOWS
R – https://cran.r-project.org/bin/windows/base/R-3.6.2-
win.exe
R Studio - https://rstudio.com/products/rstudio/download/
CRAN cricketr package
• install.packages(“cricketr”)
Test the package
• library(cricketr)
9. INSTALL JDK / ANACONDA / SPARK
WINDOWS or MAC or LINUX
Installing Apache Spark and Python
• https://sundog-education.com/spark-python/
10. SAMPLE R CODE
library(cricketr)
# Kapil Dev's player id : 30028
kapilDev = getPlayerData(30028, dir="C://Meetup", file="Kapil.csv",
type="bowling", homeOrAway=c(1,2),result=c(1,2,4))
write.csv(kapilDev, 'C:MeetupKapil.csv', row.names=FALSE,
quote=FALSE)
# Imran Khan's player id : 40560
# Ian Botham's player id : 9163
# Richard Hadlee's player id : 37224
23. REFERENCES
What is Cricket? Get to know the sport
Introduction to Cricket – 1
Introduction to Cricket – 2
Cricket: The Basic Introduction
Cricket for Americans
ESPN's Legends of Cricket - The Top 25 Cricketers Of All Time
ESPN Legends of Cricket : Sir Ian Botham
ESPN Legends of Cricket : Imran Khan
ESPN Legends of Cricket : Kapil Dev
ESPN Legends of Cricket : Sir Richard Hadlee
24. REFERENCES
ESPNCRICINFO
StatsGuru
Sample Test Match Scorecard
The R Project for Statistical Computing
Rstudio (Optional)
CRAN – Package cricketr
Installing Apache Spark and Python – Sundog Education
Sundog Education – Installing Apache Spark and Python
cricketr – getPlayerData
Code