SlideShare a Scribd company logo
Naribole
Dec 14, 2016
1
/r/Soccer Activity and User Distribution
Analysis
Project Web Scraping
Sharan Naribole
/r/Soccer Introduction
1
• Large community
o Over 500,000 subscribers globally
o Hundreds of posts discussed daily
o Goals, pre, post and live match discussion, news articles etc.
/r/Soccer Flairs
2
• User Flair
oTeam crest appears beside username on all comments
o Provides additional dimension of context to the comments
Objective
3
To scrape and analyze /r/soccer top posts for:
a) user flair distribution and
b) it’s relationship with comments activity, submission score and type
Outline
4
Data Collection
Data Processing
Data Analysis
Data Collection
5
• Every post in the top 1000 posts during Nov 12 2016 -Dec 12 2016
o Constraint: Minimum of 100 comments
o Matches occur on a weekly basis!
o Over 500 posts/submissions met the constraints
• Per-submission Information
o Title
o Submission score (~= Upvotes - Downvotes)
o Number of comments
o Unique user-flair mapping in top 500 comments
Data Collection Tools
6
• Scrapy Crawl Spider
o /r/soccer home page sorted by top
o Comments page of each submission
• SelectorGadget
Per-Submission Computation
7
• Flair Diversity
o Unique number of flairs
• Percentage share per flair
o 100* Number of commenters of a flair/ Total commenter-flair mappings
• Top percentage share
o Highest percentage share among all the flairs
• Comments
o Language processing to find total number of comments
Computation Framework
8
Title
Flair
Diversity
Top Share Comments
Submission
1
Submission
2
Submission
3
…
Submission 1 Submission 2 …
Flair 1
Flair 2
Flair 3
…
• Pandas DataFrames
Data Analysis
9
Diversity - Comments - Score Relationship
Submission Type Analysis
Flair Distribution
Diversity - Comments - Score Relationship
10
• Comments increase with Diversity
• Submission Score increases with Diversity
Submission Type Analysis
• Goal video submission
o Posted in near real-time with high quality goals getting top score
o Hypothesis:
★ Flair diversity as goals are discussed for their quality
★ Comments proportional to score
• Match Thread, Post-Match Thread submissions
o Discussion during and after live match
o Hypothesis:
★ Low flair diversity as teams taking part in the match expected to have high share
★ High number of comments as users comment on various events not just goals
11
Submission Type - Comments Analysis
• Expectedly, comments are higher for Match Threads
12
Submission Type - Flair Diversity Analysis
• Unexpectedly, Match Threads have higher flair diversity
• Match Threads among top posts likelier discussed by other fans
13
Submission Type - Score Analysis
• Match and Post-Match Threads most active during and just after a match
• Goal submissions are rated for the quality of goal increasing over time
14
Flair Distribution
• English Premier League clubs lead the /r/soccer table!
15
Conclusion
16
• Scraped, visualized and analyzed /r/soccer top posts during past one month
★ Flair Diversity, Score and Top Flair Share relationships
★ Submission type-analysis
★ Flair distributions

More Related Content

Similar to Scraping Soccer Popularity on Reddit

You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
You Must Be New: Becoming Fans & Communicating Values While Defining Internat...You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
meghaninmotion
 
ADV 420 MLS
ADV 420 MLS ADV 420 MLS
ADV 420 MLS
nowaksam
 
Octagon & Hawthorn FC Case Study - CheckinLine
Octagon & Hawthorn FC Case Study - CheckinLineOctagon & Hawthorn FC Case Study - CheckinLine
Octagon & Hawthorn FC Case Study - CheckinLine
CheckinLine
 
Research cameron roy fmp
Research cameron roy fmpResearch cameron roy fmp
Research cameron roy fmp
CameronRoy8
 
UF PUR3622 Social Media Strategy Project
UF PUR3622 Social Media Strategy ProjectUF PUR3622 Social Media Strategy Project
UF PUR3622 Social Media Strategy Project
Angelo Yeskey
 
Personal Brand Exploration- Derron Jones
Personal Brand Exploration- Derron JonesPersonal Brand Exploration- Derron Jones
Personal Brand Exploration- Derron Jones
DeeJones23
 
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
NETWAYS
 
NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)
wqchen
 
Text Analysis of U.S Presidential Debate 2016
Text Analysis of U.S Presidential Debate 2016Text Analysis of U.S Presidential Debate 2016
Text Analysis of U.S Presidential Debate 2016
Pranav Navandar
 
First Pitch Case Competition - MIT SSAC 2018
First Pitch Case Competition - MIT SSAC 2018First Pitch Case Competition - MIT SSAC 2018
First Pitch Case Competition - MIT SSAC 2018
Preston Dishner
 
LEINSTER
LEINSTER LEINSTER
LEINSTER
Robert Joyce
 

Similar to Scraping Soccer Popularity on Reddit (11)

You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
You Must Be New: Becoming Fans & Communicating Values While Defining Internat...You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
You Must Be New: Becoming Fans & Communicating Values While Defining Internat...
 
ADV 420 MLS
ADV 420 MLS ADV 420 MLS
ADV 420 MLS
 
Octagon & Hawthorn FC Case Study - CheckinLine
Octagon & Hawthorn FC Case Study - CheckinLineOctagon & Hawthorn FC Case Study - CheckinLine
Octagon & Hawthorn FC Case Study - CheckinLine
 
Research cameron roy fmp
Research cameron roy fmpResearch cameron roy fmp
Research cameron roy fmp
 
UF PUR3622 Social Media Strategy Project
UF PUR3622 Social Media Strategy ProjectUF PUR3622 Social Media Strategy Project
UF PUR3622 Social Media Strategy Project
 
Personal Brand Exploration- Derron Jones
Personal Brand Exploration- Derron JonesPersonal Brand Exploration- Derron Jones
Personal Brand Exploration- Derron Jones
 
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
stackconf 2023 | Analyzing Public Conversation using LDA and Topic Modeling, ...
 
NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)NBA Moneyball in Web Application Using R (20160307 MLDM)
NBA Moneyball in Web Application Using R (20160307 MLDM)
 
Text Analysis of U.S Presidential Debate 2016
Text Analysis of U.S Presidential Debate 2016Text Analysis of U.S Presidential Debate 2016
Text Analysis of U.S Presidential Debate 2016
 
First Pitch Case Competition - MIT SSAC 2018
First Pitch Case Competition - MIT SSAC 2018First Pitch Case Competition - MIT SSAC 2018
First Pitch Case Competition - MIT SSAC 2018
 
LEINSTER
LEINSTER LEINSTER
LEINSTER
 

Recently uploaded

zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
YousufSait3
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
Rakesh Kumar R
 
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
ssuserad3af4
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Mobile app Development Services | Drona Infotech
Mobile app Development Services  | Drona InfotechMobile app Development Services  | Drona Infotech
Mobile app Development Services | Drona Infotech
Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 

Recently uploaded (20)

zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
 
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
316895207-SAP-Oil-and-Gas-Downstream-Training.pptx
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Mobile app Development Services | Drona Infotech
Mobile app Development Services  | Drona InfotechMobile app Development Services  | Drona Infotech
Mobile app Development Services | Drona Infotech
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 

Scraping Soccer Popularity on Reddit

  • 1. Naribole Dec 14, 2016 1 /r/Soccer Activity and User Distribution Analysis Project Web Scraping Sharan Naribole
  • 2. /r/Soccer Introduction 1 • Large community o Over 500,000 subscribers globally o Hundreds of posts discussed daily o Goals, pre, post and live match discussion, news articles etc.
  • 3. /r/Soccer Flairs 2 • User Flair oTeam crest appears beside username on all comments o Provides additional dimension of context to the comments
  • 4. Objective 3 To scrape and analyze /r/soccer top posts for: a) user flair distribution and b) it’s relationship with comments activity, submission score and type
  • 6. Data Collection 5 • Every post in the top 1000 posts during Nov 12 2016 -Dec 12 2016 o Constraint: Minimum of 100 comments o Matches occur on a weekly basis! o Over 500 posts/submissions met the constraints • Per-submission Information o Title o Submission score (~= Upvotes - Downvotes) o Number of comments o Unique user-flair mapping in top 500 comments
  • 7. Data Collection Tools 6 • Scrapy Crawl Spider o /r/soccer home page sorted by top o Comments page of each submission • SelectorGadget
  • 8. Per-Submission Computation 7 • Flair Diversity o Unique number of flairs • Percentage share per flair o 100* Number of commenters of a flair/ Total commenter-flair mappings • Top percentage share o Highest percentage share among all the flairs • Comments o Language processing to find total number of comments
  • 9. Computation Framework 8 Title Flair Diversity Top Share Comments Submission 1 Submission 2 Submission 3 … Submission 1 Submission 2 … Flair 1 Flair 2 Flair 3 … • Pandas DataFrames
  • 10. Data Analysis 9 Diversity - Comments - Score Relationship Submission Type Analysis Flair Distribution
  • 11. Diversity - Comments - Score Relationship 10 • Comments increase with Diversity • Submission Score increases with Diversity
  • 12. Submission Type Analysis • Goal video submission o Posted in near real-time with high quality goals getting top score o Hypothesis: ★ Flair diversity as goals are discussed for their quality ★ Comments proportional to score • Match Thread, Post-Match Thread submissions o Discussion during and after live match o Hypothesis: ★ Low flair diversity as teams taking part in the match expected to have high share ★ High number of comments as users comment on various events not just goals 11
  • 13. Submission Type - Comments Analysis • Expectedly, comments are higher for Match Threads 12
  • 14. Submission Type - Flair Diversity Analysis • Unexpectedly, Match Threads have higher flair diversity • Match Threads among top posts likelier discussed by other fans 13
  • 15. Submission Type - Score Analysis • Match and Post-Match Threads most active during and just after a match • Goal submissions are rated for the quality of goal increasing over time 14
  • 16. Flair Distribution • English Premier League clubs lead the /r/soccer table! 15
  • 17. Conclusion 16 • Scraped, visualized and analyzed /r/soccer top posts during past one month ★ Flair Diversity, Score and Top Flair Share relationships ★ Submission type-analysis ★ Flair distributions