SlideShare a Scribd company logo
1 of 13
More than a reddit N-gram viewer
JERRY PRAWIHARJO
INSIGHT DATA ENGINEERING FELLOW
nerddit
Motivation
N-grams:
Allows Data Scientists to do Topic trends analysis, language analysis
Allows for “Type-ahead” feature
Subreddits network graph:
SR1
SR2
SR3
U
U
U
U
1-gram
My Name is Jerry
“My” “Name” “is” “Jerry”
2-grams
My name is Jerry
“My Name”
“name is”
“is Jerry”
3-grams
My name is Jerry
“My name is”
“name is Jerry”
Pipeline
6x m4.xlarge
$1.43/hour
5x m4.xlarge
$28.7/day
t2.micro
free
~10GB
10/2007-12/2015
>1TB uncompressed
4x m4.large
$11.5/day
Reddit Statistics
Year Date comments Unique authors Unique subreddits
2015 2015-12-01 10000 25000 60000
2014 2014-12-01 50000 35000 40000
N-gram
Ngram Date N Count Percentage
Hallows 2011-04 1 10 0.1
Deathly Hallows 2011-04 2 50 0.1
Ngram N Subreddit Count (counter type)
Hallows 1 movies 1000
Deathly Hallows 2 movies 5000
N gram cluster against subredditsTime series ngrams
(“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”])
(“2011-04::old::movies”, 1)
(“2011-04::lady::movies”, 1)
…
(“2011-04::Deathly::movies”, 1)
(“2011-04::old::movies”, 10)
(“2011-04::lady::movies”, 5)
…
(“2011-04::Deathly::movies”, 2)Job took ~2days to complete
Regex filters
URLs, IMG links,
unicodes
Subreddits Graph
Year node1 node2
2011 movies {politics: 10, games:5,…}
2014 politics {games: 3,conservative: 2,…}
Year Distinct authors subreddit Comments
2011 TheOceldoc movies 100
2011 JohnDoe politics 200
(TheOceldoc, (movies, politics, games,…)
(JohnDoe, (politics, conservative,…)
(“movies::politics”,10)
(“movies::games”,5)
(“politics::games”,3)
…
(“politics::conservative”,2)
Edge weight
Filter degree < 100
Clustering
Force Atlas 2 layout
Spark Tuning
0
2
4
6
8
10
12
14
16
18
A B C D
Time(minutes)
Case
Case Rdd Compress Kryo
A FALSE FALSE
B TRUE FALSE
C TRUE TRUE
D FALSE TRUE
Jerry Prawiharjo
Phd in Optoelectronics from Southampton England
◦ Distributed computation on Beowulf cluster (MPI)
Product Development Engineer at Neophotonics
◦ Test software development and data analysis
Senior Test Development Engineer at Cisco
◦ Test station development (hardware and software)
for 100G transceiver module
Back Up
Challenges
Sheer amount of Data: >1TB
◦ Scoping the project: monthly time bucket (as opposed to daily or weekly)
◦ Filter foreign language subreddits
◦ Spark tuning
S3 rate limit: Process data on file-per-file basis

More Related Content

Viewers also liked

Beni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsBeni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsCaterina Policaro
 
Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Mtra. Zoraida Gpe. Mtz
 
Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Paul St. Aubin
 
Licencias creative commons
Licencias creative commonsLicencias creative commons
Licencias creative commonsMary Macas
 
Il catalogo come learning place
Il catalogo come learning placeIl catalogo come learning place
Il catalogo come learning placeAgnese Galeffi
 
Storytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleStorytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleMariagrazia Licandro
 
Stella e Simão Mil Folhas
Stella e Simão Mil FolhasStella e Simão Mil Folhas
Stella e Simão Mil Folhasmrvpimenta
 
La catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiLa catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiRomina D'Antoni
 
Tema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNTema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNjcarlostecnologia
 
Uses of libray and internet
Uses of libray and internetUses of libray and internet
Uses of libray and internetJaveria600
 

Viewers also liked (14)

Beni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsBeni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione Os
 
Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)
 
Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016
 
Ahmad Syahidi B Che Zainal CV
Ahmad Syahidi B Che Zainal CVAhmad Syahidi B Che Zainal CV
Ahmad Syahidi B Che Zainal CV
 
Licencias creative commons
Licencias creative commonsLicencias creative commons
Licencias creative commons
 
SMKASAS & MUHI
SMKASAS & MUHISMKASAS & MUHI
SMKASAS & MUHI
 
Il catalogo come learning place
Il catalogo come learning placeIl catalogo come learning place
Il catalogo come learning place
 
Storytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleStorytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al Digitale
 
Stella e Simão Mil Folhas
Stella e Simão Mil FolhasStella e Simão Mil Folhas
Stella e Simão Mil Folhas
 
La catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiLa catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmati
 
Metal Semi-Conductor Junctions
Metal Semi-Conductor JunctionsMetal Semi-Conductor Junctions
Metal Semi-Conductor Junctions
 
18 el vidrio
18 el vidrio18 el vidrio
18 el vidrio
 
Tema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNTema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióN
 
Uses of libray and internet
Uses of libray and internetUses of libray and internet
Uses of libray and internet
 

Recently uploaded

Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Nerddit Demo Presentation

  • 1. More than a reddit N-gram viewer JERRY PRAWIHARJO INSIGHT DATA ENGINEERING FELLOW nerddit
  • 2. Motivation N-grams: Allows Data Scientists to do Topic trends analysis, language analysis Allows for “Type-ahead” feature Subreddits network graph: SR1 SR2 SR3 U U U U
  • 3. 1-gram My Name is Jerry “My” “Name” “is” “Jerry”
  • 4. 2-grams My name is Jerry “My Name” “name is” “is Jerry”
  • 5. 3-grams My name is Jerry “My name is” “name is Jerry”
  • 7. Reddit Statistics Year Date comments Unique authors Unique subreddits 2015 2015-12-01 10000 25000 60000 2014 2014-12-01 50000 35000 40000
  • 8. N-gram Ngram Date N Count Percentage Hallows 2011-04 1 10 0.1 Deathly Hallows 2011-04 2 50 0.1 Ngram N Subreddit Count (counter type) Hallows 1 movies 1000 Deathly Hallows 2 movies 5000 N gram cluster against subredditsTime series ngrams (“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”]) (“2011-04::old::movies”, 1) (“2011-04::lady::movies”, 1) … (“2011-04::Deathly::movies”, 1) (“2011-04::old::movies”, 10) (“2011-04::lady::movies”, 5) … (“2011-04::Deathly::movies”, 2)Job took ~2days to complete Regex filters URLs, IMG links, unicodes
  • 9. Subreddits Graph Year node1 node2 2011 movies {politics: 10, games:5,…} 2014 politics {games: 3,conservative: 2,…} Year Distinct authors subreddit Comments 2011 TheOceldoc movies 100 2011 JohnDoe politics 200 (TheOceldoc, (movies, politics, games,…) (JohnDoe, (politics, conservative,…) (“movies::politics”,10) (“movies::games”,5) (“politics::games”,3) … (“politics::conservative”,2) Edge weight Filter degree < 100 Clustering Force Atlas 2 layout
  • 10. Spark Tuning 0 2 4 6 8 10 12 14 16 18 A B C D Time(minutes) Case Case Rdd Compress Kryo A FALSE FALSE B TRUE FALSE C TRUE TRUE D FALSE TRUE
  • 11. Jerry Prawiharjo Phd in Optoelectronics from Southampton England ◦ Distributed computation on Beowulf cluster (MPI) Product Development Engineer at Neophotonics ◦ Test software development and data analysis Senior Test Development Engineer at Cisco ◦ Test station development (hardware and software) for 100G transceiver module
  • 13. Challenges Sheer amount of Data: >1TB ◦ Scoping the project: monthly time bucket (as opposed to daily or weekly) ◦ Filter foreign language subreddits ◦ Spark tuning S3 rate limit: Process data on file-per-file basis