SlideShare a Scribd company logo
1 of 13
More than a reddit N-gram viewer
JERRY PRAWIHARJO
INSIGHT DATA ENGINEERING FELLOW
nerddit
Motivation
N-grams:
Allows Data Scientists to do Topic trends analysis, language analysis
Allows for “Type-ahead” feature
Subreddits network graph:
SR1
SR2
SR3
U
U
U
U
1-gram
My Name is Jerry
“My” “Name” “is” “Jerry”
2-grams
My name is Jerry
“My Name”
“name is”
“is Jerry”
3-grams
My name is Jerry
“My name is”
“name is Jerry”
Pipeline
6x m4.xlarge
$1.43/hour
5x m4.xlarge
$28.7/day
t2.micro
free
~10GB
10/2007-12/2015
>1TB uncompressed
4x m4.large
$11.5/day
Reddit Statistics
Year Date comments Unique authors Unique subreddits
2015 2015-12-01 10000 25000 60000
2014 2014-12-01 50000 35000 40000
N-gram
Ngram Date N Count Percentage
Hallows 2011-04 1 10 0.1
Deathly Hallows 2011-04 2 50 0.1
Ngram N Subreddit Count (counter type)
Hallows 1 movies 1000
Deathly Hallows 2 movies 5000
N gram cluster against subredditsTime series ngrams
(“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”])
(“2011-04::old::movies”, 1)
(“2011-04::lady::movies”, 1)
…
(“2011-04::Deathly::movies”, 1)
(“2011-04::old::movies”, 10)
(“2011-04::lady::movies”, 5)
…
(“2011-04::Deathly::movies”, 2)Job took ~2days to complete
Regex filters
URLs, IMG links,
unicodes
Subreddits Graph
Year node1 node2
2011 movies {politics: 10, games:5,…}
2014 politics {games: 3,conservative: 2,…}
Year Distinct authors subreddit Comments
2011 TheOceldoc movies 100
2011 JohnDoe politics 200
(TheOceldoc, (movies, politics, games,…)
(JohnDoe, (politics, conservative,…)
(“movies::politics”,10)
(“movies::games”,5)
(“politics::games”,3)
…
(“politics::conservative”,2)
Edge weight
Filter degree < 100
Clustering
Force Atlas 2 layout
Spark Tuning
0
2
4
6
8
10
12
14
16
18
A B C D
Time(minutes)
Case
Case Rdd Compress Kryo
A FALSE FALSE
B TRUE FALSE
C TRUE TRUE
D FALSE TRUE
Jerry Prawiharjo
Phd in Optoelectronics from Southampton England
◦ Distributed computation on Beowulf cluster (MPI)
Product Development Engineer at Neophotonics
◦ Test software development and data analysis
Senior Test Development Engineer at Cisco
◦ Test station development (hardware and software)
for 100G transceiver module
Back Up
Challenges
Sheer amount of Data: >1TB
◦ Scoping the project: monthly time bucket (as opposed to daily or weekly)
◦ Filter foreign language subreddits
◦ Spark tuning
S3 rate limit: Process data on file-per-file basis

More Related Content

Viewers also liked

Beni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsBeni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsCaterina Policaro
 
Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Mtra. Zoraida Gpe. Mtz
 
Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Paul St. Aubin
 
Licencias creative commons
Licencias creative commonsLicencias creative commons
Licencias creative commonsMary Macas
 
Il catalogo come learning place
Il catalogo come learning placeIl catalogo come learning place
Il catalogo come learning placeAgnese Galeffi
 
Storytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleStorytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleMariagrazia Licandro
 
Stella e Simão Mil Folhas
Stella e Simão Mil FolhasStella e Simão Mil Folhas
Stella e Simão Mil Folhasmrvpimenta
 
La catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiLa catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiRomina D'Antoni
 
Tema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNTema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNjcarlostecnologia
 
Uses of libray and internet
Uses of libray and internetUses of libray and internet
Uses of libray and internetJaveria600
 

Viewers also liked (14)

Beni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione OsBeni Culturali 2.1 Introduzione Os
Beni Culturali 2.1 Introduzione Os
 
Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)Funciones del lenguaje y prototipos text (repaso)
Funciones del lenguaje y prototipos text (repaso)
 
Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016Durham Region Real Estate Statistics August 2016
Durham Region Real Estate Statistics August 2016
 
Ahmad Syahidi B Che Zainal CV
Ahmad Syahidi B Che Zainal CVAhmad Syahidi B Che Zainal CV
Ahmad Syahidi B Che Zainal CV
 
Licencias creative commons
Licencias creative commonsLicencias creative commons
Licencias creative commons
 
SMKASAS & MUHI
SMKASAS & MUHISMKASAS & MUHI
SMKASAS & MUHI
 
Il catalogo come learning place
Il catalogo come learning placeIl catalogo come learning place
Il catalogo come learning place
 
Storytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al DigitaleStorytelling: l'Arte del Narrare da Omero al Digitale
Storytelling: l'Arte del Narrare da Omero al Digitale
 
Stella e Simão Mil Folhas
Stella e Simão Mil FolhasStella e Simão Mil Folhas
Stella e Simão Mil Folhas
 
La catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmatiLa catalogazione di videoregistrazioni e filmati
La catalogazione di videoregistrazioni e filmati
 
Metal Semi-Conductor Junctions
Metal Semi-Conductor JunctionsMetal Semi-Conductor Junctions
Metal Semi-Conductor Junctions
 
18 el vidrio
18 el vidrio18 el vidrio
18 el vidrio
 
Tema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióNTema 14 Materiales De ConstruccióN
Tema 14 Materiales De ConstruccióN
 
Uses of libray and internet
Uses of libray and internetUses of libray and internet
Uses of libray and internet
 

Recently uploaded

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 

Recently uploaded (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 

Nerddit Demo Presentation

  • 1. More than a reddit N-gram viewer JERRY PRAWIHARJO INSIGHT DATA ENGINEERING FELLOW nerddit
  • 2. Motivation N-grams: Allows Data Scientists to do Topic trends analysis, language analysis Allows for “Type-ahead” feature Subreddits network graph: SR1 SR2 SR3 U U U U
  • 3. 1-gram My Name is Jerry “My” “Name” “is” “Jerry”
  • 4. 2-grams My name is Jerry “My Name” “name is” “is Jerry”
  • 5. 3-grams My name is Jerry “My name is” “name is Jerry”
  • 7. Reddit Statistics Year Date comments Unique authors Unique subreddits 2015 2015-12-01 10000 25000 60000 2014 2014-12-01 50000 35000 40000
  • 8. N-gram Ngram Date N Count Percentage Hallows 2011-04 1 10 0.1 Deathly Hallows 2011-04 2 50 0.1 Ngram N Subreddit Count (counter type) Hallows 1 movies 1000 Deathly Hallows 2 movies 5000 N gram cluster against subredditsTime series ngrams (“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”]) (“2011-04::old::movies”, 1) (“2011-04::lady::movies”, 1) … (“2011-04::Deathly::movies”, 1) (“2011-04::old::movies”, 10) (“2011-04::lady::movies”, 5) … (“2011-04::Deathly::movies”, 2)Job took ~2days to complete Regex filters URLs, IMG links, unicodes
  • 9. Subreddits Graph Year node1 node2 2011 movies {politics: 10, games:5,…} 2014 politics {games: 3,conservative: 2,…} Year Distinct authors subreddit Comments 2011 TheOceldoc movies 100 2011 JohnDoe politics 200 (TheOceldoc, (movies, politics, games,…) (JohnDoe, (politics, conservative,…) (“movies::politics”,10) (“movies::games”,5) (“politics::games”,3) … (“politics::conservative”,2) Edge weight Filter degree < 100 Clustering Force Atlas 2 layout
  • 10. Spark Tuning 0 2 4 6 8 10 12 14 16 18 A B C D Time(minutes) Case Case Rdd Compress Kryo A FALSE FALSE B TRUE FALSE C TRUE TRUE D FALSE TRUE
  • 11. Jerry Prawiharjo Phd in Optoelectronics from Southampton England ◦ Distributed computation on Beowulf cluster (MPI) Product Development Engineer at Neophotonics ◦ Test software development and data analysis Senior Test Development Engineer at Cisco ◦ Test station development (hardware and software) for 100G transceiver module
  • 13. Challenges Sheer amount of Data: >1TB ◦ Scoping the project: monthly time bucket (as opposed to daily or weekly) ◦ Filter foreign language subreddits ◦ Spark tuning S3 rate limit: Process data on file-per-file basis