SlideShare a Scribd company logo
Subreddit
Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
Find your Reddit Subculture
2007 - Impersonal Web
2017 - Personal Web
Reddit Comment Dataset
2 billion comments
1 million
subreddits
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
Data Pipeline
Ingestion / Processing User Interface
Challenge 1:
Data Size
Every month on
Reddit:
● Reddit is too big to cluster
directly!
● The raw clustering matrix
has 200 billion elements.
60k Subreddits
3 million unique authors
Solution 1a:
Filtering
Every month on
Reddit:
● Filter for activity: 100
comments/month
● Active clustering matrix has
200 million elements
● Now 1000 times faster to
cluster
6k active
Subreddits
30k active
authors
Solution 1b: PCA
Every month on
Reddit:
● PCA transforms author
space to shared interest
space by finding correlations
● PCA shrinks dimensionality
by another 100 times 300
shared
interests
6k active
Subreddits
Challenge 2: Slow PCA
Even on a cluster, PCA takes too long
on 200 million elements: 100 minutes
on 9 Spark workers.
PCA scales as O(MI)
M is the number of matrix elements
I is the number of interests after PCA
Over 80% of total time!
Solution 2: Random PCA
Use Facebook Research Random PCA
(2014) on a single node
Fbpca is O(M ln(I))
For 250 interests, FBPCA is 45 times
faster! One FBPCA worker is 5x faster
than 9 full PCA workers.
5x faster for an average sized month
Challenge 3: Finding K for K-Means Clustering
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Solution 3: Silhouette Analysis
Silhouette Analysis reveals
clustering scale at small k
Also reveals a second clustering
scale of around 400 clusters
in this case
A Happy Medium
Too impersonal Too personalized
David Lyon
PhD Physics from the University of Illinois
Doing GPU simulations
I love hiking, table tennis, and astrophysics
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
Next Steps - Popular Topics by Cluster
Find the popular topics within each cluster using Term-Frequency Inverse-
Document-Frequency (TF-IDF) or LDA
Terms are 1-grams and 2-grams used in each cluster, and the document
frequency is over all of reddit for that month.
Challenge 2:
Every month on
Reddit:
● Too many individual authors
● Need to cluster by shared
interests, not author 30k active
authors
6k active
Subreddits
Challenge 3: Finding K for K-Means
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Random PCA
Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR
CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko,
2009)
Fast Randomized SVD (Facebook Research, 2014)
Complexity of Random PCA is O(mn ln(k))
For k=100, Random PCA is more than 20x faster!
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99
9,999
Auth
1,000,0
00
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by Month
Group by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has
mean=0 and variance = 1
Growth in Number of Subreddits
40 subreddits
1 million subreddits
Week 4 Challenges
● Spark for iterative machine learning because Spark can
mapreduce in memory
● By reducing the dimension of data,
● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
Clustering is Universal
Galaxies cluster into
superclusters of ~100k
members
The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for
chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
Subreddit Clustering
Monthly graph from 10k subreddits X 2 million authors = 10 billion
matrix entries
Drastically reduce the size of data using Principal Component
Analysis, normalized so that larger subreddits aren’t favored
Cluster in reduced dimensional space using K-means
Topics within Clusters based on relative frequency of 1-grams and
Social media brings us closer
Continual contact with over 1 billion people
We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend
or ban from community!
● Online communities become bubbles
isolated from each other

More Related Content

Similar to Subreddit Subcultures

Insight presentation
Insight presentationInsight presentation
Insight presentation
Adam Costarino
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologies
UTKARSH BHATNAGAR
 
Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016
UTKARSH BHATNAGAR
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
Inhacking
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
Аліна Шепшелей
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Symeon Papadopoulos
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
承剛 謝
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
AIST
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
Kunwoo Park
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahim
SAIL_QU
 
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnipHendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
Scott Hendrickson
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
Jason Riedy
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
Sarang Rakhecha
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
DERIGalway
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
Pei Lee
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
Miha Ahronovitz
 

Similar to Subreddit Subcultures (20)

Insight presentation
Insight presentationInsight presentation
Insight presentation
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologies
 
Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahim
 
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnipHendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 

Recently uploaded

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Subreddit Subcultures

  • 1. Subreddit Subcultures Insight Data Engineering Fellowship, Silicon Valley David Lyon Find your Reddit Subculture
  • 4. Reddit Comment Dataset 2 billion comments 1 million subreddits
  • 5. Personalization of Reddit Over Time Reddit Clustering App https://youtu.be/XHczo0TM17E
  • 6. Data Pipeline Ingestion / Processing User Interface
  • 7. Challenge 1: Data Size Every month on Reddit: ● Reddit is too big to cluster directly! ● The raw clustering matrix has 200 billion elements. 60k Subreddits 3 million unique authors
  • 8. Solution 1a: Filtering Every month on Reddit: ● Filter for activity: 100 comments/month ● Active clustering matrix has 200 million elements ● Now 1000 times faster to cluster 6k active Subreddits 30k active authors
  • 9. Solution 1b: PCA Every month on Reddit: ● PCA transforms author space to shared interest space by finding correlations ● PCA shrinks dimensionality by another 100 times 300 shared interests 6k active Subreddits
  • 10. Challenge 2: Slow PCA Even on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers. PCA scales as O(MI) M is the number of matrix elements I is the number of interests after PCA Over 80% of total time!
  • 11. Solution 2: Random PCA Use Facebook Research Random PCA (2014) on a single node Fbpca is O(M ln(I)) For 250 interests, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers. 5x faster for an average sized month
  • 12. Challenge 3: Finding K for K-Means Clustering Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 13. Solution 3: Silhouette Analysis Silhouette Analysis reveals clustering scale at small k Also reveals a second clustering scale of around 400 clusters in this case
  • 14. A Happy Medium Too impersonal Too personalized
  • 15. David Lyon PhD Physics from the University of Illinois Doing GPU simulations I love hiking, table tennis, and astrophysics
  • 16. Next Steps - Random PCA for Spark.ml Step 1: Learn Scala! Step 2: Contribute to Open Source community Step 3: Streaming Random PCA?
  • 17. Next Steps - Popular Topics by Cluster Find the popular topics within each cluster using Term-Frequency Inverse- Document-Frequency (TF-IDF) or LDA Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
  • 18. Challenge 2: Every month on Reddit: ● Too many individual authors ● Need to cluster by shared interests, not author 30k active authors 6k active Subreddits
  • 19. Challenge 3: Finding K for K-Means Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 20. Random PCA Complexity of PCA is O(mnk) for m rows, n input columns, k output columns FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009) Fast Randomized SVD (Facebook Research, 2014) Complexity of Random PCA is O(mn ln(k)) For k=100, Random PCA is more than 20x faster!
  • 21. Before PCA Football 2 1 Baseball 3 1 15 TV 5 2 22 Movies 1 21 1 2 Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99 9,999 Auth 1,000,0 00
  • 22. After PCA Football 80 2 1 Baseball 90 3 2 TV 6 80 77 Movies 2 80 20 Sub Sporting Fictional Political
  • 23. Anatomy of a Reddit Comment BodyAuthorDate Subreddit Group by Month Group by Subreddit Count #comments by author per subreddit Normalize authors so each author has mean=0 and variance = 1
  • 24. Growth in Number of Subreddits 40 subreddits 1 million subreddits
  • 25. Week 4 Challenges ● Spark for iterative machine learning because Spark can mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters change slowly, but time window reduced from monthly to daily
  • 26. Clustering is Universal Galaxies cluster into superclusters of ~100k members The red dot is our galaxy ● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine. ● The big blob to the upper left is Liberal Arts.
  • 27. Subreddit Clustering Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored Cluster in reduced dimensional space using K-means Topics within Clusters based on relative frequency of 1-grams and
  • 28. Social media brings us closer Continual contact with over 1 billion people We can find people who share our exact interests ...and separates us ● Less tolerance for differences - unfriend or ban from community! ● Online communities become bubbles isolated from each other