SlideShare a Scribd company logo
1 of 10
Ranking User Influence in the Reddit Social Network
Adam James Costarino | Insight Data Science
Influence and Similarity Search
1
Technologies Hadoop, Spark, Parquet, Cassandra
2.5 Terabytes
Challenges
Data Size
Product
Motivations
Performing analysis on Social Network composed of over 100
Million distinct users and 1.2 Billion edges.
2
User Graph
Username : ‘adam’
Subreddits : [‘AskReddit’,‘history’…
Id : 30n42a
Username : ‘james
Subreddits : [‘AskReddit’, ‘politics’ …
Link_Id : t3_30n42a
Edge Orientation : Commenter → Poster
2017 Data Size:
40 million interactions per month
Over 4 million distinct users in monthly graph
3
Subreddit Graph
Edge Weight :
[Some Long]
There is no
directionality between
subreddit connections
Edge Weights are the number of intersecting users.
The Subreddit Graph is represented with an adjacency
matrix.
4
Data Pipeline
2.5
Terabytes
of Raw
JSON data
in S3
Features
extract
then
compress
to HDFS.
(50 – 1)
All complex
graph
creation and
processing is
performed in
Spark
Lots of data → Quick writes
to Cassandra
5
Solution: Use Scala mutable Map
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
Module initialization
$ is used for third party
language tools. It is generally a
reserved identifier
6
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
Solution: Write an Interface that extends map
Now Simply Use:
MapStringLong.class
7
Solution: Use .mapToPair to Map Scala mutable Map to Java Map
Challenge 1: Storing Data in Cassandra Requiring Object
Parameterization
8
Challenge 2: Creating PageRank Algorithm
Solution: Create my own PageRank
Initialize Ranks to 1
Number of Total Users (N)
Sum of the in coming PageRanks
divided by the number outgoing links
Dampening factor is set to 0.15 as
recommended by Stanford
9
• Physics
• Java Enthusiast
• Collegiate Skier in USCSA
• Other Interesting Projects
• Virtual Machine for LC-4 in C
• Compiler for J Programming Language in
C → LC-4

More Related Content

Similar to Insight presentation

Subreddit Subcultures
Subreddit SubculturesSubreddit Subcultures
Subreddit SubculturesDave Lyon
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Management and analysis of social media data
Management and analysis of social media dataManagement and analysis of social media data
Management and analysis of social media dataWeining Qian
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsNeo4j
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworksAmal Targhi
 
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 20072009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007Marc Smith
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013Amazon Web Services
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Social Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network ContextSocial Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network ContextIRJET Journal
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsMarco Brambilla
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebFabrizio Orlandi
 
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014Trieu Nguyen
 
SPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan PuttaguntaSPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan PuttaguntaSravan Puttagunta
 

Similar to Insight presentation (20)

Subreddit Subcultures
Subreddit SubculturesSubreddit Subcultures
Subreddit Subcultures
 
GitConnect
GitConnectGitConnect
GitConnect
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Management and analysis of social media data
Management and analysis of social media dataManagement and analysis of social media data
Management and analysis of social media data
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworks
 
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 20072009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007
2009 - Node XL v.84+ - Social Media Network Visualization Tools For Excel 2007
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Social Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network ContextSocial Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network Context
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic Web
 
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014
Reactive Reatime Big Data with Open Source Lambda Architecture - TechCampVN 2014
 
SPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan PuttaguntaSPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
SPAR 2015 - Civil Maps Presentation by Sravan Puttagunta
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Insight presentation

  • 1. Ranking User Influence in the Reddit Social Network Adam James Costarino | Insight Data Science
  • 2. Influence and Similarity Search 1 Technologies Hadoop, Spark, Parquet, Cassandra 2.5 Terabytes Challenges Data Size Product Motivations Performing analysis on Social Network composed of over 100 Million distinct users and 1.2 Billion edges.
  • 3. 2 User Graph Username : ‘adam’ Subreddits : [‘AskReddit’,‘history’… Id : 30n42a Username : ‘james Subreddits : [‘AskReddit’, ‘politics’ … Link_Id : t3_30n42a Edge Orientation : Commenter → Poster 2017 Data Size: 40 million interactions per month Over 4 million distinct users in monthly graph
  • 4. 3 Subreddit Graph Edge Weight : [Some Long] There is no directionality between subreddit connections Edge Weights are the number of intersecting users. The Subreddit Graph is represented with an adjacency matrix.
  • 5. 4 Data Pipeline 2.5 Terabytes of Raw JSON data in S3 Features extract then compress to HDFS. (50 – 1) All complex graph creation and processing is performed in Spark Lots of data → Quick writes to Cassandra
  • 6. 5 Solution: Use Scala mutable Map Challenge 1: Storing Data in Cassandra Requiring Object Parameterization Module initialization $ is used for third party language tools. It is generally a reserved identifier
  • 7. 6 Challenge 1: Storing Data in Cassandra Requiring Object Parameterization Solution: Write an Interface that extends map Now Simply Use: MapStringLong.class
  • 8. 7 Solution: Use .mapToPair to Map Scala mutable Map to Java Map Challenge 1: Storing Data in Cassandra Requiring Object Parameterization
  • 9. 8 Challenge 2: Creating PageRank Algorithm Solution: Create my own PageRank Initialize Ranks to 1 Number of Total Users (N) Sum of the in coming PageRanks divided by the number outgoing links Dampening factor is set to 0.15 as recommended by Stanford
  • 10. 9 • Physics • Java Enthusiast • Collegiate Skier in USCSA • Other Interesting Projects • Virtual Machine for LC-4 in C • Compiler for J Programming Language in C → LC-4

Editor's Notes

  1. Steps to developing small molecule: Target validation and selection; chemical hit and lead generation; lead optimization to identify a clinical drug candidate; and finally hypothesis-driven, biomarker-led clinical trials