SlideShare a Scribd company logo
1 of 21
Download to read offline
Near Real Time Analysis
of Web Scale Social Data
Nathan Halko, Ph.D, @nhalko
Data Scientist @SpotRight
New Algorithms for Complex Data
Friday, March 20, 2015
Santa Fe, New Mexico
Near Real Time
Analysis
Web Scale
Social Data
• publicly available user generated content
• connections between users, actions,
events (graph)
• data: meta, profile, demographic, etc
• Twitter, Pinterest, blogs, articles, etc
• everybody about everything
• US consumer population
• “actionable” profiles
• streaming computation
• searchable indexed results
• If a tree falls in the woods…
• collect, process, deliver
• web crawling, graph algorithms, Solr queries
• creation of a SpotRight ‘social profile’
Goals
1. Put in perspective some algorithms and how
they are used
2. Discuss architecture
3. Qua[nt,l]ify measures of performance
cosmic microwave Background
Me (then)
SpotRight (then) — SpotInfluence
• Randomized methods in Linear Algebra
• Prof. Gunnar Martinsson, CU Boulder
• Large scale implementation -> Hadoop, Mahout
• Experts in systems, architecture, coding
• Cutting edge technology no legacy
• Big ideas: all pairs shortest path on graph of
everyone about everything
• Influence scoring, google for people
Me (now) • Data Scientist - good title to have on LinkedIn
• In reality - sys admin, code monkey, analyst, fix bugs,
rev code, monitoring
• Design complex algorithms that run at scale on
distributed systems based on whats cool.
SpotRight 10 second elevator pitch
*** from an engineer’s perspective ***
• Collect and organize information about
people on the web: 1. what they say
2. who they interact with
3. how they behave
• Provide a means to link online data to
traditional offline data. key=emailMd5
• Deliver timely aggregate and record level
data to our clients
SpotRight work flow
Deliver
Organize and
Link
Collect
• Crawl the web
• Twitter Graph Builder - 500M accounts
• Polite (crawl delay) Respectful (robots.txt) and
Onymous (Hi! I’m SpotBot/1.0)
• Identity resolution - Profile creation
• Many-to-many relationships
emailMd5 -> Urls -> SocialProfiles
• Distributed and local graph algorithms
• Spark and Solr
• $$$
Open Source Software
Graph Builder
Hadoop Map Reduce
Map HDFS
Fetcher
Async http
Parser
html, json, regex
Cassandra
Kafka
Reduce
Driver
Crawler
HighWaterQueue
Crawler
Links
Tweet Frequency
• Window of 20 tweets
• Predict the Nth tweet and revisit
• Classify accounts N=5,10,15
• Auto refresh
• Corner cases, errors, exceptions
Author
Follower
Following
@Mention
Retweet
Me Links
#Hashtag
• Directed weighted
• Heterogeneous
• Chronological
• 500M -> 250M -> 100M users
• 400M edges, 200M tweets per month
• Always running
• Extensible - networks with graph structure
• Scalable - Kraken
Profile Creation
Hadoop, Giraph, ScalaGraph
Me
Shared
Every
You
Me
Me
You
You
One
We
Know
Me You
SharedShared
• Shared nodes can be massive: KimKardashian, Orkut
• Direction/Presence of edge untrustworthy
• Whitelist networks to form seeds
• Heuristics based on business logic
• Over aggregation, black holes
https://twitter.com/jjshoutout
http://t.co/j4FrloXTZc http://www.shoutlet.com/
https://twitter.com/andredavidson
http://t.co/68mDP3dR0I http://www.qriously.com/
https://twitter.com/jenrjohn
http://t.co/noMxAuzd3D
https://twitter.com/aaroneverson https://twitter.com/djprohaska
https://twitter.com/ericjlance
http://t.co/l2oMuaXdNH
https://twitter.com/rjugs
http://t.co/erhbMRtfTe http://uk.linkedin.com/in/rjugessur/
https://twitter.com/swayinc
https://twitter.com/shoutlethttp://twitter.com/shoutlethttp://t.co/gOgCuiZZym
http://t.co/jPnD7equKJ http://t.co/NlhMO0TUWB
http://t.co/iIrDPHNCbo
https://twitter.com/alisonshoutlet
https://twitter.com/pirving
http://shoutlet.com/https://twitter.com/robertarwel
http://www.monetate.com/http://t.co/pYUo4m8BKxhttp://www.tagman.com/
https://twitter.com/meandurr
http://www.asos.com/ http://t.co/Zz6yyQf7wr
http://t.co/i5HpI9j9qL
http://pinterest.com/shoutlet
https://twitter.com/hershel_miller
http://t.co/ArEfyYD1ni http://www.madisonwisconsinliving.com/
https://twitter.com/davejfrito
http://t.co/z2si0afweo http://t.co/Z2sI0AFWeo
https://twitter.com/zachshoutlet
http://t.co/n6sXxcns5P
http://pinterest.com/ericjlance
https://twitter.com/brandonshoutlet
http://t.co/aPzCCL0K2Q
https://twitter.com/asaengel https://twitter.com/paulpucci2
http://t.co/gtCtEyM0Y9
https://twitter.com/joeshout https://twitter.com/geneshoutlet
http://t.co/Ic2wmpKXHD
https://twitter.com/stevenjheld https://twitter.com/stevevanegeren
https://twitter.com/iainmasson
http://t.co/nwE2O91Iz3
https://twitter.com/allison_hope_27
Examples
https://twitter.com/nhalko
http://www.strava.com/athletes/2642700 http://www.spotright.com/
http://spotright.com/
http://t.co/BovIYuLJ3h
https://twitter.com/jp_lind
http://www.linkedin.com/in/jplind http://t.co/ntUOWCX1bh
http://pinterest.com/heyrich
https://twitter.com/heyrichhttp://www.facebook.com/heyrich
http://www.spotinfluence.com/ http://www.gpo.co/ http://t.co/pm5Uzw0oDThttps://twitter.com/spotright
http://t.co/o5n9lcVcwihttps://twitter.com/graphmassivehttps://twitter.com/spot_dev
http://t.co/JRRYs8Vezr
https://twitter.com/edmessman
http://t.co/i6xJl8lJ7g
https://twitter.com/sptrght2
http://t.co/UWf5YNv95c
https://plus.google.com/108483248177909869830
http://www.talkrank.com/
https://twitter.com/richgrote
http://t.co/v7nkEuLXbF
https://twitter.com/ed_messman
http://t.co/wAP1dDMxOn
http://pinterest.com/spotright
https://twitter.com/sptrght
http://plancast.com/nhalko
http://plancast.com/heyrich
https://twitter.com/talkrank
http://t.co/npJ08gFUKy
https://twitter.com/spotinfluence
Nearest Neighbor
Index
Cassandra
Giraph - BSP - Connected Components
For each superstep:
send min id to neighbors
or decide to halt
Unique min id for CC
Map
1. Assemble local graph
2. key by ccID
Peoplify
Reduce
1. Assemble connected component
2. Lift into ScalaGraph (in memory)
3. Edge cut to form seeds
4. Assign people ids to seeds
5. Grow seeds to share urls
New, exisiting, merge, delete,
destroy!?
https://twitter.com/nhalko
• followers/following
• me-links
• ccId (singular)
• peopleIds (plural)
Cassandra
Work Avoidancethe extreme preference of leisure as
opposed to work
Url_a
Url_b
Url_c
1. Mark urls of interest with hot ccID ~5M

wtd < zzz
2. CC alg pushes hot ccID to other urls
3. Peoplify can lift and assembly only subset 

10M instead of 750M
Url_a
Url_b
Url_c
Url_a
Url_b
Url_c
Datasize
3 hrs (orig MR recursive)
3 days (orig MR recursive)
40 mins
3 hrs
Delivery
Spark, Solr
Given a set of identifiers (emailMd5s) create aggregate
level insights and record level appends.
Example: Following behavior, Twitter as an Interest Graph
Example: Tracking mentions and sentiment
1. Set number of cores and heap per job
2. Internal map tasks reuse JVM
3. RDDs feel like Scala collections
4. I can read the source!!
5. Persist RDDs at will
6. Joins
7. Streaming (listening), GraphX (Giraph rep.)
Kafka queue
stream.map {
item =>
// do work
}
head —
offset —
Next generation Hadoop
In the 2012 United States presidential election
between Barack Obama and Mitt Romney, he correctly
predicted the winner of all 50 states and the District of
Columbia.
Nate Silver
1. library release velocity - keeping up with new code versions
2. simplicity so others can maintain
3. make it work, make it right, make it fast
4. corner cases, outliers, the 0.01% - if it can possibly happen
it will in abundance
5. on average, nothing interesting happens - results plagued
by majority
6. waterfall, abysmal fill rates
7. Record level view (CC vs MLR)
8. Most time spent waiting on database
Algorithmic considerations
Algorithm Complexity
vs
Unique Input Data
• Student discount verification vs Consumer data modeling
• Rdio vs Netflix
Solr
3.5 B docs
2.7 TB
Cassandra
500M nodes/
30B edges
1.2 PB
HDFS
crawler
files, scratch
20 big servers: 16 core, 48-128GB ram, kraken 40-…
Hadoop
30 core MapReduce jobs
70 contrib jobs
Spark
8 streaming
2 batch
15 active Spotright codebases
4 engineers
Amazon
S3, Glacier
Kafka
15 queues
Going
1. Migrate Hadoop -> Spark
2. Giraph -> Spark GraphX
3. More Spark streaming, automation
4. Graph collection in Solr
5. Visualization
Forward....
Nathan Halko @nhalko
nathan@spotright.com
www.spotright.com

More Related Content

What's hot

Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better🧑‍💻 Manuel Coppotelli
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 

What's hot (20)

Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Cascalog
CascalogCascalog
Cascalog
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 

Viewers also liked

Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1David Taieb
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
RA 7877 Sexual Harassment Act
RA 7877 Sexual Harassment ActRA 7877 Sexual Harassment Act
RA 7877 Sexual Harassment ActJofred Martinez
 

Viewers also liked (10)

NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
RA 7877 Sexual Harassment Act
RA 7877 Sexual Harassment ActRA 7877 Sexual Harassment Act
RA 7877 Sexual Harassment Act
 

Similar to Near Real Time Analysis of Web Scale Social Data

Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 

Similar to Near Real Time Analysis of Web Scale Social Data (20)

Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Near Real Time Analysis of Web Scale Social Data

  • 1. Near Real Time Analysis of Web Scale Social Data Nathan Halko, Ph.D, @nhalko Data Scientist @SpotRight New Algorithms for Complex Data Friday, March 20, 2015 Santa Fe, New Mexico
  • 2. Near Real Time Analysis Web Scale Social Data • publicly available user generated content • connections between users, actions, events (graph) • data: meta, profile, demographic, etc • Twitter, Pinterest, blogs, articles, etc • everybody about everything • US consumer population • “actionable” profiles • streaming computation • searchable indexed results • If a tree falls in the woods… • collect, process, deliver • web crawling, graph algorithms, Solr queries • creation of a SpotRight ‘social profile’
  • 3. Goals 1. Put in perspective some algorithms and how they are used 2. Discuss architecture 3. Qua[nt,l]ify measures of performance
  • 4. cosmic microwave Background Me (then) SpotRight (then) — SpotInfluence • Randomized methods in Linear Algebra • Prof. Gunnar Martinsson, CU Boulder • Large scale implementation -> Hadoop, Mahout • Experts in systems, architecture, coding • Cutting edge technology no legacy • Big ideas: all pairs shortest path on graph of everyone about everything • Influence scoring, google for people Me (now) • Data Scientist - good title to have on LinkedIn • In reality - sys admin, code monkey, analyst, fix bugs, rev code, monitoring • Design complex algorithms that run at scale on distributed systems based on whats cool.
  • 5. SpotRight 10 second elevator pitch *** from an engineer’s perspective *** • Collect and organize information about people on the web: 1. what they say 2. who they interact with 3. how they behave • Provide a means to link online data to traditional offline data. key=emailMd5 • Deliver timely aggregate and record level data to our clients
  • 6. SpotRight work flow Deliver Organize and Link Collect • Crawl the web • Twitter Graph Builder - 500M accounts • Polite (crawl delay) Respectful (robots.txt) and Onymous (Hi! I’m SpotBot/1.0) • Identity resolution - Profile creation • Many-to-many relationships emailMd5 -> Urls -> SocialProfiles • Distributed and local graph algorithms • Spark and Solr • $$$
  • 8. Graph Builder Hadoop Map Reduce Map HDFS Fetcher Async http Parser html, json, regex Cassandra Kafka Reduce Driver Crawler HighWaterQueue
  • 9. Crawler Links Tweet Frequency • Window of 20 tweets • Predict the Nth tweet and revisit • Classify accounts N=5,10,15 • Auto refresh • Corner cases, errors, exceptions Author Follower Following @Mention Retweet Me Links #Hashtag • Directed weighted • Heterogeneous • Chronological • 500M -> 250M -> 100M users • 400M edges, 200M tweets per month • Always running • Extensible - networks with graph structure • Scalable - Kraken
  • 10. Profile Creation Hadoop, Giraph, ScalaGraph Me Shared Every You Me Me You You One We Know Me You SharedShared • Shared nodes can be massive: KimKardashian, Orkut • Direction/Presence of edge untrustworthy • Whitelist networks to form seeds • Heuristics based on business logic • Over aggregation, black holes
  • 11. https://twitter.com/jjshoutout http://t.co/j4FrloXTZc http://www.shoutlet.com/ https://twitter.com/andredavidson http://t.co/68mDP3dR0I http://www.qriously.com/ https://twitter.com/jenrjohn http://t.co/noMxAuzd3D https://twitter.com/aaroneverson https://twitter.com/djprohaska https://twitter.com/ericjlance http://t.co/l2oMuaXdNH https://twitter.com/rjugs http://t.co/erhbMRtfTe http://uk.linkedin.com/in/rjugessur/ https://twitter.com/swayinc https://twitter.com/shoutlethttp://twitter.com/shoutlethttp://t.co/gOgCuiZZym http://t.co/jPnD7equKJ http://t.co/NlhMO0TUWB http://t.co/iIrDPHNCbo https://twitter.com/alisonshoutlet https://twitter.com/pirving http://shoutlet.com/https://twitter.com/robertarwel http://www.monetate.com/http://t.co/pYUo4m8BKxhttp://www.tagman.com/ https://twitter.com/meandurr http://www.asos.com/ http://t.co/Zz6yyQf7wr http://t.co/i5HpI9j9qL http://pinterest.com/shoutlet https://twitter.com/hershel_miller http://t.co/ArEfyYD1ni http://www.madisonwisconsinliving.com/ https://twitter.com/davejfrito http://t.co/z2si0afweo http://t.co/Z2sI0AFWeo https://twitter.com/zachshoutlet http://t.co/n6sXxcns5P http://pinterest.com/ericjlance https://twitter.com/brandonshoutlet http://t.co/aPzCCL0K2Q https://twitter.com/asaengel https://twitter.com/paulpucci2 http://t.co/gtCtEyM0Y9 https://twitter.com/joeshout https://twitter.com/geneshoutlet http://t.co/Ic2wmpKXHD https://twitter.com/stevenjheld https://twitter.com/stevevanegeren https://twitter.com/iainmasson http://t.co/nwE2O91Iz3 https://twitter.com/allison_hope_27 Examples https://twitter.com/nhalko http://www.strava.com/athletes/2642700 http://www.spotright.com/ http://spotright.com/ http://t.co/BovIYuLJ3h https://twitter.com/jp_lind http://www.linkedin.com/in/jplind http://t.co/ntUOWCX1bh http://pinterest.com/heyrich https://twitter.com/heyrichhttp://www.facebook.com/heyrich http://www.spotinfluence.com/ http://www.gpo.co/ http://t.co/pm5Uzw0oDThttps://twitter.com/spotright http://t.co/o5n9lcVcwihttps://twitter.com/graphmassivehttps://twitter.com/spot_dev http://t.co/JRRYs8Vezr https://twitter.com/edmessman http://t.co/i6xJl8lJ7g https://twitter.com/sptrght2 http://t.co/UWf5YNv95c https://plus.google.com/108483248177909869830 http://www.talkrank.com/ https://twitter.com/richgrote http://t.co/v7nkEuLXbF https://twitter.com/ed_messman http://t.co/wAP1dDMxOn http://pinterest.com/spotright https://twitter.com/sptrght http://plancast.com/nhalko http://plancast.com/heyrich https://twitter.com/talkrank http://t.co/npJ08gFUKy https://twitter.com/spotinfluence
  • 12. Nearest Neighbor Index Cassandra Giraph - BSP - Connected Components For each superstep: send min id to neighbors or decide to halt Unique min id for CC Map 1. Assemble local graph 2. key by ccID Peoplify Reduce 1. Assemble connected component 2. Lift into ScalaGraph (in memory) 3. Edge cut to form seeds 4. Assign people ids to seeds 5. Grow seeds to share urls New, exisiting, merge, delete, destroy!? https://twitter.com/nhalko • followers/following • me-links • ccId (singular) • peopleIds (plural) Cassandra
  • 13. Work Avoidancethe extreme preference of leisure as opposed to work Url_a Url_b Url_c 1. Mark urls of interest with hot ccID ~5M
 wtd < zzz 2. CC alg pushes hot ccID to other urls 3. Peoplify can lift and assembly only subset 
 10M instead of 750M Url_a Url_b Url_c Url_a Url_b Url_c Datasize 3 hrs (orig MR recursive) 3 days (orig MR recursive) 40 mins 3 hrs
  • 14. Delivery Spark, Solr Given a set of identifiers (emailMd5s) create aggregate level insights and record level appends. Example: Following behavior, Twitter as an Interest Graph Example: Tracking mentions and sentiment
  • 15. 1. Set number of cores and heap per job 2. Internal map tasks reuse JVM 3. RDDs feel like Scala collections 4. I can read the source!! 5. Persist RDDs at will 6. Joins 7. Streaming (listening), GraphX (Giraph rep.) Kafka queue stream.map { item => // do work } head — offset — Next generation Hadoop
  • 16. In the 2012 United States presidential election between Barack Obama and Mitt Romney, he correctly predicted the winner of all 50 states and the District of Columbia. Nate Silver
  • 17. 1. library release velocity - keeping up with new code versions 2. simplicity so others can maintain 3. make it work, make it right, make it fast 4. corner cases, outliers, the 0.01% - if it can possibly happen it will in abundance 5. on average, nothing interesting happens - results plagued by majority 6. waterfall, abysmal fill rates 7. Record level view (CC vs MLR) 8. Most time spent waiting on database Algorithmic considerations
  • 18. Algorithm Complexity vs Unique Input Data • Student discount verification vs Consumer data modeling • Rdio vs Netflix
  • 19. Solr 3.5 B docs 2.7 TB Cassandra 500M nodes/ 30B edges 1.2 PB HDFS crawler files, scratch 20 big servers: 16 core, 48-128GB ram, kraken 40-… Hadoop 30 core MapReduce jobs 70 contrib jobs Spark 8 streaming 2 batch 15 active Spotright codebases 4 engineers Amazon S3, Glacier Kafka 15 queues
  • 20. Going 1. Migrate Hadoop -> Spark 2. Giraph -> Spark GraphX 3. More Spark streaming, automation 4. Graph collection in Solr 5. Visualization Forward....