Near Real Time Analysis of Web Scale Social Data

Near Real Time Analysis
of Web Scale Social Data
Nathan Halko, Ph.D, @nhalko
Data Scientist @SpotRight
New Algorithms for Complex Data
Friday, March 20, 2015
Santa Fe, New Mexico

Near Real Time
Analysis
Web Scale
Social Data
• publicly available user generated content
• connections between users, actions,
events (graph)
• data: meta, profile, demographic, etc
• Twitter, Pinterest, blogs, articles, etc
• everybody about everything
• US consumer population
• “actionable” profiles
• streaming computation
• searchable indexed results
• If a tree falls in the woods…
• collect, process, deliver
• web crawling, graph algorithms, Solr queries
• creation of a SpotRight ‘social profile’

Goals
1. Put in perspective some algorithms and how
they are used
2. Discuss architecture
3. Qua[nt,l]ify measures of performance

cosmic microwave Background
Me (then)
SpotRight (then) — SpotInfluence
• Randomized methods in Linear Algebra
• Prof. Gunnar Martinsson, CU Boulder
• Large scale implementation -> Hadoop, Mahout
• Experts in systems, architecture, coding
• Cutting edge technology no legacy
• Big ideas: all pairs shortest path on graph of
everyone about everything
• Influence scoring, google for people
Me (now) • Data Scientist - good title to have on LinkedIn
• In reality - sys admin, code monkey, analyst, fix bugs,
rev code, monitoring
• Design complex algorithms that run at scale on
distributed systems based on whats cool.

SpotRight 10 second elevator pitch
*** from an engineer’s perspective ***
• Collect and organize information about
people on the web: 1. what they say
2. who they interact with
3. how they behave
• Provide a means to link online data to
traditional ofﬂine data. key=emailMd5
• Deliver timely aggregate and record level
data to our clients

SpotRight work flow
Deliver
Organize and
Link
Collect
• Crawl the web
• Twitter Graph Builder - 500M accounts
• Polite (crawl delay) Respectful (robots.txt) and
Onymous (Hi! I’m SpotBot/1.0)
• Identity resolution - Profile creation
• Many-to-many relationships
emailMd5 -> Urls -> SocialProfiles
• Distributed and local graph algorithms
• Spark and Solr
• $$$

Graph Builder
Hadoop Map Reduce
Map HDFS
Fetcher
Async http
Parser
html, json, regex
Cassandra
Kafka
Reduce
Driver
Crawler
HighWaterQueue

Crawler
Links
Tweet Frequency
• Window of 20 tweets
• Predict the Nth tweet and revisit
• Classify accounts N=5,10,15
• Auto refresh
• Corner cases, errors, exceptions
Author
Follower
Following
@Mention
Retweet
Me Links
#Hashtag
• Directed weighted
• Heterogeneous
• Chronological
• 500M -> 250M -> 100M users
• 400M edges, 200M tweets per month
• Always running
• Extensible - networks with graph structure
• Scalable - Kraken

Proﬁle Creation
Hadoop, Giraph, ScalaGraph
Me
Shared
Every
You
Me
Me
You
You
One
We
Know
Me You
SharedShared
• Shared nodes can be massive: KimKardashian, Orkut
• Direction/Presence of edge untrustworthy
• Whitelist networks to form seeds
• Heuristics based on business logic
• Over aggregation, black holes

https://twitter.com/jjshoutout
http://t.co/j4FrloXTZc http://www.shoutlet.com/
https://twitter.com/andredavidson
http://t.co/68mDP3dR0I http://www.qriously.com/
https://twitter.com/jenrjohn
http://t.co/noMxAuzd3D
https://twitter.com/aaroneverson https://twitter.com/djprohaska
https://twitter.com/ericjlance
http://t.co/l2oMuaXdNH
https://twitter.com/rjugs
http://t.co/erhbMRtfTe http://uk.linkedin.com/in/rjugessur/
https://twitter.com/swayinc
https://twitter.com/shoutlethttp://twitter.com/shoutlethttp://t.co/gOgCuiZZym
http://t.co/jPnD7equKJ http://t.co/NlhMO0TUWB
http://t.co/iIrDPHNCbo
https://twitter.com/alisonshoutlet
https://twitter.com/pirving
http://shoutlet.com/https://twitter.com/robertarwel
http://www.monetate.com/http://t.co/pYUo4m8BKxhttp://www.tagman.com/
https://twitter.com/meandurr
http://www.asos.com/ http://t.co/Zz6yyQf7wr
http://t.co/i5HpI9j9qL
http://pinterest.com/shoutlet
https://twitter.com/hershel_miller
http://t.co/ArEfyYD1ni http://www.madisonwisconsinliving.com/
https://twitter.com/davejfrito
http://t.co/z2si0afweo http://t.co/Z2sI0AFWeo
https://twitter.com/zachshoutlet
http://t.co/n6sXxcns5P
http://pinterest.com/ericjlance
https://twitter.com/brandonshoutlet
http://t.co/aPzCCL0K2Q
https://twitter.com/asaengel https://twitter.com/paulpucci2
http://t.co/gtCtEyM0Y9
https://twitter.com/joeshout https://twitter.com/geneshoutlet
http://t.co/Ic2wmpKXHD
https://twitter.com/stevenjheld https://twitter.com/stevevanegeren
https://twitter.com/iainmasson
http://t.co/nwE2O91Iz3
https://twitter.com/allison_hope_27
Examples
https://twitter.com/nhalko
http://www.strava.com/athletes/2642700 http://www.spotright.com/
http://spotright.com/
http://t.co/BovIYuLJ3h
https://twitter.com/jp_lind
http://www.linkedin.com/in/jplind http://t.co/ntUOWCX1bh
http://pinterest.com/heyrich
https://twitter.com/heyrichhttp://www.facebook.com/heyrich
http://www.spotinﬂuence.com/ http://www.gpo.co/ http://t.co/pm5Uzw0oDThttps://twitter.com/spotright
http://t.co/o5n9lcVcwihttps://twitter.com/graphmassivehttps://twitter.com/spot_dev
http://t.co/JRRYs8Vezr
https://twitter.com/edmessman
http://t.co/i6xJl8lJ7g
https://twitter.com/sptrght2
http://t.co/UWf5YNv95c
https://plus.google.com/108483248177909869830
http://www.talkrank.com/
https://twitter.com/richgrote
http://t.co/v7nkEuLXbF
https://twitter.com/ed_messman
http://t.co/wAP1dDMxOn
http://pinterest.com/spotright
https://twitter.com/sptrght
http://plancast.com/nhalko
http://plancast.com/heyrich
https://twitter.com/talkrank
http://t.co/npJ08gFUKy
https://twitter.com/spotinﬂuence

Nearest Neighbor
Index
Cassandra
Giraph - BSP - Connected Components
For each superstep:
send min id to neighbors
or decide to halt
Unique min id for CC
Map
1. Assemble local graph
2. key by ccID
Peoplify
Reduce
1. Assemble connected component
2. Lift into ScalaGraph (in memory)
3. Edge cut to form seeds
4. Assign people ids to seeds
5. Grow seeds to share urls
New, exisiting, merge, delete,
destroy!?
https://twitter.com/nhalko
• followers/following
• me-links
• ccId (singular)
• peopleIds (plural)
Cassandra

Work Avoidancethe extreme preference of leisure as
opposed to work
Url_a
Url_b
Url_c
1. Mark urls of interest with hot ccID ~5M 
wtd < zzz
2. CC alg pushes hot ccID to other urls
3. Peoplify can lift and assembly only subset  
10M instead of 750M
Url_a
Url_b
Url_c
Url_a
Url_b
Url_c
Datasize
3 hrs (orig MR recursive)
3 days (orig MR recursive)
40 mins
3 hrs

Delivery
Spark, Solr
Given a set of identiﬁers (emailMd5s) create aggregate
level insights and record level appends.
Example: Following behavior, Twitter as an Interest Graph
Example: Tracking mentions and sentiment

1. Set number of cores and heap per job
2. Internal map tasks reuse JVM
3. RDDs feel like Scala collections
4. I can read the source!!
5. Persist RDDs at will
6. Joins
7. Streaming (listening), GraphX (Giraph rep.)
Kafka queue
stream.map {
item =>
// do work
}
head —
offset —
Next generation Hadoop

In the 2012 United States presidential election
between Barack Obama and Mitt Romney, he correctly
predicted the winner of all 50 states and the District of
Columbia.
Nate Silver

1. library release velocity - keeping up with new code versions
2. simplicity so others can maintain
3. make it work, make it right, make it fast
4. corner cases, outliers, the 0.01% - if it can possibly happen
it will in abundance
5. on average, nothing interesting happens - results plagued
by majority
6. waterfall, abysmal ﬁll rates
7. Record level view (CC vs MLR)
8. Most time spent waiting on database
Algorithmic considerations

Algorithm Complexity
vs
Unique Input Data
• Student discount veriﬁcation vs Consumer data modeling
• Rdio vs Netﬂix

Solr
3.5 B docs
2.7 TB
Cassandra
500M nodes/
30B edges
1.2 PB
HDFS
crawler
ﬁles, scratch
20 big servers: 16 core, 48-128GB ram, kraken 40-…
Hadoop
30 core MapReduce jobs
70 contrib jobs
Spark
8 streaming
2 batch
15 active Spotright codebases
4 engineers
Amazon
S3, Glacier
Kafka
15 queues

Going
1. Migrate Hadoop -> Spark
2. Giraph -> Spark GraphX
3. More Spark streaming, automation
4. Graph collection in Solr
5. Visualization
Forward....

Nathan Halko @nhalko
nathan@spotright.com
www.spotright.com

Near Real Time Analysis of Web Scale Social Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Near Real Time Analysis of Web Scale Social Data

Similar to Near Real Time Analysis of Web Scale Social Data (20)

Near Real Time Analysis of Web Scale Social Data