This document provides an overview of Near Real Time Analysis of Web Scale Social Data. It discusses how SpotRight collects and organizes publicly available user-generated social data at web scale, including connections between users, actions, events, profiles, demographics, and more from sources like Twitter, Pinterest, blogs and articles. It describes SpotRight's goals, architecture, algorithms, and tools used to perform real-time analysis and deliver timely insights to clients, including graph building, profile creation, and delivery of results. Key aspects involve collecting petabytes of social data, performing distributed graph algorithms at scale, and querying and delivering insights from the organized data.
1. Near Real Time Analysis
of Web Scale Social Data
Nathan Halko, Ph.D, @nhalko
Data Scientist @SpotRight
New Algorithms for Complex Data
Friday, March 20, 2015
Santa Fe, New Mexico
2. Near Real Time
Analysis
Web Scale
Social Data
• publicly available user generated content
• connections between users, actions,
events (graph)
• data: meta, profile, demographic, etc
• Twitter, Pinterest, blogs, articles, etc
• everybody about everything
• US consumer population
• “actionable” profiles
• streaming computation
• searchable indexed results
• If a tree falls in the woods…
• collect, process, deliver
• web crawling, graph algorithms, Solr queries
• creation of a SpotRight ‘social profile’
3. Goals
1. Put in perspective some algorithms and how
they are used
2. Discuss architecture
3. Qua[nt,l]ify measures of performance
4. cosmic microwave Background
Me (then)
SpotRight (then) — SpotInfluence
• Randomized methods in Linear Algebra
• Prof. Gunnar Martinsson, CU Boulder
• Large scale implementation -> Hadoop, Mahout
• Experts in systems, architecture, coding
• Cutting edge technology no legacy
• Big ideas: all pairs shortest path on graph of
everyone about everything
• Influence scoring, google for people
Me (now) • Data Scientist - good title to have on LinkedIn
• In reality - sys admin, code monkey, analyst, fix bugs,
rev code, monitoring
• Design complex algorithms that run at scale on
distributed systems based on whats cool.
5. SpotRight 10 second elevator pitch
*** from an engineer’s perspective ***
• Collect and organize information about
people on the web: 1. what they say
2. who they interact with
3. how they behave
• Provide a means to link online data to
traditional offline data. key=emailMd5
• Deliver timely aggregate and record level
data to our clients
6. SpotRight work flow
Deliver
Organize and
Link
Collect
• Crawl the web
• Twitter Graph Builder - 500M accounts
• Polite (crawl delay) Respectful (robots.txt) and
Onymous (Hi! I’m SpotBot/1.0)
• Identity resolution - Profile creation
• Many-to-many relationships
emailMd5 -> Urls -> SocialProfiles
• Distributed and local graph algorithms
• Spark and Solr
• $$$
9. Crawler
Links
Tweet Frequency
• Window of 20 tweets
• Predict the Nth tweet and revisit
• Classify accounts N=5,10,15
• Auto refresh
• Corner cases, errors, exceptions
Author
Follower
Following
@Mention
Retweet
Me Links
#Hashtag
• Directed weighted
• Heterogeneous
• Chronological
• 500M -> 250M -> 100M users
• 400M edges, 200M tweets per month
• Always running
• Extensible - networks with graph structure
• Scalable - Kraken
10. Profile Creation
Hadoop, Giraph, ScalaGraph
Me
Shared
Every
You
Me
Me
You
You
One
We
Know
Me You
SharedShared
• Shared nodes can be massive: KimKardashian, Orkut
• Direction/Presence of edge untrustworthy
• Whitelist networks to form seeds
• Heuristics based on business logic
• Over aggregation, black holes
12. Nearest Neighbor
Index
Cassandra
Giraph - BSP - Connected Components
For each superstep:
send min id to neighbors
or decide to halt
Unique min id for CC
Map
1. Assemble local graph
2. key by ccID
Peoplify
Reduce
1. Assemble connected component
2. Lift into ScalaGraph (in memory)
3. Edge cut to form seeds
4. Assign people ids to seeds
5. Grow seeds to share urls
New, exisiting, merge, delete,
destroy!?
https://twitter.com/nhalko
• followers/following
• me-links
• ccId (singular)
• peopleIds (plural)
Cassandra
13. Work Avoidancethe extreme preference of leisure as
opposed to work
Url_a
Url_b
Url_c
1. Mark urls of interest with hot ccID ~5M
wtd < zzz
2. CC alg pushes hot ccID to other urls
3. Peoplify can lift and assembly only subset
10M instead of 750M
Url_a
Url_b
Url_c
Url_a
Url_b
Url_c
Datasize
3 hrs (orig MR recursive)
3 days (orig MR recursive)
40 mins
3 hrs
14. Delivery
Spark, Solr
Given a set of identifiers (emailMd5s) create aggregate
level insights and record level appends.
Example: Following behavior, Twitter as an Interest Graph
Example: Tracking mentions and sentiment
15. 1. Set number of cores and heap per job
2. Internal map tasks reuse JVM
3. RDDs feel like Scala collections
4. I can read the source!!
5. Persist RDDs at will
6. Joins
7. Streaming (listening), GraphX (Giraph rep.)
Kafka queue
stream.map {
item =>
// do work
}
head —
offset —
Next generation Hadoop
16. In the 2012 United States presidential election
between Barack Obama and Mitt Romney, he correctly
predicted the winner of all 50 states and the District of
Columbia.
Nate Silver
17. 1. library release velocity - keeping up with new code versions
2. simplicity so others can maintain
3. make it work, make it right, make it fast
4. corner cases, outliers, the 0.01% - if it can possibly happen
it will in abundance
5. on average, nothing interesting happens - results plagued
by majority
6. waterfall, abysmal fill rates
7. Record level view (CC vs MLR)
8. Most time spent waiting on database
Algorithmic considerations