Your SlideShare is downloading. ×
0
Taming the Social                     Media FirehoseScott Hendrickson  Data Scientist       Gnip
Social media firehosesConnect, move and store lots of dataFilter and analyzeE.g. How a social media story evolvesDig deeper...
Obtain: pointing and clicking does not scale.                                    Scrub: the world is a messy place.       ...
Obtain	                    Parse	    Store	                         Filter	  Analyze	                       Structure	    ...
Continuous streams of flexibly structured social       media activities in near-real time.
ContinuousTwitter Full Firehose:      300M+ activities/day      3,500 activities/second      or 1 activity every 290 μsecW...
streamsE.g. Streaming HTTP      Not your familiar 1-shot web APIs                  A step from stateless sessions         ...
flexiblystructuredVis-à-vis firehoses:    Emphasis on time-ordered events   Combination of data and meta-data            E.g...
social media activitiesTweets, micro-blogsBlog/rich-media postsComments/threaded discussionsRich media-sharing (urls, repo...
near-real timeTwitter (Tweet-through-firehose-spigot)            ~1.6 s (as low as 300 msec)      Wordpress Posts: (Post-th...
1.  Compare time-evolution of social media    reactions across firehoses2.  Compare richness of content across    firehoses
Firehoses:       Twitter       Wordpress Posts and Comments       NewsgatorFilter content on key terms:       “quake”     ...
Surprise events fit a “double-exponential” pulse inactivity rate that enables consistent comparisonbetween events and sources
R0 = 1288.150591alpha=0.001470beta=0.000195# t0=1332266953# TPeak=1332268410Time-to-peak = 24.3 minPeak rate=855Mass=58162...
1.  Connect and stream data from firehoses2.  Preliminary filter3.  Store to file4.  Extract post times5.  Count activities i...
ConnectingSimple      HTTP streaming with cURL       curl --compressed         -v -ushendrickson@gnip.com         "https:/...
ConnectingConsiderations:  Disconnects   Redundancy  Latency  Bandwidth  Data bursts  Costs  Publisher TOS – Deletes  De-d...
Moving and StoringVolumes (JSON, gzip’d)      100M Tweets = 25 GB       < 2 min @300 MB/s (SATA II)      < 6 hrs @10 Mb/s ...
Filter Model – guess at structure and process         Parse – sort out the pieces         Filter – reduce to what matters ...
Speed vs. DepthEvolution
Network dynamics     Influencers, path analysis, viral spread…Time dynamics     Time to peak, story half-life…Natural langu...
www.gnip.comTwitter: @drskippy27
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
Upcoming SlideShare
Loading in...5
×

Hendrickson data2 2012-gnip

207

Published on

Talk from Data2.0, San Francisco, 2012 April 3

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
207
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hendrickson data2 2012-gnip"

  1. 1. Taming the Social Media FirehoseScott Hendrickson Data Scientist Gnip
  2. 2. Social media firehosesConnect, move and store lots of dataFilter and analyzeE.g. How a social media story evolvesDig deeper
  3. 3. Obtain: pointing and clicking does not scale. Scrub: the world is a messy place. Explore: you can see a lot by looking. Models: always bad, sometimes ugly. iNterpret: insight, not numbers.Hilary  Mason  &  Chris  Wiggins    h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐data-­‐science/    
  4. 4. Obtain   Parse   Store   Filter  Analyze   Structure   Aggregate  iNterpret  
  5. 5. Continuous streams of flexibly structured social media activities in near-real time.
  6. 6. ContinuousTwitter Full Firehose: 300M+ activities/day 3,500 activities/second or 1 activity every 290 μsecWordpress and Disqus Comments: 400K+ activities/day 4.6 activities/second or 1 activity every 0.22 s
  7. 7. streamsE.g. Streaming HTTP Not your familiar 1-shot web APIs A step from stateless sessions •  Connection monitoring •  “Keep alive” records •  Caching-on-disconnect(Ping  à  gniP)  
  8. 8. flexiblystructuredVis-à-vis firehoses: Emphasis on time-ordered events Combination of data and meta-data E.g. Tweet and number of Retweets Activity encapsulation Hierarchical structures within activityFlexibly  Structured  =  “Unstructured”  in  the  normalized  set-­‐based  database  sense  
  9. 9. social media activitiesTweets, micro-blogsBlog/rich-media postsComments/threaded discussionsRich media-sharing (urls, reposts)Location data (place, long/lat)Friend/follower relationshipsEngagement (e.g. Likes, up- and down-votes, reputation)Tagging
  10. 10. near-real timeTwitter (Tweet-through-firehose-spigot) ~1.6 s (as low as 300 msec) Wordpress Posts: (Post-through-firehose-spigot) ~2.5 s (as low as 1 sec) Is  anything  realPme?  
  11. 11. 1.  Compare time-evolution of social media reactions across firehoses2.  Compare richness of content across firehoses
  12. 12. Firehoses: Twitter Wordpress Posts and Comments NewsgatorFilter content on key terms: “quake” “terremoto”Extract date time posted, group in 1 min bucketsand plot
  13. 13. Surprise events fit a “double-exponential” pulse inactivity rate that enables consistent comparisonbetween events and sources
  14. 14. R0 = 1288.150591alpha=0.001470beta=0.000195# t0=1332266953# TPeak=1332268410Time-to-peak = 24.3 minPeak rate=855Mass=5816206.183899# T 1/2life=13322725931/2Life = 69.7 min  
  15. 15. 1.  Connect and stream data from firehoses2.  Preliminary filter3.  Store to file4.  Extract post times5.  Count activities in 1-minute buckets6.  Proxy of “richness”: count number of a characters in content7.  Visualize
  16. 16. ConnectingSimple HTTP streaming with cURL curl --compressed -v -ushendrickson@gnip.com "https://stream.gnip.com:443/accounts/ shendrickson/publishers/twitter/streams/sample10/ decahose.json" Build based on librariesOTS solutions
  17. 17. ConnectingConsiderations: Disconnects Redundancy Latency Bandwidth Data bursts Costs Publisher TOS – Deletes De-dups, missing activities
  18. 18. Moving and StoringVolumes (JSON, gzip’d) 100M Tweets = 25 GB < 2 min @300 MB/s (SATA II) < 6 hrs @10 Mb/s (Ethernet) 1 day Wordpress.com posts = 350MBFiles systemNoSQL/Key-Value Stores – Flexible structureRelational DB Stores – Indexes rockMessage Queues
  19. 19. Filter Model – guess at structure and process Parse – sort out the pieces Filter – reduce to what matters Aggregate – cluster, sum, average… Analyse – tell the story with data
  20. 20. Speed vs. DepthEvolution
  21. 21. Network dynamics Influencers, path analysis, viral spread…Time dynamics Time to peak, story half-life…Natural language processing ”Aboutness” is hard, but gets easier as domain " narrowsExplore and deploy Master skills, shorten cycles of exploration Move learning to production
  22. 22. www.gnip.comTwitter: @drskippy27
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×