Hendrickson data2 2012-gnip


Published on

Talk from Data2.0, San Francisco, 2012 April 3

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hendrickson data2 2012-gnip

  1. 1. Taming the Social Media FirehoseScott Hendrickson Data Scientist Gnip
  2. 2. Social media firehosesConnect, move and store lots of dataFilter and analyzeE.g. How a social media story evolvesDig deeper
  3. 3. Obtain: pointing and clicking does not scale. Scrub: the world is a messy place. Explore: you can see a lot by looking. Models: always bad, sometimes ugly. iNterpret: insight, not numbers.Hilary  Mason  &  Chris  Wiggins    h1p://www.dataists.com/2010/09/a-­‐taxonomy-­‐of-­‐data-­‐science/    
  4. 4. Obtain   Parse   Store   Filter  Analyze   Structure   Aggregate  iNterpret  
  5. 5. Continuous streams of flexibly structured social media activities in near-real time.
  6. 6. ContinuousTwitter Full Firehose: 300M+ activities/day 3,500 activities/second or 1 activity every 290 μsecWordpress and Disqus Comments: 400K+ activities/day 4.6 activities/second or 1 activity every 0.22 s
  7. 7. streamsE.g. Streaming HTTP Not your familiar 1-shot web APIs A step from stateless sessions •  Connection monitoring •  “Keep alive” records •  Caching-on-disconnect(Ping  à  gniP)  
  8. 8. flexiblystructuredVis-à-vis firehoses: Emphasis on time-ordered events Combination of data and meta-data E.g. Tweet and number of Retweets Activity encapsulation Hierarchical structures within activityFlexibly  Structured  =  “Unstructured”  in  the  normalized  set-­‐based  database  sense  
  9. 9. social media activitiesTweets, micro-blogsBlog/rich-media postsComments/threaded discussionsRich media-sharing (urls, reposts)Location data (place, long/lat)Friend/follower relationshipsEngagement (e.g. Likes, up- and down-votes, reputation)Tagging
  10. 10. near-real timeTwitter (Tweet-through-firehose-spigot) ~1.6 s (as low as 300 msec) Wordpress Posts: (Post-through-firehose-spigot) ~2.5 s (as low as 1 sec) Is  anything  realPme?  
  11. 11. 1.  Compare time-evolution of social media reactions across firehoses2.  Compare richness of content across firehoses
  12. 12. Firehoses: Twitter Wordpress Posts and Comments NewsgatorFilter content on key terms: “quake” “terremoto”Extract date time posted, group in 1 min bucketsand plot
  13. 13. Surprise events fit a “double-exponential” pulse inactivity rate that enables consistent comparisonbetween events and sources
  14. 14. R0 = 1288.150591alpha=0.001470beta=0.000195# t0=1332266953# TPeak=1332268410Time-to-peak = 24.3 minPeak rate=855Mass=5816206.183899# T 1/2life=13322725931/2Life = 69.7 min  
  15. 15. 1.  Connect and stream data from firehoses2.  Preliminary filter3.  Store to file4.  Extract post times5.  Count activities in 1-minute buckets6.  Proxy of “richness”: count number of a characters in content7.  Visualize
  16. 16. ConnectingSimple HTTP streaming with cURL curl --compressed -v -ushendrickson@gnip.com "https://stream.gnip.com:443/accounts/ shendrickson/publishers/twitter/streams/sample10/ decahose.json" Build based on librariesOTS solutions
  17. 17. ConnectingConsiderations: Disconnects Redundancy Latency Bandwidth Data bursts Costs Publisher TOS – Deletes De-dups, missing activities
  18. 18. Moving and StoringVolumes (JSON, gzip’d) 100M Tweets = 25 GB < 2 min @300 MB/s (SATA II) < 6 hrs @10 Mb/s (Ethernet) 1 day Wordpress.com posts = 350MBFiles systemNoSQL/Key-Value Stores – Flexible structureRelational DB Stores – Indexes rockMessage Queues
  19. 19. Filter Model – guess at structure and process Parse – sort out the pieces Filter – reduce to what matters Aggregate – cluster, sum, average… Analyse – tell the story with data
  20. 20. Speed vs. DepthEvolution
  21. 21. Network dynamics Influencers, path analysis, viral spread…Time dynamics Time to peak, story half-life…Natural language processing ”Aboutness” is hard, but gets easier as domain " narrowsExplore and deploy Master skills, shorten cycles of exploration Move learning to production
  22. 22. www.gnip.comTwitter: @drskippy27