Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

We are losing our tweets!


Published on

Lessons learned from TwapperKeeper prototype.

Published in: Technology
  • Be the first to comment

We are losing our tweets!

  1. We are losing our tweets!<br />An analysis, a prototype, lessons learned, and proposed third party solution to the problem<br />John O’Brien III<br />@jobrieniii<br /><br />
  2. Twitter “Primer”<br />Social network / micro blogging site<br />Send / read 140 character messages<br />You can follow anyone, and they can follow you<br />Sent messages are delivered to all your followers<br />Sent messages are also publically indexed and searchable<br />Permissions can be established to restrict delivery, but this is not the norm<br />
  3. Problem<br />As the usage of Twitter has exploded, Twitter’s ability to provide long term access to tweets that mention key events (typically #hashtag’ed) has eroded<br />
  4. First, who cares?<br />Individuals<br />Bloggers<br />Conference Attendees / Leaders<br />Academia / “Web” Ecologists<br />Media Outlets<br />Companies<br />Government<br />
  5. So lets dive into the problem...<br />Followers<br />Search<br />
  6. Search UI / API Constraints<br />Limited to keywords, #hashtags, or @mentions within 140 char body of tweet <br />100 tweets x 15 pages = 1500 per search term<br />For a given keyword, exists in search for “around 1.5 weeks but is dynamic and subject to shrink as the number of tweets per day continues to grow.” – Twitter website<br />
  7. Hmmmm….<br />No other ‘in the cloud’ sites were found back in June, only client side applications and ‘hacked’ custom scripts<br />RSS feeds were considered but initially dismissed because they typically require an end user client<br />Decision was to “build our own” and see if we can solve the problem<br />
  8. A little bit about my thoughts on the SDLC process…<br />**FOCUS**<br />ON<br />LEARNING<br />“Minimally Viable”<br />PROTOTYPE<br />
  9. “Minimally Viable” Micro App<br />What if we could get ahead of the problem and store the data before Twitter “loses” it?<br />Functional Requirements<br />Ability for user to define #hashtags of importance<br />Create a background script that leverages the Twitter /search REST API to keep an eye on each hash tag and store data in local database<br />**Sweep, grab, and record…**<br />Must be running at all times and publically available<br />Technical Specs<br />Build on LAMP stack, put into the cloud, running 24/7/365<br />
  10. “Minimally Viable” Micro App<br />internet<br />php script to <br />query each<br />#hashtag<br />Twitter<br />/search<br />API<br />Our Database<br />
  11. “BETA”was born on Saturday and released to public on Sunday…<br />
  12. And we started to grow and get customer feedback…<br />
  13. And we lived through a key world event…<br /><br />
  14. So what did we learn?<br />We need to be whitelisted<br />People often don’t start the archiving until after they start using #hashtags<br />Thus, point forward solution not enough, need to reach back as well<br />While hashtags are the norm, some people would just like to track keywords<br />Velocity of tweets can be a major issue<br />What if a hashtag results are greater than 1500 tweets per minute? <br />Hashtags of archive interest typically spike in velocity and die off in traffic.<br />However some archives get VERY, VERY big!<br />
  15. And more learning…<br />URL shortening services are of long time concern to users and archiving community<br />Twitter /search REST API periodically is unresponsive <br />Twitter /search REST API sometimes glitches and returns duplicate data<br />People want not only output in html, but raw exports for publication, analysis and real time consumption (txt, csv, xml, json, etc)<br />Twitter engineers contact us and recommend also incorporating newly releasedreal time streams <br />/track, /sample , /firehose<br />
  16. Recommended “out-of-beta” V2.0<br />Anticipate #hashtags to archive based upon Twitter trending stats and autocreate archives<br />Hybrid approach of using /search and /track (real time stream) APIs to handle velocity issues<br />Check for duplicates “before” inserts<br />Implement monitoring and “self healing” services<br />Shortened URLs should be resolved into fully qualified URLs and stored separately for reference (at time of capture)<br />Create TwapperKeeper API by modularizing the archiving engine into a SOA architecture (/create, /info, /get) for internal and external consumption<br />Include additional output formats to be provided for download<br />“Extracts” of large archives should be automatically generated on a daily basis and made available for download<br />VERSION <br />2.0<br />
  17. Recommended “out-of-beta” V2.0<br />Twitter<br />/track<br />API<br />hybrid php / curl script<br />to archive per #hashtag<br />Monitor Health and Self Heal<br />Twitter<br />/search<br />API<br />auto<br />create trends<br />Twitter<br />/trends<br />API<br />Our Database<br />File extractor<br />api<br />/create<br />/info<br />/get<br />external sites<br />short url lookup<br />
  18. Questions?<br />