We are losing our tweets!

2,362 views

Published on

Lessons learned from TwapperKeeper prototype.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,362
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
16
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Love the circle!
  • We are losing our tweets!

    1. We are losing our tweets!<br />An analysis, a prototype, lessons learned, and proposed third party solution to the problem<br />John O’Brien III<br />@jobrieniii<br />http://www.linkedin.com/in/jobrieniii<br />
    2. Twitter “Primer”<br />Social network / micro blogging site<br />Send / read 140 character messages<br />You can follow anyone, and they can follow you<br />Sent messages are delivered to all your followers<br />Sent messages are also publically indexed and searchable<br />Permissions can be established to restrict delivery, but this is not the norm<br />
    3. Problem<br />As the usage of Twitter has exploded, Twitter’s ability to provide long term access to tweets that mention key events (typically #hashtag’ed) has eroded<br />
    4. First, who cares?<br />Individuals<br />Bloggers<br />Conference Attendees / Leaders<br />Academia / “Web” Ecologists<br />Media Outlets<br />Companies<br />Government<br />
    5. So lets dive into the problem...<br />Followers<br />Search<br />
    6. Search UI / API Constraints<br />Limited to keywords, #hashtags, or @mentions within 140 char body of tweet <br />100 tweets x 15 pages = 1500 per search term<br />For a given keyword, exists in search for “around 1.5 weeks but is dynamic and subject to shrink as the number of tweets per day continues to grow.” – Twitter website<br />
    7. Hmmmm….<br />No other ‘in the cloud’ sites were found back in June, only client side applications and ‘hacked’ custom scripts<br />RSS feeds were considered but initially dismissed because they typically require an end user client<br />Decision was to “build our own” and see if we can solve the problem<br />
    8. A little bit about my thoughts on the SDLC process…<br />**FOCUS**<br />ON<br />LEARNING<br />“Minimally Viable”<br />PROTOTYPE<br />
    9. “Minimally Viable” Micro App<br />What if we could get ahead of the problem and store the data before Twitter “loses” it?<br />Functional Requirements<br />Ability for user to define #hashtags of importance<br />Create a background script that leverages the Twitter /search REST API to keep an eye on each hash tag and store data in local database<br />**Sweep, grab, and record…**<br />Must be running at all times and publically available<br />Technical Specs<br />Build on LAMP stack, put into the cloud, running 24/7/365<br />
    10. “Minimally Viable” Micro App<br />internet<br />php script to <br />query each<br />#hashtag<br />Twitter<br />/search<br />API<br />Our Database<br />
    11. TwapperKeeper.com “BETA”was born on Saturday and released to public on Sunday…<br />
    12. And we started to grow and get customer feedback…<br />
    13. And we lived through a key world event…<br />http://mashable.com/2009/09/16/white-house-records/<br />
    14. So what did we learn?<br />We need to be whitelisted<br />People often don’t start the archiving until after they start using #hashtags<br />Thus, point forward solution not enough, need to reach back as well<br />While hashtags are the norm, some people would just like to track keywords<br />Velocity of tweets can be a major issue<br />What if a hashtag results are greater than 1500 tweets per minute? <br />Hashtags of archive interest typically spike in velocity and die off in traffic.<br />However some archives get VERY, VERY big!<br />
    15. And more learning…<br />URL shortening services are of long time concern to users and archiving community<br />Twitter /search REST API periodically is unresponsive <br />Twitter /search REST API sometimes glitches and returns duplicate data<br />People want not only output in html, but raw exports for publication, analysis and real time consumption (txt, csv, xml, json, etc)<br />Twitter engineers contact us and recommend also incorporating newly releasedreal time streams <br />/track, /sample , /firehose<br />
    16. Recommended “out-of-beta” V2.0<br />Anticipate #hashtags to archive based upon Twitter trending stats and autocreate archives<br />Hybrid approach of using /search and /track (real time stream) APIs to handle velocity issues<br />Check for duplicates “before” inserts<br />Implement monitoring and “self healing” services<br />Shortened URLs should be resolved into fully qualified URLs and stored separately for reference (at time of capture)<br />Create TwapperKeeper API by modularizing the archiving engine into a SOA architecture (/create, /info, /get) for internal and external consumption<br />Include additional output formats to be provided for download<br />“Extracts” of large archives should be automatically generated on a daily basis and made available for download<br />VERSION <br />2.0<br />
    17. Recommended “out-of-beta” V2.0<br />Twitter<br />/track<br />API<br />hybrid php / curl script<br />to archive per #hashtag<br />Monitor Health and Self Heal<br />Twitter<br />/search<br />API<br />auto<br />create trends<br />Twitter<br />/trends<br />API<br />Our Database<br />File extractor<br />api<br />/create<br />/info<br />/get<br />external sites<br />short url lookup<br />
    18. Questions?<br />

    ×