2. High Level Server Configuration – as of 4/25/2010 Twitter Primary Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup Twapper Keeper Server Backup extract
3. Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk [11 instances] archive tables z_{notebook_id}
4. New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new. This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}
5. New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new. This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}
6.
7. Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing
8. There is contention for access to RAWSTREAM table causing a latency on the inbound /STREAM HTTP connection resulting in 30-40 min latency, which can cause lost tweets during reconnects (which happen when new archives are created)
9.
10. Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration
12. Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly
13. Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency
14. Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting for streaming API)
This presentation outlines the planned operational / infrastructure upgrades planned to improve the accuracy and stability of the archiving platform.
The following document outlines the high level configuration of the Twapper Keeper archiving platform. For Keyword and Hashtag archives:The /track Streaming API is the primary input. The Twitter RESTful /search API is used when archives are first created to pull back older tweets and to quickly fill the archive with data. The /search API is also continually called for all archives in a round robin fashion to try to ensure no missed tweets. For Person archives:The RESTful /timeline API is called in a round robin fashion to get all public tweets from a persons timeline. ](When a Person archive is first created it is flagged as new and given priority to quickly fill the archive.)
This slide provides more detail on the primary hashtag and keyword processing.When new archives are created by the user they are stored in the TWAPPERS table and a separate z_table is created for the archive.The phire_track.php job (a) constantly looks for new additions (roughly every 60 seconds) and reconnects the stream to the Twitter /track stream with the various predicates.All data is stored in RAWSTREAM as it received from Twitter via the HTTP connection.Eleven instances of sweeper_stream.php are running to offload stacks of data from raw stream, process, and bin in the appropriate archive z_table. (a) This is based upon the phirehose library - http://code.google.com/p/phirehose/
This slide provides more detail on the new / missed hashtag and keyword processing. This uses the RESTful /search API to quickly backfill archives with data already stored in Twitter’s search as well as try to find any missed tweets that may get missed in the Streaming API (for any reason).sweeper_new.php is looking for archives that are flagged as new and runs a search against the Twitter API (results are set to max 100, pagination from pg 1-15). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_missed.php crawls each archive flagged as “not” new and continually calls the search API (max 100, pg 1-15) to try and find any missed tweets.All search queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/
This slide provides more detail on the new / primary archiving process for Person archive. This uses the RESTful /timeline API to quickly backfill archives with data already stored in public user timeline and continually crawl the user timeline to find more data.sweeper_personal_new.php is looking for person archives that are flagged as new and runs a Twitter /timeline API call (results are set to max 200, pagination from pg 1-16). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_personal.php crawls each archive flagged as “not” new and continually calls Twitter /timeline API (max 200, pg 1-16) to try and find any missed tweets.All timeline queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/