SlideShare a Scribd company logo
Twapper Keeper Operations / Infrastructure Enhancements04/25/2010
High Level Server Configuration – as of 4/25/2010  Twitter Primary  Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup  Twapper Keeper Server Backup extract
Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each  archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk  [11 instances] archive tables z_{notebook_id}
New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new.  This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive.  Once more than 1 tweet is found  it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.)   twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter   /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them.  This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}
New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new.  This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive.  Once more than 1 tweet is found  it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.)   twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter   /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them.  This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}
Issues / Concerns Server is running at too high a load rate and is running out of disk space ,[object Object]
Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing

More Related Content

Similar to Twapper Keeper Ops / Infrastructure Enhancements

Managing tfs
Managing tfsManaging tfs
Managing tfs
Esteban Garcia
 
Leveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance TestingLeveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance Testing
Thoughtworks
 
By: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony TrivinoBy: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony Trivino
webhostingguy
 
General configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf fileGeneral configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf file
Cognizant
 
Postgres Toolkit
Postgres ToolkitPostgres Toolkit
Asynchronous reading and writing http r equest
Asynchronous reading and writing http r equestAsynchronous reading and writing http r equest
Asynchronous reading and writing http r equest
Pragyanshis Patnaik
 
Apache
ApacheApache
Apache
Rathan Raj
 
Java Servlets
Java ServletsJava Servlets
Java Servlets
BG Java EE Course
 
Liit tyit sem 5 enterprise java unit 1 notes 2018
Liit tyit sem 5 enterprise java  unit 1 notes 2018 Liit tyit sem 5 enterprise java  unit 1 notes 2018
Liit tyit sem 5 enterprise java unit 1 notes 2018
tanujaparihar
 
JAVA Servlets
JAVA ServletsJAVA Servlets
JAVA Servlets
deepak kumar
 
Http
HttpHttp
Apache HTTP Server
Apache HTTP ServerApache HTTP Server
Apache HTTP Server
Tan Huynh Cong
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
webhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
webhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
webhostingguy
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BS
Information Development World
 
Twitter System Design
Twitter System DesignTwitter System Design
Twitter System Design
AkshatMishra72438
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
Praveen Yadav
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
DataWorks Summit
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 

Similar to Twapper Keeper Ops / Infrastructure Enhancements (20)

Managing tfs
Managing tfsManaging tfs
Managing tfs
 
Leveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance TestingLeveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance Testing
 
By: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony TrivinoBy: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony Trivino
 
General configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf fileGeneral configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf file
 
Postgres Toolkit
Postgres ToolkitPostgres Toolkit
Postgres Toolkit
 
Asynchronous reading and writing http r equest
Asynchronous reading and writing http r equestAsynchronous reading and writing http r equest
Asynchronous reading and writing http r equest
 
Apache
ApacheApache
Apache
 
Java Servlets
Java ServletsJava Servlets
Java Servlets
 
Liit tyit sem 5 enterprise java unit 1 notes 2018
Liit tyit sem 5 enterprise java  unit 1 notes 2018 Liit tyit sem 5 enterprise java  unit 1 notes 2018
Liit tyit sem 5 enterprise java unit 1 notes 2018
 
JAVA Servlets
JAVA ServletsJAVA Servlets
JAVA Servlets
 
Http
HttpHttp
Http
 
Apache HTTP Server
Apache HTTP ServerApache HTTP Server
Apache HTTP Server
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BS
 
Twitter System Design
Twitter System DesignTwitter System Design
Twitter System Design
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Twapper Keeper Ops / Infrastructure Enhancements

  • 1. Twapper Keeper Operations / Infrastructure Enhancements04/25/2010
  • 2. High Level Server Configuration – as of 4/25/2010 Twitter Primary Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup Twapper Keeper Server Backup extract
  • 3. Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk [11 instances] archive tables z_{notebook_id}
  • 4. New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new. This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}
  • 5. New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new. This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}
  • 6.
  • 7. Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing
  • 8. There is contention for access to RAWSTREAM table causing a latency on the inbound /STREAM HTTP connection resulting in 30-40 min latency, which can cause lost tweets during reconnects (which happen when new archives are created)
  • 9.
  • 10. Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration
  • 11. Procure additional disk space for Primary and Backup server
  • 12. Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly
  • 13. Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency
  • 14. Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting for streaming API)
  • 15. Implement OAuth for all REST API calls
  • 16. Implement an approach to reduce contention on RAWSTREAM table to reduce latency on inbound Streaming API
  • 18. Target completion by May 14, 2010

Editor's Notes

  1. This presentation outlines the planned operational / infrastructure upgrades planned to improve the accuracy and stability of the archiving platform.
  2. The following document outlines the high level configuration of the Twapper Keeper archiving platform. For Keyword and Hashtag archives:The /track Streaming API is the primary input. The Twitter RESTful /search API is used when archives are first created to pull back older tweets and to quickly fill the archive with data. The /search API is also continually called for all archives in a round robin fashion to try to ensure no missed tweets. For Person archives:The RESTful /timeline API is called in a round robin fashion to get all public tweets from a persons timeline. ](When a Person archive is first created it is flagged as new and given priority to quickly fill the archive.)
  3. This slide provides more detail on the primary hashtag and keyword processing.When new archives are created by the user they are stored in the TWAPPERS table and a separate z_table is created for the archive.The phire_track.php job (a) constantly looks for new additions (roughly every 60 seconds) and reconnects the stream to the Twitter /track stream with the various predicates.All data is stored in RAWSTREAM as it received from Twitter via the HTTP connection.Eleven instances of sweeper_stream.php are running to offload stacks of data from raw stream, process, and bin in the appropriate archive z_table. (a) This is based upon the phirehose library - http://code.google.com/p/phirehose/
  4. This slide provides more detail on the new / missed hashtag and keyword processing. This uses the RESTful /search API to quickly backfill archives with data already stored in Twitter’s search as well as try to find any missed tweets that may get missed in the Streaming API (for any reason).sweeper_new.php is looking for archives that are flagged as new and runs a search against the Twitter API (results are set to max 100, pagination from pg 1-15). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_missed.php crawls each archive flagged as “not” new and continually calls the search API (max 100, pg 1-15) to try and find any missed tweets.All search queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/
  5. This slide provides more detail on the new / primary archiving process for Person archive. This uses the RESTful /timeline API to quickly backfill archives with data already stored in public user timeline and continually crawl the user timeline to find more data.sweeper_personal_new.php is looking for person archives that are flagged as new and runs a Twitter /timeline API call (results are set to max 200, pagination from pg 1-16). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_personal.php crawls each archive flagged as “not” new and continually calls Twitter /timeline API (max 200, pg 1-16) to try and find any missed tweets.All timeline queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/