SlideShare a Scribd company logo
1 of 7
Twapper Keeper Operations / Infrastructure Enhancements04/25/2010
High Level Server Configuration – as of 4/25/2010  Twitter Primary  Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup  Twapper Keeper Server Backup extract
Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each  archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk  [11 instances] archive tables z_{notebook_id}
New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new.  This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive.  Once more than 1 tweet is found  it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.)   twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter   /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them.  This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}
New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new.  This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive.  Once more than 1 tweet is found  it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.)   twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter   /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them.  This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}
Issues / Concerns Server is running at too high a load rate and is running out of disk space ,[object Object]
Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing

More Related Content

Similar to Twapper Keeper Ops / Infrastructure Enhancements

Leveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance TestingLeveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance TestingThoughtworks
 
By: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony TrivinoBy: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony Trivinowebhostingguy
 
General configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf fileGeneral configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf fileCognizant
 
Asynchronous reading and writing http r equest
Asynchronous reading and writing http r equestAsynchronous reading and writing http r equest
Asynchronous reading and writing http r equestPragyanshis Patnaik
 
Liit tyit sem 5 enterprise java unit 1 notes 2018
Liit tyit sem 5 enterprise java  unit 1 notes 2018 Liit tyit sem 5 enterprise java  unit 1 notes 2018
Liit tyit sem 5 enterprise java unit 1 notes 2018 tanujaparihar
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptwebhostingguy
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSInformation Development World
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingDataWorks Summit
 

Similar to Twapper Keeper Ops / Infrastructure Enhancements (20)

Managing tfs
Managing tfsManaging tfs
Managing tfs
 
Leveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance TestingLeveraging Functional Tools and AWS for Performance Testing
Leveraging Functional Tools and AWS for Performance Testing
 
By: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony TrivinoBy: Luis A. Colón Anthony Trivino
By: Luis A. Colón Anthony Trivino
 
General configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf fileGeneral configurations on apache directives included in the httpd.conf file
General configurations on apache directives included in the httpd.conf file
 
Postgres Toolkit
Postgres ToolkitPostgres Toolkit
Postgres Toolkit
 
Asynchronous reading and writing http r equest
Asynchronous reading and writing http r equestAsynchronous reading and writing http r equest
Asynchronous reading and writing http r equest
 
Apache
ApacheApache
Apache
 
Java Servlets
Java ServletsJava Servlets
Java Servlets
 
Liit tyit sem 5 enterprise java unit 1 notes 2018
Liit tyit sem 5 enterprise java  unit 1 notes 2018 Liit tyit sem 5 enterprise java  unit 1 notes 2018
Liit tyit sem 5 enterprise java unit 1 notes 2018
 
JAVA Servlets
JAVA ServletsJAVA Servlets
JAVA Servlets
 
Apache HTTP Server
Apache HTTP ServerApache HTTP Server
Apache HTTP Server
 
Http
HttpHttp
Http
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Utosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.pptUtosc2007_Apache_Configuration.ppt
Utosc2007_Apache_Configuration.ppt
 
Boost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BSBoost Your Content Strategy for REST APIs with Gururaj BS
Boost Your Content Strategy for REST APIs with Gururaj BS
 
Twitter System Design
Twitter System DesignTwitter System Design
Twitter System Design
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Twapper Keeper Ops / Infrastructure Enhancements

  • 1. Twapper Keeper Operations / Infrastructure Enhancements04/25/2010
  • 2. High Level Server Configuration – as of 4/25/2010 Twitter Primary Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup Twapper Keeper Server Backup extract
  • 3. Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk [11 instances] archive tables z_{notebook_id}
  • 4. New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new. This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}
  • 5. New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new. This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}
  • 6.
  • 7. Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing
  • 8. There is contention for access to RAWSTREAM table causing a latency on the inbound /STREAM HTTP connection resulting in 30-40 min latency, which can cause lost tweets during reconnects (which happen when new archives are created)
  • 9.
  • 10. Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration
  • 11. Procure additional disk space for Primary and Backup server
  • 12. Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly
  • 13. Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency
  • 14. Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting for streaming API)
  • 15. Implement OAuth for all REST API calls
  • 16. Implement an approach to reduce contention on RAWSTREAM table to reduce latency on inbound Streaming API
  • 18. Target completion by May 14, 2010

Editor's Notes

  1. This presentation outlines the planned operational / infrastructure upgrades planned to improve the accuracy and stability of the archiving platform.
  2. The following document outlines the high level configuration of the Twapper Keeper archiving platform. For Keyword and Hashtag archives:The /track Streaming API is the primary input. The Twitter RESTful /search API is used when archives are first created to pull back older tweets and to quickly fill the archive with data. The /search API is also continually called for all archives in a round robin fashion to try to ensure no missed tweets. For Person archives:The RESTful /timeline API is called in a round robin fashion to get all public tweets from a persons timeline. ](When a Person archive is first created it is flagged as new and given priority to quickly fill the archive.)
  3. This slide provides more detail on the primary hashtag and keyword processing.When new archives are created by the user they are stored in the TWAPPERS table and a separate z_table is created for the archive.The phire_track.php job (a) constantly looks for new additions (roughly every 60 seconds) and reconnects the stream to the Twitter /track stream with the various predicates.All data is stored in RAWSTREAM as it received from Twitter via the HTTP connection.Eleven instances of sweeper_stream.php are running to offload stacks of data from raw stream, process, and bin in the appropriate archive z_table. (a) This is based upon the phirehose library - http://code.google.com/p/phirehose/
  4. This slide provides more detail on the new / missed hashtag and keyword processing. This uses the RESTful /search API to quickly backfill archives with data already stored in Twitter’s search as well as try to find any missed tweets that may get missed in the Streaming API (for any reason).sweeper_new.php is looking for archives that are flagged as new and runs a search against the Twitter API (results are set to max 100, pagination from pg 1-15). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_missed.php crawls each archive flagged as “not” new and continually calls the search API (max 100, pg 1-15) to try and find any missed tweets.All search queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/
  5. This slide provides more detail on the new / primary archiving process for Person archive. This uses the RESTful /timeline API to quickly backfill archives with data already stored in public user timeline and continually crawl the user timeline to find more data.sweeper_personal_new.php is looking for person archives that are flagged as new and runs a Twitter /timeline API call (results are set to max 200, pagination from pg 1-16). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_personal.php crawls each archive flagged as “not” new and continually calls Twitter /timeline API (max 200, pg 1-16) to try and find any missed tweets.All timeline queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/