Twapper Keeper Ops / Infrastructure Enhancements

•

0 likes•272 views

Twapper Keeper Operations / Infrastructure Enhancements04/25/2010

High Level Server Configuration – as of 4/25/2010 Twitter Primary Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup Twapper Keeper Server Backup extract

$Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk [11 instances] archive tables z_{notebook_id}$

$New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new. This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}$

$New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new. This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}$

Issues / Concerns Server is running at too high a load rate and is running out of disk space ,[object Object]

Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing

Similar to Twapper Keeper Ops / Infrastructure Enhancements

Managing tfsEsteban Garcia

Leveraging Functional Tools and AWS for Performance TestingThoughtworks

By: Luis A. Colón Anthony Trivinowebhostingguy

General configurations on apache directives included in the httpd.conf fileCognizant

Postgres ToolkitUptime Technologies LLC (JP)

Asynchronous reading and writing http r equestPragyanshis Patnaik

ApacheRathan Raj

Java ServletsBG Java EE Course

Liit tyit sem 5 enterprise java unit 1 notes 2018 tanujaparihar

JAVA Servletsdeepak kumar

Apache HTTP ServerTan Huynh Cong

Httpleminhvuong

Utosc2007_Apache_Configuration.pptwebhostingguy

Boost Your Content Strategy for REST APIs with Gururaj BSInformation Development World

Twitter System DesignAkshatMishra72438

Unit5 servletsPraveen Yadav

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates

Hive & HBase For Transaction ProcessingDataWorks Summit

Similar to Twapper Keeper Ops / Infrastructure Enhancements (20)

Managing tfs

Leveraging Functional Tools and AWS for Performance Testing

By: Luis A. Colón Anthony Trivino

General configurations on apache directives included in the httpd.conf file

Postgres Toolkit

Asynchronous reading and writing http r equest

Apache

Java Servlets

Liit tyit sem 5 enterprise java unit 1 notes 2018

JAVA Servlets

Apache HTTP Server

Http

Utosc2007_Apache_Configuration.ppt

Boost Your Content Strategy for REST APIs with Gururaj BS

Twitter System Design

Unit5 servlets

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Hive & HBase For Transaction Processing

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Search Engine Optimization SEO PDF for 2024.pdfRankYa

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

"ML in Production",Oleksandr BaganFwdays

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

How to write a Business Continuity PlanDatabarracks

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Nell’iperspazio con Rocket: il Framework Web di Rust!

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Anypoint Exchange: It’s Not Just a Repo!

Developer Data Modeling Mistakes: From Postgres to NoSQL

Search Engine Optimization SEO PDF for 2024.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DevEX - reference for building teams, processes, and platforms

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

SIP trunking in Janus @ Kamailio World 2024

Connect Wave/ connectwave Pitch Deck Presentation

Unraveling Multimodality with Large Language Models.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

"ML in Production",Oleksandr Bagan

DMCC Future of Trade Web3 - Special Edition

How to write a Business Continuity Plan

What's New in Teams Calling, Meetings and Devices March 2024

Designing IA for AI - Information Architecture Conference 2024

Twapper Keeper Ops / Infrastructure Enhancements

1. Twapper Keeper Operations / Infrastructure Enhancements04/25/2010

2. High Level Server Configuration – as of 4/25/2010 Twitter Primary Twapper Keeper Server /track Streaming API /search REST API mysql apache jobs export Backup Twapper Keeper Server Backup extract

3. Primary Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache phire_track.php create persistent connection to Twitter Streaming API for each archive predicate listed in twappers, and store tweets in rawstream table [1 instance] twappers rawstream /track Streaming API sweeper_stream.php pull chunks of tweets, compare to archives on file, insert Into proper archive table, delete from processing when complete with chunk [11 instances] archive tables z_{notebook_id}

4. New / Missed Archiving Process for Hashtags and Keywords – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_new.php When archive is first created, it is flagged as new. This process picks up new archives and uses Twitter /search REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_missed.php For each archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged stream disconnects. archive tables z_{notebook_id}

5. New / Primary Archiving Process for Persons – as of 4/25/2010 Twitter Primary Twapper Keeper Server Jobs mysql apache sweeper_personal_new.php When archive is first created, it set flagged as new. This process picks up new archives and uses Twitter /timeline REST API to find all tweets available and stores them in appropriate archive. Once more than 1 tweet is found it unflags archive as no longer new. (NOTE: Sometimes calls are rate limited and thus need to ensure some data is rec’d before unflagging.) twappers sweeper_personal.php for each keyword and hashtag archive in the system, this is constantly crawling the Twitter /search REST API for all tweets in Twitter cache, checks to make sure they have been archived, and if not archives them. This is important in during Streaming API reconnects or prolonged disconnects. archive tables z_{notebook_id}

7. Users do not have automatic visibility to system issue and have to ask questions too often about archives, especially when things are queued in the RAWSTREAM table awaiting processing

8. There is contention for access to RAWSTREAM table causing a latency on the inbound /STREAM HTTP connection resulting in 30-40 min latency, which can cause lost tweets during reconnects (which happen when new archives are created)

10. Upgrade the Primary Twapper Keeper server VPS configuration to include larger CPU and RAM configuration

11. Procure additional disk space for Primary and Backup server

12. Incorporate automated export / import routine for backup server and rsynch apache and jobs files to ensure secondary server can be brought online more rapidly

13. Implement a layer of monitoring for administrator and provide a “system health” page for users for transparency

14. Provide a user feature to “reset” archives to kick off the requery process if they are in a hurry to see tweets (will force re-call of /search instead of waiting for streaming API)

15. Implement OAuth for all REST API calls

16. Implement an approach to reduce contention on RAWSTREAM table to reduce latency on inbound Streaming API

17. Timeline

18. Target completion by May 14, 2010

Editor's Notes

This presentation outlines the planned operational / infrastructure upgrades planned to improve the accuracy and stability of the archiving platform.
The following document outlines the high level configuration of the Twapper Keeper archiving platform. For Keyword and Hashtag archives:The /track Streaming API is the primary input. The Twitter RESTful /search API is used when archives are first created to pull back older tweets and to quickly fill the archive with data. The /search API is also continually called for all archives in a round robin fashion to try to ensure no missed tweets. For Person archives:The RESTful /timeline API is called in a round robin fashion to get all public tweets from a persons timeline. ](When a Person archive is first created it is flagged as new and given priority to quickly fill the archive.)
This slide provides more detail on the primary hashtag and keyword processing.When new archives are created by the user they are stored in the TWAPPERS table and a separate z_table is created for the archive.The phire_track.php job (a) constantly looks for new additions (roughly every 60 seconds) and reconnects the stream to the Twitter /track stream with the various predicates.All data is stored in RAWSTREAM as it received from Twitter via the HTTP connection.Eleven instances of sweeper_stream.php are running to offload stacks of data from raw stream, process, and bin in the appropriate archive z_table. (a) This is based upon the phirehose library - http://code.google.com/p/phirehose/
This slide provides more detail on the new / missed hashtag and keyword processing. This uses the RESTful /search API to quickly backfill archives with data already stored in Twitter’s search as well as try to find any missed tweets that may get missed in the Streaming API (for any reason).sweeper_new.php is looking for archives that are flagged as new and runs a search against the Twitter API (results are set to max 100, pagination from pg 1-15). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_missed.php crawls each archive flagged as “not” new and continually calls the search API (max 100, pg 1-15) to try and find any missed tweets.All search queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/
This slide provides more detail on the new / primary archiving process for Person archive. This uses the RESTful /timeline API to quickly backfill archives with data already stored in public user timeline and continually crawl the user timeline to find more data.sweeper_personal_new.php is looking for person archives that are flagged as new and runs a Twitter /timeline API call (results are set to max 200, pagination from pg 1-16). All results are stored. Once an archive has some data in it is unflagged as new. sweeper_personal.php crawls each archive flagged as “not” new and continually calls Twitter /timeline API (max 200, pg 1-16) to try and find any missed tweets.All timeline queries leverage a mildly modified php-twitter library - http://code.google.com/p/php-twitter/

Twapper Keeper Ops / Infrastructure Enhancements

Recommended

Recommended

More Related Content

Similar to Twapper Keeper Ops / Infrastructure Enhancements

Similar to Twapper Keeper Ops / Infrastructure Enhancements (20)

Recently uploaded

Recently uploaded (20)

Twapper Keeper Ops / Infrastructure Enhancements

Editor's Notes