SpiderDuck: Twitters Real-time URL Fetcher Murzabayev Askhat aka @murzabayevPresentation may contain adult language ordiscuss topics that are above a PG-13 level
Introduction• Tweets often contain URLs or links to a variety of content on the web.• SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.
So, what does it mean? Yes, it means that weknow everything about you and your tweets
Introduction(cont.)• Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example: – Search – Clients – Tweet Button – Trust & Safety – Analytics
Background (Before Hijri)• Prior to SpiderDuck, Twitter had a service that resolved all URLs shared in Tweets by issuing HEAD requests and following redirects – It resolved the URLs but did not actually download the content. – It did not implement politeness rules typical of modern bots.(ex: rate limiting and following robots.txt directives.)
Background (Великое переселение)• Open source URL crawler. We realized though that almost all of the available crawlers have two properties that we didnt need: – They are recursive crawlers. – They are optimized for large batch crawls. What we needed was a fast, real-time URL fetcher.
System Overview Kestrel: This is message queuing system widely used at Twitter for queuing incoming Tweets. Schedulers: These jobs determine whether to fetch a URL, schedule the fetch, follow redirect hops if any. Each scheduler performs its work independently of the others; that is, any number of schedulers can be added to horizontally scale the system as Tweet and URL volume grows.
System Overview Fetchers: These are Thrift servers that maintain short-term fetch queues of URLs, issue the actual HTTP fetch requests and implement rate limiting and robots.txt processing. Like the Schedulers, Fetchers scale horizontally with fetch rate. Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.
System Overview Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata. Content Store: This is an HDFS(Hadoop) cluster for archiving downloaded content and all fetch information.
How Twitter uses SpiderDuck• To retrieve URL metadata (for example, page title) and resolution information (that is, the canonical URL after redirects).• Other services periodically process SpiderDuck logs in HDFS to generate aggregate stats for Twitter’s internal metrics dashboards or conduct other types of batch analyses. ( How many images are shared on Twitter each day?” “What news sites do Twitter users most often link to?” and “How many URLs did we fetch yesterday from this specific website?”)
Performance numbers• For URLs that get fetched, SpiderDuck’s median processing latency <2 sec., & 99% <5 sec.(clicked “Tweet,” the URL in that Tweet is extracted, prepared for fetch, all redirect hops are retrieved, the content is downloaded and parsed, and the metadata is extracted and made available to clients via the Metadata Store)
Performance numbers(cont.)• Most of that time is spent either in the Fetcher Request Queues (due to rate limiting) or in actually fetching from the external web server. SpiderDuck itself adds no more than a few hundred milliseconds of processing overhead, most of which is spent in HTML parsing.• Cassandra-based MetaDS handles =10,000 req./sec. The store’s median latency for reads is 4-5 ms., and its 99% is 50-60 ms.
That’s All• Read acknowledgements part, thanks to them• Links to resources(open-source libs) will be given to Mr.Saparkhojayev, ask him to share them