• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Spider duck
 

Spider duck

on

  • 1,480 views

Presentation made by Murzabayev Askhat at Suleyman Demirel University,Kazakhstan about Twitter's SpiderDuck real time URL-parser

Presentation made by Murzabayev Askhat at Suleyman Demirel University,Kazakhstan about Twitter's SpiderDuck real time URL-parser

Statistics

Views

Total Views
1,480
Views on SlideShare
1,480
Embed Views
0

Actions

Likes
1
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Spider duck Spider duck Presentation Transcript

    • SpiderDuck: Twitters Real-time URL Fetcher Murzabayev Askhat aka @murzabayevPresentation may contain adult language ordiscuss topics that are above a PG-13 level
    • Introduction• Tweets often contain URLs or links to a variety of content on the web.• SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.
    • So, what does it mean? Yes, it means that weknow everything about you and your tweets
    • Introduction(cont.)• Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example: – Search – Clients – Tweet Button – Trust & Safety – Analytics
    • Background (Before Hijri)• Prior to SpiderDuck, Twitter had a service that resolved all URLs shared in Tweets by issuing HEAD requests and following redirects – It resolved the URLs but did not actually download the content. – It did not implement politeness rules typical of modern bots.(ex: rate limiting and following robots.txt directives.)
    • Background (Великое переселение)• Open source URL crawler. We realized though that almost all of the available crawlers have two properties that we didnt need: – They are recursive crawlers. – They are optimized for large batch crawls. What we needed was a fast, real-time URL fetcher.
    • Background (After Hijri)
    • System Overview Kestrel: This is message queuing system widely used at Twitter for queuing incoming Tweets. Schedulers: These jobs determine whether to fetch a URL, schedule the fetch, follow redirect hops if any. Each scheduler performs its work independently of the others; that is, any number of schedulers can be added to horizontally scale the system as Tweet and URL volume grows.
    • System Overview Fetchers: These are Thrift servers that maintain short-term fetch queues of URLs, issue the actual HTTP fetch requests and implement rate limiting and robots.txt processing. Like the Schedulers, Fetchers scale horizontally with fetch rate. Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.
    • System Overview Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata. Content Store: This is an HDFS(Hadoop) cluster for archiving downloaded content and all fetch information.
    • URL Scheduler
    • URL Fetcher
    • How Twitter uses SpiderDuck• To retrieve URL metadata (for example, page title) and resolution information (that is, the canonical URL after redirects).• Other services periodically process SpiderDuck logs in HDFS to generate aggregate stats for Twitter’s internal metrics dashboards or conduct other types of batch analyses. ( How many images are shared on Twitter each day?” “What news sites do Twitter users most often link to?” and “How many URLs did we fetch yesterday from this specific website?”)
    • Performance numbers• For URLs that get fetched, SpiderDuck’s median processing latency <2 sec., & 99% <5 sec.(clicked “Tweet,” the URL in that Tweet is extracted, prepared for fetch, all redirect hops are retrieved, the content is downloaded and parsed, and the metadata is extracted and made available to clients via the Metadata Store)
    • Performance numbers(cont.)• Most of that time is spent either in the Fetcher Request Queues (due to rate limiting) or in actually fetching from the external web server. SpiderDuck itself adds no more than a few hundred milliseconds of processing overhead, most of which is spent in HTML parsing.• Cassandra-based MetaDS handles =10,000 req./sec. The store’s median latency for reads is 4-5 ms., and its 99% is 50-60 ms.
    • That’s All• Read acknowledgements part, thanks to them• Links to resources(open-source libs) will be given to Mr.Saparkhojayev, ask him to share them
    • Thank you!
    • @murzabayev