Your SlideShare is downloading. ×
Spider duck
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Spider duck

1,238
views

Published on

Presentation made by Murzabayev Askhat at Suleyman Demirel University,Kazakhstan about Twitter's SpiderDuck real time URL-parser

Presentation made by Murzabayev Askhat at Suleyman Demirel University,Kazakhstan about Twitter's SpiderDuck real time URL-parser

Published in: Technology, Design

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,238
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SpiderDuck: Twitters Real-time URL Fetcher Murzabayev Askhat aka @murzabayevPresentation may contain adult language ordiscuss topics that are above a PG-13 level
  • 2. Introduction• Tweets often contain URLs or links to a variety of content on the web.• SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.
  • 3. So, what does it mean? Yes, it means that weknow everything about you and your tweets
  • 4. Introduction(cont.)• Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example: – Search – Clients – Tweet Button – Trust & Safety – Analytics
  • 5. Background (Before Hijri)• Prior to SpiderDuck, Twitter had a service that resolved all URLs shared in Tweets by issuing HEAD requests and following redirects – It resolved the URLs but did not actually download the content. – It did not implement politeness rules typical of modern bots.(ex: rate limiting and following robots.txt directives.)
  • 6. Background (Великое переселение)• Open source URL crawler. We realized though that almost all of the available crawlers have two properties that we didnt need: – They are recursive crawlers. – They are optimized for large batch crawls. What we needed was a fast, real-time URL fetcher.
  • 7. Background (After Hijri)
  • 8. System Overview Kestrel: This is message queuing system widely used at Twitter for queuing incoming Tweets. Schedulers: These jobs determine whether to fetch a URL, schedule the fetch, follow redirect hops if any. Each scheduler performs its work independently of the others; that is, any number of schedulers can be added to horizontally scale the system as Tweet and URL volume grows.
  • 9. System Overview Fetchers: These are Thrift servers that maintain short-term fetch queues of URLs, issue the actual HTTP fetch requests and implement rate limiting and robots.txt processing. Like the Schedulers, Fetchers scale horizontally with fetch rate. Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.
  • 10. System Overview Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata. Content Store: This is an HDFS(Hadoop) cluster for archiving downloaded content and all fetch information.
  • 11. URL Scheduler
  • 12. URL Fetcher
  • 13. How Twitter uses SpiderDuck• To retrieve URL metadata (for example, page title) and resolution information (that is, the canonical URL after redirects).• Other services periodically process SpiderDuck logs in HDFS to generate aggregate stats for Twitter’s internal metrics dashboards or conduct other types of batch analyses. ( How many images are shared on Twitter each day?” “What news sites do Twitter users most often link to?” and “How many URLs did we fetch yesterday from this specific website?”)
  • 14. Performance numbers• For URLs that get fetched, SpiderDuck’s median processing latency <2 sec., & 99% <5 sec.(clicked “Tweet,” the URL in that Tweet is extracted, prepared for fetch, all redirect hops are retrieved, the content is downloaded and parsed, and the metadata is extracted and made available to clients via the Metadata Store)
  • 15. Performance numbers(cont.)• Most of that time is spent either in the Fetcher Request Queues (due to rate limiting) or in actually fetching from the external web server. SpiderDuck itself adds no more than a few hundred milliseconds of processing overhead, most of which is spent in HTML parsing.• Cassandra-based MetaDS handles =10,000 req./sec. The store’s median latency for reads is 4-5 ms., and its 99% is 50-60 ms.
  • 16. That’s All• Read acknowledgements part, thanks to them• Links to resources(open-source libs) will be given to Mr.Saparkhojayev, ask him to share them
  • 17. Thank you!
  • 18. @murzabayev