Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Link extraction and classification


Published on

A lecture I gave at IST (Tagus Park) about the challenge of extracting information from a real-time feed of links.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Link extraction and classification

  1. 1. Link extraction and classification Bruno Pedro December 2010
  2. 2. Bruno PedroA n e x p e r i e n c e d We b d e v e l o p e r a n dentrepreneur. Has extensive background inlarge scale projects and technical writing.
  3. 3. What is tarpipe? User
  4. 4. What is tarpipe?
  5. 5. twitter source:• Average ~1.1M tweets/hour• ~300 new tweets every second
  6. 6. Challenge• ~300 reads/second• 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation)• How to process all this information?
  7. 7. Strategy• Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  8. 8. Strategy• Process later: • Regular expressions • Term extraction • Machine learning
  9. 9. Link extraction
  10. 10. Link extraction extract user extract URL new tweet yeslaunch process has URL? save show no dump
  11. 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  12. 12. Content extraction URL retry? yes yes nofetch contents redirect? error? yes no save
  13. 13. Content analysis• Assume malformed (X)HTML• Regular expressions or• Convert to XHTML• DOM traversing
  14. 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graphplacement { • Who shared the link? • Internal and external links
  15. 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphshead found? save found? found? no no
  16. 16. The big picturenew tweet classify content save extract URL extract content
  17. 17. Food for thought• PubsubHubbub• Activity Streams• twitter streaming API
  18. 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance toupdates to various social curious experiments in spend time experimentingweb sites, creating simple social media that Ive with tarpipe and I have toor complex workflows to seen lately. The service say that I am intrigued byupdate several buckets in has the potential to be the concept and impressedone fell swoop. the answer to the lament by the implementation. I first talked about in TheAdam Pash looming crisis: Personal Jeff Barrlifehacker syndication overload. Rafe Needleman CNET newsthank you automated publishing