Successfully reported this slideshow.

Link extraction and classification

0

Share

Loading in …3
×
1 of 18
1 of 18

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Link extraction and classification

  1. 1. http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
  2. 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  3. 3. What is tarpipe? User
  4. 4. What is tarpipe?
  5. 5. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  6. 6. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  7. 7. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  8. 8. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  9. 9. Link extraction
  10. 10. Link extraction extract user extract URL new tweet yes launch process has URL? save show no dump
  11. 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  12. 12. Content extraction URL retry? yes yes no fetch contents redirect? error? yes no save
  13. 13. Content analysis • Assume malformed (X)HTML • Regular expressions or • Convert to XHTML • DOM traversing
  14. 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graph placement { • Who shared the link? • Internal and external links
  15. 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphs head found? save found? found? no no
  16. 16. The big picture new tweet classify content save extract URL extract content
  17. 17. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • twitter streaming API http://dev.twitter.com/pages/streaming_api
  18. 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you automated publishing

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • ×