http://www.flickr.com/photos/pio1976/3330670980/                                                  Link extraction and class...
Bruno PedroA n e x p e r i e n c e d We b d e v e l o p e r a n dentrepreneur. Has extensive background inlarge scale proj...
What is tarpipe?       User
What is tarpipe?
twitter                      source: mashable.com• Average ~1.1M tweets/hour• ~300 new tweets every second
Challenge• ~300 reads/second• 160 X 300 = 48 KB/second = 4 GB/day  (approximate calculation)• How to process all this info...
Strategy• Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading er...
Strategy• Process later: • Regular expressions • Term extraction • Machine learning
Link extraction
Link extraction                        extract user   extract URL         new tweet             yeslaunch process         ...
Content extraction  • Short URLs  • HTTP redirects  • HTTP errors — retry?  • Content analysis
Content extraction    URL                                   retry?                               yes                      ...
Content analysis• Assume malformed (X)HTML• Regular expressions  or• Convert to XHTML• DOM traversing
Content classification                • HTML title, description, keywords   tag  cloud            {   • H1, H2, H3, ...    ...
Content classification (X)HTML                 extract head                                                   extract text ...
The big picturenew tweet                     classify content   save             extract URL            extract content
Food for thought• PubsubHubbub  http://code.google.com/p/pubsubhubbub/• Activity Streams  http://activitystrea.ms/• twitte...
tarpipe streamlines your     tarpipe is one of the most     Today I had a chance toupdates to various social    curious ex...
Upcoming SlideShare
Loading in …5
×

Link extraction and classification

3,070 views
2,922 views

Published on

A lecture I gave at IST (Tagus Park) about the challenge of extracting information from a real-time feed of links.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,070
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Link extraction and classification

    1. 1. http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
    2. 2. Bruno PedroA n e x p e r i e n c e d We b d e v e l o p e r a n dentrepreneur. Has extensive background inlarge scale projects and technical writing.http://tarpipe.com/user/bpedro
    3. 3. What is tarpipe? User
    4. 4. What is tarpipe?
    5. 5. twitter source: mashable.com• Average ~1.1M tweets/hour• ~300 new tweets every second
    6. 6. Challenge• ~300 reads/second• 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation)• How to process all this information?
    7. 7. Strategy• Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
    8. 8. Strategy• Process later: • Regular expressions • Term extraction • Machine learning
    9. 9. Link extraction
    10. 10. Link extraction extract user extract URL new tweet yeslaunch process has URL? save show no dump
    11. 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
    12. 12. Content extraction URL retry? yes yes nofetch contents redirect? error? yes no save
    13. 13. Content analysis• Assume malformed (X)HTML• Regular expressions or• Convert to XHTML• DOM traversing
    14. 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graphplacement { • Who shared the link? • Internal and external links
    15. 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphshead found? save found? found? no no
    16. 16. The big picturenew tweet classify content save extract URL extract content
    17. 17. Food for thought• PubsubHubbub http://code.google.com/p/pubsubhubbub/• Activity Streams http://activitystrea.ms/• twitter streaming API http://dev.twitter.com/pages/streaming_api
    18. 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance toupdates to various social curious experiments in spend time experimentingweb sites, creating simple social media that Ive with tarpipe and I have toor complex workflows to seen lately. The service say that I am intrigued byupdate several buckets in has the potential to be the concept and impressedone fell swoop. the answer to the lament by the implementation. I first talked about in TheAdam Pash looming crisis: Personal Jeff Barrlifehacker syndication overload. Amazon.com Rafe Needleman CNET newsthank you automated publishing

    ×