http://www.flickr.com/photos/pio1976/3330670980/




                                                  Link extraction and classification
                                                                         Bruno Pedro
                                                                          December 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Link extraction
Link extraction
                        extract user   extract URL




         new tweet             yes




launch process           has URL?         save       show




                                no




                           dump
Content extraction

  • Short URLs
  • HTTP redirects
  • HTTP errors — retry?
  • Content analysis
Content extraction
    URL                                   retry?
                               yes



                                              yes



                                     no
fetch contents         redirect?          error?

                 yes

                                              no




                                          save
Content analysis

• Assume malformed (X)HTML
• Regular expressions
  or
• Convert to XHTML
• DOM traversing
Content classification
                • HTML title, description, keywords
   tag
  cloud
            {   • H1, H2, H3, ...
                • Paragraphs

  graph
placement   {   • Who shared the link?
                • Internal and external links
Content classification
 (X)HTML                 extract head
                                                   extract text         extract text
                          elements




                   yes                       yes                  yes




                          H1,H2,...                paragraphs
head found?                                                                save
                           found?                    found?




              no                        no
The big picture

new tweet                     classify content   save
             extract URL




            extract content
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• twitter streaming API
  http://dev.twitter.com/pages/streaming_api
tarpipe streamlines your     tarpipe is one of the most     Today I had a chance to
updates to various social    curious experiments in         spend time experimenting
web sites, creating simple   social media that I've         with tarpipe and I have to
or complex workflows to       seen lately. The service       say that I am intrigued by
update several buckets in    has the potential to be        the concept and impressed
one fell swoop.              the answer to the lament       by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal       Jeff Barr
lifehacker                   syndication overload.          Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                            automated publishing

Link extraction and classification

  • 1.
    http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
  • 2.
    Bruno Pedro A ne x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 3.
  • 4.
  • 5.
    twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 6.
    Challenge • ~300 reads/second •160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 7.
    Strategy • Read andstore immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 8.
    Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 9.
  • 10.
    Link extraction extract user extract URL new tweet yes launch process has URL? save show no dump
  • 11.
    Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  • 12.
    Content extraction URL retry? yes yes no fetch contents redirect? error? yes no save
  • 13.
    Content analysis • Assumemalformed (X)HTML • Regular expressions or • Convert to XHTML • DOM traversing
  • 14.
    Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graph placement { • Who shared the link? • Internal and external links
  • 15.
    Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphs head found? save found? found? no no
  • 16.
    The big picture newtweet classify content save extract URL extract content
  • 17.
    Food for thought •PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • twitter streaming API http://dev.twitter.com/pages/streaming_api
  • 18.
    tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you automated publishing