SlideShare a Scribd company logo
http://www.flickr.com/photos/pio1976/3330670980/




                                                  Link extraction and classification
                                                                         Bruno Pedro
                                                                          December 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Link extraction
Link extraction
                        extract user   extract URL




         new tweet             yes




launch process           has URL?         save       show




                                no




                           dump
Content extraction

  • Short URLs
  • HTTP redirects
  • HTTP errors — retry?
  • Content analysis
Content extraction
    URL                                   retry?
                               yes



                                              yes



                                     no
fetch contents         redirect?          error?

                 yes

                                              no




                                          save
Content analysis

• Assume malformed (X)HTML
• Regular expressions
  or
• Convert to XHTML
• DOM traversing
Content classification
                • HTML title, description, keywords
   tag
  cloud
            {   • H1, H2, H3, ...
                • Paragraphs

  graph
placement   {   • Who shared the link?
                • Internal and external links
Content classification
 (X)HTML                 extract head
                                                   extract text         extract text
                          elements




                   yes                       yes                  yes




                          H1,H2,...                paragraphs
head found?                                                                save
                           found?                    found?




              no                        no
The big picture

new tweet                     classify content   save
             extract URL




            extract content
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• twitter streaming API
  http://dev.twitter.com/pages/streaming_api
tarpipe streamlines your     tarpipe is one of the most     Today I had a chance to
updates to various social    curious experiments in         spend time experimenting
web sites, creating simple   social media that I've         with tarpipe and I have to
or complex workflows to       seen lately. The service       say that I am intrigued by
update several buckets in    has the potential to be        the concept and impressed
one fell swoop.              the answer to the lament       by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal       Jeff Barr
lifehacker                   syndication overload.          Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                            automated publishing

More Related Content

Viewers also liked

Schnelle Geschäfte
Schnelle GeschäfteSchnelle Geschäfte
Schnelle Geschäfte
Mayflower GmbH
 
Introd to CS. Presentation
Introd to CS. PresentationIntrod to CS. Presentation
Introd to CS. Presentation
Miguel Rebollo
 
Test-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPCTest-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPC
Mayflower GmbH
 
The Executable Web
The Executable WebThe Executable Web
The Executable Web
Bruno Pedro
 
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und AlloyNative Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
Mayflower GmbH
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
Bruno Pedro
 
Computer concepts presentation 2
Computer concepts presentation 2Computer concepts presentation 2
Computer concepts presentation 2
Arunodya Silva
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumers
Bruno Pedro
 
Shoeism - Frau im Glück
Shoeism - Frau im GlückShoeism - Frau im Glück
Shoeism - Frau im Glück
Mayflower GmbH
 
Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)Bruno Pedro
 
Plugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debuggingPlugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debuggingMayflower GmbH
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?Bruno Pedro
 
Salt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native ClientSalt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native Client
Mayflower GmbH
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
Bruno Pedro
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
Bruno Pedro
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
Bruno Pedro
 
The importance of /me
The importance of /meThe importance of /me
The importance of /me
Bruno Pedro
 
Information Retrieval Challenges
Information Retrieval ChallengesInformation Retrieval Challenges
Information Retrieval Challenges
Bruno Pedro
 
API Code Generation
API Code GenerationAPI Code Generation
API Code Generation
Bruno Pedro
 

Viewers also liked (20)

Schnelle Geschäfte
Schnelle GeschäfteSchnelle Geschäfte
Schnelle Geschäfte
 
Introd to CS. Presentation
Introd to CS. PresentationIntrod to CS. Presentation
Introd to CS. Presentation
 
Test-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPCTest-Driven JavaScript Development IPC
Test-Driven JavaScript Development IPC
 
The Executable Web
The Executable WebThe Executable Web
The Executable Web
 
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und AlloyNative Cross-Platform-Apps mit Titanium Mobile und Alloy
Native Cross-Platform-Apps mit Titanium Mobile und Alloy
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
 
node-fs
node-fsnode-fs
node-fs
 
Computer concepts presentation 2
Computer concepts presentation 2Computer concepts presentation 2
Computer concepts presentation 2
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumers
 
Shoeism - Frau im Glück
Shoeism - Frau im GlückShoeism - Frau im Glück
Shoeism - Frau im Glück
 
Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)Autenticação e Autorização (in portuguese)
Autenticação e Autorização (in portuguese)
 
Plugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debuggingPlugging holes — javascript memory leak debugging
Plugging holes — javascript memory leak debugging
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?
 
Salt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native ClientSalt and pepper — native code in the browser Browser using Google native Client
Salt and pepper — native code in the browser Browser using Google native Client
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
 
The importance of /me
The importance of /meThe importance of /me
The importance of /me
 
Information Retrieval Challenges
Information Retrieval ChallengesInformation Retrieval Challenges
Information Retrieval Challenges
 
API Code Generation
API Code GenerationAPI Code Generation
API Code Generation
 

Similar to Link extraction and classification

WordPress Intermediate Workshop
WordPress Intermediate WorkshopWordPress Intermediate Workshop
WordPress Intermediate Workshop
The Toolbox, Inc.
 
Wp 3hr-course
Wp 3hr-courseWp 3hr-course
Wp 3hr-course
Rich Webster
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
Rich Bowen
 
Pub355: SEO Copywriting
Pub355: SEO CopywritingPub355: SEO Copywriting
Pub355: SEO Copywriting
somisguided
 
From PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and BackFrom PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and Back
Timothy Chandler
 
Big Websites
Big WebsitesBig Websites
Big Websites
Four Kitchens
 
HTML Semantic Tags
HTML Semantic TagsHTML Semantic Tags
HTML Semantic Tags
Bruce Kyle
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
Ali BELCAID
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
jasonhaddix
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
Joanne Klein
 
Why Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your ClientsWhy Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your Clients
Pantheon
 
CommunitySherpa Field Presentation
CommunitySherpa Field PresentationCommunitySherpa Field Presentation
CommunitySherpa Field Presentation
Chris Vaughn
 
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Laura Scott
 
Aws r
Aws rAws r
There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)
Grace Solivan
 
Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012
Adrian Roselli
 
We are losing our tweets!
We are losing our tweets!We are losing our tweets!
We are losing our tweets!
John O'Brien III
 
<?php + WordPress
<?php + WordPress<?php + WordPress
<?php + WordPress
Christopher Reding
 

Similar to Link extraction and classification (20)

WordPress Intermediate Workshop
WordPress Intermediate WorkshopWordPress Intermediate Workshop
WordPress Intermediate Workshop
 
Wp 3hr-course
Wp 3hr-courseWp 3hr-course
Wp 3hr-course
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
Pub355: SEO Copywriting
Pub355: SEO CopywritingPub355: SEO Copywriting
Pub355: SEO Copywriting
 
From PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and BackFrom PHP to Hack as a Stack and Back
From PHP to Hack as a Stack and Back
 
Big Websites
Big WebsitesBig Websites
Big Websites
 
HTML Semantic Tags
HTML Semantic TagsHTML Semantic Tags
HTML Semantic Tags
 
Data Acquisition for Sentiment Analysis
Data Acquisition for Sentiment AnalysisData Acquisition for Sentiment Analysis
Data Acquisition for Sentiment Analysis
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
Handout1 intro to html
Handout1 intro to htmlHandout1 intro to html
Handout1 intro to html
 
Why Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your ClientsWhy Your Site is Slow: Performance Answers for Your Clients
Why Your Site is Slow: Performance Answers for Your Clients
 
CommunitySherpa Field Presentation
CommunitySherpa Field PresentationCommunitySherpa Field Presentation
CommunitySherpa Field Presentation
 
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
Grok Drupal (7) Theming (presented at DrupalCon San Francisco)
 
Aws r
Aws rAws r
Aws r
 
There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)There's No Crying In Wordpress! (an intro to WP)
There's No Crying In Wordpress! (an intro to WP)
 
Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012Content Strategy: WordCamp Buffalo 2012
Content Strategy: WordCamp Buffalo 2012
 
We are losing our tweets!
We are losing our tweets!We are losing our tweets!
We are losing our tweets!
 
<?php + WordPress
<?php + WordPress<?php + WordPress
<?php + WordPress
 

More from Bruno Pedro

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIs
Bruno Pedro
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an API
Bruno Pedro
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an API
Bruno Pedro
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an API
Bruno Pedro
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
Bruno Pedro
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
Bruno Pedro
 
OOP (in portuguese)
OOP (in portuguese)OOP (in portuguese)
OOP (in portuguese)Bruno Pedro
 
Segurança (in portuguese)
Segurança (in portuguese)Segurança (in portuguese)
Segurança (in portuguese)Bruno Pedro
 
Cache e Performance (in portuguese)
Cache e Performance (in portuguese)Cache e Performance (in portuguese)
Cache e Performance (in portuguese)Bruno Pedro
 
Web Services (in portuguese)
Web Services (in portuguese)Web Services (in portuguese)
Web Services (in portuguese)Bruno Pedro
 
Sessões (in portuguese)
Sessões (in portuguese)Sessões (in portuguese)
Sessões (in portuguese)Bruno Pedro
 
User Interface (in portuguese)
User Interface (in portuguese)User Interface (in portuguese)
User Interface (in portuguese)Bruno Pedro
 
Takeoff2008
Takeoff2008Takeoff2008
Takeoff2008
Bruno Pedro
 

More from Bruno Pedro (14)

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIs
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an API
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an API
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an API
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
 
OAuth checklist
OAuth checklistOAuth checklist
OAuth checklist
 
OOP (in portuguese)
OOP (in portuguese)OOP (in portuguese)
OOP (in portuguese)
 
Segurança (in portuguese)
Segurança (in portuguese)Segurança (in portuguese)
Segurança (in portuguese)
 
Cache e Performance (in portuguese)
Cache e Performance (in portuguese)Cache e Performance (in portuguese)
Cache e Performance (in portuguese)
 
Web Services (in portuguese)
Web Services (in portuguese)Web Services (in portuguese)
Web Services (in portuguese)
 
Sessões (in portuguese)
Sessões (in portuguese)Sessões (in portuguese)
Sessões (in portuguese)
 
User Interface (in portuguese)
User Interface (in portuguese)User Interface (in portuguese)
User Interface (in portuguese)
 
Takeoff2008
Takeoff2008Takeoff2008
Takeoff2008
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 

Link extraction and classification

  • 1. http://www.flickr.com/photos/pio1976/3330670980/ Link extraction and classification Bruno Pedro December 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 6. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 7. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 8. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 10. Link extraction extract user extract URL new tweet yes launch process has URL? save show no dump
  • 11. Content extraction • Short URLs • HTTP redirects • HTTP errors — retry? • Content analysis
  • 12. Content extraction URL retry? yes yes no fetch contents redirect? error? yes no save
  • 13. Content analysis • Assume malformed (X)HTML • Regular expressions or • Convert to XHTML • DOM traversing
  • 14. Content classification • HTML title, description, keywords tag cloud { • H1, H2, H3, ... • Paragraphs graph placement { • Who shared the link? • Internal and external links
  • 15. Content classification (X)HTML extract head extract text extract text elements yes yes yes H1,H2,... paragraphs head found? save found? found? no no
  • 16. The big picture new tweet classify content save extract URL extract content
  • 17. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • twitter streaming API http://dev.twitter.com/pages/streaming_api
  • 18. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you automated publishing

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n