Data and Information Extraction on the Web
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Data and Information Extraction on the Web



Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University



Total Views
Views on SlideShare
Embed Views



5 Embeds 108 77 25 4 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I mean the download link for presentation....
    Are you sure you want to
    Your message goes here
  • if you mean the one that's because in the meantime UIMA graduated to TLP and you can find it at
    Are you sure you want to
    Your message goes here
  • link broken...
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data and Information Extraction on the Web Presentation Transcript

  • 1. Data and Information Extraction on the Web Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org lunedì 12 aprile 2010
  • 2. Agenda Search Goals Problems Data extraction Information extraction Mixing things together lunedì 12 aprile 2010
  • 3. Search - Goals Find what we are looking for Quickly Easily Have suggestions on other interesting related stuff Turn results into useful knowledge lunedì 12 aprile 2010
  • 4. What are you looking for? lunedì 12 aprile 2010
  • 5. Problems when googling Where to search what we are looking for How to write good queries (i.e.: relations between terms?) How to evaluate when a query is good lunedì 12 aprile 2010
  • 6. Search sources Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi- structured, linked, reachable... in one word: the Web lunedì 12 aprile 2010
  • 7. Focused search sources Address interesting sources for the desired domain Where possible, filter out the unclean and fragmented ones Choose the most standard and well structured ones lunedì 12 aprile 2010
  • 8. Fragmented sources lunedì 12 aprile 2010
  • 9. Structered sources lunedì 12 aprile 2010
  • 10. Data extraction Automatically collect data from the Web Crawl data from domain specific sources Aggregate homogeneous data (i.e.: using equivalence classes) Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.) lunedì 12 aprile 2010
  • 11. Data extraction - Crawling From scratch (good luck!) Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.) Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.) lunedì 12 aprile 2010
  • 12. Data extraction - HttpClient lunedì 12 aprile 2010
  • 13. Data extraction - HtmlUnit lunedì 12 aprile 2010
  • 14. Data extraction - Aggregating Downloaded resources can be assigned to equivalence classes Crawling process is inherently defining page classes to which pages belong automatically Relations between page classes RoadRunner, Webpipe, etc. lunedì 12 aprile 2010
  • 15. Data extraction - EC lunedì 12 aprile 2010
  • 16. Data extraction - EC “teams indexes” class “teams” class “players” class “coaches” class lunedì 12 aprile 2010
  • 17. Data extraction - Relevance What do we really deserve? Depending on the specific domain Not all pages in all classes could be relevant We could be interested only in a subset of the found page classes lunedì 12 aprile 2010
  • 18. Data extraction - Example We may be interested in retrieving only information regarding players (Player class) lunedì 12 aprile 2010
  • 19. Data extraction - Problems Server unavailability (HTTP 404, 403, 303, etc.) Security and bandwith filters (don’t get your crawler machine IP banned!) Client unavailability (memory and storage space are unlimited only in theory) Encoding Legal issues ... lunedì 12 aprile 2010
  • 20. From Data to Information lunedì 12 aprile 2010
  • 21. Data vs Information Data Information Rough Clean Semi-structured Structured Mixed content Focused Unmutable Managed Navigation oriented Domain oriented lunedì 12 aprile 2010
  • 22. From Data to Information We have crawled a lot of data We eventually have some rough structure (page classes and relations) We want to pick only what we need lunedì 12 aprile 2010
  • 23. Information extraction - Pruning We want to filter out at least: Banners, advertisement, etc. Headers/Footers Navigation bars/Search boxes Everything else not related with content We may use XPath lunedì 12 aprile 2010
  • 24. Information extraction - Pruning lunedì 12 aprile 2010
  • 25. Information extraction - Pruning lunedì 12 aprile 2010
  • 26. Information extraction Once we have extracted content We are now interested in getting useful information from it -> knowledge Look for some matchings between extracted data and our domain model lunedì 12 aprile 2010
  • 27. Information extraction - Example Navigate XML (HTML DOM) nodes with XPath Navigate content and find specific “parts” (nodes or sub-trees) Tag such “parts” as objects or properties inside a (specific) domain model Eventually need to traverse DOM multiple times lunedì 12 aprile 2010
  • 28. Information extraction - Name lunedì 12 aprile 2010
  • 29. Information extraction - Date of Birth lunedì 12 aprile 2010
  • 30. Information extraction - Team lunedì 12 aprile 2010
  • 31. Information extraction - Example A Player (taken from the Player pageclass) with name, date of birth and belonging to a team We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976” We can apply such XPaths to all PageClass instances and get information about each player lunedì 12 aprile 2010
  • 32. Information extraction - Wrapper Context navigation RoadRunner Webpipe Statistical analysis ExAlg Other... lunedì 12 aprile 2010
  • 33. Information extraction - Problems Not well structured sources Frequently changing sources False positives Corrupted extracted data lunedì 12 aprile 2010
  • 34. False positives lunedì 12 aprile 2010
  • 35. Information extraction - Relevance Using wrappers we can get a lot of information We could rank what is relevant in the: “page” context the domain model For efficiency and “reasoning” purposes lunedì 12 aprile 2010
  • 36. Information extraction - relevance lunedì 12 aprile 2010
  • 37. Information extraction - Metadata Stream extracted information into our domain model Extracted information -> Metadata Populated domain objects contain interesting semantics relations lunedì 12 aprile 2010
  • 38. Store Metadata DB (with classic relational schema) Filesystem (XML) Key-Value repository Index Triple Store ... lunedì 12 aprile 2010
  • 39. Query enriched data Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data Extract hidden knowledge querying aggregated metadata lunedì 12 aprile 2010
  • 40. Sample queries Get “young players” SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01 Aggregate queries Find the average age in each team Find the average age of World Cup players lunedì 12 aprile 2010
  • 41. Information extraction on the Web lunedì 12 aprile 2010
  • 42. References overview_and_setup/overview_and_setup.html lunedì 12 aprile 2010