Data and Information Extraction on the Web

6,410 views
6,189 views

Published on

Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Published in: Technology
3 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total views
6,410
On SlideShare
0
From Embeds
0
Number of Embeds
114
Actions
Shares
0
Downloads
269
Comments
3
Likes
8
Embeds 0
No embeds

No notes for slide

Data and Information Extraction on the Web

  1. 1. Data and Information Extraction on the Web Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org lunedì 12 aprile 2010
  2. 2. Agenda Search Goals Problems Data extraction Information extraction Mixing things together lunedì 12 aprile 2010
  3. 3. Search - Goals Find what we are looking for Quickly Easily Have suggestions on other interesting related stuff Turn results into useful knowledge lunedì 12 aprile 2010
  4. 4. What are you looking for? lunedì 12 aprile 2010
  5. 5. Problems when googling Where to search what we are looking for How to write good queries (i.e.: relations between terms?) How to evaluate when a query is good lunedì 12 aprile 2010
  6. 6. Search sources Redundant, unhomogeneous, widespread, public, noisy, free, sometimes standard, semi- structured, linked, reachable... in one word: the Web lunedì 12 aprile 2010
  7. 7. Focused search sources Address interesting sources for the desired domain Where possible, filter out the unclean and fragmented ones Choose the most standard and well structured ones lunedì 12 aprile 2010
  8. 8. Fragmented sources lunedì 12 aprile 2010
  9. 9. Structered sources lunedì 12 aprile 2010
  10. 10. Data extraction Automatically collect data from the Web Crawl data from domain specific sources Aggregate homogeneous data (i.e.: using equivalence classes) Save (portions of downloaded) data to a convenient separate storage (DB, file system, repository, etc.) lunedì 12 aprile 2010
  11. 11. Data extraction - Crawling From scratch (good luck!) Leveraging existing facilities (wget, HtmlUnit, Selenium, Apache HttpClient, Ning’s Async HttpClient, etc.) Playing with existing projects (RoadRunner, Webpipe, Apache Nutch, etc.) lunedì 12 aprile 2010
  12. 12. Data extraction - HttpClient lunedì 12 aprile 2010
  13. 13. Data extraction - HtmlUnit lunedì 12 aprile 2010
  14. 14. Data extraction - Aggregating Downloaded resources can be assigned to equivalence classes Crawling process is inherently defining page classes to which pages belong automatically Relations between page classes RoadRunner, Webpipe, etc. lunedì 12 aprile 2010
  15. 15. Data extraction - EC lunedì 12 aprile 2010
  16. 16. Data extraction - EC “teams indexes” class “teams” class “players” class “coaches” class lunedì 12 aprile 2010
  17. 17. Data extraction - Relevance What do we really deserve? Depending on the specific domain Not all pages in all classes could be relevant We could be interested only in a subset of the found page classes lunedì 12 aprile 2010
  18. 18. Data extraction - Example We may be interested in retrieving only information regarding players (Player class) lunedì 12 aprile 2010
  19. 19. Data extraction - Problems Server unavailability (HTTP 404, 403, 303, etc.) Security and bandwith filters (don’t get your crawler machine IP banned!) Client unavailability (memory and storage space are unlimited only in theory) Encoding Legal issues ... lunedì 12 aprile 2010
  20. 20. From Data to Information lunedì 12 aprile 2010
  21. 21. Data vs Information Data Information Rough Clean Semi-structured Structured Mixed content Focused Unmutable Managed Navigation oriented Domain oriented lunedì 12 aprile 2010
  22. 22. From Data to Information We have crawled a lot of data We eventually have some rough structure (page classes and relations) We want to pick only what we need lunedì 12 aprile 2010
  23. 23. Information extraction - Pruning We want to filter out at least: Banners, advertisement, etc. Headers/Footers Navigation bars/Search boxes Everything else not related with content We may use XPath lunedì 12 aprile 2010
  24. 24. Information extraction - Pruning lunedì 12 aprile 2010
  25. 25. Information extraction - Pruning lunedì 12 aprile 2010
  26. 26. Information extraction Once we have extracted content We are now interested in getting useful information from it -> knowledge Look for some matchings between extracted data and our domain model lunedì 12 aprile 2010
  27. 27. Information extraction - Example Navigate XML (HTML DOM) nodes with XPath Navigate content and find specific “parts” (nodes or sub-trees) Tag such “parts” as objects or properties inside a (specific) domain model Eventually need to traverse DOM multiple times lunedì 12 aprile 2010
  28. 28. Information extraction - Name lunedì 12 aprile 2010
  29. 29. Information extraction - Date of Birth lunedì 12 aprile 2010
  30. 30. Information extraction - Team lunedì 12 aprile 2010
  31. 31. Information extraction - Example A Player (taken from the Player pageclass) with name, date of birth and belonging to a team We now know that “Francesco Totti” is a Player of “Italy” team and was born on “27/09/1976” We can apply such XPaths to all PageClass instances and get information about each player lunedì 12 aprile 2010
  32. 32. Information extraction - Wrapper Context navigation RoadRunner Webpipe Statistical analysis ExAlg Other... lunedì 12 aprile 2010
  33. 33. Information extraction - Problems Not well structured sources Frequently changing sources False positives Corrupted extracted data lunedì 12 aprile 2010
  34. 34. False positives lunedì 12 aprile 2010
  35. 35. Information extraction - Relevance Using wrappers we can get a lot of information We could rank what is relevant in the: “page” context the domain model For efficiency and “reasoning” purposes lunedì 12 aprile 2010
  36. 36. Information extraction - relevance lunedì 12 aprile 2010
  37. 37. Information extraction - Metadata Stream extracted information into our domain model Extracted information -> Metadata Populated domain objects contain interesting semantics relations lunedì 12 aprile 2010
  38. 38. Store Metadata DB (with classic relational schema) Filesystem (XML) Key-Value repository Index Triple Store ... lunedì 12 aprile 2010
  39. 39. Query enriched data Exploit acquired metadata semantics to build SQL-like (with attributes and relations of our domain model) queries on previously unstructered data Extract hidden knowledge querying aggregated metadata lunedì 12 aprile 2010
  40. 40. Sample queries Get “young players” SELECT * FROM giocatore g WHERE g.dob AFTER 1993/01/01 Aggregate queries Find the average age in each team Find the average age of World Cup players lunedì 12 aprile 2010
  41. 41. Information extraction on the Web lunedì 12 aprile 2010
  42. 42. References http://www.w3.org/TR/xpath/ http://www.w3.org/DOM/ http://www.dia.uniroma3.it/db/roadRunner/ http://www.slideshare.net/n0on3/exalg-overview http://www.ricercaitaliana.it/prin/unita_op-2006093591_002.htm http://incubator.apache.org/uima/downloads/releaseDocs/2.3.0-incubating/docs/html/ overview_and_setup/overview_and_setup.html http://en.wikipedia.org/wiki/Web_scraping http://www.alchemyapi.com/api/scrape/ lunedì 12 aprile 2010

×