Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building and Using Knowledge Bases

1,627 views

Published on

Keynote at the AKBC-WEKEX 2012 Knowledge Extraction Workshop at HLT-NAACL 2012, June 7-8, 2012
http://akbcwekex2012.wordpress.com/

Published in: Technology, Education

Building and Using Knowledge Bases

  1. 1. WeST – Web Science & Technologies University of Koblenz Landau, Germany Building and Using Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics InstituteErmelinda d„Oro, Massimo Ruffolo – Univ. Calabria, Italy & WeST Team
  2. 2. Institut WeST – Web Science & TechnologiesSemantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS WeST – Web Science & Steffen Staab Slide 2 Technologies staab@uni-koblenz.de
  3. 3. PhD thesis trauma 17 years ago„Nach dem Auspacken der LPS 105 präsentiert sich demBetrachter ein stabiles Laufwerk, das genauso geringeAußenmaße besitzt wie die Maxtor.“Having unwrapped the LPS 105 – reveals itself to theonlooker - a stable disk drive, which has similarly smallvolume as the Maxtor.“WeST – Web Science & Steffen Staab Slide 3Technologies staab@uni-koblenz.de
  4. 4. GENERAL MOTIVATION General motivation is not information extraction, but it is solving tasks!WeST – Web Science & Steffen Staab Slide 4Technologies staab@uni-koblenz.de
  5. 5. General objective: Extracting to LOD useAsExample hasLivedInCrucial to know: Ontologies nowadays reflect this structureOntologies are• Modular (vs one to rule them all)• Distributed (vs defined in one place)• Connected (vs isolated templates)• Extensible (vs claimed to be finished)• Lightweight (vs computationally intractable)• Popular ones are used more often (vs people disagreeing)Ontologies – LEGO styleWeST – Web Science & Steffen Staab Slide 5Technologies staab@uni-koblenz.de
  6. 6. Most famous applications Steve Macbeth (Microsoft): - discussion wrt Schema.org - “about 7% of pages we crawl have mark-up”  http://www.w3.org/2012/06/06-schema-minutes.html LOD Cloud Google Knowledge Graph Bing gets its own knowledge graph http://searchengineland.com/bing-britannica-partnership-123930WeST – Web Science & Steffen Staab Slide 6Technologies staab@uni-koblenz.de
  7. 7. Example ontology-based application 1: ANALYSIS OF URBAN PARAMETERSWeST – Web Science & Steffen Staab Slide 7Technologies staab@uni-koblenz.de
  8. 8. General objective: Analysing LOD useAsExample hasLivedInWeST – Web Science & Steffen Staab Slide 8Technologies staab@uni-koblenz.de
  9. 9. http://lisa.west.uni-koblenz.de/lisa-demo/Family„s analysis of Koblenz LOD + Open Street Map data WeST – Web Science & Steffen Staab Slide 9 Technologies staab@uni-koblenz.de
  10. 10. http://lisa.west.uni-koblenz.de/lisa-demo/Entrepreneur„s analysis of Koblenz LOD + Open Street Map data 1. Prize German Linked Open Gov Data Competition 2012 WeST – Web Science & Steffen Staab Slide 10 Technologies staab@uni-koblenz.de
  11. 11. Example ontology-based application : FACETED MULTIMEDIA EXPLORATIONWeST – Web Science & Steffen Staab Slide 11Technologies staab@uni-koblenz.de
  12. 12. Making Web 2.0 More Accessible[Schenk et al; JoWS 2009] GeoNames Links Location low- to xxxxx Persons xxxx midlevel features Knowledge Tags WeST – Web Science & Steffen Staab Slide 12 Technologies staab@uni-koblenz.de
  13. 13. Choosing between Koblenz – and Koblenz Video at: http://vimeo.com/2057249WeST – Web Science & Steffen Staab Slide 13Technologies staab@uni-koblenz.de
  14. 14. Contextual InformationWeST – Web Science & Steffen Staab Slide 14Technologies staab@uni-koblenz.de
  15. 15. Tag-based refinementWeST – Web Science & Steffen Staab Slide 15Technologies staab@uni-koblenz.de
  16. 16. A tag view of „Koblenz“ & „Castle“WeST – Web Science & Steffen Staab Slide 16Technologies staab@uni-koblenz.de
  17. 17. Semantic Identity – Festung EhrenbreitsteinWeST – Web Science & Steffen Staab Slide 17Technologies staab@uni-koblenz.de
  18. 18. Persons – Celebrities, FOAFers & Flickr Users Billion Triples Challenge 1. Prize 2008WeST – Web Science & Steffen Staab Slide 18Technologies [Schenk et al; JoWS 2009] staab@uni-koblenz.de
  19. 19. Now on to information extraction: OBSERVATIONS ON INFORMATION EXTRACTIONWeST – Web Science & Steffen Staab Slide 19Technologies staab@uni-koblenz.de
  20. 20. Challenges & Opportunities for IENot all web pages are created equalWeST – Web Science & Steffen Staab Slide 20Technologies staab@uni-koblenz.de
  21. 21. Challenges & Opportunities for IESome challenges are the same, e.g. finding type instancesWeST – Web Science & Steffen Staab Slide 21Technologies staab@uni-koblenz.de
  22. 22. Challenges & Opportunities for IESome challenges are the same, e.g. finding relation instancesWeST – Web Science & Steffen Staab Slide 22Technologies staab@uni-koblenz.de
  23. 23. Challenges & Opportunities for IESome contain concepts and their descriptions, some don„t No types here, few relation typesWeST – Web Science & Steffen Staab Slide 23Technologies staab@uni-koblenz.de
  24. 24. Challenges & Opportunities for IEKnowing that they are instances and of which type Textual Positional indication indicationWeST – Web Science & Steffen Staab Slide 24Technologies staab@uni-koblenz.de
  25. 25. Challenges & Opportunities for IETo some extentpositional and layoutindications work acrosslanguages and sitesWeST – Web Science & Steffen Staab Slide 25Technologies staab@uni-koblenz.de
  26. 26. Challenges & Opportunities for IE owl:sameAs We should not only think about Web pages, but about Web sitesWeST – Web Science & Steffen Staab Slide 26Technologies staab@uni-koblenz.de
  27. 27. Challenges & Opportunities for IE We should not only think about Web pages, but about Web sites owl:sameAsWeST – Web Science & Steffen Staab Slide 27Technologies staab@uni-koblenz.de
  28. 28. Comparing related work to our objectivesRelated work objectives Our objectives IE on Web pages  IE on Web sites Acquiring instances and  Acquiring items relationship instances  Classifying items in  Instances  Concepts  Relation instances  Relationships  IE also based IE based on linear text on spatial position There is overlap and of course there are exceptions in related workWeST – Web Science & Steffen Staab Slide 28Technologies staab@uni-koblenz.de
  29. 29. OutlineThe Social Media-Case The Bio-Case Motivation State-of-the-Art Core idea of SXPath Implementation Evaluation [Oro et al; VLDB 2010]WeST – Web Science & Steffen Staab Slide 29Technologies staab@uni-koblenz.de
  30. 30. Presentation-oriented documentsWeST – Web Science & Steffen Staab Slide 30Technologies staab@uni-koblenz.de
  31. 31. Presentation-oriented documents• HTML DOM structure is site specific• Spatial arrangements are rarely explicit• Spatial layout is hidden in complex nesting of layout elements• Intricate DOM tree structures are conceptually difficult to query for the user (or a tool!) WeST – Web Science & Steffen Staab Slide 31 Technologies staab@uni-koblenz.de
  32. 32. Related WorkWeb Query languages Xpath 1.0 and XQuery1.0  Established  Too difficult to use for scraping from intricate DOM structuresVisual languages Spatial Graph Grammars [Kong et al.] are quite complex in term of both usability and efficiency Algebras for creating and querying multimedia interactive presentations (e.g. ppt) [Subrahmanian et al.]Web wrapper induction exploiting visual interface[Gottlob et al.] [Sahuguet et al.]  generate XPath location paths of DOM nodes  can benefit from using Spatial XPathWeST – Web Science & Steffen Staab Slide 32Technologies staab@uni-koblenz.de
  33. 33. OutlineThe Social Media-Case The Bio-Case Motivation State-of-the-Art Core idea of SXPath Implementation EvaluationWeST – Web Science & Steffen Staab Slide 33Technologies staab@uni-koblenz.de
  34. 34. Representing Spatial Relations between DOM Nodes b eWeST – Web Science & Steffen Staab Slide 34Technologies staab@uni-koblenz.de
  35. 35. Idea: Use Spatial Relations among DOM NodesWeST – Web Science & Steffen Staab Slide 35Technologies staab@uni-koblenz.de
  36. 36. Spatial DOM (SDOM)WeST – Web Science & Steffen Staab Slide 36Technologies staab@uni-koblenz.de
  37. 37. SXPath System ArchitectureWeST – Web Science & Steffen Staab Slide 37Technologies staab@uni-koblenz.de
  38. 38. Querying for Relations Among Nodes Rectangular Cardinal Relations (RCR) r1 E:NE r2 Spatial models allow for expressing disjunctive relations among regions Topological Relations WeST – Web Science & Steffen Staab Slide 38 Technologies staab@uni-koblenz.de
  39. 39. XPath ExampleWeST – Web Science & Steffen Staab Slide 39Technologies staab@uni-koblenz.de
  40. 40. SXPath ExampleWeST – Web Science & Steffen Staab Slide 40Technologies staab@uni-koblenz.de
  41. 41. WeST – Web Science & Steffen Staab Slide 41Technologies staab@uni-koblenz.de
  42. 42. From XPath 1.0 towards Spatial Querying with SXPathSXPath features adopts intuitive path notation:  axis::nodetest [pred]* adds to XPath  spatial axes  spatial position functions natural semantics for spatial queryingWeST – Web Science & Steffen Staab Slide 42Technologies staab@uni-koblenz.de
  43. 43. SXPath System ArchitectureWeST – Web Science & Steffen Staab Slide 43Technologies staab@uni-koblenz.de
  44. 44. Complexity Results Formal model defined in the paper [Oro et al; VLDB 2010]WeST – Web Science & Steffen Staab Slide 44Technologies staab@uni-koblenz.de
  45. 45. OutlineThe Social Media-Case The Bio-Case Motivation State-of-the-Art Core idea of SXPath Implementation EvaluationWeST – Web Science & Steffen Staab Slide 45Technologies staab@uni-koblenz.de
  46. 46. SXPath SystemWeST – Web Science & Steffen Staab Slide 46Technologies staab@uni-koblenz.de
  47. 47. Summative User StudyWeST – Web Science & Steffen Staab Slide 47Technologies staab@uni-koblenz.de
  48. 48. Summative User StudyWeST – Web Science & Steffen Staab Slide 48Technologies staab@uni-koblenz.de
  49. 49. Summative User StudyWeST – Web Science & Steffen Staab Slide 49Technologies staab@uni-koblenz.de
  50. 50. OutlineThe Social Media Case The Bio-Case Motivation  Motivation State-of-the-Art  The (Biochemical) Deep Core idea of SXPath Web SXPath Language  Contributions  Spatial Data Model  Page-level wrapper induction  Syntax & Semantics  Site-wide wrapper  Complexity generation Implementation  Error Correction by Evaluation Mutual Reinforcement  Conclusions and Future DirectionsWeST – Web Science & Steffen Staab Slide 50Technologies staab@uni-koblenz.de
  51. 51. >1000 Life Science DBs, number growing quicklyWeST – Web Science & Steffen Staab Slide 51Technologies staab@uni-koblenz.de
  52. 52. Biochemical Web Sites: Observations - 1 Labeled Data Full survey: http://sabio.villa- bosch.de/labelsurvey.html (404) Total Labeled Unlabeled Unlabeled (Redundant) 754 719 19 16 Table 1: Data fields across 20 Biochemical Web sites WeST – Web Science & Steffen Staab Slide 52 Technologies staab@uni-koblenz.de
  53. 53. Biochemical Web Sites: Observations - 2 Dynamic Web Pages WeST – Web Science & Steffen Staab Slide 53 Technologies staab@uni-koblenz.de
  54. 54. Biochemical Web Sites: Observations - 3 Rich Site StructureWeST – Web Science & Steffen Staab Slide 54Technologies staab@uni-koblenz.de
  55. 55. Biochemical Web Sites: Observations - 4 Semantics is often only in the report, not in the underlying relational database Web Services  Survey: 11 of 100 Databases1 provide APIs  Incomplete coverage  Varying granularity  No semantics in the service description 1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey was available at http://sabiork.villa-bosch.de/index.html/survey.htmlWeST – Web Science & Steffen Staab Slide 55Technologies staab@uni-koblenz.de
  56. 56. Biochemical Web Sites: Extraction Tasks [Mir et al; DILS 2009] [Mir et al; ESWC 2010] Induce Wrapper Induce Wrapper Induce WrapperWeST – Web Science & Steffen Staab Slide 56Technologies staab@uni-koblenz.de
  57. 57. Contributions Unsupervised Page-Level Wrapper Induction Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery) (Acquiring the Schema/Ontology) Automatic Error Detection and Correction by Mutual ReinforcementWeST – Web Science & Steffen Staab Slide 57Technologies staab@uni-koblenz.de
  58. 58. Page-Level Wrapper Induction – 1 D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…} O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21} //*[text()] D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… } O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21} WeST – Web Science & Steffen Staab Slide 58 Technologies staab@uni-koblenz.de
  59. 59. Page-Level Wrapper Induction - 2 Reclassify – Growing Data RegionsWeST – Web Science & Steffen Staab Slide 59Technologies staab@uni-koblenz.de
  60. 60. Page-Level Wrapper Induction - 3 D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …} O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,} D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … } O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}WeST – Web Science & Steffen Staab Slide 60Technologies staab@uni-koblenz.de
  61. 61. Page-Level Wrapper Induction - 4 Selecting Labels for Data html/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” ) html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”) html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”)WeST – Web Science & Steffen Staab Slide 61Technologies staab@uni-koblenz.de
  62. 62. Page-Level Wrapper Induction - 5 Anchor the Path Enzyme - html/table[1]/tr[8]/th[1]/code[1]/ html/table[1]/tr[8]/td[1]/code[1]/a[1] html/table[1]/tr[8]/td[1]/code[1]/a[2] //*[text()=„Enzyme‟] ../…./../td[1]/code[1]/a[position()≥2]/text() Pivot Relative GeneralizeWeST – Web Science & Steffen Staab Slide 62Technologies staab@uni-koblenz.de
  63. 63. Selected Sources KEGG, ChEBI, MSDChem  Basic qualitative data  Popular  Overlapping/complementary dataWeST – Web Science & Steffen Staab Slide 63Technologies staab@uni-koblenz.de
  64. 64. Wrapper Induction - Evaluation SOURCE #L #D #S TP FN FP P R KEGG Compound 10 762 3 411 351 46 89.9 53.9 http://www.genome.jp/kegg/ compound/ 15 759 3 0 100 99.6 KEGG Reaction 10 205 3 173 32 0 100 84.4 http://www.genome.jp/kegg/ reaction/ 15 205 0 0 100 100 ChEBI 22 831 3 595 236 41 93.5 71.6 http://www.ebi.ac.uk/chebi 15 829 2 0 100 99.7 MSDChem 30 600 3 600 0 20 96.7 100 http://www.ebi.ac.uk/msd-srv/msdchem/ 15 600 0 20 96.7 100 Average (based on final wrappers for each source) 99.1 99.8 Table 2: Page-level wrapper induction results, 20 test pages (L=Labels, D=Data entries, S=Training pages) ~9 samples – ~99% P, ~98% RWeST – Web Science & Steffen Staab Slide 64Technologies staab@uni-koblenz.de
  65. 65. Site-Wide Wrapper Induction: Observations Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)  An efficient approach should ignore these pages  We dont need to learn the entire site-structure WeST – Web Science & Steffen Staab Slide 65 Technologies staab@uni-koblenz.de
  66. 66. Site-Wide Wrapper Induction: Observations - 2 Classified Link-Collections point to data-intensive pages of the same class.WeST – Web Science & Steffen Staab Slide 66Technologies staab@uni-koblenz.de
  67. 67. Site-Wide Wrapper Induction: Observations - 3 Pages belong to the same class describe the same concepts  Some concepts are sometimes omitted  Ordering is always the sameWeST – Web Science & Steffen Staab Slide 67Technologies staab@uni-koblenz.de
  68. 68. Site-Wide Wrapper Induction 1. Start with C0 L1 S={C0} 2. Follow all classified link-collections C0 C1 3. Generate wrappers L3 for each set of target L2 pages C2 4. Determine if new C3 class is formed 5. Add navigation step If C0 != Ci (i>0) S=S+Ci; 6. Repeat 2 – 5 for each Navigation Steps new class formed in 4 W= {(C0 → L1→ C0), (C0 → L2→ C2), (C0 → L3→ C3)}WeST – Web Science & Steffen Staab Slide 68Technologies staab@uni-koblenz.de
  69. 69. Site-Wide Wrapper Induction – Evaluation SOURCE #C #C’ #D TP FN FP P R MSDChem 1 1 N/A N/A N/A N/A N/A N/A ChEBI 3 1 1711 1195 516 0 100 69.8 KEGG 10 7 6223 5044 1179 188 97 81.1 Average 98.5 75.5 Table 3: Site-wide wrapper induction results, 20 test pages for each class (C=Classes, C =Classes discovered, D=Data entries) WeST – Web Science & Steffen Staab Slide 69 Technologies staab@uni-koblenz.de
  70. 70. Error Detection and Correction:Mutual Reinforcement Observation: Certain data reappear on more than one class of pagesWeST – Web Science & Steffen Staab Slide 70Technologies staab@uni-koblenz.de
  71. 71. Error Detection and Correction:Mutual Reinforcement Reinforcement if reappearing data correctly classified as Data Otherwise it points to misclassification  Label-Data Mismatch • Correction: Introduce more samples  Label-Label Mismatch • Cannot be detectedWeST – Web Science & Steffen Staab Slide 71Technologies staab@uni-koblenz.de
  72. 72. Where to go next? Reverse engineering production 1. LOD emitting RDF & RDFS 2. Navigation model what belongs to what 3. Interaction model (- not treated at all by us so far -) 4. Layout model spatial positioning Capture this generative model using machine learning  Relational learning • Markov logic programmes? • …?WeST – Web Science & Steffen Staab Slide 72Technologies staab@uni-koblenz.de
  73. 73. Bibliography Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010. S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009. Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333.WeST – Web Science & Steffen Staab Slide 73Technologies staab@uni-koblenz.de
  74. 74. WeST – Web Science & Technologies University of Koblenz Landau, GermanyThank you for your attention!

×