Diadem 1.0

1. DIADEMDomain-centric, Intelligent, Automated Data Extraction Tim Furche, Georg Gottlob, Giorgio Orsi May 11th, 2011@ Oxford University Computing Laboratories joint work with Giovanni Grasso, Omer Gunes, XiaonanGuo, AndreyKravchenko, Thomas Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang

3. 3 1 Web Data Extraction

4. 4 Section 1: Web Data Extraction Data on the Web there is more of it than we can use no longer availability, but finding, integrating, analysing, …

5. 5 Section 1: Web Data Extraction Surface vs. Deep Web estimated 500 × surface web estimated 400000 deep web databases What? Products (stores) Directories (yellow pages) Catalogs (libraries) Public DBs (publications, census, data.gov,…) Public services (weather, location, …)

6. 6 And it’s not just one haystack …

10. 10

11. 11 7 bedrooms 5 bedrooms

12. 12 Section 1: Web Data Extraction The Web is more than HTML

13. 13 Section 1: Web Data Extraction Overview Introducing Web Data Extraction Scenarios Why now? Supervised Web Data Extraction Unsupervised Web Data Extraction DIADEM OPAL AMBER OXPath IVLIA Datalog±

14. 14 1.1 Web Data Extraction:Scenarios

15. 15 Section 1: Web Data Extraction The Need of Web Data Extraction information drives business (decision making, trend analysis, …) available in troves on the internet but: as HTML made for humans, not as structured data companies need product specifications pricing information market trends regulatory information

16. 16 keyword search fails example due to Fabian Suchaneck

17. 17 keyword search fails

18. 18 Section 1: Web Data Extraction Scenario ➀: Electronics retailer electronics retailer: online market intelligence comprehensive overview of the market daily information on price, shipping costs, trends, product mix by product, geographical region, or competitor thousands of products hundreds of competitors nowadays: specialised companies mostly manual, interpolation large cost

19. 19 Section 1: Web Data Extraction Scenario ➁: Supermarket chain supermarket chain competitors’product prices special offer or promotion (time sensitive) new products, product formats & packaging

20. 20 Section 1: Web Data Extraction Scenario ➂: Hotel Agency online travel agency best price guarantee prices of competing agencies average market price

21. 21 Section 1: Web Data Extraction Scenario ➃: Hedge Fund house price index published in regular intervals by national statistics agency affects share values of various industries hedge fund online market intelligence to predict the house price index

22. 22 Section 1: Web Data Extraction And a lot more … monitor blogs and forums market intelligence, e.g., complaints, common problems customer opinions ranking and analysing product reviews financial analysts monitor trends and stats for products of a certain company / category interest rates from financial institutions press releases and financial reports patent search & analysis …

23. 23

24. 24 1.1 Web Data Extraction:Why Now?

25. 25 Scale

26. 26 Applications

27. 27 Section 1: Web Data Extraction How to book a flight?

28. How to find a history book? 28 Section 1: Web Data Extraction

29. How to find a paper? 29 Section 1: Web Data Extraction

30. 30 Section 1: Web Data Extraction How to find a flat?

31. 31 Structured Data

32. 32

33. 33 Section 1: Web Data Extraction Why Web Data Extraction Now? Why now? Trends Trend ➊: scale—every business is online automation at scale Trend ➋: web applications rather than web documents automated form filling (deep web navigation) Trend ➌: structured, common-sense data available allows more sophisticated automated analysis also a tool for improved data extraction?

34. Web Data Extraction:Supervised 34 2

35. 35 manual: (e.g., Web Harvest) user writes the wrapper, sometimes using wrapping libraries supervised: (e.g., Lixto) user provides examples and refines the wrapper semi-supervised: user provides examples (per site), wrapper is automatically learned unsupervised: entirely automated (e.g., DIADEM) some systems omit examples and run analysis directly on all pages some systems automatically guess examples

36. 36 Section 2: Supervised Web Data Extraction Supervised Web Data Extraction User interaction needed to rather than manually writing in a programming language record interaction sequences (such as form fillings) visually select examples for data Current gold standard for high-accuracy extraction Examples: Lixto Automation Anywhere Web Harvest …

37. 37

38. 38

39. 39

40. 40 Section 1: Supervised Web Data Extraction Lixto: Extraction & Analysis Lixto: sophisticated, visual semi-automated extraction tool visually select, automatically derives patterns, verification highly scalable extraction and processing with Lixto server but also: data integration & business analytics suite data cleaning data flow scenarios: merge & filter from different web sites market intelligence & analytics

41. 41

42. 42

43. Web Data Extraction:Unsupervised 43 3

44. 44 17000 real estatesites in the UK alone

47. 47 Section 3: Unsupervised Web Data Extraction … and we really need it! search engine providers (Google, Microsoft, Yahoo!) all work on information and data extraction for “vertical”, “object” and “semantic” search turn search engines into knowledge bases for decision support

48. 48 “no one really has done this successfully at scale yet” Raghu Ramakrishnan, Yahoo!, March 2009 “Current technologies are not good enough yet to provide what search engines really need. [...] Any successful approach would probably need a combination of knowledge and learning.” Alon Halevy, Google, Feb. 2009

49. 49 Section 3: Unsupervised Web Data Extraction Unsupervised: The Story so Far Key observation: “database” web sites are generated using templates wrapper generators need to automatically identifying templates Two major approaches machine learning from a few hand-labeled examples similar to semi-supervised, but only one set of examples for an entire domain high precision only for simple domains (single entity type, few attributes) fully automatically exploit the repeated structure of result pages good precision needs a lot of data (many records per page, many pages) doesn’t work for forms (no repetition)

51. ? 51

52. 52 4 DIADEM

53. 53 Section 4: DIADEM Domain-Centric Data Extraction Blackbox analyser that turns any of the thousands of websites of a domain into structured data

54. 54 host of domain specific annotators

55. 55 domain ontology & phenomenology

56. 56 + everything the others are doing template discovery machine learning for classification

57. 57

58. 58

59. 59 Section 4: DIADEM DIADEM: Overview DIADEM combines host of domain-specific annotators with gives us a first “guess” to automatically generate examples high-level ontology about domain entities and their phenomenology on web sites of the domain allows us to verify & refine examples + advances in existing techniques for repeated structure analysis page & block classification bottom-up understanding & top-down reasoning

60. 60 4.1 DEMO

61. 61

62. 62 DIADEM 0.1 First prototype

63. 63

65. 65 Form successfully filled Next step

66. 66 Section 4: DIADEM Achievements in Numbers 15k-150k facts (5-50MB) generated per web page time: usually between 30-60 sec, at most few minutes 300-400 predicates Some numbers on the prototype: Java files: 293 with 44993 lines of code DLV rules: over 500 rules, over 200 predicates Gazetteers: 111 gazetteers with 48000 entries JAPE rules: 23 rules files with 30 rules

67. 67 ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☂ ☂ ☂ ☀ ☀ ☀ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☀ ☀ ☣ ☣ ☣ ☣ ☣

68. 68 ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☀ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☣ ☀ ☂ ☂ ☣

69. 69

70. OPAL:Ontologies for Form Analysis 70 4.2

71. 71

72. 72 Diversity

73. 73

74. 74 Section 4: DIADEM » OPAL OPAL: Overview Three step process: browser extraction and annotation labelling & segmentation classification (phenomenological mapping) Model-based, knowledge driven latter two steps are model transformations thin layer of domain-dependent concepts field types and labels triggers for field & form creation

75. 75

76. 76

77. 77

78. 78

79. 79 ICQ Data Set: Application to Other Domains

80. AMBER:Ontologies for Record Extraction 80 4.3

82. 82 just opposite as in OPAL

83. AMBER: Overview Three step process like OPAL browser extraction and annotation classification (phenomenological mapping) record segmentation (much harder than in OPAL) Model-based, knowledge driven latter two steps are model transformations thin layer of domain-dependent concepts record and attribute types triggers for record & attribute creation 83 Section 4: DIADEM » AMBER

84. 84

85. 85

86. 86 Repeating

87. 87 Similarity

88. 88

89. OXPath:Scalable, Memory-Efficient Web Extraction 89 4.4

90. How to book a flight? 90 Section 4: DIADEM » OXPath

91. How to find a history book? 91 Section 4: DIADEM » OXPath

92. How to find a flat? 92 Section 4: DIADEM » OXPath

93. How to find a paper? 93 Scenarios

94. How to find a flat with OXPath Section 4: DIADEM » OXPath Start at rightmove.co.uk: doc("rightmove.co.uk") Fill “oxford’ into the first visible field/descendant::field()[1]/{"oxford"} Click on the second next button/following::field()[2]/{click /} On the refinement form just continue by clicking on the last field/descendant::field()[last()]/{click /} Grab all the prices//p.price 94

95. State of Web Extraction No interaction with rich, scripted interfaces no actions other than form filling and submission ➀ Imperative extraction scripts explicit variable assignments, flow control, etc. either proprietary selection language or mix of XPath & external flow control ➁ Focus on automation and visual interfaces no or very limited extraction language, only ad-hoc extractions no multiway navigation, no optimization 95 Section 4: DIADEM » OXPath

96. Why OXPath? 96 Section 4: DIADEM » OXPath scalability familiarity there is no XPath for data extraction simplicity web applications

98. Summary of Complexity 98 Section 4: DIADEM » OXPath Combined: PTime-hard PTime-hard Data: NLogSpace LogSpace Extraction marker = n-ary, nested queries Actions = multiple pages O(n4⋅q2) O(n3⋅q2) Contextual actions (action free prefix) Buffer bounded by page depth

99. 99 Constant Memory

100. 100 browser bound

101. 101 … for many pages

102. 102 … for many results

103. 103 memory

104. 104 faster

105. 105 even faster

106. 106 4.5 IVLIA:Ontologies for PDF Extraction

107. 107

108. PDF Analysis 108 Section 4: DIADEM » IVLIA

109. Semantic Analysis and Annotation 109 Section 4: DIADEM » IVLIA

110. Datalog±:Ontological Reasoning at Web Scale 110 4.6

111. 111 Section 4: DIADEM » Datalog± Much is possible with Datalog DL axiom Datalog rule Concept Inclusion employee(X) -> person(X) employeevperson (Inverse) Role Inclusion reports¡vmanager reports(X,Y) -> manager(Y,X) Role Transitivity trans(manager) manager(X,Y), manager(Y,Z) -> manager(X,Z) Datalog and ontological reasoning

112. 112 Section 4: DIADEM » Datalog± but it’s not enough … DL axiom Datalog(?) rule Participation employeev∃report employee(X) -> ∃Yreport(X,Y) Disjointness employee(X), customer(X) -> ⊥ employee v:customer Functionality reports(X,Y), reports(X,Z) -> Y = Z funct(reports) Datalog and ontological reasoning

113. 113 Section 4: DIADEM » Datalog± Ontological Databases E/R Schema Object Relational Schema Relational Schema person(ssn, name, birthdate) employee (ssn, empID, name, birthdate, department) department (depName, building) project (projID, startDate, duration) supervision (supervisor, supervised) assignment (employee, project)

114. 114 Section 4: DIADEM » Datalog± Ontological Constraints Taxonomy Definitions employee(X,Y,Z,W) -> ∃V person(V,Y,Z) project(X,Y,Z) -> activity(X,Y,Z) Concept Definitions employee(X1,Y1,Z1,W1,U1), supervision(Y1,Y2), employee(X2,Y2,Z2,W2,U2) -> supervisor(X1,Y1,Z1,W1,U1) An employee who supervises another employee is a supervisor generalManager(X1,Y1,Z1,W1,U1) -> supervision(Y1,Y1) A general manager supervises him/herself

115. 115 expressiveness efficiency KR expressiveness efficiency DB Big Picture

116. 116 Big Picture

117. 117 Our goal … DB technology + constraints Datalog DLs (DL-Lite, EL, Flogic Lite) Unifying Framework Section 4: DIADEM » Datalog± while maintaining query answering tractable in data complexity!

118. 118 employee(X), inProject(X,Y) ->∃Zemployee(Z),supervises(Z,X) reports(X,Y),reports(Z,X)->Y = Z employee(X),customer(X) -> ⊥ Section 4: DIADEM » Datalog± Extend Datalog by allowing in the head: existential (∃) variables  Tuple-generating dependencies (TGDs) equality (=) Equality-generating dependencies (EGDs) constant false (⊥)  Negative constraints (NCs) What we get is Datalog[∃,=,⊥] Datalog+ Datalog±

119. 119 Linear DL-Lite Sticky-join FO-rewritable Guarded EL PTIME Datalog±: Overview Section 4: DIADEM » Datalog±

120. 120 Section 4: DIADEM » Datalog± Comparison with existing semantic data management solutions IBM IODT [Ma et Al. SIGMOD ‘08] Ontotext BigOWLLim [Kiryakov WWW ‘06] Requiem [Horrocks et Al. ISWC ‘09] Prototype implementation: Nyaya (http://mais.dia.uniroma3.it/Nyaya/Home.html) Implements guarded, weakly-acyclic, linear and sticky Datalog ± Couples a Datalog ± engine with efficient storage mechanism Datalog±: In practice (experiments)

121. 121 Section 4: DIADEM » Datalog± Paper Semantic Data Markets: Store, Reason and Query by R. De Virgilio, G. Orsi, L. Tanca and R. Torlone (submitted) Findings: commercial systems do not identify FO-rewritable fragments they could answer queries much faster than they do now testing FO-rewritability conditions is easy Datalog±: In practice (experiments)

122. 122 Section 4: DIADEM » Datalog± If the language of Σis FO-rewritable fact updates reduce to updates in a RDBMS predicate updates reduce to re-compute the rewriting Datalog±: Updates

123. 123

124. Q&A diadem-project.info

Diadem 1.0

Recommended

Recommended

More Related Content

Similar to Diadem 1.0

Similar to Diadem 1.0 (20)

More from Giorgio Orsi

More from Giorgio Orsi (20)

Recently uploaded

Recently uploaded (20)

Diadem 1.0