Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Diadem DBOnto Kick Off meeting

481 views

Published on

http://diadem.cs.ox.ac.uk/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Diadem DBOnto Kick Off meeting

  1. 1. WELCOME 1 DIADEM data extraction methodology domain-centric intelligent automated Web data as you want it
  2. 2. TEAM 2 Georg Gottlob Professor, FRS Project lead Scientific director Tim Furche Postdoc Technical director Giovanni Grasso Postdoc Extraction infrastructure Giorgio Orsi Postdoc Knowledge modelling Christian Schallhart Postdoc Software engineering Xiaonan Guo Postdoc Forms and interaction
  3. 3. TEAM 3 Omer Gunes D.Phil. student Jinsong Guo D.Phil. student Andrew Sellers Captain USAF former D.Phil. student Andrey Kravchenko D.Phil. student Stefano Ortona D.Phil. student Cheng Wang D.Phil. student
  4. 4. FUNDING 4 CONCLUSION: $3.4M ~$5M, equity-free investment in basic, unique technology
  5. 5. DIADEM helps you collect the right data
  6. 6. DIADEM shovel for the data science rush
  7. 7. 7 50-80% Data scientists […] spend 50 to 80 percent of their time […] collecting and preparing […] digital data […] from sensors, documents, the web and conventional databases. –STEVE LOHR New York Times, Aug. 2014
  8. 8. INTRODUCTION 8 Data … is still a pain ○ Data exists, but getting and using it is hard ◗ For example, when you are making decisions ○ Tipping point: tech leaders leverage data to striking effect ◗ Amazon, Walmart, Google ○ What about the rest of the world?
  9. 9. 9 collect & prepare data “You can’t do this manually, you’re never going to find enough data scientists and analysts.” – SHARMILA SHAHANI-MULLIGAN CEO Clearstory (New York Times, Aug 2014)
  10. 10. INTRODUCTION 10 … but there is a remedy ○ We can get you the data you need in the form you need ◗ from competitors ◗ from open sources ◗ from your intranet ○ At any scale, covering popular as well as long tail sources ○ Far more comprehensive than manual solutions ○ Far cheaper even than partial, manual solution
  11. 11. HOW: TECHNOLOGY & TEAM 11 What? Data Extraction ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm
  12. 12. HOW: TECHNOLOGY & TEAM 12 What? Data Extraction >10000 ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm
  13. 13. Scale — what it’s all about
  14. 14. 14 “For many kinds of information one has to extract from thousands of sites in order to build a comprehensive database” –NILESH DALV I Yahoo!
  15. 15. 15 “No one really has done this successfully at scale yet” –RAGHU RAMAKRISHNAN Yahoo!
  16. 16. 16 “Current technologies are not good enough yet” –ALON HALEVY Google
  17. 17. HOW: TECHNOLOGY & TEAM 17 Technology: Our Strength 10,493 Sites from real-estate and used-car 92% Effective wrappers for more than 92% of sites on average 97% Precision of extracted primary attributes 20 2.1 Days on a 45 node Amazon EC2 cluster Days (one expert) to adjust system to a new domain
  18. 18. HOW: TECHNOLOGY & TEAM 18 Technology: Our Strength 2000 1500 seconds) 1000 (time 500 0 number of records 0 250 500 750 1000
  19. 19. HOW: TECHNOLOGY & TEAM 19 Phenomenology Self-organising adjusts itself to observations on the pages different sequence of tasks for every site strong isolation of components AI Rule-based AI declarative rules instead of heuristics uniform query of pages, phenomenology, … all domain-independent appearance of objects on the web reason for DIADEM’s high accuracy easily adapted to new domains
  20. 20. HOW: TECHNOLOGY & TEAM 20 http://diadem.cs.ox.ac.uk/demo
  21. 21. HOW: TECHNOLOGY & TEAM 21 Manual Automatic Supervised + magic Data extraction isn’t new … Scaling costly Very common Fully algorithmic Active research Human + algorithm Most commercial products
  22. 22. HOW: TECHNOLOGY & TEAM 22 Competitors DIADEM data extraction methodology Mozenda, Lixto, Connotate, domain-centric intelligent automated BlackLocus, import.io, scrapinghub.com, promptcloud.com massive human effort small human effort continuously once low scale one or few sources massive scale thousands of sources low cost efficiency high cost efficiency
  23. 23. HOW: TECHNOLOGY & TEAM 23 What about Google & Co. ○ Verticals are becoming ever more relevant for search ◗ the major change to Google’s result page in the last decade ◗ crucial for intelligent personal assistants (Siri, Google Now) ○ Revived interest in large-scale extraction of structured data ◗ as part of knowledge graph ◗ currently only good for common sense facts ○ Recent AI/deep learning acquisitions by Google, Facebook
  24. 24. HOW? INCUBATION PLAN 24 Data science—a huge market $50 billion Data science market 2017 *ACCORDING TO FORBES, WIKIBON FORECAST $25 billion Data collection & cleaning *ACCORDING NEW YORK TIMES
  25. 25. Clients HOW? INCUBATION PLAN 25
  26. 26. Strategic Partners HOW? INCUBATION PLAN 26 Price intelligence & analytics Price comparison & catalogs Recommendations & reviews
  27. 27. HOW? INCUBATION PLAN 27 DIADEM Vision Deep data for products Short term
  28. 28. HOW? INCUBATION PLAN 28 DIADEM Vision Deep data for everyone Long-term term
  29. 29. HOW? INCUBATION PLAN 29 DIADEM Vision “Suggest the best smart watch for my preferences!” “Suggest a great evening out!” “Suggest a cheap headphone with great bass!” “Suggest a great hotel in an area with lots of bars and close to my conference!”
  30. 30. HOW: TECHNOLOGY & TEAM 30 WWW 2014: Fallacies in DE –KEVIN C. CHANG Co-Founder Cazoodle, move.com, UIUC #1: Can not start with ‘given a set of result pages’ #2: Must not stop at 70% accuracy DIADEM #3: Must be scalable to more than thousands of sources #4: Must leverage human feedback ✓ ✓ ✓ ✓
  31. 31. DIADEM ANALYSIS 31 Table 3: Wrapper quality Wrapper quality 5 wrapper effective wrong or missing data no data UK real estate 91% 7% 2% Oxford real estate 90% 6% 4% ViNTs10 4% 5% 91% UK used cars 93% 4% 3% US real estate 90% 5% 5%
  32. 32. DIADEM ANALYSIS 32 Competition? precision recall 84% 88% 95% 98% 99% 77% 56% 38% 97% 99% 72% 78% 81% 48% 53% 58% MDR DEPTA ViNTs DIADEM 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Records RE⌧RND UC⌧RND CONCLUSION: Do only a part of the job, and poorly
  33. 33. DIADEM ANALYSIS 33 Competition? precision recall 83% 84% 97% 95% 42% 48% 96% 95% 65% 60% 58% 74% RoadRunner DEPTA DIADEM 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Attributes RE⌧RND UC⌧RND CONCLUSION: Do only a part of the job, and poorly
  34. 34. DIADEM ANALYSIS 34 25% Competition? unit beds CONCLUSION: make transmission age engine_size Do only a part of the job, and poorly period_baths receptions 0% price location postcode model colour body_type fuel_type registration door_number mileage Attribute quality ICQ dataset HA [14] ExQ [41] StatParser [36] DIADEM [17] F1 for labeling 92% 96% 96% 98% Table 3: Form labeling accuracy cars are more prominently placed on the site. There are about 3% of sites where no wrapper can be induced, typically as they con-tain no properties, all properties are on aggregators, or they contain no pivot attribute. For these sites, DIADEM correctly detects that there is no effective wrapper. The final case is that DIADEM fails to produce an effective wrapper, yet one exists. The most common reasons for these failures are dynamic forms (15%), result pages
  35. 35. DIADEM 35 DIADEM’s Components 1 ROSeAnn (VLDB’14) World-best entity extraction from text and structure
  36. 36. DIADEM 36 DIADEM’s Components The Ontological ROSeAnn Key: (Automatically VLDB’14) Understanding and Integrating Forms 1 World-best entity extraction from text and structure 1 TEMPLATE OPAL field_(WWW’by_proper<12, VLDBJ’C,A> {13) field<C>(N)(N@A{d,e,p}} 2 2 World-most-effective form understanding & filling 3 TEMPLATE field_by_segment<C,A>{field<C>(N)(N@A{e,p}} 4 5 TEMPLATE field_by_value<C,A> {field<C>(N)(N@A{m}, 6 ¬(A16= A, N@A1{d,e,p}_N@A1{e,p}) } 7 8 TEMPLATE field_minmax<C,CM,A> { Range widget ⟸ two fields + connected by “to” or other range connector 9 field<CM>(N1)(+ some child(clues in N1,the G),annotations child(or N2,classifications G),adjacent(N1,N2), 10 N1@A{e,d},(field<C>(N2)_N2@A{e,d}) 11 field<C_range>(N2)(child(N1,G),child(N2,G),next(N2,N1), 12 field<C>(N1),N2@range_connector{e,d},¬(A1$ C,N2@A1{d}) 13 field<CM>(N1)(child(N1,! G),child(N2,G),adjacent(N1,N2), 10 11 12 13
  37. 37. DIADEM 37 DIADEM’s Components 1 ROSeAnn (VLDB’14) World-best entity extraction from text and structure 2 OPAL (WWW’12, VLDBJ’13) World-most-effective form understanding & filling 3 AMBER (TWeb’14) World-most-accurate record identification for listing pages data area a div a div a div a p span PRICE b LOCATION p span PRICE b LOCATION p span PRICE em p span strong PRICE div b LOCATION span PRICE LOCATION i BEDS
  38. 38. DIADEM 38 DIADEM’s Components 1 2 3 4 Bitemporal Complex Event Processing of ROSeAnn (VLDB’14) World-best entity extraction from text and structure Web Event Advertisements? OPAL (WWW’12, VLDBJ’13) World-most-effective form understanding & filling Tim Furche1, Giovanni Grasso1, Michael Huemer2, Christian Schallhart1, and Michael Schrefl2 AMBER (TWeb’14) World-most-accurate record identification for listing pages 1 Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 3QD firstname.lastname@cs.ox.ac.uk OXPath (VLDB’11, VLDBJ’13) World-most-efficient extraction language 2 Department of Business Informatics – Data & Knowledge Engineering, Johannes Kepler University, Altenberger Str. 69, Linz, Austria lastname@dke.uni-linz.ac.at doc(’http://www.scottfraser.co.uk/’)//select[@id=’search-type’]/{1 /} 2 //input/{click /}/(//div[1]/table//td[4]/a/{click /})*{0,500} //div[@class=’property-wrapper’]:<record> 4 [? .:<ORIGIN_URL=current-url()>]
  39. 39. DIADEM 39 DIADEM’s Components 1 ROSeAnn (VLDB’14) World-best entity extraction from text and structure 2 OPAL (WWW’12, VLDBJ’13) World-most-effective form understanding & filling 3 AMBER (TWeb’14) World-most-accurate record identification for listing pages 4 OXPath (VLDB’11, VLDBJ’13) World-most-efficient extraction language 5 DIADEM (VLDB’14) World-first accurate, automatic full-site extraction system
  40. 40. FORM PHENOMENOLOGY 40 Example 1: Form ○ Task: classify and group form fields into semantic segments ◗ Problem: HTML structure is only an approximation ○ Phenomenology: Detect semantic segments, e.g., ◗ if there is a continuous list of option fields (, ☑️) ◗ with the same type ◗ and a parent that can’t be classified
  41. 41. FORM PHENOMENOLOGY 41 Example 1: Form s e g m e n t < C > ( ∃ X ) : - h t m l - c h i l d ( N 1 , P ) , parent can not be classified html-child(N2, P) , N1 ≠ N2, ¬segment(P), o p t i o n - f i e l d ( N 1) , o p t i o n - f i e l d ( N 2) , concept<C>(N1), concept<C>(N2), m a x - c o n t - l i s t - o f - f i e l d s - w i t h - t y p e < C > ( N 1, N 2) . both option fields same type C end points of largest continuous list of type C
  42. 42. RESULT PAGE PHENOMENOLOGY 42 Example 2: Dataareas ○ Task: Finding areas on a page that contain relevant data ○ Idea: Use the regularity resulting from the DB templates ○ Problem: Distinguishing regular noise, e.g., featured properties ○ Solution: Maximisation problem over pivot elements ◗ occurrences of mandatory attributes such as price
  43. 43. RESULT PAGE PHENOMENOLOGY 43 D1 M1,1 M1,2 D2 … D3 … M1,3 E M1,4 Figure 3: Data area identification consistent_cluster_members(C, N1, N2, N3) :- pivot(N1), pivot(N2), ... similar_depth(N1, N2), similar_depth(N2, N3), similar_depth(N1,N3), similar_tree_distance(N1, N2, N3). its of order dominance: The pivot nodes in E are organized rather regularly, whereas the pivot nodes in D1 vary quite notably. How-ever, cluster(C,N) :- ... continuous, lca, contains at least one of all mandatories there variation is small enough that M1,1 to M1,4 are depth and
  44. 44. RESULT PAGE PHENOMENOLOGY 44 Example 2: Record alignment data area a img div img a img img a img img £860 div div £900 £500 p £900 ○ set of uniform, non-overlapping records ○ maximise regularity, minimise outliers ◗ pairwise edit distance with bias towards pivot nodes p £900 Figure 4: Record Segmentation Algorithm 2: Segmentation(DOM P,Data Area d) 1 L {n : child(f(d),n) 2 P^9n0 2 y(d) : desc-or-self(n,n0)}; 2 sort L in document order; 3 foreach 1  k  |L|−1 do Partition[k] {n : L[k] ( n ) L[k+1]}; 4 Len min{|Partition[i]|: |{j : |Partition[ j]| = |Partition[i]|}| maximal}; 5 while L[1]−sibl L[2] < Len do delete L[1]; 6 while L[|L|−1]−sibl L[|L|] < Len do delete L[|L|]; 7 while 1 < k < |L| do 8 if L[k]−sibl L[k+1] < Len then delete L[k+1] else k++; 9 StartCandidates {L}[{{n : 9l 2 L : n−sibl l = i} : i  Len}; 10 OptimalSegmentation / 0; OptimalSim •; 11 foreach S 2 StartCandidates do 12 sort S in document order; 13 foreach 1  k  |L|−1 do 14 Segmentation[k] {n : n−sibl S[k]  Len}; 15 if 8P 2 Segmentation : |P| = Len then 16 if irregularity(Segmentation) < OptimalSim then all text nodes. With the exception of a’s tag, all HTML tags are annotated by the type of step. For the leftmost a and its i descendant in Figure 5, e.g., the tag path is a/first-child::p/first-child::span/next-sibl::i. Based on the tag path, AMBER quantifies the fraction of records that support the assumption that a node n is an attribute of type A within record r with the support suppr(n,A). DEFINITION 9. Let E be an extraction instance on DOM P, containing a node n within record r belonging to data area d, and A 2 A an attribute type. Then suppr(n,A) denotes the support of n as attribute of type A within r, defined as the fraction of records r06= r in d that contain a node n0 with tag-pathr(n) = tag-pathr0 (n0) that is annotated with A. Consider a data area with 10 records, containing 1 PRICE-annotated
  45. 45. BLOCK PHENOMENOLOGY 45 Example 3: Pagination links Website n n1 n2 P R Screenshot Real estate FindAProperty 370 1 1 1 1 Zoopla 332 1 1 1 1 Savills 234 2 2 1 1 Cars Autotrader 262 2 2 1 1 Motors 472 2 2 1 1 Autoweb 103 2 2 1 1 Retail Amazon 448 1 1 1 1 Ikea 290 2 0 1 1 Lands’ End 527 2 2 1 1 Forums TechCrunch 279 0 1 1 1 TMZ 200 2 2 1 1 Ars Technica 341 2 2 1 1 Table 1: Sample pages
  46. 46. BLOCK PHENOMENOLOGY 46 Example 3: Pagination links ○ Machine learning on top of derived features Description Type Predicate Content 1 Annotated as NEXT bool plm::annotated_by<NEXT> 2 Annotated as PAGINATION bool plm::annotated_by<PAGINATION> 3 Annotated as NUMBER bool plm::annotated_by<NUMBER> 4 Number of characters int plm::char_num Page position 5 Relative position on page int2 plm::relative_position<css::page> 6 Relative position in first screen int2 plm::relative_position<std::first_screen> 7 In first screen bool plm::contained_in<std::first_screen> 8 In last screen bool plm::contained_in<std::last_screen> Visual proximity 9 Pagination annotation close to node bool plm::in_proximity<plm::annotated_by<PAGINATION>> 10 Number of close numeric nodes int plm::num_in_proximity<numeric> 11 Closest numeric node is a link bool plm::closest<std::left_proximity>_with <numeric>_is<non_link> 12 Closest numeric node has different style bool <numeric>_is<different_style> 13 Closest link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT> 14 Ascending w. closest numeric left, right bool plm::ascending-numerics Structural 15 Preceding numeric node is a link bool plm::closest<std::preceding>_with <numeric>_is<non_link> 16 Preceding numeric node has different style bool <numeric>_is<different_style> 17 Preceding link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT> Table 3: PLM: Pagination Link Model
  47. 47. BLOCK PHENOMENOLOGY 47 Example 3: Pagination links TEMPLATE annotated_by<Model,AType> { 2 <Model>::annotated_by<AType>(X) ( node_of_interest(X), gate::annotation(X, <AType>, _). } 4 TEMPLATE in_proximity<Model,Property(Close)> { ○ Datalog± rules for deriving features ○ Lots of visual reasoning on the page ○ Rich template language to avoid duplication <Model>::in_proximity<Property>(X) ( node_of_interest(X), 6 std::proximity(Y,X), <Property(Close)>. } TEMPLATE num_in_proximity<Model,Property(Close)> { 8 <Model>::in_proximity<Property>(X,Num) ( node_of_interest(X), std::proximity(Close,X), Num = #count(N: <Property(Close)>). } 10 TEMPLATE relative_position<Model,Within(Height,Width)> { <Model>::relative_position<Within>(X, (PosH, PosV)) ( node_of_interest(X), 12 css::box(X, LeftX, TopX, _, _), <Within(Height,Width)>, Width , PosV = 100·TopX Height . } PosH = 100·LeftX 14 TEMPLATE contained_in<Model,Container(Left,Top,Bottom,Right)> { <Model>::contained_in<Container>(X) ( node_of_interest(X), 16 css::box(X,LeftX,TopX,RightX,BottomX), <Container(Left,Top,Right,Bottom)>, Left < LeftX < RightX < Right, Top < TopX < BottomX < Bottom. } 18 TEMPLATE closest<Model,Relation(Closest,X),Property(Closest),Test(Closest)> { <Model>::closest<Relation>_with<Property>_is<Test>(X) ( node_of_interest(X), 20 <Relation(Closest,X)>, <Property(Closest)>, <Test(Closest)>, ¬(<Relation(Y,X)>, <Property(Y)>, <Relation(Y,Closest)>). } Fig. 4: BERyL feature templates In a similar way, the second template defines a boolean feature that holds for nodes
  48. 48. Discussion QUESTIONS 48 ?

×