Integrating Web Query Results:
     Holistic Schema Matching
                    1




                                 CI...
Outline
                    2

 Introduction
 Approach
 Framework
 Algorithm
 Experiments




26 pages
Introduction
                3




                          Back
26 pages
Introduction
                4




                          Back
26 pages
5




26 pages
Introduction –
              Schema Matching on Query Results
                                         6

 Data fields ar...
Introduction - Approach
                           7

 The enrichment occurs basically in three levels
     The content o...
Framework – Problem Statement
                               8

 Suppose A={a1,a2,…} for the book source. For source
  S1...
Framework
              Matching as Domain Schema Discovery
                                            9

 Let the domai...
Framework
               Matching as Domain Schema Discovery
                                              10

 This proc...
Framework
           Matching as Domain Schema Discovery
                                        11

 Let the data observ...
Framework
           Matching as Domain Schema Discovery
                             12

 Suppose X1={x11,x12,x13} and X...
Framework Formulation and Overview
                                             13

 Field Model
   A field model a is a...
Framework Formulation and Overview
                                14

 Overall, our framework translates the problem of ...
Algorithm
                           15

 To solve our matching problem, we need to discover
  either an optimal matching...
Algorithm
                                         16

 InitMatch
   The function is to generate an initial matching, to...
Algorithm
                                   17

 LearnSchema – From matching to schema
   Aim to construct a schema bas...
Algorithm
                                                             18

 SchemaMatch – From Schema to Matching
   Giv...
Algorithm
                                                                      19

 MetaMatch :
   Adopt F-measure to m...
Algorithm
                          20

 HoliMatch’s algorithm




26 pages
Experiments
                                21

 Data set
   Four domains

   For each domain, collect 10 sources




2...
Experiments
                                22

 Comparison Methods
   PairMatch: adopt Corpus-based approach

   ClusM...
Experiments
                         23

 Matching on Correct Extraction Data
   Matchers




       Iterations
   



...
Experiments
                         24

 Matching on Correct Extraction Data
   Sources




26 pages
Experiments
                         25

 Matching on Correct Extraction Data
   Pairwise




26 pages
Experiments
                         26

 Matching on Real Extraction Data




26 pages
Upcoming SlideShare
Loading in …5
×

20090411

525 views

Published on

Paper Reading

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
525
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20090411

  1. 1. Integrating Web Query Results: Holistic Schema Matching 1 CIKM’08 Shui-Lung Chuang Kevin Chen-Chuan Chang Yen-Ling Lin 2009/04/13 26 pages
  2. 2. Outline 2  Introduction  Approach  Framework  Algorithm  Experiments 26 pages
  3. 3. Introduction 3 Back 26 pages
  4. 4. Introduction 4 Back 26 pages
  5. 5. 5 26 pages
  6. 6. Introduction – Schema Matching on Query Results 6  Data fields are the basic units processed by matching.  A data field can be viewed as a label plus a set of values.  We lack explicit and complete schema information. e.g.  To conquer such challenges, we observe some niches in this context of integrating query results First, we often need to integrate multiple sources. Some useful 1) effects naturally occur when cross-referencing many sources. Second, although no schema-based constraint is available, there are 2) indeed useful regularities that can be observed from many sources. These regularities, treated as observed domain constraints, are very helpful for matching discovery. 26 pages
  7. 7. Introduction - Approach 7  The enrichment occurs basically in three levels The content of a field 1. The kinds of fields 2. The constraints of fields 3. With all the above enrichment, we learn a more  complete schema to describe the whole input data.  This learned schema can thus help us in making further matching. 26 pages
  8. 8. Framework – Problem Statement 8  Suppose A={a1,a2,…} for the book source. For source S1, the fields X1 = (x11,x12,…,x17) can be assigned with the matching Y1= (a1,a2,…,a7)  Matching is actually discovering the assignment of the groups in A to the fields of each source: Ys = (ys1,…,ysls) and each yi ∊ A is the group that source field xsi ∊ Xs is assigned as. 26 pages
  9. 9. Framework Matching as Domain Schema Discovery 9  Let the domain schema be M=(A, B)  A :the set of domain fields  B:the statistical constraints  For each source Ss It projects M onto a source schema Ms = (Ys, Vs) 1) Ys:a subset of A to be the fields of source Ss 1) Vs:a set of constraints instantiated from B 2) Construct the source instances Xs 2) Vs Us , Ys Xs :Is = (Xs, Us) 3) Output:Xs 4) 26 pages
  10. 10. Framework Matching as Domain Schema Discovery 10  This procedure of data generation can be conceptually sketched as:  M=(A, B) where A={a1,…,a11} and B={first(a1):.67, first(a2):.33, pos≻(a2, a3):1} M1=(Y1,V1) where Y1={a1,..,a5,a7,a8} and V1= ={first(a1):.67,  first(a2):.33, pos≻(a2, a3):1} We generate data using source schema M1.  Map Y1 as X1 – e.g., a2 is mapped as x1,2   first(a1) in V1 is rewritten as first(x11) in U1, pos≻(a2,a3) as pos≻(x12,x13) 26 pages
  11. 11. Framework Matching as Domain Schema Discovery 11  Let the data observed from source Ss be Is= (Xs, Us).  Given the matching Y={Ys: s ∊S}, learning the best domain schema can be described as a probabilistic optimization expression: arg max p ( I s |Y s , M ) * M sS M  Similarly, if the domain schema M is given, the best matching Y {Y : s S } can be discovered, again using * * s statistical techniques to find out the most likely assignment of domain fields to the fields of each source: * arg max p ( I s | Y s , M ) for each s ∊ S Ys Ys 26 pages
  12. 12. Framework Matching as Domain Schema Discovery 12  Suppose X1={x11,x12,x13} and X2={x21,x22}. Suppose we have one predicate function to check: first. Then, I1={X1,U1} where U1={first(x11):1}, and I2={X2,U2} where U2={first(X21):1}  Suppose Y1={a1,a2,a3} and Y2={a2,a3}. Construct M1= (Y1,V1), V1={first(a1):1} and M2=(Y2,V2) , V2={first(a2):1}  It is clear that first(a1) holds for M1 but not M2. Thus first(a1) has confidence 0.5. Thus, combining source schemas M1 and M2, the domain schema then becomes M=(A, B) where A={a1, a2, a3} and B={first(a1):.5, first(a2):.5}. 26 pages
  13. 13. Framework Formulation and Overview 13  Field Model  A field model a is a statistic model specifying how to generate instances.  A field model a is a function that accepts an instance z and produces p(z| a ), indicating the likelihood that z is an instance produced by the field model a .  Statistical Constraint  A statistical constraint b is written as f(e):c f: a predicate name, e is the vector of elements, c is a confidence  value of range[0,1]. 26 pages
  14. 14. Framework Formulation and Overview 14  Overall, our framework translates the problem of instance- based matching into a schema-discovery problem.  With such a strategy, we leverage not only the data instances but also the regularities observed from the data in a principled way. a 26 pages
  15. 15. Algorithm 15  To solve our matching problem, we need to discover either an optimal matching Y* or an optimal schema M*.  If one of them is obtained, the other can be derived.  The basic idea is to start an initial guess of the matching Y and iteratively improve it using the schema M that is derived from the current estimation of Y. 26 pages
  16. 16. Algorithm 16  InitMatch  The function is to generate an initial matching, to be the start point for iterations.  EnumRelations  We need to identify the constraints occurring in the input data. i ,..., i 1 k  Predicate Function f ( i1 ,..., i k , X ) i1 ,..., i k :which elements to check their satisfaction with the  predicate f and X is the original data. True: the input satisfies the predicate  False: otherwise  26 pages
  17. 17. Algorithm 17  LearnSchema – From matching to schema  Aim to construct a schema based on a given matching.  First, group the matched source fields together.  Each group is trained as field model.  Model it as 2-state HMM. Learning an HMM a given a set of instances and computing  the probability p(z|a) for given instance z will follow the standard HMM training and inference algorithm. 26 pages
  18. 18. Algorithm 18  SchemaMatch – From Schema to Matching  Given the domain schema, matching becomes labeling the elements of sources with the appropriate domain fields.  For each hj∈Vs with the corresponding bj ∈ B, let their constraint be fj(yi1,…yik), we define qi, j (a ) z (a ) (a ) p (h j | b j ) ( yl ) ( yl ) i i l l i1 ,..., i k , y i a l i1 ,.., i k , l i qi ( a ) z qi, j (a ) j The most likely value for each yi is thus:  * yi arg max q i ( a ) aA 26 pages
  19. 19. Algorithm 19  MetaMatch :  Adopt F-measure to measure the consistency. 2 R i , j Pi , j Fi , j Ri, j Pi , j For two matching m1 and m2, using m1 as tastee and m2 as  tester, ni F ( m1 , m 2 ) max { Fi , j } n j m1 i m2 Let these candidates generated during this process be C and  the n matchings be R={r1,…,rn}: The final matching is obtained as: * m arg max F (m , r ) mC rR  InitMatch aims to guess an initial matching, to be the start point of the iterative computation. 26 pages
  20. 20. Algorithm 20  HoliMatch’s algorithm 26 pages
  21. 21. Experiments 21  Data set  Four domains  For each domain, collect 10 sources 26 pages
  22. 22. Experiments 22  Comparison Methods  PairMatch: adopt Corpus-based approach  ClusMatch:  ChainMatch: e.g., 1-2-3-4  ProgMatch: e.g., becoming (((1-2)-3)-4)  InitMatch:an extension of using pairwise matching  HoliMatch  Performance  The matching accuracy is measured using F-measure.  Give the result matching m and the correct matching c, the F- measure is F(m, c), indicating how close m is to c. 26 pages
  23. 23. Experiments 23  Matching on Correct Extraction Data  Matchers Iterations  26 pages
  24. 24. Experiments 24  Matching on Correct Extraction Data  Sources 26 pages
  25. 25. Experiments 25  Matching on Correct Extraction Data  Pairwise 26 pages
  26. 26. Experiments 26  Matching on Real Extraction Data 26 pages

×