Edbt 2010, Belhajjame

1,118 views

Published on

A talk given at the EDBT/ICDT 2010 conference. For more details, visit the project website at http://img.cs.manchester.ac.uk/dataspaces/dataspaces.html

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,118
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Edbt 2010, Belhajjame

  1. 1. Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler EDBT/ICDT  2010   1  
  2. 2. Data  Integra2on   What  are  the  available  proteins  of  the  Fruit  Fly?     Scien2st   Integra2on   Schema   Mappings   PedroDB   PepSeeker   Pride   GPMDB   EDBT/ICDT  2010   2  
  3. 3. Towards  Pay-­‐as-­‐you-­‐go  Data  Integra2on     Data  Integra*on   –  SeKng  up  a  data  integra2on  system  requires  significant  upfront  effort   –  The  specifica2on  of  schema  mappings  has  proved  to  be  2me  and   resource  consuming:  it  requires  deep  knowledge  of  the  sources  to  be   integrated  as  well  as  the  user’s  requirements.     Dataspaces:  a  Pay-­‐as-­‐you-­‐go  Data  Integra*on  [Franklin  et  al.  2005]   –  Reduce  the  up-­‐front  cost  required  to  setup  a  data  integra2on  system:   Provide  some  services  immediately   –  Gradually  improve  the  services  provided  by  the  system  through   interac2on  with  end  users  in  a  pay-­‐as-­‐you-­‐go  fashion.   M.  J.  Franklin,  A.  Y.  Halevy,  and  D.  Maier.  From  databases  to  dataspaces:  a  new  abstrac2on  for  informa2on   management.  SIGMOD  Record,  34(4):27–33,  2005.   EDBT/ICDT  2010   3  
  4. 4. Pay-­‐as-­‐you-­‐go  Data  Integra2on   What  are  the  available  proteins  of  the  Fruit  Fly?     Scien2st   Integra2on   Schema   Bootstrap   Dataspaces   Mappings   PedroDB   PepSeeker   Pride   GPMDB   Objec2ve  of  the  present  work:     Inves2gate  Pay-­‐as-­‐you-­‐go  Annota2on,  Selec2on,  and  Refinement  of  Schema  Mappings   EDBT/ICDT  2010   4  
  5. 5. Pay-­‐as-­‐you-­‐go  Data  Integra2on    We consider that integration schema and source schemas are relational, and that the schema mappings that define the extent of the relations in the integration schema, r, are global as view mappings of the form: m = ⟨r,qs⟩ where qs is a relational query over the source schemas.  A relation in the integration schema can be associated with multiple candidate mappings: We consider a setting in which multiple matching mechanisms can be used, each of which could give rise to multiple mapping candidates for populating the same relation of the integration schema. EDBT/ICDT  2010   5  
  6. 6. Outline     User  Feedback     Annota*on  of  Schema  Mappings     Selec*on  of  Schema  Mappings  Based  on  User  Requirements     Refinement  of  Schema  Mappings     EDBT/ICDT  2010   6  
  7. 7. User  Feedback     Query:  What  are  the  available  fruit  fly  proteins?     Results:   Feedback   ✔   ✖   ✖   ✔   EDBT/ICDT  2010   7  
  8. 8. User  Feedback  (cont.)     Let  m  be  a  candidate  mapping,  and  UF  a  set  of  feedback  instances  UF   supplied  by  the  user:       tp(m,UF):  the  tuples  that  are  expected  by  the  user  and  that  are  retrieved   by  the  mapping  m.     fp(m,UF):  the  tuples  that  are  not  expected  by  the  user  and  that  are   retrieved  by  the  mapping  m.       fn(m,UF):  the  tuples  that  are  expected  by  the  user  and  are  not  retrieved   by  the  mapping  m.   EDBT/ICDT  2010   8  
  9. 9. Outline    User  Feedback     Annota*on  of  Schema  Mappings     Selec*on  of  Schema  Mappings  Based  on  User  Requirements     Refinement  of  Schema  Mappings     EDBT/ICDT  2010   9  
  10. 10. Annota2ng  Mappings   Using  a  simple  annota*on  scheme,  a  schema  mapping  can  be   annotated  as:    Correct      Incorrect     The  set  of  schema  mappings  is  likely  to  be  incomplete,  and,   therefore,  we  may  end  up  annota2ng  all  mappings  as  incorrect.   Because  of  this,  we  use  a  less  stringent  scheme  mapping   annota2on.     EDBT/ICDT  2010   10  
  11. 11. Annota2ng  Mappings  (cont.)   Instead,  we  use  and  adapt  the  no2ons  of  precision  and  recall   used  in  informa2on  retrieval  to  measure  the  quality  of  a   mapping.    Precision:      Recall:      F  measure:     EDBT/ICDT  2010   11  
  12. 12. Mapping  Annota2on:  Valida2on   Ques*ons:     –  How  much  user  feedback  is  required  for  approxima8ng  the   real  precision  and  recall,  i.e.,  those  based  on  complete   knowledge  of  the  expected  results?   –  Does  the  pay-­‐as-­‐you-­‐go  philosophy  hold?   EDBT/ICDT  2010   12  
  13. 13. Mapping  Annota2on:  Valida2on  (cont.)   Experiment:     Data:   –  Two  datasets:  the  Mondial  geographical  database  and  the  Amalgam   data  integra2on  benchmark   –  Candidate  schema  mappings:  created  using  the  IBM  Infosphere  Data   Architect.       Process:  we  applied  the  two-­‐step  process  illustrated  below  for  mul2ple   itera2ons.   1.  Generate  a  sample  feedback  instances.   2.  Compute  the  rela2ve  precision  and  recall  of  the  candidate  mappings   given  cumula2ve  feedback.   EDBT/ICDT  2010   13  
  14. 14. Mapping  Annota2on:  Error  in  Precision   Error   EDBT/ICDT  2010   14  
  15. 15. Mapping  Annota2on:  Error  in  Recall   Error   EDBT/ICDT  2010   15  
  16. 16. Outline    User  Feedback    Annota*on  of  Schema  Mappings     Selec*on  of  Schema  Mappings  Based  on  User  Requirements     Refinement  of  Schema  Mappings     EDBT/ICDT  2010   16  
  17. 17. Mapping  Selec2on     Mapping  selec2on  should  be  tailored  to  meet  user  requirements.     We  use  a  selec2on  method  that  aims  to  maximise  the  recall  such  that  the   precision  of  the  results  is  higher  than  a  given  precision  threshold.     We  cast  this  selec2on  problem  as  a  search  problem  that  aims  to  maximise  the   following  u2lity  func2on:   D.  A.  Menascé  and  V.  Dubey.  U2lity-­‐based  qos  brokering  in  service  oriented  architectures.  In  ICWS,  pages   422–430.  IEEE  CS,  2007.   EDBT/ICDT  2010   17  
  18. 18. Mapping  Selec2on     Mapping  selec2on  should  be  tailored  to  meet  user  requirements.     We  use  a  selec2on  method  that  aims  to  maximise  the  recall  such  that  the   precision  of  the  results  is  higher  than  a  given  precision  threshold.     We  cast  this  selec2on  problem  as  a  search  problem  that  aims  to  maximise  the   following  u2lity  func2on:   D.  A.  Menascé  and  V.  Dubey.  U2lity-­‐based  qos  brokering  in  service  oriented  architectures.  In  ICWS,  pages   422–430.  IEEE  CS,  2007.   EDBT/ICDT  2010   18  
  19. 19. Mapping  Selec2on:  Precision   Do  we  meet  precision  requirement,     i.e.,  is  the  precision  threshold  set  by  the  user  respected?   EDBT/ICDT  2010   19  
  20. 20. Mapping  Selec2on:  Precision   EDBT/ICDT  2010   20  
  21. 21. Mapping  Selec2on:  Recall   Do  we  get  some  benefits  for  recall,     i.e.,  does  the  method  we  use  maximise  the  recall?   EDBT/ICDT  2010   21  
  22. 22. Mapping  Selec2on:  Recall   EDBT/ICDT  2010   22  
  23. 23. Outline    User  Feedback    Annota*on  of  Schema  Mappings    Selec*on  of  Schema  Mappings  Based  on  User  Requirements     Refinement  of  Schema  Mappings     EDBT/ICDT  2010   23  
  24. 24. Mapping  Refinement     We  dis2nguish  two  kinds  of  refinement:       Mapping  refinement  that  seeks  to  reduce  the  number  of  false  posi2ves     A  candidate  mapping  is  refined  by  modifying  a  source  query  so  that  the   number  of  false  posi2ves  it  returns  is  reduced.       Mapping  refinement  that  aims  to  increase  the  number  of  true  posi2ves     A  candidate  mapping  m  is  refined  by  modifying  a  source  query  so  that   the  number  of  true  posi2ves  it  returns  is  increased.     EDBT/ICDT  2010   24  
  25. 25. Mapping  Refinement:  Example   I Want Fruit fly proteins Integration Protein schema Accession name gene m = <Protein, ProteinEntry> Source schema EDBT/ICDT  2010   25  
  26. 26. Mapping  Refinement:  The  Space  of  Solu2ons   The  space  of  solu2ons  is  composed  of  the  mappings  that  can  be  constructed   out  of  the  candidate  mappings.  Specifically:,  by   i. Joining  the  source  query  of  a  candidate  mapping.     ii. Augmen2ng  the  source  query  of  a  candidate  mapping  with  a  selec2on     condi2on.   iii. Relaxing  the  selec2on  condi2on  of  the  source  query  of  a  candidate     mapping.   iv. Combining  the  source  queries  of  two  or  more  mappings  using  union,     difference  and  intersec2on.   15/04/2009   Khalid   26  
  27. 27. Exploring  the  Space  of  Solu2ons     The  space  of  mappings  that  can  be  obtained  by  refinement  is   poten2ally  large.       A  search  algorithm  that  explores  the  whole  space  of  the  possible   mappings  may  not  be  able  to  find  a  solu2on  in  a  bounded  2me.     In  the  context  of  the  present  work,  we  used  an  evolu*onary   algorithm  for  exploring  the  space  of  mappings  that  can  be  obtained   by  refinement.   15/04/2009   Khalid   27  
  28. 28. Mapping  Refinement  Algorithm   EDBT/ICDT  2010   28  
  29. 29. Mapping  Refinement:  Valida2on     Ques*on:      Can  mapping  refinement  improve  the  quality  of  ini8al  candidate   mappings,  and,  if  so,  at  what  cost,  i.e.,  what  is  the  amount  of  user   feedback  required?     Experiment:  To  answer  the  above  ques2on  we  applied  the   following  process  for  mul2ple  itera2ons.   1) Generate  a  sample  of  feedback  instances.   2) Annotate  the  set  of  candidate  mappings.   3) Refine  candidate  mappings  using  the  RefineMappings  algorithm.   EDBT/ICDT  2010   29  
  30. 30. Mapping  Refinement:  Valida2on  (cont.)   EDBT/ICDT  2010   30  
  31. 31. Conclusions     Pay-­‐as-­‐you-­‐go  Annota*on  of  Schema  Mappings     We  showed  how  schema  mappings  can  be  incrementally  annotated  based   on  feedback  supplied  by  end  users.     We  also  showed  through  an  evalua2on  exercise  that  the  more  feedback   the  user  supplies,  the  bemer  is  the  quality  of  the  mapping  annota2on   computed.         Applica*on:  Selec*on  and  Refinement  of  Schema  Mappings   in  Dataspaces     Mapping  annota2on  computed  based  on  user  feedback  are  used  as  input   for  enabling  the  selec2on  and  the  refinement  of  schema  mappings.     The  evalua2on  exercises  also  showed  that  mapping  refinement  is  more   cost  effec2ve  in  the  first  feedback  itera2ons.         EDBT/ICDT  2010   31  
  32. 32. Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler EDBT/ICDT  2010   32  

×