Your SlideShare is downloading. ×
Entity Matching for Semistructured Data in the Cloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Entity Matching for Semistructured Data in the Cloud

568

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
568
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Entity Matching for Semistructured Data in the Cloud Marcus Paradies ACM SAC 2012 - CC Track March 27, 2012Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 19
  • 2. Outline 1 Motivation 2 ChuQL 3 Entity Matching 4 MAXIM: Entity Matching in the Cloud 5 SummaryMarcus Paradies Entity Matching for Semistructured Data in the Cloud 2 / 19
  • 3. Motivation Enriching/Improving Wikipedia References from Wikipedia article Hash joinMarcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 4. Motivation Enriching/Improving Wikipedia Lookup in the CiteSeer databaseMarcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 5. Motivation Enriching/Improving Wikipedia Lookup in GoogleMarcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 6. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc.Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 19
  • 7. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Challenges Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjectiveMarcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 19
  • 8. Motivation Problem Statement Definition Given two datasets of records, R and S, a set of attributes a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where n i=1 simai (R.ai , S.ai ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford | first = David | first = David <record id=”6627383”> <record id=”6627383”> | authorlink = David Mumford | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | title = The Red Book of Varieties and Schemes | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> Schemes</title> | location = Berlin | location = Berlin <publisher>Springer</publisher> <publisher>Springer</publisher> | date = 1999 | date = 1999 <year>1999</year> <year>1999</year> | page = 198 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data setMarcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 19
  • 9. Motivation Problem Statement Definition Given two datasets of records, R and S, a set of attributes a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where n i=1 simai (R.ai , S.ai ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford | first = David | first = David <record id=”6627383”> <record id=”6627383”> | authorlink = David Mumford | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | title = The Red Book of Varieties and Schemes | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> Schemes</title> | location = Berlin | location = Berlin <publisher>Springer</publisher> <publisher>Springer</publisher> | date = 1999 | date = 1999 <year>1999</year> <year>1999</year> | page = 198 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data setMarcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 19
  • 10. ChuQLMarcus Paradies Entity Matching for Semistructured Data in the Cloud 6 / 19
  • 11. ChuQL ChuQL by example Wordcount in ChuQL 1 mapreduce { 2 input { fn : collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml : in // revision 4 return {" key ": fn : data ( $x // username | $x // ip ) , 5 " val ": $x // title } } 6 map { $hxml : in } 7 reduce { {" key ": $hxml : in = >" key " , " value ": fn : count ( $hxml : in = >" val ")} } 8 rw { < author name ="{ $hxml : in = >" key "}" count ="{ $hxml : in = >" val "}"/ > } 9 output { fn : put (" hdfs :// outputdir /") } 10 }Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 19
  • 12. Entity MatchingMarcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 19
  • 13. Entity Matching What is Entity Matching?Marcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 19
  • 14. Entity Matching What is Entity Matching? Challenges Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependentMarcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 19
  • 15. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn bMarcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 16. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b How can we improve the runtime of an EM task?Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 17. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b Distributed BlockingMarcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 18. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b Distributed Blocking Parallel MatchingMarcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 19. MAXIM: Entity Matching in the CloudMarcus Paradies Entity Matching for Semistructured Data in the Cloud 11 / 19
  • 20. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functionsMarcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 19
  • 21. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Main Idea Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a blockMarcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 19
  • 22. MAXIM: Entity Matching in the Cloud Architecture Search Node 1 Search Node 2 Search Node N Engine Engine Engine Data Node Data Node Data Node ... Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine HDFS HDFS Architecture Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFSMarcus Paradies Entity Matching for Semistructured Data in the Cloud 13 / 19
  • 23. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Three Stages Preparation Stage Blocking Stage Matching StageMarcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 24. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Extract Store into full-text Build references references index XML index Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX references references records records Preparation Stage Stage 1: Preparation Stage Extracts references from Wikipedia Reads and transforms records from CiteSeerX Sends CiteSeerX data to local full-text indexMarcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 25. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Extract Store into full-text Build Retrieve Generate references references Get query Store index XML index references query response blocks Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX Generate Semantic Generate Semantic references references records records Block Block Preparation Stage Blocking Stage Stage 2: Blocking Stage Reads extracted references from HDFS Probes full-text index to retrieve candidate publications Assign candidate publications to block(s)Marcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 26. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Store Extract Store into full-text Build Retrieve Generate Verify references references Get query Store record index XML index references query candidate response blocks pairs pairs Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX Generate Semantic Generate Semantic Record pair generation Record pair generation references references records records Block Block Preparation Stage Blocking Stage Matching Stage Stage 3: Matching Stage Read blocks from HDFS Generate candidate pairs and apply similarity functions Store matching pairs and their similarityMarcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 27. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing PublicationsMarcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 28. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }}Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 29. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 30. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 31. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction Read and Transformation {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray <doc> | title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field> | journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and | year = 1990 Connecting Words In Language Generation</field> | pages = 186–197 <field name="author">Bonnie Dorr</field> }} <field name="description">Generating language ...</field> </doc> Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference>Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 32. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction Read and Transformation {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray <doc> | title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field> | journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and | year = 1990 Connecting Words In Language Generation</field> | pages = 186–197 <field name="author">Bonnie Dorr</field> }} <field name="description">Generating language ...</field> </doc> Transformation Indexing <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> Lucene Lucene <journal>Proceedings of the 16th VLDB conference</journal> Index Index <year>1990</year> <pages>186–197</pages> </reference>Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 33. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Block generation Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in referenceMarcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 34. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Block generation Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference Example Hashing <citation> <citation> <id>26334893</id> <id>26334893</id> <citation> <cat>Search engine optimization</cat> <cat>Search engine optimization</cat> <id>26334893</id> 10.0.1.1.124 <cat>Hashing</cat> search algorithms</cat> <cat>Internet <cat>Internet search algorithms</cat> Search Engine <cat>Link analysis</cat> <cat>Link analysis</cat> <cat>Join algorithms</cat> <ref> <ref> <ref> 10.0.1.11.23 <type>journal</type> <type>journal</type> <type>journal</type> <author>Taher Haveliwala</author> <author>Taher Haveliwala</author> <author>Hansjörg Zeller</author> send result <author>Jim Gray</author> <year>2003</year> <year>2003</year> Full-Text <year>1990</year> send query Index <pages>56-70</pages> <pages>56-70</pages> <pages>186-197</pages> Eigenvalue <title>The Second <title>The Second Eigenvalue send result <title>An AdaptiveGoogle Matrix</title> ofof the Hash JoinMatrix</title> the Google Algorithm Join for Multiuser Environments</title> <journal>Stanford University <journal>Stanford University <journal>Proceedings of the 16th VLDB algorithms Technical Report</journal> Technical Report</journal> conference</journal> </ref> </ref></ref> 10.0.1.1.124 </citation> </citation> </citation> 10.0.7.23.14Marcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 35. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Distributed Search in MAXIM (a) Send HTTP request (query) Search Node 1 (c) Engine (b) HTTP response (partial result) Data Node Hadoop (c) Collect partial results Full-text Task Tracker Index ChuQL Engine (a) ) (a (a) (a) (b) (b) (b) (b) Search Node 2 Search Node 3 Search Node 4 Search Node 5 Engine Engine Engine Engine Data Node Data Node Data Node Data Node Hadoop Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine ChuQL EngineMarcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 36. MAXIM: Entity Matching in the Cloud Stage 3: Matching Stage Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity functionMarcus Paradies Entity Matching for Semistructured Data in the Cloud 17 / 19
  • 37. MAXIM: Entity Matching in the Cloud Stage 3: Matching Stage Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function Number of candidate pairs n CP = Ci ∗ Ri (1) i=1 n - # of blocks in B1 , . . . , Bn Ri - # of references in block Bi Ci - # of candidate publications in block Bi CP - # of candidate pairs to verifyMarcus Paradies Entity Matching for Semistructured Data in the Cloud 17 / 19
  • 38. Summary Summary Wikipedia provides many opportunities for research Need for efficiently processing semistructured data is increasing Entity Matching is critical for data integration and data cleaning Entity Matching is difficult to parallelize due to unbalanced data partitions MAXIM parallelizes EM by building blocks of similar records in a classification fashion MAXIM allows to define own similarity functions and computation functions without changing the algorithmMarcus Paradies Entity Matching for Semistructured Data in the Cloud 18 / 19
  • 39. “Everything that can be invented has been invented.” (Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)Marcus Paradies Entity Matching for Semistructured Data in the Cloud 19 / 19
  • 40. Experiments Scaleup and Speedup 9 2 Ideal Ideal INDEXING-2000 1.8 EXTRACTING-2000 8 EXTRACTING-2000 INDEXING-2000 Speedup = Base Time / New Time Scaleup = Base Time / New Time BLOCKING 1.6 7 MATCHING 1.4 6 1.2 5 1 0.8 4 0.6 3 0.4 2 0.2 1 0 5 10 20 40 5 10 20 40 Number of nodes Number of nodes (a) Speedup for all stages (b) Scaleup for preparation stageMarcus Paradies Entity Matching for Semistructured Data in the Cloud 20 / 23
  • 41. Experiments Query Performance 900 RESULTCOUNT-50 Avg. Query Response Time (ms) 800 RESULTCOUNT-100 RESULTCOUNT-150 700 RESULTCOUNT-200 600 500 400 300 200 100 0 5 10 20 40 Number of Nodes Figure: Query Performance for different result set sizes and cluster sizes.Marcus Paradies Entity Matching for Semistructured Data in the Cloud 21 / 23
  • 42. Experiments Blocking Accuracy 1.2 Ideal WRONG-ORDER 1.1 MISPLACED-END MISPLACED-ANY MISSING 1 Accuracy 0.9 0.8 0.7 0.6 0.5 0 0.25 0.5 0.75 1.0 Variance Figure: Blocking accuracy for different typographical error classes.Marcus Paradies Entity Matching for Semistructured Data in the Cloud 22 / 23
  • 43. Experiments Number of Candidate Pairs 5.5e+006 RSCOUNT-50 5e+006 RSCOUNT-100 RSCOUNT-150 4.5e+006 RSCOUNT-200 Number of candidate pairs 4e+006 3.5e+006 3e+006 2.5e+006 2e+006 1.5e+006 1e+006 500000 0 0.0 0.1 0.25 0.5 0.75 1.0 Variance Figure: Number of candidate pair verifications in the matching stage.Marcus Paradies Entity Matching for Semistructured Data in the Cloud 23 / 23

×