Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Session 5.2 multi-core meta-blocking for big linked data

86 views

Published on

Talk at SEMANTiCS 2017
www.semantics.cc

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Session 5.2 multi-core meta-blocking for big linked data

  1. 1. Multi-core Meta-blocking for Big Linked Data George Papadakis, Konstantina Bereta, Manolis Koubarakis Themis Palpanas Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  2. 2. Entities: an invaluable asset “Entities” is what a large part of our knowledge is about: Persons Organizations Projects Locations Products Events Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  3. 3. However … How many names, descriptions or IDs (URIs) are used for the same real-world “entity”? Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  4. 4. However … How many names, descriptions or IDs (URIs) are used for the same real-world “entity”? London 런던 ‫ܢ‬‫ܠܘܢܕܘ‬ लंडन लंदन લંડન ለንደን ロンドン লন্ডন ลอนดอน இலண்டன் ლონდონი Llundain Londain Londe Londen Londen Londen Londinium London Londona Londonas Londoni Londono Londra Londres Londrez Londyn Lontoo Loundres Luân Đôn Lunden Lundúnir Lunnainn Lunnon ‫لوندون‬ ‫لندن‬ ‫لندن‬ ‫لندن‬ ‫לונדון‬ ‫לאנדאן‬ Λονδίνο Лёндан Лондан Лондон Лондон Лондон Լոնդոն 伦敦 … Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  5. 5. However … How many names, descriptions or IDs (URIs) are used for the same real-world “entity”? London 런던 ‫ܢ‬‫ܠܘܢܕܘ‬ लंडन लंदन લંડન ለንደን ロンドン লন্ডন ลอนดอน இலண்டன் ლონდონი Llundain Londain Londe Londen Londen Londen Londinium London Londona Londonas Londoni Londono Londra Londres Londrez Londyn Lontoo Loundres Luân Đôn Lunden Lundúnir Lunnainn Lunnon ‫لوندون‬ ‫لندن‬ ‫لندن‬ ‫لندن‬ ‫לונדון‬ ‫לאנדאן‬ Λονδίνο Лёндан Лондан Лондон Лондон Лондон Լոնդոն 伦敦 … capital of UK, host city of the IV Olympic Games, host city of the XIV Olympic Games, future host of the XXX Olympic Games, city of the Westminster Abbey, city of the London Eye, the city described by Charles Dickens in his novels, … Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  6. 6. However … How many names, descriptions or IDs (URIs) are used for the same real-world “entity”? London 런던 ‫ܢ‬‫ܠܘܢܕܘ‬ लंडन लंदन લંડન ለንደን ロンドン লন্ডন ลอนดอน இலண்டன் ლონდონი Llundain Londain Londe Londen Londen Londen Londinium London Londona Londonas Londoni Londono Londra Londres Londrez Londyn Lontoo Loundres Luân Đôn Lunden Lundúnir Lunnainn Lunnon ‫لوندون‬ ‫لندن‬ ‫لندن‬ ‫لندن‬ ‫לונדון‬ ‫לאנדאן‬ Λονδίνο Лёндан Лондан Лондон Лондон Лондон Լոնդոն 伦敦 … capital of UK, host city of the IV Olympic Games, host city of the XIV Olympic Games, future host of the XXX Olympic Games, city of the Westminster Abbey, city of the London Eye, the city described by Charles Dickens in his novels, … http://sws.geonames.org/2643743/ http://en.wikipedia.org/wiki/London http://dbpedia.org/resource/Category:London … Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  7. 7. ◦ London, KY ◦ London, Laurel, KY ◦ London, OH ◦ London, Madison, OH ◦ London, AR ◦ London, Pope, AR ◦ London, TX ◦ London, Kimble, TX ◦ London, MO ◦ London, MO ◦ London, London, MI ◦ London, London, Monroe, MI ◦ London, Uninc Conecuh County, AL ◦ London, Uninc Conecuh County, Conecuh, AL ◦ London, Uninc Shelby County, IN ◦ London, Uninc Shelby County, Shelby, IN ◦ London, Deerfield, WI ◦ London, Deerfield, Dane, WI ◦ London, Uninc Freeborn County, MN ◦ ... How many “entities” have the same name? … or … Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  8. 8. ◦ London, KY ◦ London, Laurel, KY ◦ London, OH ◦ London, Madison, OH ◦ London, AR ◦ London, Pope, AR ◦ London, TX ◦ London, Kimble, TX ◦ London, MO ◦ London, MO ◦ London, London, MI ◦ London, London, Monroe, MI ◦ London, Uninc Conecuh County, AL ◦ London, Uninc Conecuh County, Conecuh, AL ◦ London, Uninc Shelby County, IN ◦ London, Uninc Shelby County, Shelby, IN ◦ London, Deerfield, WI ◦ London, Deerfield, Dane, WI ◦ London, Uninc Freeborn County, MN ◦ ... ◦ London, Jack 2612 Almes Dr Montgomery, AL (334) 272-7005 ◦ London, Jack R 2511 Winchester Rd Montgomery, AL 36106-3327 (334) 272-7005 ◦ London, Jack 1222 Whitetail Trl Van Buren, AR 72956-7368 (479) 474-4136 ◦ London, Jack 7400 Vista Del Mar Ave La Jolla, CA 92037-4954 (858) 456-1850 ◦ ... How many “entities” have the same name? … or … Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  9. 9. Preliminaries Current situation: The LOD cloud is inadequately linked. Less than 10% of its KBs were strongly interlinked with at least another KB in 2014. Solution: Entity Resolution (ER) identifies the different entity descriptions that actually correspond to same real-world object and interlinks them with owl:sameAs statements. Challenge: Quadratic computational cost: O(n2). Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  10. 10. Improving ER efficiency Blocking: • groups similar entities into blocks • executes comparisons only inside blocks Challenge: Variety There are ∼2,600 diverse vocabularies, but only 109 of them are shared by more than one KB, according to http://stats.lod2.eu . Solution: schema-free blocking Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  11. 11. s1 ns1:id1 ns1:FullName : “Jim Lloyd Mayer”; ns1: job : “autoseller”; ns1:location “Fifth Ave” . ns2:id4 ns2:hasName “Jim Mayer”; ns2:occupation “car-vendor – seller” ; ns2:hasAddres “Fifth Ave” . s2 ns1:id2 ns1:FullName “George Brown”; ns1:job “vehicle vendor”; ns1:location “Fifth Ave” . s3 ns1:id3 ns1:FullName “Stephan Jordan”; ns1:job “car seller”; ns1:location “Atlantic Ave” . t1 ns2:id5 ns2:hasName “George Lloyd Brown”; ns2:occupation “car trader”; ns2:hasAddres “Fifth Ave” . t2 ns2:id6 ns2:hasName “Nick Papas”; ns2:occupation “car dealer”; ns2:hasAddres “Fifth Ave” . t3 TS Example Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  12. 12. b9 (car) b1 (Jim) s1 t1 b3 (George) s2 t2 b4 (vendor) s2 t1 b5 (seller) s3 t1 b6 (Lloyd) s1 t2 b10 (Fifth) s1 t1 s2 t2 t3 b8 (Ave) b2 (Mayer) s1 t1 b7 (Brown) s2 s1 t1s2 t2 t3s3 t2 t1 s3 t2 t3 Token Blocking Performance • Recall: 100% • Precision: 2 / 25 = 4% Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  13. 13. Meta-blocking Every comparison between entities si and tj belongs to one of the following types: 1. Matching if si ≡ tj. 2. Redundant if si and tj co-occur and are compared in another block. 3. Superfluous if si ≠ tj and the comparison is not redundant. In this context, Meta-blocking restructures a given set of blocks into a new one that contains substantially lower number of redundant and superfluous comparisons (much higher Precision), while maintaining the original number of matching ones (same Recall). Core idea: the more blocks two entities share, the more likely they are to be matching. Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  14. 14. s2 s3 2/6 t2 t3 s1 3/8 t1 4/8 b‘’1 s1 t1 s2 s3 t3 s1 b‘’2 s1 t2 b‘’4 s3 t1 t2 t1 b‘’3 s2 t2 4/7 2/5 2/8 3/8 2/6 3/9 b‘’5 s3 t3 Example of Meta-blocking Performance • Recall: 100% • Precision: 2 / 5 = 40% Graph Building Graph Pruning Restructured Blocks Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  15. 15. Algorithms for Graph Pruning Four are the main algorithms for this purpose: 1. Weight Edge Pruning (WEP) it retains all edges with weight > overall average weight 2. Cardinality Edge Pruning (CEP) it retains the top-K weighted edges, where K = BC∙|E|/2 3. Weight Node Pruning (WNP) for every node, it retains all edges with weight > the average weight of all adjacent edges 4. Cardinality Node Pruning (CNP) for each node, it retains the top-k weighted edges, where k=BC-1 * Blocking Cardinality (BC): average blocks per entity Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  16. 16. Meta-blocking Efficiency Meta-blocking does not scale to the Volume of Linked Data, having a computational cost analogous to: • the number of comparisons in the input blocks, which grow superlinearly with |S|+|T| • the average number of blocks per entity, which grow linearly with the size of entity descriptions Parallel Meta-blocking not applicable to a desktop application → based on MapReduce, it requires a cluster to run Solution: Multi-core Meta-blocking! Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  17. 17. Multi-core Meta-blocking Two types of methods: • Block-based • Entity-based Fork-join approach: • The computational cost is split into a set of chunks* that are placed in an array, with an index indicating the next chunk to be processed. • Every thread retrieves the current value of the index and is assigned to process the corresponding chunk. *chunk = individual items* or a non-overlapping set of items *item = an individual block or an individual entity Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  18. 18. Parallelization Strategies Depending on the definition of chunks, we defined the following parallelization strategies: 1. Random parallelization → individual items in arbitrary order 2. Naïve Parallelization → individual items sorted by #comparisons 3. Partition Parallelization → an arbitrary number of non-overlapping groups of items with the same #comparisons 4. Segment Parallelization → #cores non-overlapping groups of items with the same #comparisons Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  19. 19. Partition Parallelization Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. Input: I’ = { 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 } 10 9 8 10 9 8 1 7 2 6 3 5 4 10 9 8 7 2 6 3 5 4 10 9 8 7 6 3 5 4 10 9 8 7 6 5 4 10 9 8 7 6 5 10 9 8 7 6 10 9 10 It1 Partition1 Partition2 Partition3 Partition4 Partition5 Partition6 It2 It3 It4 It5 It6 It7 It8 It9 10 9 8 7 It10Iterations Total Cost 10 9 9 9 9 9 Input: I = { 3, 5, 10, 6, 1, 9, 2, 4, 7, 8} Time complexity: O(n log n)
  20. 20. Segment Parallelization Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. 10 9 8 10 9 8 7 10 9 10Segment1 Segment2 Segment3 Segment4 10 9 5 8 6 7 10 4 9 5 8 6 7 10 9 8 6 7 3 10 4 9 5 8 6 7 3 10 4 9 5 8 6 2 7 3 10 4 9 5 1 8 6 2 7 It1 It2 It3 It4 It5 It6 It7 It8 It9 It10Iterations Total Cost 13 13 14 15 Input: I’ = { 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 } Time complexity: O(n log n)
  21. 21. Execution Plan Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. Total valid comparisons Merge Initialization MWEP MWNP MCEP MCNP Initialize chunk array and N threads. Initialize chunk array and N threads. Initialize chunk array and N threads. Estimate K. Each thread computes local aggregate edge weight and #comparisons Estimate average edge weight Each thread stores the total weight and #comparisons per entity in two arrays Check and keep valid comparisons above the weight threshold Output total valid comparisons Merge the 2 arrays to compute the average edge weight of each node Check and keep valid comparisons above the weight threshold of any adjacent node Output total valid comparisons Every thread stores the k top-weighted edges for every processed entity in a priority queue Output the comparisons that are among the k top- weighted ones for any of the adjacent entities Stage 1 Stage 2 Each thread stores the K top-weighted edges in the processed chunks in a priority queue Output the overall K top-weighted comparisons Initialize chunk array and N threads. Estimate k. th1 th2 th3 thN… th1 th2 th3 thN…
  22. 22. Experimental Evaluation - Datasets Original Datasets DBPedia 3.0rc DBPedia 3.4 Entities 1,190,733 2,164,040 Duplicates 892,579 Triples 1.69∙107 3.50∙107 Predicates 30,757 52,554 |S|x|T| 2.58∙1012 Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. Blocks Input Output (CNP) Blocks 1,239,066 1,190,733 Comparisons 1.30∙1010 3.30∙107 Detected Matches 890,817 859,554 Recall 0.998 0.963 Precision 6.86∙10−5 2.61∙10−2 System: Server running Ubuntu 12.04, 32GB RAM and 2 Intel Xeon E5620 processors, each having 4 physical cores and 8 logical cores at 2.40GHz.
  23. 23. Experimental Evaluation – CNP Wall Clock Time Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. Single-threaded time=1.3∙107 msec=3.5 hours RB=Random, block-based parallelization NB=Naïve, block-based parallelization PB=Partition, block-based parallelization SB=Segment, block-based parallelization RE=Random, entity-based parallelization NE=Naïve, entity-based parallelization PE=Partition, entity-based parallelization SE=Segment, entity-based parallelization 15min
  24. 24. Experimental Evaluation – CNP Speedup Despina-Athanasia Pantazi, Amsterdam, Semantics 2017. RB=Random, block-based parallelization NB=Naïve, block-based parallelization PB=Partition, block-based parallelization SB=Segment, block-based parallelization RE=Random, entity-based parallelization NE=Naïve, entity-based parallelization PE=Partition, entity-based parallelization SE=Segment, entity-based parallelization
  25. 25. Conclusions We presented 8 parallelization strategies for Multi-core Meta-blocking. • w.r.t. type of items: block- and entity-based parallelization • w.r.t. type of chunks: random, naïve, partition and segment parallelization Performance: • WCT: entity-based methods require less than 30 minutes for processing 3.4 million entities with 8 cores • Speedup: block-based methods exhibit very high scalability, almost linear speedup Advantages: • Single state variable, read-only chunks → simplifies thread safety • Single atomic synchronized operation (index retrieval) → high concurrency Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  26. 26. JedAI Toolkit JedAI implements the demonstrated workflow. It can be used in three ways: 1. As an open source library that implements numerous state-of-the- art methods for all steps of an established end-to-end ER workflow. 2. As a desktop application for ER with an intuitive Graphical User Interface that is suitable for both expert and lay users. 3. As a workbench for comparing all performance aspects of various (configurations of) end-to-end ER workflows. Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  27. 27. Thank you! Questions? Check our JedAI Toolkit for Entity Resolution. • Website: http://jedai.scify.org . • Github repositories: • JedAI Library: https://github.com/scify/JedAIToolkit . • JedAI Desktop Application and Workbench: https://github.com/scify/jedai-ui . • All code is implemented using Java 8. • All code is publicly available under Apache License V2.0. • Documentation (slides, videos, etc) available at https://github.com/scify/JedAIToolkit/tree/master/documentation .
  28. 28. JedAI workflow JedAI implements the following schema-free, end-to-end workflow: Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5Step 2 Step 3 Step 4 Step 6Step 1 Step 7 Reads files containing the entity profiles and the golden standard. Creates overlapping blocks. Optional step that cleans blocks from useless comparisons (repeated, superfluous). Optional step that operates on the level of individual comparisons to remove the useless ones. Executes all retained comparisons. Partitions the similarity graph into equivalence clusters. Stores and presents performance results w.r.t. numerous measures. Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.
  29. 29. JedAI methods JedAI supports several established methods for each workflow step. Data Reading Block Building Block Cleaning Comparison Cleaning Entity Matching Entity Clustering Evaluation & Storing Step 5Step 2 Step 3 Step 4 Step 6Step 1 Step 7 Possible to read CSV, RDF, XML files & relational DBs in any combination! Choose 1 out of 8 methods. Specify any combination of 3 complementary methods. Choose 1 out of 7 methods including all Meta- blocking methods. Combine 1 out of 2 methods with 12 textual representation models and 10 similarity measures. Choose 1 out of 6 methods. Store results in any of the supported formats (soon available). Despina-Athanasia Pantazi, Amsterdam, Semantics 2017.

×