LODOP - Multi-Query Optimization for Linked Data Profiling Queries
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

on

  • 188 views

Talk at PROFILES2014, ESWC2014

Talk at PROFILES2014, ESWC2014

Statistics

Views

Total Views
188
Views on SlideShare
186
Embed Views
2

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 2

http://localhost 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

LODOP - Multi-Query Optimization for Linked Data Profiling Queries Presentation Transcript

  • 1. LODOP Multi-Query Optimization for Linked Data Profiling Queries Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany ! ! ! ! 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26
  • 2. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Challenges of Linked Data Profiling 2. ProfilingTasks 3. LODOP 4. Multi-Query Optimizations OUTLINE
  • 3. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LINKED DATA PROFILING • Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value patterns, property co-occurrence, … • Data registries,VoiD, and Semantic Sitemaps provide only basic information. e.g., description, author & license information, estimated triple and link count ! • Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction ! • Data profiling: methods for computing metrics / metadata for datasets
  • 4. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. TRADITIONALVS LINKED DATA PROFILING • State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data ! • Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views
  • 5. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. CHALLENGES OF LD PROFILING • Heterogeneity • Nested graphs Makes reasoning difficult • Loose structure Things have different predicate sets • Incomplete Missing property definitions • Poorly formatted Property types used inconsistently • Inconsistent Multiple representations claim opposite things ! • Existing (relational) data profiling tools don’t work ! • Volume of data • Requires parallelization
  • 6. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - CONTRIBUTIONS • Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules ! • Apache Pig: • Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce
  • 7. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PROFILINGTASKS • Groupings • e.g. by resource, class, property type, language, vocabulary, … ! • Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks • Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length
  • 8. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. DATASETS STATISTICS ! ! ! ! ! ! ! ! ! ! ! * source: BTC 2012 dataset ** WDC = Web Data Commons *** EUNIS = European Environment Agency ! Statistics for 1M triples! DBpedia*! Freebase*! WDC RDFa**! EUNIS Species***! Number of resources! 169,035! 226,834! 168,736! 65,843! Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2! Number of classes! 19,585! 1,928! 61! 1! Number of property types! 7,844! 2,748! 477! 16! Number of URIs! 519,692! 642,183! 174,317! 407,418! Number of inlinks! 207,712! 192,179! 35,329! 78,377! Number of literals! 480,279! 357,817! 825,564! 592,582! Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
  • 9. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION
  • 10. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION • 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script) • Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data more HDFS activity • Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts
  • 11. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. OPTIMIZATION GOALS • Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators
  • 12. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. NUMBER OF INSTANCES (PIG)
  • 13. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - SYSTEM OVERVIEW
  • 14. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. MULTI-QUERY OPTIMIZATION 1. Merging identical operators 2. Combining FILTER operators 3. Combining FOREACH operators
  • 15. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution ! STEP 0: MASTER PLAN
  • 16. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS Number of property types per class! URI Literal Ratio per class!
  • 17. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 18. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 19. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism 1. MERGING IDENTICAL OPERATORS
  • 20. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS
  • 21. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 22. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 23. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 24. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 25. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 26. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 27. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 28. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 29. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 30. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. ALL OPTIMIZATIONS
  • 31. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. SUMMARY • Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time • Reduces execution time by 70% • … but rules should not be applied in all cases • More advanced (cost-based) approach is needed
  • 32. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. FUTURE WORK • Additional logical optimization rules • Ignore projections if it allows further merging of operators • Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g. cardinalities) → on-the-go • Materialization of intermediate results • Materialize common subsets, e.g. only triples with typed object values for later scripts
  • 33. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. http://github.com/bforchhammer/lodop/ ! @anjeve anja.jentzsch@hpi.uni-potsdam.de