SlideShare a Scribd company logo
Sempala
Interactive SPARQL Query Processing
on Hadoop
Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen
University of Freiburg, Germany
ISWC 2014 - Riva del Garda, Italy
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2
Motivation
 Semantic Web has arrived in real-world applications
(not only academia & research)
• Web-scale semantic data makes single machine solutions infeasible
 Hadoop ecosystem has become de-facto standard for
Big Data applications
 Our Idea: Use it for Semantic Web purposes as well
• 2 main reasons:
1) Web-scale semantic data requires solutions that scale out
2) Industry has settled on Hadoop (or Hadoop-style) architectures
 superior cost-benefit ratio compared to specialized infrastructures
Motivation
 SPARQL-on-Hadoop
• Existing solutions are mostly based on MapReduce
• Scale very well to billions of triples (or more)
• But MapReduce is batch and thus inherently slow
• Good for unselective ETL-like queries with many results
 SPARQL queries are often explorative and ad-hoc
• Typically selective thus returning only a few results
 Runtimes in the order of hours are not satisfying!
• But interactive runtimes are virtually impossible
to achieve with MapReduce
 Need for interactive SPARQL-on-Hadoop
• Following the trend of interactive SQL-on-Hadoop
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3
Sempala
 SPARQL-on-Hadoop query engine
• Especially designed with ad-hoc style selective queries in mind
• Interactive query runtimes for such queries
• Built on top of Impala, an MPP SQL query engine for Hadoop
• RDF data stored in HDFS using columnar storage format (Parquet)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4
Sempala – Data Layout
 Single unified property table
• Parquet is a column-oriented storage format
 optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5
Sempala – Data Layout
 Single unified property table
• Parquet is a column-oriented storage format
 optimized for wide tables while selecting only a few columns on request
• Sparse columns cause little to no storage overhead in Parquet
• No clustering, no joins for star pattern queries
• Duplication strategy for multi-valued properties
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6
(Article2, 2)
run-length encoding
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8
triple group tg1
triple group tg2
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9
triple group tg1
triple group tg2
join group jg1
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10
triple group tg1
triple group tg2
join group jg1
tg1
tg2
Sempala – Query Compiler
 BPG processing in Sempala
• Decompose BGP into disjoint triple groups having the same subject
• Use a subquery for every triple group and join the results (join group)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11
triple group tg1
triple group tg2
jg1
tg1
tg2
join group jg1
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13
Sempala – Query Compiler
 Overall workflow from SPARQL to Impala SQL
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14
Evaluation
 Small Cluster with low-end configuration
• 10 machines, 32 GB RAM and 2 disks each, Gigabit network
(Cloudera recommends 256 GB RAM and 12 disks or more)
• CDH 4.5, Impala 1.2.3
 LUBM and BSBM benchmarks
• LUBM up to 3K universities
• BSBM up to 3M products
 Compared Sempala with 4 other Hadoop based systems
• Hive (same Query Compiler as Sempala but Hive as execution engine)
• PigSPARQL (built on top of Apache Pig) [1]
• MapMerge (map-side merge join for SPARQL BGPs) [2]
• MAPSIN (map-side index nested loop join based on Apache HBase) [3]
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15
Evaluation
 LUBM 3K (sec in log scale)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for selective (star-shaped) queries:
• Sempala (8.3) , Hive (89) , PigSPARQL (69.7) , MAPSIN (78.1) , MapMerge (65.8)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for more sophisticated queries:
• Sempala (20.7) , Hive (316.6) , PigSPARQL (266) , MAPSIN (119.5) , MapMerge (175.6)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18
Evaluation
 LUBM 3K (sec in log scale)
 Geometric Mean for unselective queries:
• Sempala (63.5) , Hive (166) , PigSPARQL (52.4) , MAPSIN (101.4) , MapMerge (24.5)
• Runtime of Sempala dominated by storing millions of records in result table
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19
Summary
 Sempala SPARQL query engine for Hadoop
• Built on top of a state-of-the-art MPP SQL query engine (Impala)
• Uses a state-of-the-art columnar storage format (Parquet)
• SPARQL 1.0 (not only BGPs)
 Evaluation
• Outperforms existing Hadoop based systems by an order of magnitude
• Interactive runtimes for selective queries (selective ≠ simple)
(order of seconds, not minutes or even hours)
 Future Work
• Refinement of data layout (nested data structures  Impala 2.1)
• SPARQL 1.1 features (e.g. subqueries, aggregations)
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20
References
[1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8
[2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing.
In: CloudCom 2013, pp. 631-638
[3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing.
In: SSWS+HPCSW 2012, pp. 59-74
22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21

More Related Content

What's hot

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
Grigory Sapunov
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Databricks
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraphDataWorks Summit
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 

What's hot (20)

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 

Similar to Sempala - Interactive SPARQL Query Processing on Hadoop

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Geoffrey Fox
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
SnappyData
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...DataWorks Summit
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 

Similar to Sempala - Interactive SPARQL Query Processing on Hadoop (20)

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 

Recently uploaded

Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
ShahulHameed54211
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
TristanJasperRamos
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
Himani415946
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 

Recently uploaded (16)

Output determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CCOutput determination SAP S4 HANA SAP SD CC
Output determination SAP S4 HANA SAP SD CC
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptxLiving-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
Living-in-IT-era-Module-7-Imaging-and-Design-for-Social-Impact.pptx
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 

Sempala - Interactive SPARQL Query Processing on Hadoop

  • 1. Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy
  • 2. 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 2 Motivation  Semantic Web has arrived in real-world applications (not only academia & research) • Web-scale semantic data makes single machine solutions infeasible  Hadoop ecosystem has become de-facto standard for Big Data applications  Our Idea: Use it for Semantic Web purposes as well • 2 main reasons: 1) Web-scale semantic data requires solutions that scale out 2) Industry has settled on Hadoop (or Hadoop-style) architectures  superior cost-benefit ratio compared to specialized infrastructures
  • 3. Motivation  SPARQL-on-Hadoop • Existing solutions are mostly based on MapReduce • Scale very well to billions of triples (or more) • But MapReduce is batch and thus inherently slow • Good for unselective ETL-like queries with many results  SPARQL queries are often explorative and ad-hoc • Typically selective thus returning only a few results  Runtimes in the order of hours are not satisfying! • But interactive runtimes are virtually impossible to achieve with MapReduce  Need for interactive SPARQL-on-Hadoop • Following the trend of interactive SQL-on-Hadoop 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 3
  • 4. Sempala  SPARQL-on-Hadoop query engine • Especially designed with ad-hoc style selective queries in mind • Interactive query runtimes for such queries • Built on top of Impala, an MPP SQL query engine for Hadoop • RDF data stored in HDFS using columnar storage format (Parquet) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 4
  • 5. Sempala – Data Layout  Single unified property table • Parquet is a column-oriented storage format  optimized for wide tables while selecting only a few columns on request • Sparse columns cause little to no storage overhead in Parquet • No clustering, no joins for star pattern queries • Duplication strategy for multi-valued properties 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 5
  • 6. Sempala – Data Layout  Single unified property table • Parquet is a column-oriented storage format  optimized for wide tables while selecting only a few columns on request • Sparse columns cause little to no storage overhead in Parquet • No clustering, no joins for star pattern queries • Duplication strategy for multi-valued properties 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 6 (Article2, 2) run-length encoding
  • 7. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 7
  • 8. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 8 triple group tg1 triple group tg2
  • 9. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 9 triple group tg1 triple group tg2 join group jg1
  • 10. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 10 triple group tg1 triple group tg2 join group jg1 tg1 tg2
  • 11. Sempala – Query Compiler  BPG processing in Sempala • Decompose BGP into disjoint triple groups having the same subject • Use a subquery for every triple group and join the results (join group) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 11 triple group tg1 triple group tg2 jg1 tg1 tg2 join group jg1
  • 12. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 12
  • 13. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 13
  • 14. Sempala – Query Compiler  Overall workflow from SPARQL to Impala SQL 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 14
  • 15. Evaluation  Small Cluster with low-end configuration • 10 machines, 32 GB RAM and 2 disks each, Gigabit network (Cloudera recommends 256 GB RAM and 12 disks or more) • CDH 4.5, Impala 1.2.3  LUBM and BSBM benchmarks • LUBM up to 3K universities • BSBM up to 3M products  Compared Sempala with 4 other Hadoop based systems • Hive (same Query Compiler as Sempala but Hive as execution engine) • PigSPARQL (built on top of Apache Pig) [1] • MapMerge (map-side merge join for SPARQL BGPs) [2] • MAPSIN (map-side index nested loop join based on Apache HBase) [3] 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 15
  • 16. Evaluation  LUBM 3K (sec in log scale) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 16
  • 17. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for selective (star-shaped) queries: • Sempala (8.3) , Hive (89) , PigSPARQL (69.7) , MAPSIN (78.1) , MapMerge (65.8) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 17
  • 18. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for more sophisticated queries: • Sempala (20.7) , Hive (316.6) , PigSPARQL (266) , MAPSIN (119.5) , MapMerge (175.6) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 18
  • 19. Evaluation  LUBM 3K (sec in log scale)  Geometric Mean for unselective queries: • Sempala (63.5) , Hive (166) , PigSPARQL (52.4) , MAPSIN (101.4) , MapMerge (24.5) • Runtime of Sempala dominated by storing millions of records in result table 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 19
  • 20. Summary  Sempala SPARQL query engine for Hadoop • Built on top of a state-of-the-art MPP SQL query engine (Impala) • Uses a state-of-the-art columnar storage format (Parquet) • SPARQL 1.0 (not only BGPs)  Evaluation • Outperforms existing Hadoop based systems by an order of magnitude • Interactive runtimes for selective queries (selective ≠ simple) (order of seconds, not minutes or even hours)  Future Work • Refinement of data layout (nested data structures  Impala 2.1) • SPARQL 1.1 features (e.g. subqueries, aggregations) 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 20
  • 21. References [1] Schätzle, A. et al.: PigSPARQL: Mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1-4:8 [2] Przyjaciel-Zablocki, M. et al.: Map-Side Merge Joins for Scalable SPARQL BGP Processing. In: CloudCom 2013, pp. 631-638 [3] Schätzle, A. et al.: Cascading Map-Side Joins over HBase for Scalable Join Processing. In: SSWS+HPCSW 2012, pp. 59-74 22.10.2014 Sempala: Interactive SPARQL Query Processing on Hadoop 21