SlideShare a Scribd company logo
1 of 22
Spark Technology Center
1
Data Warehouse Features in
Apache Spark
IoanaDelaney
SparkTechnologyCenter,IBM
About the Speaker
Ioana Delaney
Spark Technology Center, IBM
DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer.
Worked on various releases of DB2 LUW and DB2 with BLU Acceleration
Apache Spark SQL Contributor
IBM Spark Technology Center
Founded in 2015
Location: 505 Howard St., San Francisco
Web: http://spark.tc
Twitter: @apachespark_tc
Mission:
• Contribute intellectual and technical capital
to the Apache Spark community.
• Make the core technology enterprise and
cloud-ready.
• Build data science skills to drive intelligence
into business applications
http://bigdatauniversity.com
Enterprise Data Warehouses and Open Source Analytics
• Data Warehouses provide critical features like security, performance, backup/recovery,
etc. that make it the technology of choice for most enterprise uses
• However, modern workloads like advanced analytics or machine learning require a
more flexible and agile environment, like Apache Spark
• Enabling Spark with Data Warehouse capabilities allows to get the best of both worlds:
the enterprise-strength and performance of modern Data Warehouses and the flexibility
and power of open source analytics
Easy Steps toward Data Warehouse Technology Integration
into Spark
• Incorporate basic data modeling features into Spark such as Star Schema and
Informational Referential Integrity Constraints
• Enhance push down capabilities from Spark into the underlying Data Warehouse to
exploit the features of the database engine e.g. index access
What is Data Warehouse?
• Relational database that integrates data from multiple heterogeneous sources e.g.
transactional data, files, other sources
• Designed for data modeling and analysis
• Provides information around a subject of a business e.g. product, customers, suppliers, etc.
• The most important requirements are query performance and data simplicity
• Based on a dimensional, or Star Schema model
• Consists of a fact table referencing a number of dimension tables
• Fact table contains the main data, or measurements, of a business
• Dimension tables, usually smaller tables, describe the different characteristics, or dimensions, of a
business
TPC-DS Benchmark
• Proxy of a real organization data warehouse
• De-facto industry standard benchmark for measuring the performance of decision
support solutions such as RDBMS and Hadoop/Spark based systems
• The underlying business model is a retail product supplier e.g. retail sales, web, catalog
data, inventory, demographics, etc
• Examines large volumes of data e.g. 1TB to 100TB
• Executes SQL queries of various operational requirements and complexities e.g. ad-hoc,
reporting, data mining
Excerpt from store_sales fact table diagram:
Star Schema Detection in Spark
• Queries against star schema are expected to run fast based on the relationships
among the tables
• SPARK-17791 implements star schema detection based on cardinality
heuristics
• In a query, star schema detection algorithm:
• Finds the tables connected in a star-join
• Lets the Spark Optimizer plan the star-join tables in an optimal way
Join Reordering using Star Schema
Execution plan transformation:
select i_item_id, s_store_id, avg(ss_net_profit) as store_sales_profit,
avg(sr_net_loss) as store_returns_loss
from
store_sales, store_returns, date_dim d1, date_dim d2, store, item
where
i_item_sk = ss_item_sk and
s_store_sk = ss_store_sk and
ss_customer_sk = sr_customer_sk and
ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and
sr_returned_date_sk = d2.d_date_sk and
d1.d_moy = 4 and d1.d_year = 1998 and . . .
group by i_item_id, s_store_id
order by i_item_id, s_store_id
Simplified TPC-DS Query 25
Star schema diagram:
item
store
date_dim
store_sales store_returns
date_dim
N:1
1 : N
1: N
1:N
1:N
Query execution drops from 421 secs to 147 secs
(1TB TPC-DS setup), ~ 3x improvement
Performance Results for Star Schema Queries
• TPC-DS query speedup: 2x – 8x
• By observing relationships among the
tables, Optimizer makes better planning
decisions
• Reduce the data early in the execution plan
• Reduce, or eliminate Sort Merge joins in
favor of more efficient Broadcast Hash
joins
TPC-DS 1TB performance results with star schema detection:
Star Schema to reduce join enumeration search space
• Cost-based Spark Optimizer uses dynamic programing for join enumeration
• One issue is the explosion of the search space when there is a large number of
joins
• Use heuristics search methods to eliminate sub-optimal plans
• SPARK-20233 applies star-schema filters to the dynamic programming
algorithm for join enumeration
Support for Informational Referential Integrity Constraints
• Open up an area of query optimization techniques that rely on referential
integrity (RI) constraints semantics
• Support for informational primary key and foreign key (referential integrity)
constraints
• Not enforced by the Spark SQL engine; rather used by Catalyst to optimize the
query processing
• Targeted to applications that load and analyze data that originated from a Data
Warehouse for which the conditions for a given constraint are known to be true
• SPARK-19842 - umbrella JIRA for RI support
How do Query Optimizers use RI Constraints?
• Implement powerful optimizations based on RI semantics e.g. Join Elimination
• Example using a typical user scenario: queries against views
create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as
select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity
from store_sales, date_dim, customer, item, store
where d_date_sk = ss_sold_date_sk and
c_customer_sk = ss_customer_sk and
i_item_sk = ss_item_sk and
s_store_sk = ss_store_sk and
d_year = 2002
select id, first, last, product, quantity
from customer_purchases_2002
where product like ‘bicycle%’ and
month between 1 and 2
User view: User query:
Internal optimizer query processing:
Selects only a subset
of columns from view
Join between store and
store_sales removed
based on RI analysis
Many RI based Optimizations
• Existential subquery to inner joins
• Group by push down through join
• Redundant join elimination
• Distinct elimination
• etc.
Advanced Data Source Push Down Features
• Spark applications often directly query external data sources such as relational databases
or files
• Provides Data Sources APIs for accessing structured data through Spark SQL
• Support optimizations such as Filter push down and Column pruning - subset of the
functionality that can be pushed down to some data sources
• Extend Data Sources APIs with join push down i.e. Selection and Projection push down
• Significantly improves query performance by reducing the amount of data transfer and
exploiting the capabilities of the data sources such as index access
Benefits of Join Push Down
E.g. Selective join with tables in DBMS; DBMS has indexes
• Spark reads the entire tables, applies some predicates at the data source, and executes the
join locally
• Join execution in Spark ~ O(|L|)
• Join execution in DBMS ~ O(|S|)
• Instead, push the join execution to the data source
 Efficient join execution using index access
 Reduce the data transfer
 Query runs 100x faster!
Push down based on Cost vs. Heuristics
• Join push down is not always beneficial e.g. Cartesian product
• Query Optimizer determines the best execution of a SQL query e.g. implementation of
relational operators, order of execution, etc.
• The optimizer must also decide whether the different operations should be done by
Spark, or by the data source
• Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how much it
costs (e.g. statistics from data source, network speed, etc.)
• Spark’s Catalyst Optimizer uses a combination of heuristics and cost model
• Cost model is an evolving feature in Spark
• Until cost model is fully implemented, use safe heuristics
Star-Schema Heuristics to Push Down Joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
Filtering joins (e.g. Star-joins) Expanding joins
Table diagram:
date_dim
store_sales store
N : 1
1:N
M : N
store_returns
select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_profit
from store_sales, date_dim, store
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_profit, sum(ss_net_loss) as store_loss
from store_sales, date_dim, store, store_returns
where d_moy = 4 and d_year = 2001 and
d_date_sk = ss_sold_date_sk and
s_store_sk = ss_store_sk and
ss_ticket_number = sr_ticket_number and
ss_item_sk = sr_item_sk
group by s_store_id ,s_store_name
order by s_store_id ,s_store_name
limit 100
Performance Results for Star Join Push Down
• Mix of two data sources: IBM DB2/JDBC
and Parquet
TPC-DS
Query
spark-2.2
(mins)
spark-2.2-jpd
(mins)
Speedup
Q8 32 4 8x
Q13 121 5 25x
Q15 4 2 2x
Q17 77 7 11x
Q19 42 4 11x
Q25 153 7 21x
Q29 81 7 11x
Q42 31 3 10x
Q45 14 3 4x
Q46 61 5 12x
Q48 155 5 31x
Q52 31 5 6x
Q55 31 4 8x
Q68 69 4 17x
Q74 47 23 2x
Q79 63 4 15x
Q85 22 2 11x
Q89 55 4 14x
Cluster: 4-node cluster, each node having:
12 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017)
Database info:
Schema: TPCDS
Scale factor: 1TB total space
Mix of Parquet and DB2/JDBC data sources
DB2 DPF info: 4-node cluster, each node having:
10 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM
Number of cores: 48
Summary
• Data Warehouse capabilities are important to next generation Spark Data platform
• Start with data modeling features such as Star Schema and Informational Referential
Integrity Constraints to boost query performance
• Push down processing from Spark into the underlying Data Warehouse to exploit the
features of the underlying data base engine
Future work
• Transform Catalyst into a global optimizer
• Global optimizer generates an optimal execution plan across all data sources
• Determines where an operation should be evaluated based on:
1. The cost to execute the operation.
2. The cost to transfer data between Spark and the data sources
• Key factors that affect global optimization:
• Remote table statistics (e.g. number of rows, number of distinct values in each column, etc)
• Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.)
• Extend Data Source APIs with data source characteristics
• Retrieve/compute data source table statistics
• Integrate data source cost model into Catalyst
Spark Technology Center
22
Thank You.
IoanaDelaneyursu@us.ibm.com
Visithttp://spark.tc

More Related Content

What's hot

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Essbase beginner's guide olap fundamental chapter 1
Essbase beginner's guide olap fundamental chapter 1Essbase beginner's guide olap fundamental chapter 1
Essbase beginner's guide olap fundamental chapter 1Amit Sharma
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesCode Mastery
 
Hyperion essbase basics
Hyperion essbase basicsHyperion essbase basics
Hyperion essbase basicsAmit Sharma
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostDatabricks
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoopCraig Jordan
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SASEdureka!
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300Mark Kromer
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
Tableau desktop & server
Tableau desktop & serverTableau desktop & server
Tableau desktop & serverChris Raby
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 

What's hot (20)

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Essbase beginner's guide olap fundamental chapter 1
Essbase beginner's guide olap fundamental chapter 1Essbase beginner's guide olap fundamental chapter 1
Essbase beginner's guide olap fundamental chapter 1
 
Optimization in essbase
Optimization in essbaseOptimization in essbase
Optimization in essbase
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS Cubes
 
Hyperion essbase basics
Hyperion essbase basicsHyperion essbase basics
Hyperion essbase basics
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low CostHow The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SAS
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Tableau desktop & server
Tableau desktop & serverTableau desktop & server
Tableau desktop & server
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 

Similar to 2018 data warehouse features in spark

Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Salesforce Analytics Cloud - Explained
Salesforce Analytics Cloud - ExplainedSalesforce Analytics Cloud - Explained
Salesforce Analytics Cloud - ExplainedCarl Brundage
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingPrithwis Mukerjee
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptxKareemBullard1
 
Improve power bi performance
Improve power bi performanceImprove power bi performance
Improve power bi performanceAnnie Xu
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsMatt Kuklinski
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Denny Lee
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 

Similar to 2018 data warehouse features in spark (20)

Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Salesforce Analytics Cloud - Explained
Salesforce Analytics Cloud - ExplainedSalesforce Analytics Cloud - Explained
Salesforce Analytics Cloud - Explained
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
CS636-olap.ppt
CS636-olap.pptCS636-olap.ppt
CS636-olap.ppt
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx05_DP_300T00A_Optimize.pptx
05_DP_300T00A_Optimize.pptx
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Resume
ResumeResume
Resume
 
Improve power bi performance
Improve power bi performanceImprove power bi performance
Improve power bi performance
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 

More from Chester Chen

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfChester Chen
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdfChester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?Chester Chen
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataChester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapChester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 

More from Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

2018 data warehouse features in spark

  • 1. Spark Technology Center 1 Data Warehouse Features in Apache Spark IoanaDelaney SparkTechnologyCenter,IBM
  • 2. About the Speaker Ioana Delaney Spark Technology Center, IBM DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer. Worked on various releases of DB2 LUW and DB2 with BLU Acceleration Apache Spark SQL Contributor
  • 3. IBM Spark Technology Center Founded in 2015 Location: 505 Howard St., San Francisco Web: http://spark.tc Twitter: @apachespark_tc Mission: • Contribute intellectual and technical capital to the Apache Spark community. • Make the core technology enterprise and cloud-ready. • Build data science skills to drive intelligence into business applications http://bigdatauniversity.com
  • 4. Enterprise Data Warehouses and Open Source Analytics • Data Warehouses provide critical features like security, performance, backup/recovery, etc. that make it the technology of choice for most enterprise uses • However, modern workloads like advanced analytics or machine learning require a more flexible and agile environment, like Apache Spark • Enabling Spark with Data Warehouse capabilities allows to get the best of both worlds: the enterprise-strength and performance of modern Data Warehouses and the flexibility and power of open source analytics
  • 5. Easy Steps toward Data Warehouse Technology Integration into Spark • Incorporate basic data modeling features into Spark such as Star Schema and Informational Referential Integrity Constraints • Enhance push down capabilities from Spark into the underlying Data Warehouse to exploit the features of the database engine e.g. index access
  • 6. What is Data Warehouse? • Relational database that integrates data from multiple heterogeneous sources e.g. transactional data, files, other sources • Designed for data modeling and analysis • Provides information around a subject of a business e.g. product, customers, suppliers, etc. • The most important requirements are query performance and data simplicity • Based on a dimensional, or Star Schema model • Consists of a fact table referencing a number of dimension tables • Fact table contains the main data, or measurements, of a business • Dimension tables, usually smaller tables, describe the different characteristics, or dimensions, of a business
  • 7. TPC-DS Benchmark • Proxy of a real organization data warehouse • De-facto industry standard benchmark for measuring the performance of decision support solutions such as RDBMS and Hadoop/Spark based systems • The underlying business model is a retail product supplier e.g. retail sales, web, catalog data, inventory, demographics, etc • Examines large volumes of data e.g. 1TB to 100TB • Executes SQL queries of various operational requirements and complexities e.g. ad-hoc, reporting, data mining Excerpt from store_sales fact table diagram:
  • 8. Star Schema Detection in Spark • Queries against star schema are expected to run fast based on the relationships among the tables • SPARK-17791 implements star schema detection based on cardinality heuristics • In a query, star schema detection algorithm: • Finds the tables connected in a star-join • Lets the Spark Optimizer plan the star-join tables in an optimal way
  • 9. Join Reordering using Star Schema Execution plan transformation: select i_item_id, s_store_id, avg(ss_net_profit) as store_sales_profit, avg(sr_net_loss) as store_returns_loss from store_sales, store_returns, date_dim d1, date_dim d2, store, item where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and d1.d_moy = 4 and d1.d_year = 1998 and . . . group by i_item_id, s_store_id order by i_item_id, s_store_id Simplified TPC-DS Query 25 Star schema diagram: item store date_dim store_sales store_returns date_dim N:1 1 : N 1: N 1:N 1:N Query execution drops from 421 secs to 147 secs (1TB TPC-DS setup), ~ 3x improvement
  • 10. Performance Results for Star Schema Queries • TPC-DS query speedup: 2x – 8x • By observing relationships among the tables, Optimizer makes better planning decisions • Reduce the data early in the execution plan • Reduce, or eliminate Sort Merge joins in favor of more efficient Broadcast Hash joins TPC-DS 1TB performance results with star schema detection:
  • 11. Star Schema to reduce join enumeration search space • Cost-based Spark Optimizer uses dynamic programing for join enumeration • One issue is the explosion of the search space when there is a large number of joins • Use heuristics search methods to eliminate sub-optimal plans • SPARK-20233 applies star-schema filters to the dynamic programming algorithm for join enumeration
  • 12. Support for Informational Referential Integrity Constraints • Open up an area of query optimization techniques that rely on referential integrity (RI) constraints semantics • Support for informational primary key and foreign key (referential integrity) constraints • Not enforced by the Spark SQL engine; rather used by Catalyst to optimize the query processing • Targeted to applications that load and analyze data that originated from a Data Warehouse for which the conditions for a given constraint are known to be true • SPARK-19842 - umbrella JIRA for RI support
  • 13. How do Query Optimizers use RI Constraints? • Implement powerful optimizations based on RI semantics e.g. Join Elimination • Example using a typical user scenario: queries against views create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity from store_sales, date_dim, customer, item, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and d_year = 2002 select id, first, last, product, quantity from customer_purchases_2002 where product like ‘bicycle%’ and month between 1 and 2 User view: User query: Internal optimizer query processing: Selects only a subset of columns from view Join between store and store_sales removed based on RI analysis
  • 14. Many RI based Optimizations • Existential subquery to inner joins • Group by push down through join • Redundant join elimination • Distinct elimination • etc.
  • 15. Advanced Data Source Push Down Features • Spark applications often directly query external data sources such as relational databases or files • Provides Data Sources APIs for accessing structured data through Spark SQL • Support optimizations such as Filter push down and Column pruning - subset of the functionality that can be pushed down to some data sources • Extend Data Sources APIs with join push down i.e. Selection and Projection push down • Significantly improves query performance by reducing the amount of data transfer and exploiting the capabilities of the data sources such as index access
  • 16. Benefits of Join Push Down E.g. Selective join with tables in DBMS; DBMS has indexes • Spark reads the entire tables, applies some predicates at the data source, and executes the join locally • Join execution in Spark ~ O(|L|) • Join execution in DBMS ~ O(|S|) • Instead, push the join execution to the data source  Efficient join execution using index access  Reduce the data transfer  Query runs 100x faster!
  • 17. Push down based on Cost vs. Heuristics • Join push down is not always beneficial e.g. Cartesian product • Query Optimizer determines the best execution of a SQL query e.g. implementation of relational operators, order of execution, etc. • The optimizer must also decide whether the different operations should be done by Spark, or by the data source • Needs knowledge of what each data source can do (e.g. file system vs. RDBMS), and how much it costs (e.g. statistics from data source, network speed, etc.) • Spark’s Catalyst Optimizer uses a combination of heuristics and cost model • Cost model is an evolving feature in Spark • Until cost model is fully implemented, use safe heuristics
  • 18. Star-Schema Heuristics to Push Down Joins Table diagram: date_dim store_sales store N : 1 1:N Filtering joins (e.g. Star-joins) Expanding joins Table diagram: date_dim store_sales store N : 1 1:N M : N store_returns select s_store_id ,s_store_name ,sum(ss_net_profit) as store_sales_profit from store_sales, date_dim, store where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100 select s_store_id ,s_store_name, sum(ss_net_profit) as store_sales_profit, sum(ss_net_loss) as store_loss from store_sales, date_dim, store, store_returns where d_moy = 4 and d_year = 2001 and d_date_sk = ss_sold_date_sk and s_store_sk = ss_store_sk and ss_ticket_number = sr_ticket_number and ss_item_sk = sr_item_sk group by s_store_id ,s_store_name order by s_store_id ,s_store_name limit 100
  • 19. Performance Results for Star Join Push Down • Mix of two data sources: IBM DB2/JDBC and Parquet TPC-DS Query spark-2.2 (mins) spark-2.2-jpd (mins) Speedup Q8 32 4 8x Q13 121 5 25x Q15 4 2 2x Q17 77 7 11x Q19 42 4 11x Q25 153 7 21x Q29 81 7 11x Q42 31 3 10x Q45 14 3 4x Q46 61 5 12x Q48 155 5 31x Q52 31 5 6x Q55 31 4 8x Q68 69 4 17x Q74 47 23 2x Q79 63 4 15x Q85 22 2 11x Q89 55 4 14x Cluster: 4-node cluster, each node having: 12 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48 Apache Hadoop 2.7.3, Apache Spark 2.2 main (August, 2017) Database info: Schema: TPCDS Scale factor: 1TB total space Mix of Parquet and DB2/JDBC data sources DB2 DPF info: 4-node cluster, each node having: 10 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM Number of cores: 48
  • 20. Summary • Data Warehouse capabilities are important to next generation Spark Data platform • Start with data modeling features such as Star Schema and Informational Referential Integrity Constraints to boost query performance • Push down processing from Spark into the underlying Data Warehouse to exploit the features of the underlying data base engine
  • 21. Future work • Transform Catalyst into a global optimizer • Global optimizer generates an optimal execution plan across all data sources • Determines where an operation should be evaluated based on: 1. The cost to execute the operation. 2. The cost to transfer data between Spark and the data sources • Key factors that affect global optimization: • Remote table statistics (e.g. number of rows, number of distinct values in each column, etc) • Data source characteristics (e.g. CPU speed, I/O rate, network speed, etc.) • Extend Data Source APIs with data source characteristics • Retrieve/compute data source table statistics • Integrate data source cost model into Catalyst
  • 22. Spark Technology Center 22 Thank You. IoanaDelaneyursu@us.ibm.com Visithttp://spark.tc