SlideShare a Scribd company logo
Vertica and Spark:
Connecting Computation and Data
Rui Liu and Edward Ma
Hewlett Packard Enterprise Vertica Advanced R&D Labs
Overview
Vertica
Analytics Platform
Computation and data
• The Vertica-Spark connector connects both data and
computation between Vertica and Spark
• VerticaRDD and Vertica Data Source APIs
• Data-locality optimization
• Computation pushdown
• Save data from Spark to Vertica
HPE Vertica
• An advanced SQL analytics platform
• Shared-nothing MPP architecture
• Column-oriented storage organization
• Standard SQL interface with many analytics
capabilities built-in
• Extensible via User-Defined Functions
Vertica
Analytics Platform
Connected pipeline
Spark
Vertica
Vertica	execution	plan	node
Spark	execution	plan	node
Data	flow
Connected pipeline
• Connects the computation pipelines of Spark
and Vertica in an optimized way
• Utilizes parallel channels (parallel queries) for
data movement
• Leverages data-locality optimization on Vertica
• Ensures computation push-down into Vertica as
appropriate (filters and projections)
Vertica data segmentation on hash ring
node0
node1
node2
node3
Id Name Age email …
CREATE TABLE t …
SEGMENTED BY HASH(id) …
Spark locality-aware partitions
seg 0
seg 1
seg 2
seg 3
Spark partition
Locality-aware partition:
• Uses Vertica data-distribution
information
• Generates locality-aware
queries
SELECT id, name FROM T
where
0 < 0xffffffff & hash(id)
AND
0xffffffff & hash(id) < 278985;
Locality-aware query
Non locality-aware query
Initiator
Locality-aware query
Initiator
Data shuffling
DataData Data Data Data
Vertica query pruning
• For a locality-aware
query, the execution
can be pruned
• Query only executes
on nodes that contain
the data
Query
Without
Pruning
Initiator
Query
With
Pruning
Initiator
Data
Data
Arbitrary number of partitions
seg 0
seg 1
seg 2
seg 3
Spark partition
Spark partitions over unsegmented
table replicas
replica 0 replica 1 replica 2 replica 3
Spark partition
Performance characteristics
Scalability of throughput
Computation push-down
• Implemented:
– Filters
– Projections
– Count(*)
• Works in progress:
– Joins
– Aggregations
• Future work:
– User-Defined Functions
Computation push-down schemes
• Per-tuple computation (filtering,
projection)
– Always pushed down
• Joins & aggregations
– Locality-aware: when join and
aggregation keys are co-
segmented in Vertica
– Single query: run single query with
join/aggregation inside Vertica
select c,d,f from A JOIN B on A.foo = B.foo
InitiatorInitiator Initiator
Original Query:
Locality-aware
push-down
query:
select c,d,f from
A JOIN B on A.foo = B.foo
where
0 < 0xffffffff & hash(foo)AND
0xffffffff & hash(foo) < 278985 …
Initiator
Performance Impact of Pushdowns
Vertica Analytics
• Geospatial analytics
• Sentiment analysis
• Sessionization of event streams
• Time series pattern matching
• ... (Many more)
The Other Direction
Data Sources
SaveLoad
Spark as an ETL Engine for Vertica
Logs
Streams
BI
The Other Direction
Data Sources
SaveLoad
Spark as an ETL Engine for Vertica
Logs
Streams
BI
Objective: Want reliable and fast bulk loading
Two Possible Approaches
Write
Direct
Write Read
Indirect (Two-Stage)
Comparison to JDBC Source
What could happen?
What could happen?
What could happen?
What could happen?
What could happen?
NEED TO AVOID
DUPLICATION!
Spark Task/Partition
Big-Picture (Direct)
Spark Task/Partition
Spark Task/Partition
VERTICA
STAGING TABLE
VERTICA
TARGET TABLE
Serialization & Parsing Leader Election & Commit
Exactly-once Semantics
Never results in partial or duplicate table loads
29
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STAGING_TABLE
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
TASK_STATUS
Exactly-once Semantics
Never results in partial or duplicate table loads
30
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STAGING_TABLE
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
TASK_STATUS
DUPLICATED TASK
Exactly-once Semantics
Never results in partial or duplicate table loads
31
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STAGING_TABLE
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
TASK_STATUS
UPDATE TASK_STATUS SET DONE=TRUE
WHERE TASK=2 AND DONE=FALSE
Exactly-once Semantics
Never results in partial or duplicate table loads
32
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STAGING_TABLE
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
TASK_STATUS
RESULT =1;
COMMIT
Exactly-once Semantics
Never results in partial or duplicate table loads
33
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STAGING_TABLE
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
TASK_STATUS
RESULT =0;
ROLLBACK
Leader Election
Find Out Which Task Won The Race
34
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STATUS_TABLE
PartID, done?
PartID, done?
PartID, done?
STAGING_TABLE
LAST_COMMITTER Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
UPDATE LAST_COMMITTER SET
ID={MY_ID} WHERE ID = -1
Leader Election
Find Out Which Task Won The Race
35
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STATUS_TABLE
PartID, done?
PartID, done?
PartID, done?
STAGING_TABLE
LAST_COMMITTER Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
UPDATE LAST_COMMITTER SET
ID={MY_ID} WHERE ID = -1
SUCCESS ->
COMMITTER
Leader Election
Find Out Which Task Won The Race
36
Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
RDD
Dataframe
STATUS_TABLE
PartID, done?
PartID, done?
PartID, done?
STAGING_TABLE
LAST_COMMITTER Executor
W
O
R
K
E
R
N
O
D
E
Task Partition
Task Partition
Task Partition
UPDATE LAST_COMMITTER SET
ID={MY_ID} WHERE ID = -1
FAIL -> DONE
Move to Target Table
STAGING_TABLE TARGET_TABLE
Committer Rename or Append
Direct: Pros & Cons
+ Requires no external system beyond Vertica and
Spark
+ Less I/O: Does not involve an additional write stage
- Exactly-once semantics are non-trivial to ensure
Big-Picture (Two-Stage)
HDFS VERTICA
Big-Picture (Two-Stage)
HDFS VERTICA
Big-Picture (Two-Stage)
WRITE to
ORC HDFS VERTICA
Big-Picture (Two-Stage)
HDFS VERTICA
COPY target_table FROM ‘hdfs://…’ ON ANY NODE ORC;
Big-Picture (Two-Stage)
PARSE &
STORE
HDFS VERTICA
COPY target_table FROM ‘hdfs://…’ ON ANY NODE ORC;
Two-Stage: Pros & Cons
+ Utilizes Vertica’s built-in parallel COPY command for bulk loads, and
may allow for better scalability
+ Uses the optimized ORC reader for parsing (developed in
collaboration with Hortonworks).
+ Simpler code logic, since only one query is involved
- More I/O: Requires an additional writing stage.
- Requires an additional dependency on intermediate storage system.
Conclusions
• Vertica-to-Spark
– Utilizes hash-range queries to minimize intra-Vertica data movement
– Node-pruning eliminates unneeded distributed query processing in Vertica
– Query pushdown to reduce data transfer and to take advantage of Vertica
native analytics.
• Spark-to-Vertica
– Direct connector approach efficiently and reliably coordinates transfer with
parallel Spark tasks
– A two-stage approach with an intermediary staging storage (HDFS) utilizes
a single Vertica query to load and may be more efficient on resource
consumption on Vertica side
• Community Edition
• Free Download 1TB, 3 nodes
• my.vertica.com/
• Spark Connector – Beta Program
• https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark
Learn More About – and Try! - HPE
Vertica
THANK YOU.
Rui Liu (r.liu@hpe.com)
Edward Ma (ema@hpe.com)

More Related Content

What's hot

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
 
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
DataWorks Summit
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
DataWorks Summit
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL - J On The Beach 2016
OrientDB - the 2nd generation of (Multi-Model) NoSQL  - J On The Beach 2016OrientDB - the 2nd generation of (Multi-Model) NoSQL  - J On The Beach 2016
OrientDB - the 2nd generation of (Multi-Model) NoSQL - J On The Beach 2016
Luigi Dell'Aquila
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
Databricks
 

What's hot (20)

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Da...
 
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries o...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
OrientDB - the 2nd generation of (Multi-Model) NoSQL - J On The Beach 2016
OrientDB - the 2nd generation of (Multi-Model) NoSQL  - J On The Beach 2016OrientDB - the 2nd generation of (Multi-Model) NoSQL  - J On The Beach 2016
OrientDB - the 2nd generation of (Multi-Model) NoSQL - J On The Beach 2016
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 

Similar to Vertica And Spark: Connecting Computation And Data

HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
Jack Gudenkauf
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
Databricks
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
Omer Trajman
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
Keshav Murthy
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
Nan Zhu
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
Continuent
 

Similar to Vertica And Spark: Connecting Computation And Data (20)

HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
Hw09 Hadoop + Vertica
Hw09   Hadoop + VerticaHw09   Hadoop + Vertica
Hw09 Hadoop + Vertica
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 

Vertica And Spark: Connecting Computation And Data

  • 1. Vertica and Spark: Connecting Computation and Data Rui Liu and Edward Ma Hewlett Packard Enterprise Vertica Advanced R&D Labs
  • 2. Overview Vertica Analytics Platform Computation and data • The Vertica-Spark connector connects both data and computation between Vertica and Spark • VerticaRDD and Vertica Data Source APIs • Data-locality optimization • Computation pushdown • Save data from Spark to Vertica
  • 3. HPE Vertica • An advanced SQL analytics platform • Shared-nothing MPP architecture • Column-oriented storage organization • Standard SQL interface with many analytics capabilities built-in • Extensible via User-Defined Functions Vertica Analytics Platform
  • 5. Connected pipeline • Connects the computation pipelines of Spark and Vertica in an optimized way • Utilizes parallel channels (parallel queries) for data movement • Leverages data-locality optimization on Vertica • Ensures computation push-down into Vertica as appropriate (filters and projections)
  • 6. Vertica data segmentation on hash ring node0 node1 node2 node3 Id Name Age email … CREATE TABLE t … SEGMENTED BY HASH(id) …
  • 7. Spark locality-aware partitions seg 0 seg 1 seg 2 seg 3 Spark partition Locality-aware partition: • Uses Vertica data-distribution information • Generates locality-aware queries SELECT id, name FROM T where 0 < 0xffffffff & hash(id) AND 0xffffffff & hash(id) < 278985;
  • 8. Locality-aware query Non locality-aware query Initiator Locality-aware query Initiator Data shuffling DataData Data Data Data
  • 9. Vertica query pruning • For a locality-aware query, the execution can be pruned • Query only executes on nodes that contain the data Query Without Pruning Initiator Query With Pruning Initiator Data Data
  • 10. Arbitrary number of partitions seg 0 seg 1 seg 2 seg 3 Spark partition
  • 11. Spark partitions over unsegmented table replicas replica 0 replica 1 replica 2 replica 3 Spark partition
  • 14. Computation push-down • Implemented: – Filters – Projections – Count(*) • Works in progress: – Joins – Aggregations • Future work: – User-Defined Functions
  • 15. Computation push-down schemes • Per-tuple computation (filtering, projection) – Always pushed down • Joins & aggregations – Locality-aware: when join and aggregation keys are co- segmented in Vertica – Single query: run single query with join/aggregation inside Vertica
  • 16. select c,d,f from A JOIN B on A.foo = B.foo InitiatorInitiator Initiator Original Query: Locality-aware push-down query: select c,d,f from A JOIN B on A.foo = B.foo where 0 < 0xffffffff & hash(foo)AND 0xffffffff & hash(foo) < 278985 … Initiator
  • 18. Vertica Analytics • Geospatial analytics • Sentiment analysis • Sessionization of event streams • Time series pattern matching • ... (Many more)
  • 19. The Other Direction Data Sources SaveLoad Spark as an ETL Engine for Vertica Logs Streams BI
  • 20. The Other Direction Data Sources SaveLoad Spark as an ETL Engine for Vertica Logs Streams BI Objective: Want reliable and fast bulk loading
  • 21. Two Possible Approaches Write Direct Write Read Indirect (Two-Stage)
  • 27. What could happen? NEED TO AVOID DUPLICATION!
  • 28. Spark Task/Partition Big-Picture (Direct) Spark Task/Partition Spark Task/Partition VERTICA STAGING TABLE VERTICA TARGET TABLE Serialization & Parsing Leader Election & Commit
  • 29. Exactly-once Semantics Never results in partial or duplicate table loads 29 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STAGING_TABLE Executor W O R K E R N O D E Task Partition Task Partition Task Partition TASK_STATUS
  • 30. Exactly-once Semantics Never results in partial or duplicate table loads 30 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STAGING_TABLE Executor W O R K E R N O D E Task Partition Task Partition Task Partition TASK_STATUS DUPLICATED TASK
  • 31. Exactly-once Semantics Never results in partial or duplicate table loads 31 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STAGING_TABLE Executor W O R K E R N O D E Task Partition Task Partition Task Partition TASK_STATUS UPDATE TASK_STATUS SET DONE=TRUE WHERE TASK=2 AND DONE=FALSE
  • 32. Exactly-once Semantics Never results in partial or duplicate table loads 32 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STAGING_TABLE Executor W O R K E R N O D E Task Partition Task Partition Task Partition TASK_STATUS RESULT =1; COMMIT
  • 33. Exactly-once Semantics Never results in partial or duplicate table loads 33 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STAGING_TABLE Executor W O R K E R N O D E Task Partition Task Partition Task Partition TASK_STATUS RESULT =0; ROLLBACK
  • 34. Leader Election Find Out Which Task Won The Race 34 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STATUS_TABLE PartID, done? PartID, done? PartID, done? STAGING_TABLE LAST_COMMITTER Executor W O R K E R N O D E Task Partition Task Partition Task Partition UPDATE LAST_COMMITTER SET ID={MY_ID} WHERE ID = -1
  • 35. Leader Election Find Out Which Task Won The Race 35 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STATUS_TABLE PartID, done? PartID, done? PartID, done? STAGING_TABLE LAST_COMMITTER Executor W O R K E R N O D E Task Partition Task Partition Task Partition UPDATE LAST_COMMITTER SET ID={MY_ID} WHERE ID = -1 SUCCESS -> COMMITTER
  • 36. Leader Election Find Out Which Task Won The Race 36 Executor W O R K E R N O D E Task Partition Task Partition Task Partition RDD Dataframe STATUS_TABLE PartID, done? PartID, done? PartID, done? STAGING_TABLE LAST_COMMITTER Executor W O R K E R N O D E Task Partition Task Partition Task Partition UPDATE LAST_COMMITTER SET ID={MY_ID} WHERE ID = -1 FAIL -> DONE
  • 37. Move to Target Table STAGING_TABLE TARGET_TABLE Committer Rename or Append
  • 38. Direct: Pros & Cons + Requires no external system beyond Vertica and Spark + Less I/O: Does not involve an additional write stage - Exactly-once semantics are non-trivial to ensure
  • 42. Big-Picture (Two-Stage) HDFS VERTICA COPY target_table FROM ‘hdfs://…’ ON ANY NODE ORC;
  • 43. Big-Picture (Two-Stage) PARSE & STORE HDFS VERTICA COPY target_table FROM ‘hdfs://…’ ON ANY NODE ORC;
  • 44. Two-Stage: Pros & Cons + Utilizes Vertica’s built-in parallel COPY command for bulk loads, and may allow for better scalability + Uses the optimized ORC reader for parsing (developed in collaboration with Hortonworks). + Simpler code logic, since only one query is involved - More I/O: Requires an additional writing stage. - Requires an additional dependency on intermediate storage system.
  • 45. Conclusions • Vertica-to-Spark – Utilizes hash-range queries to minimize intra-Vertica data movement – Node-pruning eliminates unneeded distributed query processing in Vertica – Query pushdown to reduce data transfer and to take advantage of Vertica native analytics. • Spark-to-Vertica – Direct connector approach efficiently and reliably coordinates transfer with parallel Spark tasks – A two-stage approach with an intermediary staging storage (HDFS) utilizes a single Vertica query to load and may be more efficient on resource consumption on Vertica side
  • 46. • Community Edition • Free Download 1TB, 3 nodes • my.vertica.com/ • Spark Connector – Beta Program • https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark Learn More About – and Try! - HPE Vertica
  • 47. THANK YOU. Rui Liu (r.liu@hpe.com) Edward Ma (ema@hpe.com)