SlideShare a Scribd company logo
1 of 27
THE PUSHDOWN
OF EVERYTHING
Stephan Kessler
Santiago Mola
Who we are?
Stephan Kessler
Developer @ SAP, Walldorf
o SAP HANA Vora team
o Integration of Vora query engine with
Apache Spark.
o Bringing new features and performance
improvements to Apache Spark.
o Before joining SAP: PhD and M.Sc. at the
Karlsruhe Institute of Technology.
o Research on privacy in databases and
sensor networks.
Santiago Mola
Developer @ Stratio, Madrid
o Working with the SAP HANA Vora team
o Focus on Apache Spark SQL extensions and data
sources implementation.
o Bootstrapped Stratio Sparkta, worked on Stratio
Ingestion and helped customers to build stream
processing solutions.
o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic
University of Valencia.
SAP HANA Vora
• SAP HANA Vora is a SQL-on-Hadoop solution based on:
– In-Memory columnar query execution engine with built-in query
compilation
– Spark SQL extensions (will be Open Source soon!):
• OLAP extensions
• Hierarchy queries
• Extended Data Sources API (‘Push Down Everything’)
Spark SQL
Data Sources API
Spark Core Engine
Data Sources
MLlib Streaming …
CSV HANA
HANA VORA
Motivation
• “The fastest way of processing data is not processing it at all!”
• Data Sources API allows to defer computation of filters and projects to
the ‘source’
– Less I/O spent reading
– Less memory spent
• But: Data Sources can also be full-blown databases
– Deferring parts of the logical plan leads to
additional benefits
→ The Pushdown of Everything
Pushed down:
Project: Column1
Filter: Column2 > 20
Average: Column2
Implementing a Data Source
1. Creating a ‘DefaultSource’ class that implements the trait
(Schema)RelationProvider
trait SchemaRelationProvider {
def createRelation(
sqlContext: SQLContext, parameters: Map[String, String],
schema: StructType): BaseRelation
}
2. The returned “BaseRelation” can implement the following traits
– TableScan
– PrunedScan
– PrunedFilterScan
Full Scan
• The most basic form of reading data: read it all, sequentially.
• Implementing trait table scan
trait TableScan {
def buildScan(): RDD[Row]
}
• SQL: SELECT * FROM table
Pruned Scan
• Read all rows, only a few columns
• Implementing trait PrunedScan
trait PrunedScan {
def buildScan(requiredColumns: Array[String]): RDD[Row]
}
• SQL: SELECT <column list> FROM table
Pruned Filtered Scan
• Can filter which rows are fetched (predicate push down).
• Implement trait PrunedFilteredScan
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
• SQL: SELECT <column list> FROM table WHERE <predicate>
• Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).
How does it work?
Assume the following table attendees
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown Name Age Hometown
Peter 23 London
John 30 New York
Stephan 72 Karlsruhe
… … …
How does it work?
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown
The query is parsed into this Logical Plan:
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SQL
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Example with PrunedScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Example with
PrunedFilteredScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown
filtered: hometown = ‘Amsterdam’)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
How can we improve this?
• There are sources doing more than filtering and pruning
– aggregation, joins, ...
• Some sources can execute more complex filters and functions
– Example: SELECT col1 + 1 WHERE col2 + col3 < col4.
• Default Data Sources API cannot push down these things
– They might be trivial for the data source to execute.
• This leads to unnecessary work
– fetching more data
– Not using optimizations of the source
Enter the Catalyst Source API
• We implemented a new interface that data sources can implement to
signal that they can push down complex queries.
• Complexity of pushed down queries is arbitrary
– functions, set operators, joins, deeply nested subqueries, …
– even data source UDFs that are not supported in Spark).
trait CatalystSource {
def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean
def supportsLogicalPlan(plan: LogicalPlan): Boolean
def supportsExpression(expr: Expression): Boolean
def logicalPlanToRDD(plan: LogicalPlan): RDD[Row]
}
Partitioned and Holistic sources
• Data sources that can compute queries that operate on a holistic data set
– HANA, Cassandra, PostgreSQL, MongoDB
• Data sources that can compute queries that operate only over each
partition
– Vora, Parquet, ORC, PostgreSQL instances in Postgres XL
• Some can do both (to some degree)
• Our planner extensions allow to optimize push down for both cases if the
data source implements the Catalyst Source API.
Partitioned vs. Holistic Sources
HDFS
Physical
Node
Physical
Node
Physical
Node
Data Node Data Node Data Node
Vora
Engine
Vora
Engine
Vora
Engine
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
SAP
HANA
Postgres
SQL
…
Example with CatalystSource
(partioned execution)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(CatalystSource)
Aggregate
(hometown, SUM(PartialSum) /
SUM(PartialCount))
Physical plan
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT hometown,
SUM(PartialSum) / SUM(PartialCount)
FROM source
GROUP BY hometown
SQL representation
SQL
Example with CatalystSource
(holistic source)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(CatalystSource)
Physical plan
SELECT hometown, AGE(age)
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Returned Rows
Assumption: Table Size is 𝒏 Rows
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT name, age, hometown
FROM attendees
TableScan/
Pruned Scan
Pruned Filter
Scan
Catalyst
Source
Returns
𝑛 Rows
Returns
< 𝑛 Rows
Returns
<< 𝑛 Rows
#distinct ‘hometowns’
Advantages
• A single interface covers all queries.
• CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan.
• Fine-grained control of features supported by the data source
• Incremental implementation of a data source possible
– Start with supporting projects and filters and continue with more
• Opens the door to tighter integration with all kinds of databases.
– Dramatic performance improvements possible.
Current disadvantages and limitations
• Implementing CatalystSource for a rich data source (e.g., supporting SQL)
is a considerably complex task.
• Current implementation relies on (some) Spark APIs that are unstable.
– Backwards compatibility is not guaranteed.
• Pushing down a complex query could be slower than not pushing it down
– Examples:
• it overloads the data source
• generates a result larger than its input tables)
– CatalystSource implementors can workaround this by marking such
queries as unsupported
What are the next steps?
• Improve the API to make it simpler for implementors
– add utilities to generate SQL,
– matchers to simplify working with logical plans
• Provide a stable API
– CatalystSource implementations should work with different Spark
versions without modification.
• Provide a common trait to reduce boilerplate code
– Example: A data source implementing CatalystSource should not
need to implement TableScan, PrunedScan or PrunedFilteredScan.
Summary
• Extension of the Data Sources API to pushdown arbitrary logical plans
• Leveraging functionality of source to process less data
• Part of SAP Hana Vora
• We will put it Open Source
Thank you!
stephan.kessler@sap.com smola@stratio.com

More Related Content

What's hot

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 

What's hot (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Similar to The Pushdown of Everything by Stephan Kessler and Santiago Mola

Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Andy Huang
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffJAX London
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 

Similar to The Pushdown of Everything by Stephan Kessler and Santiago Mola (20)

Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Scala and Spring
Scala and SpringScala and Spring
Scala and Spring
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

The Pushdown of Everything by Stephan Kessler and Santiago Mola

  • 1. THE PUSHDOWN OF EVERYTHING Stephan Kessler Santiago Mola
  • 2. Who we are? Stephan Kessler Developer @ SAP, Walldorf o SAP HANA Vora team o Integration of Vora query engine with Apache Spark. o Bringing new features and performance improvements to Apache Spark. o Before joining SAP: PhD and M.Sc. at the Karlsruhe Institute of Technology. o Research on privacy in databases and sensor networks. Santiago Mola Developer @ Stratio, Madrid o Working with the SAP HANA Vora team o Focus on Apache Spark SQL extensions and data sources implementation. o Bootstrapped Stratio Sparkta, worked on Stratio Ingestion and helped customers to build stream processing solutions. o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic University of Valencia.
  • 3. SAP HANA Vora • SAP HANA Vora is a SQL-on-Hadoop solution based on: – In-Memory columnar query execution engine with built-in query compilation – Spark SQL extensions (will be Open Source soon!): • OLAP extensions • Hierarchy queries • Extended Data Sources API (‘Push Down Everything’)
  • 4. Spark SQL Data Sources API Spark Core Engine Data Sources MLlib Streaming … CSV HANA HANA VORA
  • 5. Motivation • “The fastest way of processing data is not processing it at all!” • Data Sources API allows to defer computation of filters and projects to the ‘source’ – Less I/O spent reading – Less memory spent • But: Data Sources can also be full-blown databases – Deferring parts of the logical plan leads to additional benefits → The Pushdown of Everything Pushed down: Project: Column1 Filter: Column2 > 20 Average: Column2
  • 6. Implementing a Data Source 1. Creating a ‘DefaultSource’ class that implements the trait (Schema)RelationProvider trait SchemaRelationProvider { def createRelation( sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation } 2. The returned “BaseRelation” can implement the following traits – TableScan – PrunedScan – PrunedFilterScan
  • 7. Full Scan • The most basic form of reading data: read it all, sequentially. • Implementing trait table scan trait TableScan { def buildScan(): RDD[Row] } • SQL: SELECT * FROM table
  • 8. Pruned Scan • Read all rows, only a few columns • Implementing trait PrunedScan trait PrunedScan { def buildScan(requiredColumns: Array[String]): RDD[Row] } • SQL: SELECT <column list> FROM table
  • 9. Pruned Filtered Scan • Can filter which rows are fetched (predicate push down). • Implement trait PrunedFilteredScan trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } • SQL: SELECT <column list> FROM table WHERE <predicate> • Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).
  • 10. How does it work? Assume the following table attendees Query: SELECT hometown, AVG(age) FROM attendees WHERE hometown = ’Amsterdam’ GROUP BY hometown Name Age Hometown Peter 23 London John 30 New York Stephan 72 Karlsruhe … … …
  • 11. How does it work? Query: SELECT hometown, AVG(age) FROM attendees WHERE hometown = ’Amsterdam’ GROUP BY hometown The query is parsed into this Logical Plan: Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’
  • 12. Example with TableScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan Planning PhysicalRDD (full scan) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SQL SELECT name, age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation
  • 13. Example with TableScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (full scan) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT name, age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 14. Example with PrunedScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (pruned: age, hometown) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 15. Example with PrunedFilteredScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (pruned: age, hometown filtered: hometown = ‘Amsterdam’) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT age, hometown FROM attendees WHERE hometown = ‘Amsterdam’ SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 16. How can we improve this? • There are sources doing more than filtering and pruning – aggregation, joins, ... • Some sources can execute more complex filters and functions – Example: SELECT col1 + 1 WHERE col2 + col3 < col4. • Default Data Sources API cannot push down these things – They might be trivial for the data source to execute. • This leads to unnecessary work – fetching more data – Not using optimizations of the source
  • 17. Enter the Catalyst Source API • We implemented a new interface that data sources can implement to signal that they can push down complex queries. • Complexity of pushed down queries is arbitrary – functions, set operators, joins, deeply nested subqueries, … – even data source UDFs that are not supported in Spark). trait CatalystSource { def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean def supportsLogicalPlan(plan: LogicalPlan): Boolean def supportsExpression(expr: Expression): Boolean def logicalPlanToRDD(plan: LogicalPlan): RDD[Row] }
  • 18. Partitioned and Holistic sources • Data sources that can compute queries that operate on a holistic data set – HANA, Cassandra, PostgreSQL, MongoDB • Data sources that can compute queries that operate only over each partition – Vora, Parquet, ORC, PostgreSQL instances in Postgres XL • Some can do both (to some degree) • Our planner extensions allow to optimize push down for both cases if the data source implements the Catalyst Source API.
  • 19. Partitioned vs. Holistic Sources HDFS Physical Node Physical Node Physical Node Data Node Data Node Data Node Vora Engine Vora Engine Vora Engine Spark Worker Spark Worker Spark Worker Spark Worker SAP HANA Postgres SQL …
  • 20. Example with CatalystSource (partioned execution) Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan Planning PhysicalRDD (CatalystSource) Aggregate (hometown, SUM(PartialSum) / SUM(PartialCount)) Physical plan SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCount FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SELECT hometown, SUM(PartialSum) / SUM(PartialCount) FROM source GROUP BY hometown SQL representation SQL
  • 21. Example with CatalystSource (holistic source) Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (CatalystSource) Physical plan SELECT hometown, AGE(age) FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 22. Returned Rows Assumption: Table Size is 𝒏 Rows SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCount FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SELECT age, hometown FROM attendees WHERE hometown = ‘Amsterdam’ SELECT name, age, hometown FROM attendees TableScan/ Pruned Scan Pruned Filter Scan Catalyst Source Returns 𝑛 Rows Returns < 𝑛 Rows Returns << 𝑛 Rows #distinct ‘hometowns’
  • 23. Advantages • A single interface covers all queries. • CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan. • Fine-grained control of features supported by the data source • Incremental implementation of a data source possible – Start with supporting projects and filters and continue with more • Opens the door to tighter integration with all kinds of databases. – Dramatic performance improvements possible.
  • 24. Current disadvantages and limitations • Implementing CatalystSource for a rich data source (e.g., supporting SQL) is a considerably complex task. • Current implementation relies on (some) Spark APIs that are unstable. – Backwards compatibility is not guaranteed. • Pushing down a complex query could be slower than not pushing it down – Examples: • it overloads the data source • generates a result larger than its input tables) – CatalystSource implementors can workaround this by marking such queries as unsupported
  • 25. What are the next steps? • Improve the API to make it simpler for implementors – add utilities to generate SQL, – matchers to simplify working with logical plans • Provide a stable API – CatalystSource implementations should work with different Spark versions without modification. • Provide a common trait to reduce boilerplate code – Example: A data source implementing CatalystSource should not need to implement TableScan, PrunedScan or PrunedFilteredScan.
  • 26. Summary • Extension of the Data Sources API to pushdown arbitrary logical plans • Leveraging functionality of source to process less data • Part of SAP Hana Vora • We will put it Open Source

Editor's Notes

  1. Notes: Quick slide: about 1 minute
  2. Notes: 1 or 2 minutes about SAP HANA Vora.
  3. Notes: 30 seconds about Data Sources API intro: Data Sources API defines how Spark SQL can interact with an external source of data. The Data Source can represent a file format on HDFS, a relation database, a web service…
  4. With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark. Clarification: Here are three columns: Logical plan. Physical plan. A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
  5. With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark. Clarification: Here are three columns: Logical plan. Physical plan. A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
  6. With PrunedScan, we fetch all rows with a subset of columns. This can reduce I/O considerably.
  7. PrunedFilteredScan works as PrunedFilteredScan, but adding a filter on rows according to a condition. This is equivalent to adding a WHERE clause.