SlideShare a Scribd company logo
Radical Speed for SQL
Queries on Databricks:
Photon Under the Hood
Alex Behm
Tech Lead, Databricks
Greg Rahn
Staff Product Manager, Databricks
Agenda
▪ Intro to Photon
▪ Recent Developments
▪ Up Next
▪ Summary
Introduction to Photon
Observed Workload Trends
Businesses are moving faster, and as a result
organizations spend less time in data modeling, leading
to worse performance.
▪ Most columns don’t have "NOT NULL" constraints defined
▪ Strings are convenient but slower than specific types
▪ Data lifecycle: Raw → Bronze → Silver → Gold
Can we get both agility and performance?
-- Data [Analysts | Engineers | Scientists] everywhere
Just one more ask:
SQL as a first-class citizen on
Databricks
What is Photon?
Photon is a new 100% Apache Spark compatible query engine
designed for speed and flexibility.
It’s built from the ground up to deliver the fastest performance
on modern cloud hardware for all data use cases across
data engineering, data science, machine learning, and data analytics.
• Re-architected for the fastest performance on real-world
applications
• Native C++ engine for faster queries
• Custom built memory management to avoid JVM bottlenecks
• Vectorized: memory, instruction, and data parallelism (SIMD)
• Works with your existing code and avoids vendor lock-in
• 100% compatible with open source Spark DataFrame APIs and Spark SQL
• Transparent operation to users - no need to invoke something new, it just works
• Optimizing for all data use cases and workloads
• Today, supporting SQL and DataFrame workloads
• Coming soon, Streaming, Data Science, and more
Building the next generation query engine
Why build a new execution engine?
● Parsing
● Catalyst: Analysis/Planning/Optimization
● Scheduling
Execute Task
Client: Submit SQL Query
Execute Task Execute Task Execute Task Spark Executors
Mixed
JVM/Native
Spark Driver
JVM
Photon in the Databricks Lakehouse Platform
Delta Lake
1
0
1
0
1
0
1
0
1
0
1
0
• Hybrid Photon/Spark Plans
• Use Photon when possible, fall back to Spark for unsupported operations
• Completely transparent to users
• Native code using off-heap memory
• Natural access to memory and intrinsics (no fiddling with Java Unsafe)
• No JVM GC, large heaps ok
• No JVM JIT performance cliffs / limitations
• Fully integrated with Spark’s memory manager
• Prefers hash join over sort-merge join
• Rich per-operator performance metrics
Key Photon Characteristics
Recent Developments in Photon
Development Focus Areas
1. Production Readiness
a. Goal: Resilience comparable to DBR → spilling support
b. Testing and hardening, real customer workloads
2. Query Coverage
a. Today: Basics like joins/aggregations/shuffle, common types and functions
b. In development: Nested types, built-in functions
c. Coming soon: Sort/Window
3. Performance
a. Analyze and optimize common usage patterns
Disclaimer: Microbenchmarks
Microbenchmarks do not necessarily reflect
real-world end-to-end performance
During Photon development we analyze and optimize
performance with extensive microbenchmarks
In the following slides, we share benchmark results that
were run in controlled and narrowly scoped scenarios
Resilience with Very Large Inputs
• Spilling for very large inputs
• Write intermediate state to external storage to process
inputs exceeding available memory
✅ Hash Shuffle
✅ Hash Aggregation
✅ Hash Join
2-5x Speedup
Example: Spilling Hash Join [1 of 4]
Partitioned Hash Table
• Hash join has two phases
• build and probe
• Build phase: insert records
from one join input into the
hash table
• Hash table has a fixed
number of partitions
Example: Spilling Hash Join [2 of 4]
• When memory runs out spill
one partition to disk
• New records go to
in-memory partitions or
straight to disk
• Repeat until build is done
Partitioned Hash Table
Example: Spilling Hash Join [3 of 4]
• Probe phase: process
rows from other join input
• Emit results for probe
rows matching in-memory
build partitions
• Spill probe rows matching
a spilled build partition
Partitioned Hash Table
Build
Probe
Example: Spilling Hash Join [4 of 4]
• For each spilled partition,
repeat the same
build/probe process
• Might spill again! Apply
same algorithm recursively
Build
Probe
⨝
Spilling Hash Join vs. Spilling Sort-Merge Join
• Photon converts Sort-Merge Joins to Hash Joins
• Sort Merge Join
• Buffer + sort both join inputs, increasing memory pressure
• Spilling sort → write entire input to sorted runs
• Hash Join
• Only buffer build input (typically the smaller input) in a hash table
• Graceful degradation: Spill both inputs at the build-partition granularity
• Role reversal: Swap build/probe when processing spilled partitions
Up to 5x Speedup
Hardening: How we test Photon
• Random queries and data
• Using new open-source Spark random query generator
• Failure injection
• Randomly trip error paths to ensure graceful query failure
• Spill injection
• Randomly trigger spill events to simulate memory pressure
• Clang/LLVM C++ tools
• Address Sanitizer
• Undefined Behavior Sanitizer
• Combinations of the above
🐛
🔨
Query Coverage
Overview of Query Coverage
Data Types Operators
✅ Byte/Short/Int/Long
✅ Boolean
✅ String/Binary
✅ Decimal
✅ Float/Double
✅ Date/Timestamp
✅ Struct
Coming soon: Array, Map
✅ Scan, Filter, Project
✅ Hash Aggregate/Join/Shuffle
✅ Nested-Loop Join
✅ Null-Aware Anti Join
✅ Union, Expand, ScalarSubquery
Coming soon: Sort, Window
Expressions
✅ Comparison / Logic
✅ Arithmetic / Math (most)
✅ Conditional (IF, CASE, etc.)
✅ String (common ones)
✅ Casts
✅ Aggregates (most common
ones)
✅ Date/Timestamp (in progress)
Coming soon: UDFs, long tail
Expression Coverage for DATE/TIMESTAMP
• Many queries contain date/timestamp logic
• As of today: 95% coverage (100% very soon)
• Fast path for UTC timezone (default)
• Some expressions are very complicated to implement
• Individual functions run in Spark, but still run the operator/plan in Photon
Microbenchmarks do not necessarily reflect speedups on end-to-end queries, functions optimized for UTC timezone, your mileage may vary
Nested/Complex Type Support
• ✅ Struct
• Array / Map, in active development
• Reading data and basic usage/functions work
• In progress: collect_list() / collect_set()
• Long tail of array expressions
Microbenchmarks do not necessarily reflect speedups on end-to-end queries, your mileage may vary
• Currently supports all scalar types and Struct
• Array/Map in active development
• Can be turned on/off independently of Photon
• spark.databricks.photon.parquetWriter.enabled = true
• Typical speedups: 2-4x
• Wider (>100 columns) tables can see even more gains
Writing Delta/Parquet Data
DML Support [DELETE / UPDATE / MERGE]
• Bulk of work like joins/aggregations run in Photon
• Benefits from Photon Delta/Parquet writing capability
• Typical speedups: 2-3x
ANSI SQL Support
• Development in tandem with open-source Spark
• Fail queries on overflow or similar errors
Photon: What's Next
Current/Up Next Efforts in Photon
• Finishing nested type support, including writes
• Outstanding ANSI SQL behaviors
• Sort and Window operators
• Support for bucketed tables
How to use Photon today
● Enable Photon via Workspace cluster
● Notebook or JAR
● Available on: AWS
● Not supported yet
○ UDFs
○ Streaming
● Photon via Databricks SQL
● Redash
● Tableau
● Microsoft Power BI
● BYO Tool via ODBC / JDBC
● Available on: AWS, Azure
● Not supported yet
○ Sort
○ Window
SQL Data Engineering / ELT / ETL
Interactive SQL Analytics
J
u
n
e
Photon: Key Use Cases for Preview
J
u
n
e
SELECT
vendor_id,
SUM(trip_distance) as SumTripDistance,
AVG(trip_distance) as AvgTripDistance
FROM abehm.nyc_yellow
WHERE passenger_count IN (1, 2, 4)
GROUP BY vendor_id
ORDER BY vendor_id
Sort
+- Exchange rangepartitioning
+- HashAggregate
+- Exchange hashpartitioning
+- HashAggregate
+- Project
+- Filter
+- ColumnarToRow
+- FileScan
Sort
+- Exchange
+- ColumnarToRow
+- PhotonResultStage
+- PhotonGroupingAgg
+- PhotonShuffleExchangeSource
+- PhotonShuffleMapStage
+- PhotonShuffleExchangeSink
+- PhotonGroupingAgg
+- PhotonProject
+- PhotonFilter
+- PhotonAdapter
+- FileScan
Spark UI
● Yellow → Photon Nodes
● Blue → Spark Nodes
Metrics
● Photon nodes have rich metrics to help
understand behavior and performance
● Easier than Spark where several nodes
are squashed together
1
2
3
4
Performance observations
Customer Feedback
Test Date
Average Query
Response time
(seconds)
Reduction
from
previous
June '20
DBR v6.6
7.8
December
'20
Photon
6.2 21%
May '21
Photon
4.4 29%
44% reduction
2.5x
3.7x
Avg query speedup
Power Test speedup
DEMO
"Demo" - just a walkthrough showing where users
can turn on Photon in Databricks?
Note: From getting started to executing existing
code/queries and monitoring Photon (Spark UI +
Query execution on SQLA)
Logo slide with generalized perf observations
brought down merge latency by 2-3x
Summary
Related Talks
WEDNESDAY
• 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
• 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm,
Databricks
• 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume
THURSDAY
• 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
• 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks
FRIDAY
• 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks
How to get started
In June
databricks.com/try
SQL> SELECT questions FROM audience;
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Data Types Operators
✅ Byte/Short/Int/Long
✅ Boolean
✅ String/Binary
✅ Decimal
✅ Float/Double
✅ Date/Timestamp
✅ Struct
Coming soon: Array, Map
✅ Scan, Filter, Project
✅ Hash Aggregate/Join/Shuffle
✅ Nested-Loop Join
✅ Null-Aware Anti Join
✅ Union, Expand, ScalarSubquery
Coming soon: Sort, Window
Expressions
✅ Comparison / Logic
✅ Arithmetic / Math (most)
✅ Conditional (IF, CASE, etc.)
✅ String (common ones)
✅ Casts
✅ Aggregates (most common
ones)
✅ Date/Timestamp (in progress)
Coming soon: UDFs, long tail
● Parsing
● Catalyst: Analysis/Planning/Optimization
● Scheduling
Execute Task
Client: Submit SQL Query
Execute Task Execute Task Execute Task Spark Executors
Mixed
JVM/Native
Spark Driver
JVM
Delta Lake
1
0
1
0
1
0
1
0
1
0
1
0

More Related Content

What's hot

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 

What's hot (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 

Similar to Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
hyeongchae lee
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
SPC Adriatics
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Tim Callaghan
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
Senturus
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
HostedbyConfluent
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
Databricks
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 

Similar to Radical Speed for SQL Queries on Databricks: Photon Under the Hood (20)

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

  • 1. Radical Speed for SQL Queries on Databricks: Photon Under the Hood Alex Behm Tech Lead, Databricks Greg Rahn Staff Product Manager, Databricks
  • 2. Agenda ▪ Intro to Photon ▪ Recent Developments ▪ Up Next ▪ Summary
  • 4. Observed Workload Trends Businesses are moving faster, and as a result organizations spend less time in data modeling, leading to worse performance. ▪ Most columns don’t have "NOT NULL" constraints defined ▪ Strings are convenient but slower than specific types ▪ Data lifecycle: Raw → Bronze → Silver → Gold Can we get both agility and performance?
  • 5. -- Data [Analysts | Engineers | Scientists] everywhere Just one more ask: SQL as a first-class citizen on Databricks
  • 6. What is Photon? Photon is a new 100% Apache Spark compatible query engine designed for speed and flexibility. It’s built from the ground up to deliver the fastest performance on modern cloud hardware for all data use cases across data engineering, data science, machine learning, and data analytics.
  • 7. • Re-architected for the fastest performance on real-world applications • Native C++ engine for faster queries • Custom built memory management to avoid JVM bottlenecks • Vectorized: memory, instruction, and data parallelism (SIMD) • Works with your existing code and avoids vendor lock-in • 100% compatible with open source Spark DataFrame APIs and Spark SQL • Transparent operation to users - no need to invoke something new, it just works • Optimizing for all data use cases and workloads • Today, supporting SQL and DataFrame workloads • Coming soon, Streaming, Data Science, and more Building the next generation query engine
  • 8. Why build a new execution engine?
  • 9. ● Parsing ● Catalyst: Analysis/Planning/Optimization ● Scheduling Execute Task Client: Submit SQL Query Execute Task Execute Task Execute Task Spark Executors Mixed JVM/Native Spark Driver JVM Photon in the Databricks Lakehouse Platform Delta Lake 1 0 1 0 1 0 1 0 1 0 1 0
  • 10. • Hybrid Photon/Spark Plans • Use Photon when possible, fall back to Spark for unsupported operations • Completely transparent to users • Native code using off-heap memory • Natural access to memory and intrinsics (no fiddling with Java Unsafe) • No JVM GC, large heaps ok • No JVM JIT performance cliffs / limitations • Fully integrated with Spark’s memory manager • Prefers hash join over sort-merge join • Rich per-operator performance metrics Key Photon Characteristics
  • 12. Development Focus Areas 1. Production Readiness a. Goal: Resilience comparable to DBR → spilling support b. Testing and hardening, real customer workloads 2. Query Coverage a. Today: Basics like joins/aggregations/shuffle, common types and functions b. In development: Nested types, built-in functions c. Coming soon: Sort/Window 3. Performance a. Analyze and optimize common usage patterns
  • 13. Disclaimer: Microbenchmarks Microbenchmarks do not necessarily reflect real-world end-to-end performance During Photon development we analyze and optimize performance with extensive microbenchmarks In the following slides, we share benchmark results that were run in controlled and narrowly scoped scenarios
  • 14. Resilience with Very Large Inputs • Spilling for very large inputs • Write intermediate state to external storage to process inputs exceeding available memory ✅ Hash Shuffle ✅ Hash Aggregation ✅ Hash Join 2-5x Speedup
  • 15. Example: Spilling Hash Join [1 of 4] Partitioned Hash Table • Hash join has two phases • build and probe • Build phase: insert records from one join input into the hash table • Hash table has a fixed number of partitions
  • 16. Example: Spilling Hash Join [2 of 4] • When memory runs out spill one partition to disk • New records go to in-memory partitions or straight to disk • Repeat until build is done Partitioned Hash Table
  • 17. Example: Spilling Hash Join [3 of 4] • Probe phase: process rows from other join input • Emit results for probe rows matching in-memory build partitions • Spill probe rows matching a spilled build partition Partitioned Hash Table Build Probe
  • 18. Example: Spilling Hash Join [4 of 4] • For each spilled partition, repeat the same build/probe process • Might spill again! Apply same algorithm recursively Build Probe ⨝
  • 19. Spilling Hash Join vs. Spilling Sort-Merge Join • Photon converts Sort-Merge Joins to Hash Joins • Sort Merge Join • Buffer + sort both join inputs, increasing memory pressure • Spilling sort → write entire input to sorted runs • Hash Join • Only buffer build input (typically the smaller input) in a hash table • Graceful degradation: Spill both inputs at the build-partition granularity • Role reversal: Swap build/probe when processing spilled partitions Up to 5x Speedup
  • 20. Hardening: How we test Photon • Random queries and data • Using new open-source Spark random query generator • Failure injection • Randomly trip error paths to ensure graceful query failure • Spill injection • Randomly trigger spill events to simulate memory pressure • Clang/LLVM C++ tools • Address Sanitizer • Undefined Behavior Sanitizer • Combinations of the above 🐛 🔨
  • 22. Overview of Query Coverage Data Types Operators ✅ Byte/Short/Int/Long ✅ Boolean ✅ String/Binary ✅ Decimal ✅ Float/Double ✅ Date/Timestamp ✅ Struct Coming soon: Array, Map ✅ Scan, Filter, Project ✅ Hash Aggregate/Join/Shuffle ✅ Nested-Loop Join ✅ Null-Aware Anti Join ✅ Union, Expand, ScalarSubquery Coming soon: Sort, Window Expressions ✅ Comparison / Logic ✅ Arithmetic / Math (most) ✅ Conditional (IF, CASE, etc.) ✅ String (common ones) ✅ Casts ✅ Aggregates (most common ones) ✅ Date/Timestamp (in progress) Coming soon: UDFs, long tail
  • 23. Expression Coverage for DATE/TIMESTAMP • Many queries contain date/timestamp logic • As of today: 95% coverage (100% very soon) • Fast path for UTC timezone (default) • Some expressions are very complicated to implement • Individual functions run in Spark, but still run the operator/plan in Photon
  • 24. Microbenchmarks do not necessarily reflect speedups on end-to-end queries, functions optimized for UTC timezone, your mileage may vary
  • 25. Nested/Complex Type Support • ✅ Struct • Array / Map, in active development • Reading data and basic usage/functions work • In progress: collect_list() / collect_set() • Long tail of array expressions
  • 26. Microbenchmarks do not necessarily reflect speedups on end-to-end queries, your mileage may vary
  • 27. • Currently supports all scalar types and Struct • Array/Map in active development • Can be turned on/off independently of Photon • spark.databricks.photon.parquetWriter.enabled = true • Typical speedups: 2-4x • Wider (>100 columns) tables can see even more gains Writing Delta/Parquet Data
  • 28. DML Support [DELETE / UPDATE / MERGE] • Bulk of work like joins/aggregations run in Photon • Benefits from Photon Delta/Parquet writing capability • Typical speedups: 2-3x ANSI SQL Support • Development in tandem with open-source Spark • Fail queries on overflow or similar errors
  • 30. Current/Up Next Efforts in Photon • Finishing nested type support, including writes • Outstanding ANSI SQL behaviors • Sort and Window operators • Support for bucketed tables
  • 31. How to use Photon today
  • 32. ● Enable Photon via Workspace cluster ● Notebook or JAR ● Available on: AWS ● Not supported yet ○ UDFs ○ Streaming ● Photon via Databricks SQL ● Redash ● Tableau ● Microsoft Power BI ● BYO Tool via ODBC / JDBC ● Available on: AWS, Azure ● Not supported yet ○ Sort ○ Window SQL Data Engineering / ELT / ETL Interactive SQL Analytics J u n e Photon: Key Use Cases for Preview J u n e
  • 33.
  • 34. SELECT vendor_id, SUM(trip_distance) as SumTripDistance, AVG(trip_distance) as AvgTripDistance FROM abehm.nyc_yellow WHERE passenger_count IN (1, 2, 4) GROUP BY vendor_id ORDER BY vendor_id Sort +- Exchange rangepartitioning +- HashAggregate +- Exchange hashpartitioning +- HashAggregate +- Project +- Filter +- ColumnarToRow +- FileScan Sort +- Exchange +- ColumnarToRow +- PhotonResultStage +- PhotonGroupingAgg +- PhotonShuffleExchangeSource +- PhotonShuffleMapStage +- PhotonShuffleExchangeSink +- PhotonGroupingAgg +- PhotonProject +- PhotonFilter +- PhotonAdapter +- FileScan
  • 35. Spark UI ● Yellow → Photon Nodes ● Blue → Spark Nodes Metrics ● Photon nodes have rich metrics to help understand behavior and performance ● Easier than Spark where several nodes are squashed together
  • 38. Customer Feedback Test Date Average Query Response time (seconds) Reduction from previous June '20 DBR v6.6 7.8 December '20 Photon 6.2 21% May '21 Photon 4.4 29% 44% reduction
  • 40. DEMO "Demo" - just a walkthrough showing where users can turn on Photon in Databricks? Note: From getting started to executing existing code/queries and monitoring Photon (Spark UI + Query execution on SQLA)
  • 41. Logo slide with generalized perf observations brought down merge latency by 2-3x
  • 43. Related Talks WEDNESDAY • 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks • 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm, Databricks • 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume THURSDAY • 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics • 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks FRIDAY • 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast & Molly Nagamuthu, Databricks
  • 44. How to get started In June databricks.com/try
  • 45. SQL> SELECT questions FROM audience;
  • 46. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 47. Data Types Operators ✅ Byte/Short/Int/Long ✅ Boolean ✅ String/Binary ✅ Decimal ✅ Float/Double ✅ Date/Timestamp ✅ Struct Coming soon: Array, Map ✅ Scan, Filter, Project ✅ Hash Aggregate/Join/Shuffle ✅ Nested-Loop Join ✅ Null-Aware Anti Join ✅ Union, Expand, ScalarSubquery Coming soon: Sort, Window Expressions ✅ Comparison / Logic ✅ Arithmetic / Math (most) ✅ Conditional (IF, CASE, etc.) ✅ String (common ones) ✅ Casts ✅ Aggregates (most common ones) ✅ Date/Timestamp (in progress) Coming soon: UDFs, long tail
  • 48. ● Parsing ● Catalyst: Analysis/Planning/Optimization ● Scheduling Execute Task Client: Submit SQL Query Execute Task Execute Task Execute Task Spark Executors Mixed JVM/Native Spark Driver JVM Delta Lake 1 0 1 0 1 0 1 0 1 0 1 0