SlideShare a Scribd company logo
Optimizing
Apache Spark SQL Joins
Vida Ha
Solutions Architect
About Me
2005 Mobile Web & Voice Search
2
About Me
2005 Mobile Web & Voice Search
3
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
2014 Solutions Architect
Evolution of Spark…
5
2014:
• Spark 1.x
• RDD based API’s.
• Everyday I’m Shufflin’
2017:
• Spark 2.x
• Dataframes & Datasets
• Adv SQL Catalyst
• Optimizing Joins
6
SELECT …
FROM TABLE A
JOIN TABLE B
ON A.KEY1 = B.KEY2
Spark SQL Joins
Topics Covered Today
7
Basic Joins:
• Shuffle Hash Join
• Troubleshooting
• Broadcast Hash Join
• Cartesian Join
Special Cases:
• Theta Join
• One to Many Join
Shuffle Hash Join
8
A Shuffle Hash Join is the most basic type of
join, and goes back to Map Reduce
Fundamentals.
• Map through two different data frames/tables.
• Use the fields in the join condition as the output
key.
• Shuffle both datasets by the output key.
• In the reduce phase, join the two datasets now
any rows of both tables with the same keys are on
the same machine and are sorted.
Shuffle Hash Join
9
Table 1 Table 2MAP
SHUFFLE
REDUCE Output Output Output Output Output
join_rdd = sqlContext.sql(“select *
FROM people_in_the_us
JOIN states
ON people_in_the_us.state = states.name”)
Shuffle Hash Join Performance
Works best when the DF’s:
• Distribute evenly with the key you are joining on.
• Have an adequate number of keys for parallelism.
US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled
into only 50
keys for
each of the
states.
Uneven Sharding & Limited Parallelism,
US DF
Partition 2
US DF
Partition N Small State
DF
A larger Spark Cluster will not solve these
problems!
US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled
into only 50
keys for
each of the
states.
Uneven Sharding & Limited Parallelism,
US DF
Partition 2
US DF
Partition N Small State
DF
Broadcast Hash Join can address this problem if
one DF is small enough to fit in memory.
join_rdd = sqlContext.sql(“select *
FROM people_in_california
LEFT JOIN all_the_people_in_the_world
ON people_in_california.id =
all_the_people_in_the_world.id”)
More Performance Considerations
Final output keys = # of people in CA, so don’t
need a huge Spark cluster, right?
The Size of the Spark Cluster to run this job is limited
by the Large table rather than the Medium Sized Table.
Left Join - Shuffle Step
Not a Problem:
● Even Sharding
● Good Parallelism
Shuffles everything
before dropping keys
All CA DF All World
DF
All the Data from
Both Tables
Final
Joined
Output
A Better Solution
Filter the World DF for only entries that match the CA ID
Filter Transform
Benefits:
● Less Data shuffled
over the network
and less shuffle
space needed.
● More transforms,
but still faster.
Shuffle
All CA DF All World
DF
Final
Joined
Output
Partial
World DF
● Can’t tell you.
● There aren’t always strict rules for optimizing.
● If you were only considering two small
columns from the World RDD in Parquet
format, the filtering step may not be worth it.
You should understand your data and it’s unique properties in
order to best optimize your Spark Job.
What’s the Tipping Point for Huge?
Things to Look for:
● Tasks that take much longer to run than others.
● Speculative tasks that are launching.
● Shards that have a lot more input or shuffle output.
Check the Spark
UI pages for task
level detail about
your Spark job.
In Practice: Detecting Shuffle Problems
Broadcast Hash Join
18
Parallelism of the large DF is maintained (n output
partitions), and shuffle is not even needed.
Broadcast
Large DF
Partition N
Large DF
Partition 1
Large DF
Partition 2
Optimization: When one of the DF’s is small
enough to fit in memory on a single machine.
Small DF
Small DF Small DF Small DF
Broadcast Hash Join
19
• Often optimal over Shuffle Hash Join.
• Use “explain” to determine if the Spark SQL
catalyst hash chosen Broadcast Hash Join.
• Should be automatic for many Spark SQL tables,
may need to provide hints for other types.
Cartesian Join
20
• A cartesian join can easily explode the number of output
rows.
100,000 X 100,000 = 10 Billion
• Alternative to a full blown cartesian join:
• Create an RDD of UID by UID.
• Force a Broadcast of the rows of the table .
• Call a UDF given the UID by UID to look up the table
rows and perform your calculation.
• Time your calculation on a sample set to size your cluster.
One To Many Join
21
• A single row on one table can map to many rows on the
2nd table.
• Can explode the number of output rows.
• Not a problem if you use parquet - the size of the output
files is not that much since the duplicate data encodes
well.
Theta Join
22
• Spark SQL consider each keyA against each keyB in
the example above and loop to see if the theta
condition is met.
• Better Solution - create buckets for keyA and KeyB
can be matched on.
join_rdd = sqlContext.sql(“select *
FROM tableA
JOIN tableB
ON (keyA < keyB + 10)”)
Thank you
Questions?

More Related Content

What's hot

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 

What's hot (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Similar to Optimizing Apache Spark SQL Joins

Meetup talk
Meetup talkMeetup talk
Meetup talk
Arpit Tak
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
db2dart and inspect
db2dart and inspectdb2dart and inspect
db2dart and inspect
dbawork
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
Amazon Web Services Korea
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Kace & SQL
Kace & SQLKace & SQL
Kace & SQL
Dell World
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Codemotion
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
Arjen de Vries
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
Laurence Chen
 
Apache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlibApache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlib
Jens Fisseler, Dr.
 

Similar to Optimizing Apache Spark SQL Joins (20)

Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
db2dart and inspect
db2dart and inspectdb2dart and inspect
db2dart and inspect
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
MapReduce
MapReduceMapReduce
MapReduce
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Kace & SQL
Kace & SQLKace & SQL
Kace & SQL
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 
Apache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlibApache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlib
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
YousufSait3
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 

Recently uploaded (20)

zOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL DifferenceszOS Mainframe JES2-JES3 JCL-JECL Differences
zOS Mainframe JES2-JES3 JCL-JECL Differences
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 

Optimizing Apache Spark SQL Joins

  • 1. Optimizing Apache Spark SQL Joins Vida Ha Solutions Architect
  • 2. About Me 2005 Mobile Web & Voice Search 2
  • 3. About Me 2005 Mobile Web & Voice Search 3 2012 Reporting & Analytics
  • 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics 2014 Solutions Architect
  • 5. Evolution of Spark… 5 2014: • Spark 1.x • RDD based API’s. • Everyday I’m Shufflin’ 2017: • Spark 2.x • Dataframes & Datasets • Adv SQL Catalyst • Optimizing Joins
  • 6. 6 SELECT … FROM TABLE A JOIN TABLE B ON A.KEY1 = B.KEY2 Spark SQL Joins
  • 7. Topics Covered Today 7 Basic Joins: • Shuffle Hash Join • Troubleshooting • Broadcast Hash Join • Cartesian Join Special Cases: • Theta Join • One to Many Join
  • 8. Shuffle Hash Join 8 A Shuffle Hash Join is the most basic type of join, and goes back to Map Reduce Fundamentals. • Map through two different data frames/tables. • Use the fields in the join condition as the output key. • Shuffle both datasets by the output key. • In the reduce phase, join the two datasets now any rows of both tables with the same keys are on the same machine and are sorted.
  • 9. Shuffle Hash Join 9 Table 1 Table 2MAP SHUFFLE REDUCE Output Output Output Output Output
  • 10. join_rdd = sqlContext.sql(“select * FROM people_in_the_us JOIN states ON people_in_the_us.state = states.name”) Shuffle Hash Join Performance Works best when the DF’s: • Distribute evenly with the key you are joining on. • Have an adequate number of keys for parallelism.
  • 11. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism, US DF Partition 2 US DF Partition N Small State DF A larger Spark Cluster will not solve these problems!
  • 12. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism, US DF Partition 2 US DF Partition N Small State DF Broadcast Hash Join can address this problem if one DF is small enough to fit in memory.
  • 13. join_rdd = sqlContext.sql(“select * FROM people_in_california LEFT JOIN all_the_people_in_the_world ON people_in_california.id = all_the_people_in_the_world.id”) More Performance Considerations Final output keys = # of people in CA, so don’t need a huge Spark cluster, right?
  • 14. The Size of the Spark Cluster to run this job is limited by the Large table rather than the Medium Sized Table. Left Join - Shuffle Step Not a Problem: ● Even Sharding ● Good Parallelism Shuffles everything before dropping keys All CA DF All World DF All the Data from Both Tables Final Joined Output
  • 15. A Better Solution Filter the World DF for only entries that match the CA ID Filter Transform Benefits: ● Less Data shuffled over the network and less shuffle space needed. ● More transforms, but still faster. Shuffle All CA DF All World DF Final Joined Output Partial World DF
  • 16. ● Can’t tell you. ● There aren’t always strict rules for optimizing. ● If you were only considering two small columns from the World RDD in Parquet format, the filtering step may not be worth it. You should understand your data and it’s unique properties in order to best optimize your Spark Job. What’s the Tipping Point for Huge?
  • 17. Things to Look for: ● Tasks that take much longer to run than others. ● Speculative tasks that are launching. ● Shards that have a lot more input or shuffle output. Check the Spark UI pages for task level detail about your Spark job. In Practice: Detecting Shuffle Problems
  • 18. Broadcast Hash Join 18 Parallelism of the large DF is maintained (n output partitions), and shuffle is not even needed. Broadcast Large DF Partition N Large DF Partition 1 Large DF Partition 2 Optimization: When one of the DF’s is small enough to fit in memory on a single machine. Small DF Small DF Small DF Small DF
  • 19. Broadcast Hash Join 19 • Often optimal over Shuffle Hash Join. • Use “explain” to determine if the Spark SQL catalyst hash chosen Broadcast Hash Join. • Should be automatic for many Spark SQL tables, may need to provide hints for other types.
  • 20. Cartesian Join 20 • A cartesian join can easily explode the number of output rows. 100,000 X 100,000 = 10 Billion • Alternative to a full blown cartesian join: • Create an RDD of UID by UID. • Force a Broadcast of the rows of the table . • Call a UDF given the UID by UID to look up the table rows and perform your calculation. • Time your calculation on a sample set to size your cluster.
  • 21. One To Many Join 21 • A single row on one table can map to many rows on the 2nd table. • Can explode the number of output rows. • Not a problem if you use parquet - the size of the output files is not that much since the duplicate data encodes well.
  • 22. Theta Join 22 • Spark SQL consider each keyA against each keyB in the example above and loop to see if the theta condition is met. • Better Solution - create buckets for keyA and KeyB can be matched on. join_rdd = sqlContext.sql(“select * FROM tableA JOIN tableB ON (keyA < keyB + 10)”)

Editor's Notes

  1. PRESENTER: Underline text added for extra emphasis