Operational Tips for Deploying Spark by Miklos Christine

•

4 likes•3,109 views

Spark Summit

Spark Summit East Talk

Data & Analytics

Operational Tips for
Deploying Spark
Miklos Christine
Solutions Engineer
Databricks

$ whoami
• Previously @ Cloudera
• Deep Knowledge of Big Data Stack
• Apache Spark Expert
• Solutions Engineer @ Databricks!

Agenda
• Quick Apache Spark Overview
• Configuration Systems
• Pipeline Design Best Practices
• Debugging Techniques

• Command Line:
spark-defaults.conf
spark-env.sh
• Programmatically:
SparkConf()
• Hadoop Configs:
core-site.xml
hdfs-site.xml
Spark Core Configuration
// Print SparkConfig
sc.getConf.toDebugString
// Print Hadoop Config
val hdConf =
sc.hadoopConfiguration.iterator()
while (hdConf.hasNext){
println(hdConf.next().toString())
}

• Set SQL Configs Through SQL Interface
SET key=value;
sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”)
• Tools to see current configurations
// View SparkSQL Config Properties
val sqlConf = sqlContext.getAllConfs
sqlConf.foreach(x => println(x._1 +" : " + x._2))
Spark SQL Configuration

• File Formats
• Compression Codecs
• Spark APIs
• Job Profiles
Spark Pipeline Design

File Formats
• Text File Formats
– CSV
– JSON
• Avro Row Format
• Parquet Columnar Format

Compression Codecs
• Choose and Analyze Compression Codecs
– Snappy, Gzip, LZO
• Configuration Parameters
– io.compression.codecs
– spark.sql.parquet.compression.codec
– spark.io.compression.codec

Small Files Problem
• Small files problem still exists
• Metadata loading
• Use coalesce()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

• 2 Types of Partitioning
– File level and Spark
# Get Number of Spark
df.rdd.getNumPartitions()
40
Partitioning
df.write.
partitionBy(“colName”).
saveAsTable(“tableName”)

• Leverage Spark UI
– SQL
– Streaming
Spark Job Profiles

• Monitoring & Metrics
– Spark
– Servers
● Toolset
– Ganglia
– Graphite
Job Profiles: Monitoring
Ref:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

● Analyze the Driver’s stacktrace.
● Analyze the executors stacktraces
– Find the initial executor’s failure.
● Review metrics
– Memory
– Disk
– Networking
Debugging Spark

● OutOfMemoryErrors
– Driver
– Executors
● Out of Disk Space Issues
● Long GC Pauses
● API Usage
Top Support Issues

● Use builtin functions instead of custom UDFs
– import pyspark.sql.functions
– import org.apache.spark.sql.functions
● Examples:
– to_date()
– get_json_object()
– regexp_extract()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
Top Support Issues

● SQL Joins
– df_users.join(df_orders).explain()
– set spark.sql.autoBroadcastJoinThreshold
● Exported Parquet from External Systems
– spark.sql.parquet.binaryAsString
● Tune number of Shuffle Partitions
– spark.sql.shuffle.partitions
Top Support Issues

Thank You!
mwc@databricks.com
https://www.linkedin.com/in/mrchristine

What's hot

Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Big Data visualization with Apache Spark and Zeppelinprajods

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

Spark Summit EU talk by Jakub HavaSpark Summit

Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan

Continuous Application with FAIR Scheduler with Robert XueDatabricks

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Operational Tips for Deploying SparkDatabricks

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Building a unified data pipeline in Apache SparkDataWorks Summit

Continuous Processing in Structured Streaming with Jose TorresDatabricks

Parallelize R Code Using Apache Spark Databricks

Building a High-Performance Database with Scala, Akka, and SparkEvan Chan

The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit

Lambda at Weather Scale by Robbie StricklandSpark Summit

What's hot (20)

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Presto on Apache Spark: A Tale of Two Computation Engines

Big Data visualization with Apache Spark and Zeppelin

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

Spark Summit EU talk by Jakub Hava

Spark Internals Training | Apache Spark | Spark | Anika Technologies

Continuous Application with FAIR Scheduler with Robert Xue

Alpine academy apache spark series #1 introduction to cluster computing wit...

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...

Operational Tips for Deploying Spark

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

Building a unified data pipeline in Apache Spark

Continuous Processing in Structured Streaming with Jose Torres

Parallelize R Code Using Apache Spark

Building a High-Performance Database with Scala, Akka, and Spark

The Pushdown of Everything by Stephan Kessler and Santiago Mola

Lambda at Weather Scale by Robbie Strickland

Viewers also liked

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Lessons Learned From Running Spark On DockerSpark Summit

Getting The Best Performance With PySparkSpark Summit

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks

Beyond Parallelize and Collect by Holden KarauSpark Summit

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit

Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Spark Summit

Time Series Analysis with Spark by Sandy RyzaSpark Summit

Apache Spark Model Deployment Databricks

Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit

Monitoring Spark ApplicationsTzach Zohar

Spark Summit EU talk by Berni SchieferSpark Summit

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit

Production Readiness Testing At Salesforce Using Spark MLlibSpark Summit

Getting the best performance with PySpark - Spark Summit West 2016Holden Karau

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Spark Summit EU 2015: SparkUI visualization: a lens into your applicationDatabricks

Viewers also liked (20)

Top 5 Mistakes When Writing Spark Applications

Lessons Learned From Running Spark On Docker

Getting The Best Performance With PySpark

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Beyond Parallelize and Collect by Holden Karau

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...

Time Series Analysis with Spark by Sandy Ryza

Apache Spark Model Deployment

Lessons Learned Running Hadoop and Spark in Docker Containers

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...

Monitoring Spark Applications

Spark Summit EU talk by Berni Schiefer

Highlights and Challenges from Running Spark on Mesos in Production by Morri ...

Production Readiness Testing At Salesforce Using Spark MLlib

Getting the best performance with PySpark - Spark Summit West 2016

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...

Spark Summit EU 2015: SparkUI visualization: a lens into your application

Similar to Operational Tips for Deploying Spark by Miklos Christine

20170126 big data processingVienna Data Science Group

Incorta spark integrationDylan Wan

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

Spark ProgrammingTaewook Eom

Spark with HDInsightKhalid Salama

Intro to SparkKyle Burke

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Building iot applications with Apache Spark and Apache BahirLuciano Resende

Analytics with Cassandra & SparkMatthias Niehoff

Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

From Zero to Stream ProcessingEventador

Meetup spark structured streamingJosé Carlos García Serrano

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

Sparkstreaming with kafka and h base at scale (1)Sigmoid

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Building Out Your Kafka Developer CDC Ecosystemconfluent

Similar to Operational Tips for Deploying Spark by Miklos Christine (20)

20170126 big data processing

Incorta spark integration

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

Spark Programming

Spark with HDInsight

Intro to Spark

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Building iot applications with Apache Spark and Apache Bahir

Analytics with Cassandra & Spark

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Big Data Processing with .NET and Spark (SQLBits 2020)

From Zero to Stream Processing

Meetup spark structured streaming

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...

Sparkstreaming with kafka and h base at scale (1)

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

Building Out Your Kafka Developer CDC Ecosystem

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

B2 Creative Industry Response Evaluation.docxStephen266013

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Digi Khata Problem along complete plan.pptxTanveerAhmed817946

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

Decoding Loan Approval: Predictive Modeling in Action

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

04242024_CCC TUG_Joins and Relationships

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Brighton SEO | April 2024 | Data Storytelling

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

B2 Creative Industry Response Evaluation.docx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Digi Khata Problem along complete plan.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Operational Tips for Deploying Spark by Miklos Christine

1. Operational Tips for Deploying Spark Miklos Christine Solutions Engineer Databricks

2. $ whoami • Previously @ Cloudera • Deep Knowledge of Big Data Stack • Apache Spark Expert • Solutions Engineer @ Databricks!

3. Agenda • Quick Apache Spark Overview • Configuration Systems • Pipeline Design Best Practices • Debugging Techniques

4. Apache Spark

5. Spark Configuration

6. • Command Line: spark-defaults.conf spark-env.sh • Programmatically: SparkConf() • Hadoop Configs: core-site.xml hdfs-site.xml Spark Core Configuration // Print SparkConfig sc.getConf.toDebugString // Print Hadoop Config val hdConf = sc.hadoopConfiguration.iterator() while (hdConf.hasNext){ println(hdConf.next().toString()) }

7. • Set SQL Configs Through SQL Interface SET key=value; sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”) • Tools to see current configurations // View SparkSQL Config Properties val sqlConf = sqlContext.getAllConfs sqlConf.foreach(x => println(x._1 +" : " + x._2)) Spark SQL Configuration

8. • File Formats • Compression Codecs • Spark APIs • Job Profiles Spark Pipeline Design

9. File Formats • Text File Formats – CSV – JSON • Avro Row Format • Parquet Columnar Format

10. Compression Codecs • Choose and Analyze Compression Codecs – Snappy, Gzip, LZO • Configuration Parameters – io.compression.codecs – spark.sql.parquet.compression.codec – spark.io.compression.codec

11. Small Files Problem • Small files problem still exists • Metadata loading • Use coalesce() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

12. • 2 Types of Partitioning – File level and Spark # Get Number of Spark df.rdd.getNumPartitions() 40 Partitioning df.write. partitionBy(“colName”). saveAsTable(“tableName”)

13. • Leverage Spark UI – SQL – Streaming Spark Job Profiles

14. Spark Job Profiles

15. Spark Job Profiles

16. • Monitoring & Metrics – Spark – Servers ● Toolset – Ganglia – Graphite Job Profiles: Monitoring Ref: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

17. ● Analyze the Driver’s stacktrace. ● Analyze the executors stacktraces – Find the initial executor’s failure. ● Review metrics – Memory – Disk – Networking Debugging Spark

18. ● OutOfMemoryErrors – Driver – Executors ● Out of Disk Space Issues ● Long GC Pauses ● API Usage Top Support Issues

19. ● Use builtin functions instead of custom UDFs – import pyspark.sql.functions – import org.apache.spark.sql.functions ● Examples: – to_date() – get_json_object() – regexp_extract() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions Top Support Issues

20. ● SQL Joins – df_users.join(df_orders).explain() – set spark.sql.autoBroadcastJoinThreshold ● Exported Parquet from External Systems – spark.sql.parquet.binaryAsString ● Tune number of Shuffle Partitions – spark.sql.shuffle.partitions Top Support Issues

21. Thank You! mwc@databricks.com https://www.linkedin.com/in/mrchristine

Operational Tips for Deploying Spark by Miklos Christine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Operational Tips for Deploying Spark by Miklos Christine

Similar to Operational Tips for Deploying Spark by Miklos Christine (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Operational Tips for Deploying Spark by Miklos Christine