Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operational Tips for Deploying Spark by Miklos Christine

2,598 views

Published on

Spark Summit East Talk

Published in: Data & Analytics

Operational Tips for Deploying Spark by Miklos Christine

  1. 1. Operational Tips for Deploying Spark Miklos Christine Solutions Engineer Databricks
  2. 2. $ whoami • Previously @ Cloudera • Deep Knowledge of Big Data Stack • Apache Spark Expert • Solutions Engineer @ Databricks!
  3. 3. Agenda • Quick Apache Spark Overview • Configuration Systems • Pipeline Design Best Practices • Debugging Techniques
  4. 4. Apache Spark
  5. 5. Spark Configuration
  6. 6. • Command Line: spark-defaults.conf spark-env.sh • Programmatically: SparkConf() • Hadoop Configs: core-site.xml hdfs-site.xml Spark Core Configuration // Print SparkConfig sc.getConf.toDebugString // Print Hadoop Config val hdConf = sc.hadoopConfiguration.iterator() while (hdConf.hasNext){ println(hdConf.next().toString()) }
  7. 7. • Set SQL Configs Through SQL Interface SET key=value; sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”) • Tools to see current configurations // View SparkSQL Config Properties val sqlConf = sqlContext.getAllConfs sqlConf.foreach(x => println(x._1 +" : " + x._2)) Spark SQL Configuration
  8. 8. • File Formats • Compression Codecs • Spark APIs • Job Profiles Spark Pipeline Design
  9. 9. File Formats • Text File Formats – CSV – JSON • Avro Row Format • Parquet Columnar Format
  10. 10. Compression Codecs • Choose and Analyze Compression Codecs – Snappy, Gzip, LZO • Configuration Parameters – io.compression.codecs – spark.sql.parquet.compression.codec – spark.io.compression.codec
  11. 11. Small Files Problem • Small files problem still exists • Metadata loading • Use coalesce() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
  12. 12. • 2 Types of Partitioning – File level and Spark # Get Number of Spark df.rdd.getNumPartitions() 40 Partitioning df.write. partitionBy(“colName”). saveAsTable(“tableName”)
  13. 13. • Leverage Spark UI – SQL – Streaming Spark Job Profiles
  14. 14. Spark Job Profiles
  15. 15. Spark Job Profiles
  16. 16. • Monitoring & Metrics – Spark – Servers ● Toolset – Ganglia – Graphite Job Profiles: Monitoring Ref: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
  17. 17. ● Analyze the Driver’s stacktrace. ● Analyze the executors stacktraces – Find the initial executor’s failure. ● Review metrics – Memory – Disk – Networking Debugging Spark
  18. 18. ● OutOfMemoryErrors – Driver – Executors ● Out of Disk Space Issues ● Long GC Pauses ● API Usage Top Support Issues
  19. 19. ● Use builtin functions instead of custom UDFs – import pyspark.sql.functions – import org.apache.spark.sql.functions ● Examples: – to_date() – get_json_object() – regexp_extract() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions Top Support Issues
  20. 20. ● SQL Joins – df_users.join(df_orders).explain() – set spark.sql.autoBroadcastJoinThreshold ● Exported Parquet from External Systems – spark.sql.parquet.binaryAsString ● Tune number of Shuffle Partitions – spark.sql.shuffle.partitions Top Support Issues
  21. 21. Thank You! mwc@databricks.com https://www.linkedin.com/in/mrchristine

×