Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Insider’s Guide to Maximizing Spark SQL Performance


Published on

Hadoop / Spark Conference Japan 2019

Published in: Software
  • Be the first to comment

An Insider’s Guide to Maximizing Spark SQL Performance

  1. 1. An Insider’s Guide to Maximizing Spark SQL Performance Xiao Li @ gatorsmile Japanese Hadoop/Spark Conf @ Tokyo | Mar 2019 1
  2. 2. About Me • Engineering Manager at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark SQL, Database Replication, Information Integration • Ph.D. in University of Florida • Github: gatorsmile
  3. 3. Databricks Customers Across Industries Financial Services Healthcare & Pharma Media & Entertainment Technology Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech Data & Analytics Services
  4. 4. DATABRICKS WORKSPACE Databricks Delta ML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
  5. 5. Apache Spark 3.x 5 Catalyst Optimization & Tungsten Execution SparkSession / DataFrame / DataSet APIs SQL Spark ML Spark Streaming Spark Graph 3rd-party Libraries Spark CoreData Source Connectors
  6. 6. Apache Spark 3.x 6 Catalyst Optimization & Tungsten Execution SparkSession / DataFrame / DataSet APIs SQL Spark ML Spark Streaming Spark Graph 3rd-party Libraries Spark CoreData Source Connectors
  7. 7. From declarative queries to RDDs 7 SQL Dataset DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan DAGsPhysical Plans Selected Physical Plan CostModel Metadata Catalog Cache Manager Analyzer Optimizer Planner Query Execution Parser RDD Results
  8. 8. 8 Maximize Performance
  9. 9. 9 Read Plan. Interpret Plan. Tune Plan. Track Execution.
  10. 10. Read Plans from SQL Tab in either Spark UI or Spark History Server 10 Spark 3.0: Show the actual SQL statement? [SPARK-27045]
  11. 11. Page: In Details for SQL Query 11
  12. 12. 12 Parsed Plan Analyzed Plan Optimized Plan Physical Plan
  13. 13. 13 Understand and Tune Plans
  14. 14. 14 Different Results!!!
  15. 15. 15 Read the analyzed plan to check the implicit type casting. Tip: Explicitly cast the types in the queries.
  16. 16. 16 Read the analyzed plan to check the implicit type casting. Tip: Explicitly cast the types in the queries.
  17. 17. Hive-serde Tables 17 Syntax to create a Hive Serde table
  18. 18. Hive serde reader
  19. 19. 19 filter pushdown Native reader/writer performs faster than Hive serde reader/writer
  20. 20. 20 Native Data Source Tables Syntax to create a Spark native ORC table Tip: Create native data source tables for better performance and stability.
  21. 21. 21 Push Down + Implicit Type Casting Not pushed down??? Tip: Cast is needed? Update the constants?
  22. 22. Nested Schema Pruning 22Not pruned???
  23. 23. Nested Schema Pruning 23
  24. 24. Collapse Projects 24 Call UDF three times!!!
  25. 25. Collapse Projects 25
  26. 26. Cross-session SQL Cache 26 • If a query is cached in the one session, the new queries in all the sessions might be impacted. • Check your query plan!
  27. 27. 27
  28. 28. 28 Track Execution
  29. 29. From SQL query to Spark Jobs 29
  30. 30. 30 • A SQL query => multiple Spark jobs. • For example, broadcast exchange, shuffle exchange, non-correlated subquery, lazily cached queries. • A Spark job => A DAG • A chain of RDD dependencies organized in a directed acyclic graph (DAG)
  31. 31. 31 The higher level SQL physical operators. Optimized ogical Plan DAGsPhysical Plans Selected Physical Plan CostModel he ger r Planner Query ExecutionQuery Execution The low level Spark RDD primitives.
  32. 32. Job Tab in Spark UI 32 The amount of time for each job. Any stage/task failure?
  33. 33. Job Tab 33 The amount of time for each stage. • Jobs • Stages • Tasks
  34. 34. Stages Tab 34 • How the time are spent? • Any outlier in task execution? • Straggler tasks? • Skew in data size, compute time? • Too many/few tasks (partitions)? • Load balanced? Locality? Tasks specific info
  35. 35. 35 Balanced? Skew? Killed? Which executor’s log we should read?
  36. 36. Executors Tab 36 size of data transferred between stages used/available memory All the problematic executors in the same node?
  37. 37. 37 - Interacting with Hive metastore? - Slow query planning?
  38. 38. Storage Tab 38 Tip: Caching query results might be slower if your memory is not large enough.
  39. 39. Storage Tab 39 Tip: Unpersist the cache if not used.
  40. 40. Additional Resources 40 • Apache Spark document: guide.html • Blog: • Previous summit: america/sessions • Databricks document: • Books: • Databricks academy: • Databricks ebooks:
  41. 41. Apache Spark™ • Use Cases • Research • Technical Deep Dives AI • Productionizing ML • Deep Learning • Cloud Hardware Fields • Data Science • Data Engineering • Enterprise 5000+ ATTENDEES Practitioners: Data Scientists, Data Engineers, Analysts, Architects Leaders: Engineering Management, VPs, Heads of Analytics & Data, CxOs TRACKS
  42. 42. 42 Nike: Enabling Data Scientists to bring their Models to Market Facebook: Vectorized Query Execution in Apache Spark at Facebook Tencent: Large-scale Malicious Domain Detection with Spark AI IBM: In-memory storage Evolution in Apache Spark Capital One: Apache Spark and Sights at Speed: Streaming, Feature management and Execution Apple: Making Nested Columns as First Citizen in Apache Spark SQL EBay: Managing Apache Spark workload and automatic optimizing. Google: Validating Spark ML Jobs HP: Apache Spark for Cyber Security in big company Microsoft: Apache Spark Serving: Unifying Batch, Streaming and RESTful Serving ABSA Group: A Mainframe Data Source for Spark SQL and Streaming Facebook: an efficient Facebook-scale shuffle service IBM: Make your PySpark Data Fly with Arrow! Facebook : Distributed Scheduling Framework for Apache Spark Zynga: Automating Predictive Modeling at Zynga with PySpark World Bank: Using Crowdsourced Images to Create Image Recognition Models and NLP to Augment Global Trade indicator Optimizing Performance and Computing Resource. Microsoft: Azure Databricks with R: Deep Dive Airbnb: Apache Spark at Airbnb Netflix: Migrating to Apache Spark at Netflix Microsoft: Infrastructure for Deep Learning in Apache Spark Intel: Game playing using AI on Apache Spark Facebook: Scaling Apache Spark @ Facebook Lyft: Scaling Apache Spark on K8S at Lyft Uber: Using Spark Mllib Models in a Production Training and Serving Platform Apple: Bridging the gap between Datasets and DataFrames Salesforce: The Rule of 10,000 Spark Jobs Target: Lessons in Linear Algebra at Scale with Apache Spark ICL: Cooperative Task Execution for Apache Spark Workday: Lesson Learned Using Apache Spark
  43. 43. Thank you Xiao Li ( 43