Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 156

Internals of Speeding up PySpark with Arrow

4

Share

Download to read offline

Back in the old days of Apache Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3, PySpark has sped up tremendously thanks to the addition of the Arrow serialisers. In this talk you will learn how the Spark Scala core communicates with the Python processes, how data is exchanged across both sub-systems and the development efforts present and underway to make it as fast as possible.

Related Books

Free with a 30 day trial from Scribd

See all

Internals of Speeding up PySpark with Arrow

  1. 1. WHOAMI > Ruben Berenguel (@berenguel) > PhD in Mathematics > (big) data consultant > Lead data engineer using Python, Go and Scala > Right now at Affectv #UnifiedDataAnalytics #SparkAISummit
  2. 2. What is Pandas? #UnifiedDataAnalytics #SparkAISummit
  3. 3. What is Pandas? > Python Data Analysis library #UnifiedDataAnalytics #SparkAISummit
  4. 4. What is Pandas? > Python Data Analysis library > Used everywhere data and Python appear in job offers #UnifiedDataAnalytics #SparkAISummit
  5. 5. What is Pandas? > Python Data Analysis library > Used everywhere data and Python appear in job offers > Efficient (is columnar and has a C and Cython backend) #UnifiedDataAnalytics #SparkAISummit
  6. 6. #UnifiedDataAnalytics #SparkAISummit
  7. 7. #UnifiedDataAnalytics #SparkAISummit
  8. 8. #UnifiedDataAnalytics #SparkAISummit
  9. 9. #UnifiedDataAnalytics #SparkAISummit
  10. 10. #UnifiedDataAnalytics #SparkAISummit
  11. 11. #UnifiedDataAnalytics #SparkAISummit
  12. 12. #UnifiedDataAnalytics #SparkAISummit
  13. 13. #UnifiedDataAnalytics #SparkAISummit
  14. 14. #UnifiedDataAnalytics #SparkAISummit
  15. 15. #UnifiedDataAnalytics #SparkAISummit
  16. 16. HOW DOES PANDAS MANAGE COLUMNAR DATA? #UnifiedDataAnalytics #SparkAISummit
  17. 17. #UnifiedDataAnalytics #SparkAISummit
  18. 18. #UnifiedDataAnalytics #SparkAISummit
  19. 19. WHAT IS ARROW? #UnifiedDataAnalytics #SparkAISummit
  20. 20. WHAT IS ARROW? > Cross-language in-memory columnar format library #UnifiedDataAnalytics #SparkAISummit
  21. 21. WHAT IS ARROW? > Cross-language in-memory columnar format library > Optimised for efficiency across languages #UnifiedDataAnalytics #SparkAISummit
  22. 22. WHAT IS ARROW? > Cross-language in-memory columnar format library > Optimised for efficiency across languages > Integrates seamlessly with Pandas #UnifiedDataAnalytics #SparkAISummit
  23. 23. HOW DOES ARROW MANAGE COLUMNAR DATA? #UnifiedDataAnalytics #SparkAISummit
  24. 24. #UnifiedDataAnalytics #SparkAISummit
  25. 25. #UnifiedDataAnalytics #SparkAISummit
  26. 26. #UnifiedDataAnalytics #SparkAISummit
  27. 27. #UnifiedDataAnalytics #SparkAISummit
  28. 28. #UnifiedDataAnalytics #SparkAISummit
  29. 29. #UnifiedDataAnalytics #SparkAISummit
  30. 30. ! ❤ #UnifiedDataAnalytics #SparkAISummit
  31. 31. ! ❤ > Arrow uses RecordBatches #UnifiedDataAnalytics #SparkAISummit
  32. 32. ! ❤ > Arrow uses RecordBatches > Pandas uses blocks handled by a BlockManager #UnifiedDataAnalytics #SparkAISummit
  33. 33. ! ❤ > Arrow uses RecordBatches > Pandas uses blocks handled by a BlockManager > You can convert an Arrow Table into a Pandas DataFrame easily #UnifiedDataAnalytics #SparkAISummit
  34. 34. #UnifiedDataAnalytics #SparkAISummit
  35. 35. WHAT IS SPARK? #UnifiedDataAnalytics #SparkAISummit
  36. 36. WHAT IS SPARK? > Distributed Computation framework #UnifiedDataAnalytics #SparkAISummit
  37. 37. WHAT IS SPARK? > Distributed Computation framework > Open source #UnifiedDataAnalytics #SparkAISummit
  38. 38. WHAT IS SPARK? > Distributed Computation framework > Open source > Easy to use #UnifiedDataAnalytics #SparkAISummit
  39. 39. WHAT IS SPARK? > Distributed Computation framework > Open source > Easy to use > Scales horizontally and vertically #UnifiedDataAnalytics #SparkAISummit
  40. 40. HOW DOES SPARK WORK? #UnifiedDataAnalytics #SparkAISummit
  41. 41. SPARK USUALLY RUNS ON TOP OF A CLUSTER MANAGER
  42. 42. AND A DISTRIBUTED STORAGE
  43. 43. A SPARK PROGRAM RUNS IN THE DRIVER
  44. 44. THE DRIVER REQUESTS RESOURCES FROM THE CLUSTER MANAGER TO RUN TASKS
  45. 45. THE DRIVER REQUESTS RESOURCES FROM THE CLUSTER MANAGER TO RUN TASKS
  46. 46. THE DRIVER REQUESTS RESOURCES FROM THE CLUSTER MANAGER TO RUN TASKS
  47. 47. THE DRIVER REQUESTS RESOURCES FROM THE CLUSTER MANAGER TO RUN TASKS
  48. 48. THE MAIN BUILDING BLOCK IS THE RDD: RESILIENT DISTRIBUTED DATASET #UnifiedDataAnalytics #SparkAISummit
  49. 49. PYSPARK #UnifiedDataAnalytics #SparkAISummit
  50. 50. PYSPARK OFFERS A PYTHON API TO THE SCALA CORE OF SPARK #UnifiedDataAnalytics #SparkAISummit
  51. 51. IT USES THE PY4J BRIDGE #UnifiedDataAnalytics #SparkAISummit
  52. 52. # Connect to the gateway gateway = JavaGateway( gateway_parameters=GatewayParameters( port=gateway_port, auth_token=gateway_secret, auto_convert=True)) # Import the classes used by PySpark java_import(gateway.jvm, "org.apache.spark.SparkConf") java_import(gateway.jvm, "org.apache.spark.api.java.*") java_import(gateway.jvm, "org.apache.spark.api.python.*") . . . return gateway #UnifiedDataAnalytics #SparkAISummit
  53. 53. #UnifiedDataAnalytics #SparkAISummit
  54. 54. #UnifiedDataAnalytics #SparkAISummit
  55. 55. #UnifiedDataAnalytics #SparkAISummit
  56. 56. #UnifiedDataAnalytics #SparkAISummit
  57. 57. #UnifiedDataAnalytics #SparkAISummit
  58. 58. #UnifiedDataAnalytics #SparkAISummit
  59. 59. #UnifiedDataAnalytics #SparkAISummit
  60. 60. THE MAIN ENTRYPOINTS ARE RDD AND PipelinedRDD(RDD) #UnifiedDataAnalytics #SparkAISummit
  61. 61. PipelinedRDD BUILDS IN THE JVM A PythonRDD
  62. 62. #UnifiedDataAnalytics #SparkAISummit
  63. 63. #UnifiedDataAnalytics #SparkAISummit
  64. 64. #UnifiedDataAnalytics #SparkAISummit
  65. 65. #UnifiedDataAnalytics #SparkAISummit
  66. 66. #UnifiedDataAnalytics #SparkAISummit
  67. 67. #UnifiedDataAnalytics #SparkAISummit
  68. 68. #UnifiedDataAnalytics #SparkAISummit
  69. 69. #UnifiedDataAnalytics #SparkAISummit
  70. 70. #UnifiedDataAnalytics #SparkAISummit
  71. 71. #UnifiedDataAnalytics #SparkAISummit
  72. 72. #UnifiedDataAnalytics #SparkAISummit
  73. 73. #UnifiedDataAnalytics #SparkAISummit
  74. 74. #UnifiedDataAnalytics #SparkAISummit
  75. 75. THE MAGIC IS IN compute
  76. 76. compute IS RUN ON EACH EXECUTOR AND STARTS A PYTHON WORKER VIA PythonRunner
  77. 77. Workers act as standalone processors of streams of data #UnifiedDataAnalytics #SparkAISummit
  78. 78. Workers act as standalone processors of streams of data > Connects back to the JVM that started it #UnifiedDataAnalytics #SparkAISummit
  79. 79. Workers act as standalone processors of streams of data > Connects back to the JVM that started it > Load included Python libraries #UnifiedDataAnalytics #SparkAISummit
  80. 80. Workers act as standalone processors of streams of data > Connects back to the JVM that started it > Load included Python libraries > Deserializes the pickled function coming from the stream #UnifiedDataAnalytics #SparkAISummit
  81. 81. Workers act as standalone processors of streams of data > Connects back to the JVM that started it > Load included Python libraries > Deserializes the pickled function coming from the stream > Applies the function to the data coming from the stream #UnifiedDataAnalytics #SparkAISummit
  82. 82. Workers act as standalone processors of streams of data > Connects back to the JVM that started it > Load included Python libraries > Deserializes the pickled function coming from the stream > Applies the function to the data coming from the stream > Sends the output back #UnifiedDataAnalytics #SparkAISummit
  83. 83. … #UnifiedDataAnalytics #SparkAISummit
  84. 84. BUT… WASN'T SPARK MAGICALLY OPTIMISING EVERYTHING? #UnifiedDataAnalytics #SparkAISummit
  85. 85. YES, FOR SPARK DataFrame #UnifiedDataAnalytics #SparkAISummit
  86. 86. SPARK WILL GENERATE A PLAN (A DIRECTED ACYCLIC GRAPH) TO COMPUTE THE RESULT
  87. 87. AND THE PLAN WILL BE OPTIMISED USING CATALYST
  88. 88. DEPENDING ON THE FUNCTION, THE OPTIMISER WILL CHOOSE PythonUDFRunner OR PythonArrowRunner (BOTH EXTEND PythonRunner) #UnifiedDataAnalytics #SparkAISummit
  89. 89. IF WE CAN DEFINE OUR FUNCTIONS USING PANDAS Series TRANSFORMATIONS WE CAN SPEED UP PYSPARK CODE FROM 3X TO 100X! #UnifiedDataAnalytics #SparkAISummit
  90. 90. QUICK EXAMPLES #UnifiedDataAnalytics #SparkAISummit
  91. 91. THE BASICS: toPandas from pyspark.sql.functions import rand df = spark.range(1 << 20).toDF("id").withColumn("x", rand()) spark.conf.set("spark.sql.execution.arrow.enabled", "false") pandas_df = df.toPandas() # we'll time this #UnifiedDataAnalytics #SparkAISummit
  92. 92. from pyspark.sql.functions import rand df = spark.range(1 << 20).toDF("id").withColumn("x", rand()) spark.conf.set("spark.sql.execution.arrow.enabled", "false") pandas_df = df.toPandas() # we'll time this #UnifiedDataAnalytics #SparkAISummit
  93. 93. from pyspark.sql.functions import rand df = spark.range(1 << 20).toDF("id").withColumn("x", rand()) spark.conf.set("spark.sql.execution.arrow.enabled", "true") pandas_df = df.toPandas() # we'll time this #UnifiedDataAnalytics #SparkAISummit
  94. 94. THE FUN: .groupBy from pyspark.sql.functions import rand, randn, floor from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.range(20000000).toDF("row").drop("row") .withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100) @pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP) def subtract_mean(pdf): spent = pdf.spent return pdf.assign(spent=spent - spent.mean()) df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas() #UnifiedDataAnalytics #SparkAISummit
  95. 95. from pyspark.sql.functions import rand, randn, floor from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.range(20000000).toDF("row").drop("row") .withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100) @pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP) def subtract_mean(pdf): spent = pdf.spent return pdf.assign(spent=spent - spent.mean()) df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas() #UnifiedDataAnalytics #SparkAISummit
  96. 96. BEFORE YOU MAY HAVE DONE SOMETHING LIKE.. import numpy as np from pyspark.sql.functions import collect_list grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list")) as_pandas = grouped.toPandas() as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean) as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"] df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted") #UnifiedDataAnalytics #SparkAISummit
  97. 97. import numpy as np from pyspark.sql.functions import collect_list grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list")) as_pandas = grouped.toPandas() as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean) as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"] df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted") #UnifiedDataAnalytics #SparkAISummit
  98. 98. TLDR: USE1 ARROW AND PANDAS UDFS 1 in pyspark #UnifiedDataAnalytics #SparkAISummit
  99. 99. RESOURCES > Spark documentation > High Performance Spark by Holden Karau > The Internals of Apache Spark 2.4.2 by Jacek Laskowski > Spark's Github > Become a contributor #UnifiedDataAnalytics #SparkAISummit
  100. 100. QUESTIONS? #UnifiedDataAnalytics #SparkAISummit
  101. 101. THANKS! #UnifiedDataAnalytics #SparkAISummit
  102. 102. Get the slides from my github: github.com/rberenguel/ The repository is pyspark-arrow-pandas
  103. 103. FURTHER REFERENCES #UnifiedDataAnalytics #SparkAISummit
  104. 104. ARROW Arrow's home Arrow's github Arrow speed benchmarks Arrow to Pandas conversion benchmarks Post: Streaming columnar data with Apache Arrow Post: Why Pandas users should be excited by Apache Arrow Code: Arrow-Pandas compatibility layer code Code: Arrow Table code PyArrow in-memory data model Ballista: a POC distributed compute platform (Rust) PyJava: POC on Java/Scala and Python data interchange with Arrow #UnifiedDataAnalytics #SparkAISummit
  105. 105. PANDAS Pandas' home Pandas' github Guide: Idiomatic Pandas Code: Pandas internals Design: Pandas internals Talk: Demystifying Pandas' internals, by Marc Garcia Memory Layout of Multidimensional Arrays in numpy #UnifiedDataAnalytics #SparkAISummit
  106. 106. SPARK/PYSPARK Code: PySpark serializers JIRA: First steps to using Arrow (only in the PySpark driver) Post: Speeding up PySpark with Apache Arrow Original JIRA issue: Vectorized UDFs in Spark Initial doc draft Post by Bryan Cutler (leader for the Vec UDFs PR) Post: Introducing Pandas UDF for PySpark Code: org.apache.spark.sql.vectorized Post by Bryan Cutler: Spark toPandas() with Arrow, a Detailed Look #UnifiedDataAnalytics #SparkAISummit
  107. 107. PY4J Py4J's home Py4J's github Code: Reflection engine #UnifiedDataAnalytics #SparkAISummit
  108. 108. TABLE FOR toPandas 2^x Direct (s) With Arrow (s) Factor 17 1,08 0,18 5,97 18 1,69 0,26 6,45 19 4,16 0,30 13,87 20 5,76 0,61 9,44 21 9,73 0,96 10,14 22 17,90 1,64 10,91 23 (OOM) 3,42 24 (OOM) 11,40 #UnifiedDataAnalytics #SparkAISummit
  109. 109. EOF #UnifiedDataAnalytics #SparkAISummit

×