Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performant data processing with PySpark, SparkR and DataFrame API

3,046 views

Published on

Rakuten Technology Conference 2015@Sendai

Published in: Software

Performant data processing with PySpark, SparkR and DataFrame API

  1. 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  2. 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  3. 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  4. 4. Works of 2015 Available Jan, 2016 ?
  5. 5. Works of past
  6. 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  7. 7. Apache Spark
  8. 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  9. 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  10. 10. Apache Spark and non-JVM languages
  11. 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  12. 12. The Web UI tells us a lot http://<address>:4040
  13. 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  14. 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  15. 15. DataFrame APIs come to the rescue !
  16. 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  17. 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  18. 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  19. 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  20. 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  21. 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  22. 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  23. 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  24. 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  25. 25. Appendix : Parquet
  26. 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware

×