Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20171012 found IT #9 PySparkの勘所

1,154 views

Published on

https://foundit-project.connpass.com/event/66468/ での発表資料です。

Published in: Technology
  • Be the first to comment

20171012 found IT #9 PySparkの勘所

  1. 1. PySpark found IT project #9 @
  2. 2. ▸ facebook : Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB found IT project ▸ Twitter
  3. 3. 11
  4. 4. Wes Mckinney blog ▸ http://qiita.com/tamagawa-ryuji
  5. 5. 
 

  6. 6. ▸ ▸ ▸ Spark Hadoop ▸ PySpark ▸ Spark/Hadoop PyData
  7. 7. ▸ ▸ ▸
  8. 8. PySpark ▸ ▸ SSD ▸ CPU ▸ Parquet S3 CPU
  9. 9. https://www.slideshare.net/kumagi/ss-78765920/4
  10. 10. ▸ ▸ ▸ groupby ▸ Spark API
  11. 11. Spark Hadoop
  12. 12. Spark Hadoop Hadoop0.x Spark OS HDFS MapReduce OS HDFS Hive e.t.c. HBase MapReduce OS HDFS Hive e.t.c. HBaseMapReduce YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Impala SQL YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Mesos Spark Spark Streaming, MLlib, GraphX, Spark SQL) Spark Spark Streaming, MLlib, GraphX, Spark SQL) Windows Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
  13. 13. ▸ Amazon EMR ▸ Microsoft Azure HDInsight ▸ Cloudera Altus ▸ Databricks Community Edition Spark ▸ PyData + Jupyter PySpark
  14. 14. Spark Hadoop Hadoop Spark map JVM HDFS reduce JVM map JVM reduce JVM f1 RDD Executor JVM HDFS f2 f3 f4 f5 f6 f7 MapReduce Spark RDD
  15. 15. Spark Hadoop Spark ▸ Hadoop MapReduce ▸ Spark API MapReduce API ▸ Hadoop
  16. 16. PySpark (Py)Spark ▸ / Spark ▸ PyData ▸ Spark ▸ Spark Hadoop PyData PySpark
  17. 17. Spark 1.2 PySpark … (Py)Spark
  18. 18. PySpark
  19. 19. PySpark RDD API DataFrame API ▸ RDD Resilient Distributed Dataset = Spark Java ▸ DataFrame RDD / R data.frame ▸ Python RDD API DataFrame API Scala / Java
  20. 20. PySpark DataFrame API RDD DataFrame / Dataset MLlib ML GraphX GraphFrame Spark Streaming Structured Streaming
  21. 21. Worker node PySpark Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM RDD API PySpark Worker node Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM DataFrame API PySpark
  22. 22. PySpark ▸ RDD API Executer JVM Python VM ▸ DataFrame API JVM ▸ UDF Python VM ▸ UDF Scala Java ▸ Spark 2.x DataFrame 

  23. 23. Spark PyData
  24. 24. Spark PyData Spark PyData ▸ Spark ▸ Python PyData ▸ ▸ Parquet ▸ Apache Arrow
  25. 25. Spark PyData ▸ CSV JSON ▸Parquet Spark DataFrame API Python fastparquet pyarrow ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ =
  26. 26. Spark PyData Parquet 
 https://parquet.apache.org/documentation/latest/ 
 zip CSV I/O ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ...
  27. 27. Spark PyData Spark df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy') from fastparquet import write pdf = pd.read_csv(csvFilename) write(filename, pdf, compression='UNCOMPRESSED') fastparquet import pyarrow as pa import pyarrow.parquet as pq arrow_table = pa.Table.from_pandas(pdf) pq.write_table(arrow_table, filename, compression = 'GZIP') pyarrow
  28. 28. Spark PyData ▸ pandas CSV Spark Spark pandas … ▸ Spark - pandas ▸ pandas → Spark … ▸ Apache Arrow
  29. 29. Spark PyData Apache Arrow ▸ Apache Arrow ▸ PyData / OSS ▸ / https://arrow.apache.org
  30. 30. Spark PyData Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ Apache Arrow pandas 10 
 https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
  31. 31. PySpark
  32. 32. 
 

  33. 33. 11

×