Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

3,473 views

Published on

2017/9/27 PyData.Tokyoでのプレゼンです。

Published in: Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

  1. 1. PySpark @
  2. 2. ▸ facebook : Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB pydata.tokyo ▸ Twitter
  3. 3. 8 11
  4. 4. Wes Mckinney blog ▸ http://qiita.com/tamagawa-ryuji
  5. 5. ▸ ▸ CPU ▸ PyData.Tokyo ▸ PySpark
  6. 6. ▸ ▸ ▸ Spark Hadoop ▸ PySpark ▸ Spark/Hadoop PyData
  7. 7. ▸ ▸ ▸
  8. 8. PySpark ▸ ▸ SSD ▸ CPU ▸ Parquet S3 CPU
  9. 9. https://www.slideshare.net/kumagi/ss-78765920/4
  10. 10. ▸ ▸ ▸ groupby ▸
  11. 11. ▸ ▸
  12. 12. N ▸ N N ▸ …
  13. 13. … ▸
  14. 14. ▸ ▸ ▸ CPU/ ▸ CPU/ ▸ 1
  15. 15. Hadoop Spark ▸ ▸ ▸ n /n
  16. 16. ▸ ▸ ▸ Amazon EMR ▸ Microsoft Azure HDInsight ▸ Cloudera Altus ▸ Databricks Community Edition Spark ▸ PyData + Jupyter PySpark
  17. 17. Spark Hadoop
  18. 18. Spark Hadoop Hadoop0.x Spark OS HDFS MapReduce OS HDFS Hive e.t.c. HBase MapReduce OS HDFS Hive e.t.c. HBaseMapReduce YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Impala SQL YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Mesos Spark Spark Streaming, MLlib, GraphX, Spark SQL) Spark Spark Streaming, MLlib, GraphX, Spark SQL) Windows Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
  19. 19. Spark Hadoop Hadoop Spark map JVM HDFS reduce JVM map JVM reduce JVM f1 RDD Executor JVM HDFS f2 f3 f4 f5 f6 f7 MapReduce Spark RDD
  20. 20. Spark Hadoop Spark ▸ Hadoop MapReduce ▸ Spark API MapReduce API ▸ Hadoop
  21. 21. PySpark (Py)Spark ▸ / Spark ▸ PyData ▸ Spark ▸ Spark Hadoop PyData PySpark
  22. 22. Spark 1.2 PySpark … (Py)Spark
  23. 23. PySpark
  24. 24. PySpark RDD API DataFrame API ▸ RDD Resilient Distributed Dataset = Spark Java ▸ DataFrame RDD / R data.frame ▸ Python RDD API DataFrame API Scala / Java
  25. 25. PySpark DataFrame API RDD DataFrame / Dataset MLlib ML GraphX GraphFrame Spark Streaming Structured Streaming
  26. 26. Worker node PySpark Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM RDD API PySpark Worker node Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM DataFrame API PySpark
  27. 27. PySpark ▸ RDD API Executer JVM Python VM ▸ DataFrame API JVM ▸ UDF Python VM ▸ UDF Scala Java ▸ Spark 2.x DataFrame 

  28. 28. Spark PyData
  29. 29. Spark PyData Spark PyData ▸ Spark ▸ Python PyData ▸ ▸ Parquet ▸ Apache Arrow
  30. 30. Spark PyData ▸ CSV JSON ▸Parquet Spark DataFrame API Python fastparquet pyarrow ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ =
  31. 31. Spark PyData Parquet 
 https://parquet.apache.org/documentation/latest/ 
 zip CSV I/O ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ...
  32. 32. Spark PyData Spark df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy') from fastparquet import write pdf = pd.read_csv(csvFilename) write(filename, pdf, compression='UNCOMPRESSED') fastparquet import pyarrow as pa import pyarrow.parquet as pq arrow_table = pa.Table.from_pandas(pdf) pq.write_table(arrow_table, filename, compression = 'GZIP') pyarrow
  33. 33. Spark PyData ▸ pandas CSV Spark Spark pandas … ▸ Spark - pandas ▸ pandas → Spark … ▸ Apache Arrow
  34. 34. Spark PyData Apache Arrow ▸ Apache Arrow ▸ PyData / OSS ▸ / https://arrow.apache.org
  35. 35. Spark PyData Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ Apache Arrow pandas 10 
 https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
  36. 36. PySpark Python Spark

×