Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PySparkの勘所(20170630 sapporo db analytics showcase)

1,080 views

Published on

2017年6月30日にインサイトテクノロジーさま主催のdb analytics showcaseでしゃべったPySparkの話のスライドです。

Published in: Software
  • Be the first to comment

PySparkの勘所(20170630 sapporo db analytics showcase)

  1. 1. PySpark @
  2. 2. ▸ facebook : Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB ▸ Twitter
  3. 3. 8
  4. 4. Wes Mckinney blog ▸ http://qiita.com/tamagawa-ryuji
  5. 5. ▸ ▸ pandas PyData ▸ Spark Scala Java Spark ▸ TB
  6. 6. ▸ Spark Hadoop ▸ PySpark ▸ PySpark ▸ Spark/Hadoop PyData PySpark
  7. 7. Spark Hadoop
  8. 8. Spark Hadoop Hadoop0.x Spark OS HDFS MapReduce OS HDFS Hive e.t.c. HBase MapReduce OS HDFS Hive e.t.c. HBaseMapReduce YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Impala SQL YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Mesos Spark Spark Streaming, MLlib, GraphX, Spark SQL) Spark Spark Streaming, MLlib, GraphX, Spark SQL) Windows Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
  9. 9. Spark Hadoop Hadoop Spark map JVM HDFS reduce JVM map JVM reduce JVM f1 RDD Executor JVM HDFS f2 f3 f4 f5 f6 f7 MapReduce Spark RDD
  10. 10. Spark Hadoop Spark ▸ Hadoop MapReduce ▸ Spark API MapReduce API ▸ Hadoop
  11. 11. PySpark
  12. 12. PySpark (Py)Spark ▸ / Spark ▸ PyData ▸ Spark ▸ Spark Hadoop PyData PySpark
  13. 13. PySpark ▸ ▸ SSD ▸ CPU ▸ Parquet S3 CPU
  14. 14. Spark 1.2 PySpark … (Py)Spark
  15. 15. PySpark
  16. 16. PySpark RDD API DataFrame API ▸ RDD Resilient Distributed Dataset = Spark Java ▸ DataFrame RDD / R data.frame ▸ Spark 2.x DataFrame 
 Learning PySpark ML Structured Streaming GraphFrames TensorFrame ▸ Python RDD API DataFrame API Scala / Java
  17. 17. Worker node PySpark Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM RDD API PySpark Worker node Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM DataFrame API PySpark
  18. 18. PySpark ▸ RDD API Executer JVM Python VM ▸ DataFrame API JVM ▸ UDF Python VM ▸ UDF Scala Java ▸ Spark 2.x DataFrame 

  19. 19. Spark PyData
  20. 20. Spark PyData Spark PyData ▸ Spark ▸ Python PyData ▸ ▸ Parquet ▸ Apache Arrow
  21. 21. Spark PyData PyData
  22. 22. Spark PyData PyData Anaconda Python Blaze NumPy and pandas interface to Big Data'. dask Bokeh Canopy Python IPython matplotlib PyData nose numba JIT NumPy PyData Scipy PyData Statsmodels SymPy pandas NumPy SciPy scikit-image scikit-learn PyData
  23. 23. Spark PyData ▸ CSV JSON ▸ Spark Parquet ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ Parquet Python ▸ fastparquet pyarrow ▸ Parquet
  24. 24. Spark PyData Parquet https://parquet.apache.org/documentation/latest/ I/O
  25. 25. Spark PyData Spark df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy') from fastparquet import write pdf = pd.read_csv(csvFilename) write(filename, pdf, compression='UNCOMPRESSED') fastparquet import pyarrow as pa import pyarrow.parquet as pq arrow_table = pa.Table.from_pandas(pdf) pq.write_table(arrow_table, filename, compression = 'GZIP') pyarrow
  26. 26. Spark PyData ▸ pandas CSV Spark Spark pandas … ▸ Spark - pandas ▸ pandas → Spark … ▸ Apache Arrow
  27. 27. Spark PyData Apache Arrow ▸ Apache Arrow ▸ PyData / OSS ▸ / https://arrow.apache.org
  28. 28. Spark PyData Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis
 http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
  29. 29. PySpark
  30. 30. ▸ pandas PySpark ▸ PySpark DataFrame API ▸ Parquet CSV Parquet ▸ UI Jupyter Notebook Parquet PySpark DataFrame API pandas PyDataJupyter Notebook CSV

×