Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PySpark
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
pydata.tokyo
▸ Twitter
8 11
Wes Mckinney blog
▸ http://qiita.com/tamagawa-ryuji
▸
▸ CPU
▸ PyData.Tokyo
▸
PySpark
▸
▸
▸ Spark Hadoop
▸ PySpark
▸ Spark/Hadoop PyData
▸
▸
▸
PySpark
▸
▸ SSD
▸ CPU
▸
Parquet
S3
CPU
https://www.slideshare.net/kumagi/ss-78765920/4
▸
▸
▸ groupby
▸
▸
▸
N
▸ N
N
▸ …
…
▸
▸
▸
▸ CPU/
▸ CPU/
▸ 1
Hadoop Spark
▸
▸
▸ n /n
▸
▸
▸ Amazon EMR
▸ Microsoft Azure HDInsight
▸ Cloudera Altus
▸ Databricks Community Edition Spark
▸ PyData + Jupyter PySp...
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.
HBase
MapReduce
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN...
Spark Hadoop
Hadoop Spark
map
JVM
HDFS
reduce
JVM
map
JVM
reduce
JVM
f1
RDD
Executor JVM
HDFS
f2
f3
f4
f5
f6
f7
MapReduce ...
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
Spark 1.2
PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset =
Spark Java
▸ DataFrame RDD
/ R data.frame
▸ Python RDD...
PySpark
DataFrame API
RDD
DataFrame /
Dataset
MLlib ML
GraphX GraphFrame
Spark
Streaming
Structured
Streaming
Worker node
PySpark
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
...
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame 

Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file ...
Spark PyData
Parquet


https://parquet.apache.org/documentation/latest/


zip CSV
I/O
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0...
Spark PyData
Spark
df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20)
df.write.save(filename, c...
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ Apache Arrow pandas 10 

https://qiita.com/tam...
PySpark Python Spark
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Upcoming SlideShare
Loading in …5
×

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

2,534 views

Published on

2017/9/27 PyData.Tokyoでのプレゼンです。

Published in: Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

  1. 1. PySpark @
  2. 2. ▸ facebook : Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB pydata.tokyo ▸ Twitter
  3. 3. 8 11
  4. 4. Wes Mckinney blog ▸ http://qiita.com/tamagawa-ryuji
  5. 5. ▸ ▸ CPU ▸ PyData.Tokyo ▸ PySpark
  6. 6. ▸ ▸ ▸ Spark Hadoop ▸ PySpark ▸ Spark/Hadoop PyData
  7. 7. ▸ ▸ ▸
  8. 8. PySpark ▸ ▸ SSD ▸ CPU ▸ Parquet S3 CPU
  9. 9. https://www.slideshare.net/kumagi/ss-78765920/4
  10. 10. ▸ ▸ ▸ groupby ▸
  11. 11. ▸ ▸
  12. 12. N ▸ N N ▸ …
  13. 13. … ▸
  14. 14. ▸ ▸ ▸ CPU/ ▸ CPU/ ▸ 1
  15. 15. Hadoop Spark ▸ ▸ ▸ n /n
  16. 16. ▸ ▸ ▸ Amazon EMR ▸ Microsoft Azure HDInsight ▸ Cloudera Altus ▸ Databricks Community Edition Spark ▸ PyData + Jupyter PySpark
  17. 17. Spark Hadoop
  18. 18. Spark Hadoop Hadoop0.x Spark OS HDFS MapReduce OS HDFS Hive e.t.c. HBase MapReduce OS HDFS Hive e.t.c. HBaseMapReduce YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Impala SQL YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Mesos Spark Spark Streaming, MLlib, GraphX, Spark SQL) Spark Spark Streaming, MLlib, GraphX, Spark SQL) Windows Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
  19. 19. Spark Hadoop Hadoop Spark map JVM HDFS reduce JVM map JVM reduce JVM f1 RDD Executor JVM HDFS f2 f3 f4 f5 f6 f7 MapReduce Spark RDD
  20. 20. Spark Hadoop Spark ▸ Hadoop MapReduce ▸ Spark API MapReduce API ▸ Hadoop
  21. 21. PySpark (Py)Spark ▸ / Spark ▸ PyData ▸ Spark ▸ Spark Hadoop PyData PySpark
  22. 22. Spark 1.2 PySpark … (Py)Spark
  23. 23. PySpark
  24. 24. PySpark RDD API DataFrame API ▸ RDD Resilient Distributed Dataset = Spark Java ▸ DataFrame RDD / R data.frame ▸ Python RDD API DataFrame API Scala / Java
  25. 25. PySpark DataFrame API RDD DataFrame / Dataset MLlib ML GraphX GraphFrame Spark Streaming Structured Streaming
  26. 26. Worker node PySpark Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM RDD API PySpark Worker node Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM DataFrame API PySpark
  27. 27. PySpark ▸ RDD API Executer JVM Python VM ▸ DataFrame API JVM ▸ UDF Python VM ▸ UDF Scala Java ▸ Spark 2.x DataFrame 

  28. 28. Spark PyData
  29. 29. Spark PyData Spark PyData ▸ Spark ▸ Python PyData ▸ ▸ Parquet ▸ Apache Arrow
  30. 30. Spark PyData ▸ CSV JSON ▸Parquet Spark DataFrame API Python fastparquet pyarrow ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ =
  31. 31. Spark PyData Parquet 
 https://parquet.apache.org/documentation/latest/ 
 zip CSV I/O ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ROW BLOCK COLUMN #0 ROW #0 COLUMN #0 ROW #1 COLUMN #0 ROW #N COLUMN #1 ROW #0 COLUMN #1 ROW #1 … … COLUMN #1 ROW #N COLUMN #2 ROW #0 COLUMN #2 ROW #1 … COLUMN #M ROW #N ...
  32. 32. Spark PyData Spark df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy') from fastparquet import write pdf = pd.read_csv(csvFilename) write(filename, pdf, compression='UNCOMPRESSED') fastparquet import pyarrow as pa import pyarrow.parquet as pq arrow_table = pa.Table.from_pandas(pdf) pq.write_table(arrow_table, filename, compression = 'GZIP') pyarrow
  33. 33. Spark PyData ▸ pandas CSV Spark Spark pandas … ▸ Spark - pandas ▸ pandas → Spark … ▸ Apache Arrow
  34. 34. Spark PyData Apache Arrow ▸ Apache Arrow ▸ PyData / OSS ▸ / https://arrow.apache.org
  35. 35. Spark PyData Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ Apache Arrow pandas 10 
 https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
  36. 36. PySpark Python Spark

×