PySpark
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
▸ Twitter
8
Wes Mckinney blog
▸ http://qiita.com/tamagawa-ryuji
▸
▸ pandas PyData
▸ Spark Scala Java
Spark
▸ TB
▸ Spark Hadoop
▸ PySpark
▸ PySpark
▸ Spark/Hadoop PyData
PySpark
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.
HBase
MapReduce
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark
Spark Streaming, MLlib,
GraphX, Spark SQL)
Impala
SQL
YARN
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL) Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
Spark Hadoop
Hadoop Spark
map
JVM
HDFS
reduce
JVM
map
JVM
reduce
JVM
f1
RDD
Executor JVM
HDFS
f2
f3
f4
f5
f6
f7
MapReduce Spark
RDD
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
PySpark
▸
▸ SSD
▸ CPU
▸
Parquet
S3
CPU
Spark 1.2
PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset = Spark
Java
▸ DataFrame RDD
/ R data.frame
▸ Spark 2.x DataFrame 

Learning PySpark ML Structured Streaming GraphFrames TensorFrame
▸ Python RDD API DataFrame API Scala / Java
Worker node
PySpark
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
RDD API PySpark
Worker node
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
DataFrame API PySpark
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame 

Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
PyData
Spark PyData
PyData
Anaconda Python
Blaze NumPy and pandas interface to Big Data'. dask
Bokeh
Canopy Python
IPython
matplotlib PyData
nose
numba JIT
NumPy PyData
Scipy PyData
Statsmodels
SymPy
pandas NumPy SciPy
scikit-image
scikit-learn PyData
Spark PyData
▸ CSV JSON
▸ Spark Parquet
▸ Performance comparison of different file formats and storage
engines in the Hadoop ecosystem
▸ Parquet Python
▸ fastparquet pyarrow
▸ Parquet
Spark PyData
Parquet
https://parquet.apache.org/documentation/latest/
I/O
Spark PyData
Spark
df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20)
df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis

http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
PySpark
▸ pandas PySpark
▸ PySpark DataFrame API
▸ Parquet
CSV
Parquet
▸ UI
Jupyter Notebook
Parquet
PySpark
DataFrame API pandas
PyDataJupyter Notebook
CSV
PySparkの勘所(20170630 sapporo db analytics showcase)

PySparkの勘所(20170630 sapporo db analytics showcase)