Python, Pandas, Spark 2.0
Sky
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji
2017
• Python Spark
•
•
• Python / Pandas
• Spark 2.0
Part 1 :
•
•
•
csv
Python
Pandas Python
Jupyter Notebook
Jenkins
Spark 2.0
• Spark API RDD ~1.3 DataFrame
/ DataSet 1.4~
• DataFrame API
RDD API Python Spark
DataFrame
• RDB /
• R Pandas Spark
Spark
R / Pandas
Spark
+
Part 2 :
CSV
zip
RDB
Parquet
Excel
CSV
Feather
Spark
Pandas / Spark
•
• CPU
•
• Pandas read_csv zip CSV
Pandas
2
• CSV CPU
Pandas zip CSV
CPU …
• Parquet !
•
: Parquet
I/O
•
• Spark Parquet
• Python Parquet
HDFS / S3
Parquet Parquet
SSD
Parquet Parquet
Parquet
No
No
Yes
HDD
•
• I/O Pandas
• Spark
• DataFrame Pandas → Spark
Spark → Pandas Pandas → Spark
• Apache Arrow
CPU
~2010
2010~
SSD
CPU 

Apache Spark 2.0
• 1.x
• 2.0
1.x
• DataFrame API Python
• databricks 

http://go.databricks.com/mastering-apache-spark-2.0
•
Spark 2.0
• CPU
• CPU
• SQL DataFrame
• + SSD
• CSV zip
Pandas read_csv
Python + Spark
• Python serialize
• DataFrame API UDF
UDF Scala/Java
• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-
and-dataframe-api
Executor
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
20161215 python pandas-spark四方山話

20161215 python pandas-spark四方山話