PyData &
Apache Spark
2017 / 2 / 10
Sapporo TechBar #7
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
techbar
▸ FB
▸ Twitter
5


Python
PyData
Apache
Spark
Jupyter
Notebook
2017
and the
future
Pandas
PyData
1 / 5 : PyData
1 / 5 : PyData
PyData.org
1 / 5 : PyData
PyData
Anaconda Python
Blaze NumPy and pandas interface to Big Data'. dask
Bokeh
Canopy Python
IPython
matplotlib PyData
nose
numba JIT
NumPy PyData
Scipy PyData
Statsmodels
SymPy
pandas NumPy SciPy
scikit-image
scikit-learn PyData


pandas
2 / 5 : pandas
pandas
▸ NumPy SciPy 

▸ DataFrame
▸
2 / 5 : pandas
pandas 

Wes McKinney
2 / 5 : pandas
DataFrame
2 / 5 : pandas
2 / 5 : pandas
▸ 

Python
▸
▸ PyData pandas


Jupyter Notebook
3 /5 : Jupyter Notebook
IPython Notebook
▸ Jupyter Notebook
▸ Julia Python R
▸ JupyterCon
3 /5 : Jupyter Notebook
3 /5 : Jupyter Notebook
3 /5 : Jupyter Notebook
pandas / matplotlib
3 /5 : Jupyter Notebook
Interactive Widget
3 /5 : Jupyter Notebook
▸ Learning Jupyter
Apache Spark
4 / 5 : Apache Spark
Hadoop
▸ MapReduce Spark
▸ 2010 Hadoop = MapReduce + HDFS
▸ Hadoop
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c in-
memory SQL
engine
Spark
Spark Streaming, MLlib,
GraphX, Spark SQL)
Hadoop
HDFS S3 

YARN Mesos 

/
4 / 5 : Apache Spark
Apache Spark PyData pandas
Apache Spark pandas
JVM Python
× dask
I/O
Scala Java Python R

JVM
Python
4 / 5 : Apache Spark
Spark
▸
▸
▸ 1 PC 

Hadoop / MapReduce
4 / 5 : Apache Spark
DataFrame
4 / 5 : Apache Spark
▸
▸ SSD
▸ Spark Parquet
▸ Performance comparison of different file formats
and storage engines in the Hadoop ecosystem
▸ Parquet Python
4 / 5 : Apache Spark
Apache Spark
▸
▸ Parquet
▸
▸
Machine Learning
Machine Learning
▸
▸ scikit-learn
▸ Spark MLlib / ML
▸
▸ TensorFlow
▸ Python
2017 and the future
5/5 : 2017 and the future
PyData
▸
▸ Spark - pandas
▸ pandas → Spark …
5/5 : 2017 and the future
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis

http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
5/5 : 2017 and the future
High speed Apache Parquet for Python
▸ Parquet
▸ Spark
▸ Python
▸ Fastparquet
▸ pyarrow
5/5 : 2017 and the future
: apache arrow
▸ apache arrow
▸ PyData / OSS
▸ /
20170210 sapporotechbar7

20170210 sapporotechbar7

  • 1.
    PyData & Apache Spark 2017/ 2 / 10 Sapporo TechBar #7 @
  • 2.
    ▸ facebook :Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB techbar ▸ FB ▸ Twitter
  • 4.
  • 6.
  • 7.
  • 8.
    1 / 5: PyData
  • 9.
    1 / 5: PyData PyData.org
  • 10.
    1 / 5: PyData PyData Anaconda Python Blaze NumPy and pandas interface to Big Data'. dask Bokeh Canopy Python IPython matplotlib PyData nose numba JIT NumPy PyData Scipy PyData Statsmodels SymPy pandas NumPy SciPy scikit-image scikit-learn PyData 

  • 11.
  • 12.
    2 / 5: pandas pandas ▸ NumPy SciPy 
 ▸ DataFrame ▸
  • 13.
    2 / 5: pandas pandas 
 Wes McKinney
  • 14.
    2 / 5: pandas DataFrame
  • 15.
    2 / 5: pandas
  • 16.
    2 / 5: pandas ▸ 
 Python ▸ ▸ PyData pandas 

  • 17.
  • 18.
    3 /5 :Jupyter Notebook IPython Notebook ▸ Jupyter Notebook ▸ Julia Python R ▸ JupyterCon
  • 19.
    3 /5 :Jupyter Notebook
  • 20.
    3 /5 :Jupyter Notebook
  • 21.
    3 /5 :Jupyter Notebook pandas / matplotlib
  • 22.
    3 /5 :Jupyter Notebook Interactive Widget
  • 23.
    3 /5 :Jupyter Notebook ▸ Learning Jupyter
  • 24.
  • 25.
    4 / 5: Apache Spark Hadoop ▸ MapReduce Spark ▸ 2010 Hadoop = MapReduce + HDFS ▸ Hadoop OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c in- memory SQL engine Spark Spark Streaming, MLlib, GraphX, Spark SQL) Hadoop HDFS S3 
 YARN Mesos 
 /
  • 26.
    4 / 5: Apache Spark Apache Spark PyData pandas Apache Spark pandas JVM Python × dask I/O Scala Java Python R
 JVM Python
  • 27.
    4 / 5: Apache Spark Spark ▸ ▸ ▸ 1 PC 
 Hadoop / MapReduce
  • 28.
    4 / 5: Apache Spark DataFrame
  • 29.
    4 / 5: Apache Spark ▸ ▸ SSD ▸ Spark Parquet ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ Parquet Python
  • 30.
    4 / 5: Apache Spark Apache Spark ▸ ▸ Parquet ▸ ▸
  • 31.
  • 32.
    Machine Learning ▸ ▸ scikit-learn ▸Spark MLlib / ML ▸ ▸ TensorFlow ▸ Python
  • 33.
  • 34.
    5/5 : 2017and the future PyData ▸ ▸ Spark - pandas ▸ pandas → Spark …
  • 35.
    5/5 : 2017and the future Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis
 http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
  • 36.
    5/5 : 2017and the future High speed Apache Parquet for Python ▸ Parquet ▸ Spark ▸ Python ▸ Fastparquet ▸ pyarrow
  • 37.
    5/5 : 2017and the future : apache arrow ▸ apache arrow ▸ PyData / OSS ▸ /