Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 1 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 2 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 3 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 4 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 5 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 6 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 7 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 8 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 9 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 10 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 11 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 12 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 13 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 14 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 15 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 16 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 17 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 18 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 19 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 20 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 21 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 22 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 23 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 24 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 25 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 26 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 27 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 28 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 29 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 30 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 31 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 32 PySparkの勘所(20170630 sapporo db analytics showcase)  Slide 33
Upcoming SlideShare
Apache Sparkについて
Next
Download to read offline and view in fullscreen.

5 Likes

Share

Download to read offline

PySparkの勘所(20170630 sapporo db analytics showcase)

Download to read offline

2017年6月30日にインサイトテクノロジーさま主催のdb analytics showcaseでしゃべったPySparkの話のスライドです。

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

PySparkの勘所(20170630 sapporo db analytics showcase)

  1. 1. PySpark @
  2. 2. ▸ facebook : Ryuji Tamagawa ▸ Twitter : tamagawa_ryuji ▸ FB ▸ Twitter
  3. 3. 8
  4. 4. Wes Mckinney blog ▸ http://qiita.com/tamagawa-ryuji
  5. 5. ▸ ▸ pandas PyData ▸ Spark Scala Java Spark ▸ TB
  6. 6. ▸ Spark Hadoop ▸ PySpark ▸ PySpark ▸ Spark/Hadoop PyData PySpark
  7. 7. Spark Hadoop
  8. 8. Spark Hadoop Hadoop0.x Spark OS HDFS MapReduce OS HDFS Hive e.t.c. HBase MapReduce OS HDFS Hive e.t.c. HBaseMapReduce YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Impala SQL YARN Spark Spark Streaming, MLlib, GraphX, Spark SQL) Mesos Spark Spark Streaming, MLlib, GraphX, Spark SQL) Spark Spark Streaming, MLlib, GraphX, Spark SQL) Windows Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
  9. 9. Spark Hadoop Hadoop Spark map JVM HDFS reduce JVM map JVM reduce JVM f1 RDD Executor JVM HDFS f2 f3 f4 f5 f6 f7 MapReduce Spark RDD
  10. 10. Spark Hadoop Spark ▸ Hadoop MapReduce ▸ Spark API MapReduce API ▸ Hadoop
  11. 11. PySpark
  12. 12. PySpark (Py)Spark ▸ / Spark ▸ PyData ▸ Spark ▸ Spark Hadoop PyData PySpark
  13. 13. PySpark ▸ ▸ SSD ▸ CPU ▸ Parquet S3 CPU
  14. 14. Spark 1.2 PySpark … (Py)Spark
  15. 15. PySpark
  16. 16. PySpark RDD API DataFrame API ▸ RDD Resilient Distributed Dataset = Spark Java ▸ DataFrame RDD / R data.frame ▸ Spark 2.x DataFrame 
 Learning PySpark ML Structured Streaming GraphFrames TensorFrame ▸ Python RDD API DataFrame API Scala / Java
  17. 17. Worker node PySpark Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM RDD API PySpark Worker node Executer JVM Driver JVM Executer JVM Executer JVM Storage Python VM Worker node Worker node Python VM Python VM DataFrame API PySpark
  18. 18. PySpark ▸ RDD API Executer JVM Python VM ▸ DataFrame API JVM ▸ UDF Python VM ▸ UDF Scala Java ▸ Spark 2.x DataFrame 

  19. 19. Spark PyData
  20. 20. Spark PyData Spark PyData ▸ Spark ▸ Python PyData ▸ ▸ Parquet ▸ Apache Arrow
  21. 21. Spark PyData PyData
  22. 22. Spark PyData PyData Anaconda Python Blaze NumPy and pandas interface to Big Data'. dask Bokeh Canopy Python IPython matplotlib PyData nose numba JIT NumPy PyData Scipy PyData Statsmodels SymPy pandas NumPy SciPy scikit-image scikit-learn PyData
  23. 23. Spark PyData ▸ CSV JSON ▸ Spark Parquet ▸ Performance comparison of different file formats and storage engines in the Hadoop ecosystem ▸ Parquet Python ▸ fastparquet pyarrow ▸ Parquet
  24. 24. Spark PyData Parquet https://parquet.apache.org/documentation/latest/ I/O
  25. 25. Spark PyData Spark df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20) df.write.save(filename, compression = 'snappy') from fastparquet import write pdf = pd.read_csv(csvFilename) write(filename, pdf, compression='UNCOMPRESSED') fastparquet import pyarrow as pa import pyarrow.parquet as pq arrow_table = pa.Table.from_pandas(pdf) pq.write_table(arrow_table, filename, compression = 'GZIP') pyarrow
  26. 26. Spark PyData ▸ pandas CSV Spark Spark pandas … ▸ Spark - pandas ▸ pandas → Spark … ▸ Apache Arrow
  27. 27. Spark PyData Apache Arrow ▸ Apache Arrow ▸ PyData / OSS ▸ / https://arrow.apache.org
  28. 28. Spark PyData Wes blog ▸ pandas Apache Arrow ▸ Blog ▸ PyData Blog 
 Wes OK ▸ 2017 : pandas, Arrow, Feather, Parquet, Spark, Ibis
 http://qiita.com/tamagawa-ryuji/items/deb3f63ed4c7c8065e81
  29. 29. PySpark
  30. 30. ▸ pandas PySpark ▸ PySpark DataFrame API ▸ Parquet CSV Parquet ▸ UI Jupyter Notebook Parquet PySpark DataFrame API pandas PyDataJupyter Notebook CSV
  • BrandiMartinez2

    Nov. 27, 2021
  • YugaYamamoto

    Mar. 10, 2021
  • minnanomameswork

    Mar. 31, 2019
  • omasutani

    Jan. 9, 2018
  • taroleo

    Jul. 11, 2017

2017年6月30日にインサイトテクノロジーさま主催のdb analytics showcaseでしゃべったPySparkの話のスライドです。

Views

Total views

2,761

On Slideshare

0

From embeds

0

Number of embeds

118

Actions

Downloads

15

Shares

0

Comments

0

Likes

5

×