SlideShare a Scribd company logo
1 of 38
Download to read offline
PySpark
found IT project #9
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
found IT project
▸ Twitter
11
Wes Mckinney blog
▸ http://qiita.com/tamagawa-ryuji




▸
▸
▸ Spark Hadoop
▸ PySpark
▸ Spark/Hadoop PyData
▸
▸
▸
PySpark
▸
▸ SSD
▸ CPU
▸
Parquet
S3
CPU
https://www.slideshare.net/kumagi/ss-78765920/4
▸
▸
▸ groupby
▸ Spark API
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.
HBase
MapReduce
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark
Spark Streaming, MLlib,
GraphX, Spark SQL)
Impala
SQL
YARN
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL) Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
▸ Amazon EMR
▸ Microsoft Azure HDInsight
▸ Cloudera Altus
▸ Databricks Community Edition Spark
▸ PyData + Jupyter PySpark
Spark Hadoop
Hadoop Spark
map
JVM
HDFS
reduce
JVM
map
JVM
reduce
JVM
f1
RDD
Executor JVM
HDFS
f2
f3
f4
f5
f6
f7
MapReduce Spark
RDD
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
Spark 1.2
PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset =
Spark Java
▸ DataFrame RDD
/ R data.frame
▸ Python RDD API DataFrame API Scala
/ Java
PySpark
DataFrame API
RDD
DataFrame /
Dataset
MLlib ML
GraphX GraphFrame
Spark
Streaming
Structured
Streaming
Worker node
PySpark
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
RDD API PySpark
Worker node
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
DataFrame API PySpark
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame 

Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file formats and storage engines
in the Hadoop ecosystem
▸
=
Spark PyData
Parquet


https://parquet.apache.org/documentation/latest/


zip CSV
I/O
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
...
Spark PyData
Spark
df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20)
df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ Apache Arrow pandas 10 

https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
PySpark




11
20171012 found  IT #9 PySparkの勘所

More Related Content

What's hot

Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 
Papyri.info's Linked Data Story
Papyri.info's Linked Data StoryPapyri.info's Linked Data Story
Papyri.info's Linked Data Story
Hugh Cayless
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 

What's hot (20)

Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Watch Your Log!
Watch Your Log!Watch Your Log!
Watch Your Log!
 
A complete hadoop stack
A complete hadoop stackA complete hadoop stack
A complete hadoop stack
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 
Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2Hadoop 1 vs hadoop2
Hadoop 1 vs hadoop2
 
Mapreduce Tutorial
Mapreduce TutorialMapreduce Tutorial
Mapreduce Tutorial
 
Hadoop 2 cluster architecture
Hadoop 2 cluster architectureHadoop 2 cluster architecture
Hadoop 2 cluster architecture
 
Hadoop-BigData
Hadoop-BigDataHadoop-BigData
Hadoop-BigData
 
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data Technology
 
Hadoop 101 v2
Hadoop 101 v2Hadoop 101 v2
Hadoop 101 v2
 
Big Data - Fast Machine Learning at Scale + Couchbase
Big Data - Fast Machine Learning at Scale + CouchbaseBig Data - Fast Machine Learning at Scale + Couchbase
Big Data - Fast Machine Learning at Scale + Couchbase
 
Papyri.info's Linked Data Story
Papyri.info's Linked Data StoryPapyri.info's Linked Data Story
Papyri.info's Linked Data Story
 
Alluxio
AlluxioAlluxio
Alluxio
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC Systems
 

Viewers also liked

Viewers also liked (11)

PYNQ 祭り: Pmod のプログラミング
PYNQ 祭り: Pmod のプログラミングPYNQ 祭り: Pmod のプログラミング
PYNQ 祭り: Pmod のプログラミング
 
Pynqでカメラ画像をリアルタイムfastx コーナー検出
Pynqでカメラ画像をリアルタイムfastx コーナー検出Pynqでカメラ画像をリアルタイムfastx コーナー検出
Pynqでカメラ画像をリアルタイムfastx コーナー検出
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
PYNQで○○してみた!
PYNQで○○してみた!PYNQで○○してみた!
PYNQで○○してみた!
 
PYNQ祭り
PYNQ祭りPYNQ祭り
PYNQ祭り
 
Pynq祭り資料
Pynq祭り資料Pynq祭り資料
Pynq祭り資料
 
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
 
PYNQ単体でUIを表示してみる(PYNQまつり)
PYNQ単体でUIを表示してみる(PYNQまつり)PYNQ単体でUIを表示してみる(PYNQまつり)
PYNQ単体でUIを表示してみる(PYNQまつり)
 
PYNQ祭りLT todotani
PYNQ祭りLT todotaniPYNQ祭りLT todotani
PYNQ祭りLT todotani
 
コンピュータエンジニアへのFPGAのすすめ
コンピュータエンジニアへのFPGAのすすめコンピュータエンジニアへのFPGAのすすめ
コンピュータエンジニアへのFPGAのすすめ
 

Similar to 20171012 found IT #9 PySparkの勘所

Similar to 20171012 found IT #9 PySparkの勘所 (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
PYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdfPYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdf
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Analytics and Machine Learning with Spark and MongoDB
Analytics and Machine Learning with Spark and MongoDBAnalytics and Machine Learning with Spark and MongoDB
Analytics and Machine Learning with Spark and MongoDB
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Big Data Ecosystem after Spark
Big Data Ecosystem after SparkBig Data Ecosystem after Spark
Big Data Ecosystem after Spark
 

More from Ryuji Tamagawa

More from Ryuji Tamagawa (20)

hbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineeringhbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineering
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
 
20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌
 
20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark
 
20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Apache Sparkの紹介
Apache Sparkの紹介Apache Sparkの紹介
Apache Sparkの紹介
 
足を地に着け落ち着いて考える
足を地に着け落ち着いて考える足を地に着け落ち着いて考える
足を地に着け落ち着いて考える
 
ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践
 
Google Big Query
Google Big QueryGoogle Big Query
Google Big Query
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測
 
lessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conferencelessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conference
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
 
Mongo dbを知ろう devlove関西
Mongo dbを知ろう   devlove関西Mongo dbを知ろう   devlove関西
Mongo dbを知ろう devlove関西
 
Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話
 
データベース勉強会 In 広島 mongodb
データベース勉強会 In 広島  mongodbデータベース勉強会 In 広島  mongodb
データベース勉強会 In 広島 mongodb
 
Invitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalkInvitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalk
 
MongoDB tuning on AWS
MongoDB tuning on AWSMongoDB tuning on AWS
MongoDB tuning on AWS
 

Recently uploaded

Recently uploaded (20)

ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 

20171012 found IT #9 PySparkの勘所