SlideShare a Scribd company logo

Apache Arrowフォーマットはなぜ速いのか

Kouhei Sutou
Kouhei Sutou
Kouhei SutouFree Software Programmer at ClearCode

2020年代、ビッグデータをどう扱えばよいか。今は各プロダクト毎に効率的な扱い方を実装していますが、2020年代はそんな時代ではありません!ビッグデータの扱いでも、共通で必要なものはプロダクトを超えて協力して開発して共有する、そんな時代です!ビッグデータのための共通基盤、それがオープンソースのApache Arrowです。AmazonもGoogleもNVIDIAも開発に参加しています。 このセッションではApache Arrow開発チームの主要メンバーがApache Arrowフォーマットがなぜ速いのかを説明します。 db tech showcase ONLINE 2020用の資料です。 https://www.db-tech-showcase.com/dbts/2020/online/

Apache Arrowフォーマットはなぜ速いのか

1 of 45
Download to read offline
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowと私
2016-12-21に最初のコミット✓
2017-05-10にコミッター✓
2017-09-15にPMCメンバー✓
2020-11-25現在コミット数2位(508人中)✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrow
データ分析ツールの基盤を提供✓
ツールで必要になるやつ全部入り✓
各種プログラミング言語をサポート✓
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
全部ってなに!?
データフォーマットとかそのデータ
を高速処理する機能とか他の各種
データフォーマットと変換する機能
とかローカル・リモートにあるデー
タを透過的に読み書きする機能とか
高速RPCとかなんだけど、全部を説
明すると「すごそうだけどよくわか
らないね!」と言われる!
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
今日のトピック
Apache Arrow
フォーマット
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマット
データフォーマット
通信用とインメモリー用の両方✓
✓
表形式のデータ用
=データフレーム形式のデータ用✓
✓
速い!✓
Ad

Recommended

RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowRubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache Arrow
RubyKaigi Takeout 2021 - Red Arrow - Ruby and Apache ArrowKouhei Sutou
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...Amazon Web Services
 
AWS における サーバーレスの基礎からチューニングまで
AWS における サーバーレスの基礎からチューニングまでAWS における サーバーレスの基礎からチューニングまで
AWS における サーバーレスの基礎からチューニングまで崇之 清水
 
サーバーレスの現実と夢と今
サーバーレスの現実と夢と今サーバーレスの現実と夢と今
サーバーレスの現実と夢と今Hiroyuki Hara
 
Deep Dive on the Amazon Aurora PostgreSQL-compatible Edition - DAT402 - re:In...
Deep Dive on the Amazon Aurora PostgreSQL-compatible Edition - DAT402 - re:In...Deep Dive on the Amazon Aurora PostgreSQL-compatible Edition - DAT402 - re:In...
Deep Dive on the Amazon Aurora PostgreSQL-compatible Edition - DAT402 - re:In...Amazon Web Services
 
Airbnb Runs on Amazon Aurora - DAT331 - re:Invent 2017
Airbnb Runs on Amazon Aurora - DAT331 - re:Invent 2017Airbnb Runs on Amazon Aurora - DAT331 - re:Invent 2017
Airbnb Runs on Amazon Aurora - DAT331 - re:Invent 2017Amazon Web Services
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksScaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksAmazon Web Services
 

More Related Content

What's hot

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with ArrowRubén Berenguel
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)Amazon Web Services Korea
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsShankar M S
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm毅 吕
 
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 

What's hot (20)

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with Arrow
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
AWS re:Invent 특집 세미나 - (2) DB/분석 분야 신규 서비스 요약 :: 윤석찬 (AWS 테크에반젤리스트)
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
(SDD411) Amazon CloudSearch Deep Dive and Best Practices | AWS re:Invent 2014
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)Deep Learning을 위한  AWS 기반 인공 지능(AI) 서비스 (윤석찬)
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
 
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A ...
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 

Similar to Apache Arrowフォーマットはなぜ速いのか

Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksDustin Vannoy
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Timothy Spann
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...Amazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksLace Lofranco
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Troposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationTroposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationPatrick Pierson
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueAmazon Web Services
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueAmazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 

Similar to Apache Arrowフォーマットはなぜ速いのか (20)

Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar Deep Dive into Building Streaming Applications with Apache Pulsar
Deep Dive into Building Streaming Applications with Apache Pulsar
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Troposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS CloudformationTroposphere Python infrastructure as code for AWS Cloudformation
Troposphere Python infrastructure as code for AWS Cloudformation
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 

More from Kouhei Sutou

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Kouhei Sutou
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアKouhei Sutou
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像Kouhei Sutou
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Kouhei Sutou
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムKouhei Sutou
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroongaKouhei Sutou
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Kouhei Sutou
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムKouhei Sutou
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版Kouhei Sutou
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムKouhei Sutou
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionKouhei Sutou
 

More from Kouhei Sutou (20)

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェア
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory data
 
Apache Arrow 2019
Apache Arrow 2019Apache Arrow 2019
Apache Arrow 2019
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory data
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroonga
 
My way with Ruby
My way with RubyMy way with Ruby
My way with Ruby
 
Red Data Tools
Red Data ToolsRed Data Tools
Red Data Tools
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システム
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extension
 

Recently uploaded

2024 February Patch Tuesday
2024 February Patch Tuesday2024 February Patch Tuesday
2024 February Patch TuesdayIvanti
 
Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxBrandon Minnick, MBA
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfDomotica daVinci
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Daniel Toomey
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfDomotica daVinci
 
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxEvolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxKyle Willson
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveDomotica daVinci
 
Navigating the Never Normal Strategies for Portfolio Leaders
Navigating the Never Normal Strategies for Portfolio LeadersNavigating the Never Normal Strategies for Portfolio Leaders
Navigating the Never Normal Strategies for Portfolio LeadersOnePlan Solutions
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build PolandGDSC PJATK
 
OTel Orientation_ How to Train Teams (OTel in Practice).pdf
OTel Orientation_ How to Train Teams (OTel in Practice).pdfOTel Orientation_ How to Train Teams (OTel in Practice).pdf
OTel Orientation_ How to Train Teams (OTel in Practice).pdfPaige Cruz
 
Manual sensor Zigbee 3.0 MOES ZSS-X-PIRL-C
Manual  sensor Zigbee 3.0 MOES ZSS-X-PIRL-CManual  sensor Zigbee 3.0 MOES ZSS-X-PIRL-C
Manual sensor Zigbee 3.0 MOES ZSS-X-PIRL-CDomotica daVinci
 
Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Mateusz Kwasniewski
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Adrian Sanabria
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringMassimo Talia
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersThousandEyes
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxMaarten Balliauw
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!KivenRaySarsaba
 
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGAUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGLiveplex
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologySafe Software
 

Recently uploaded (20)

2024 February Patch Tuesday
2024 February Patch Tuesday2024 February Patch Tuesday
2024 February Patch Tuesday
 
Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptx
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
 
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptxEvolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
Evolution of Chatbots: From Custom AI Chatbots and AI Chatbots for Websites.pptx
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-Wave
 
Navigating the Never Normal Strategies for Portfolio Leaders
Navigating the Never Normal Strategies for Portfolio LeadersNavigating the Never Normal Strategies for Portfolio Leaders
Navigating the Never Normal Strategies for Portfolio Leaders
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build Poland
 
OTel Orientation_ How to Train Teams (OTel in Practice).pdf
OTel Orientation_ How to Train Teams (OTel in Practice).pdfOTel Orientation_ How to Train Teams (OTel in Practice).pdf
OTel Orientation_ How to Train Teams (OTel in Practice).pdf
 
Manual sensor Zigbee 3.0 MOES ZSS-X-PIRL-C
Manual  sensor Zigbee 3.0 MOES ZSS-X-PIRL-CManual  sensor Zigbee 3.0 MOES ZSS-X-PIRL-C
Manual sensor Zigbee 3.0 MOES ZSS-X-PIRL-C
 
Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineering
 
5 Tech Trend to Notice in ESG Landscape- 47Billion
5 Tech Trend to Notice in ESG Landscape- 47Billion5 Tech Trend to Notice in ESG Landscape- 47Billion
5 Tech Trend to Notice in ESG Landscape- 47Billion
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for Partners
 
Bringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptxBringing nullability into existing code - dammit is not the answer.pptx
Bringing nullability into existing code - dammit is not the answer.pptx
 
My sample product research idea for you!
My sample product research idea for you!My sample product research idea for you!
My sample product research idea for you!
 
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGAUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
 
Breaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI TechnologyBreaking Barriers & Leveraging the Latest Developments in AI Technology
Breaking Barriers & Leveraging the Latest Developments in AI Technology
 

Apache Arrowフォーマットはなぜ速いのか

  • 1. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマットは なぜ速いのか 須藤功平 株式会社クリアコード db tech showcase ONLINE 2020 2020-12-08
  • 2. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowと私 2016-12-21に最初のコミット✓ 2017-05-10にコミッター✓ 2017-09-15にPMCメンバー✓ 2020-11-25現在コミット数2位(508人中)✓
  • 3. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrow データ分析ツールの基盤を提供✓ ツールで必要になるやつ全部入り✓ 各種プログラミング言語をサポート✓
  • 4. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 全部ってなに!? データフォーマットとかそのデータ を高速処理する機能とか他の各種 データフォーマットと変換する機能 とかローカル・リモートにあるデー タを透過的に読み書きする機能とか 高速RPCとかなんだけど、全部を説 明すると「すごそうだけどよくわか らないね!」と言われる!
  • 5. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 今日のトピック Apache Arrow フォーマット
  • 6. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマット データフォーマット 通信用とインメモリー用の両方✓ ✓ 表形式のデータ用 =データフレーム形式のデータ用✓ ✓ 速い!✓
  • 7. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 速い! データ交換が速い!✓ データ処理が速い!✓
  • 8. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い! Apache Arrowフォーマットにすると高速化! データソース データ処理 Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV など データソース データ処理 高速化
  • 9. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Apache Spark Spark PySpark Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation pickle 高速化 Spark PySpark 実行時間 Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/ 0s 20s (短いほど高速)
  • 10. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Amazon Athena S3 Athena Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV Federated Query 組込クエリーより 高速 S3 Athena 1億行の処理にかかった時間 Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance (短いほど高速) 0s 4s 8s
  • 11. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS cuDF cuIO Analytics Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=3
  • 12. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
  • 13. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 どうして速いの? シリアライズコストが低い すぐに送れるようになる✓ ✓ デシリアライズコストが低い 受け取ったデータをすぐに使えるようになる✓ ✓
  • 14. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 シリアライズ処理 メタデータを用意1. メタデータ+元データそのものを送信 元データを加工しないから速い!✓ なにもしないのが最速!✓ 2.
  • 15. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを加工する例:JSON 0x01 0x02(8bit数値の配列) ↓ "[1,2]"(JSON) 0x01→0x49(数値→ASCIIの文字:'1') 0x02→0x50(数値→ASCIIの文字:'2')
  • 16. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データそのものを使うと… 変換処理にCPUを使わなくてよい 速い✓ ✓ 変換後のデータ用のメモリー確保ゼロ 大きなメモリー確保はコストが高い✓ 一定の作業領域を使い回すとかしなくてよい✓ 速い✓ ✓
  • 17. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 デシリアライズ処理 メタデータをパース1. メタデータを基に元データを取り出す 元データをそのまま使えるから速い!✓ なにもしないのが最速!✓ 2.
  • 18. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを元に戻す例:JSON "[1,2]"(JSON) ↓ 0x01 0x02(8bit数値の配列) 0x49→0x01(ASCIIの文字:'1'→数値) 0x50→0x02(ASCIIの文字:'2'→数値)
  • 19. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 元データを取り出せると… 変換処理にCPUを使わなくてよい 速い✓ ✓ 変換後のデータ用のメモリー確保ゼロ すでにあるデータをそのまま使うのでゼロコピー✓ 速い✓ ✓ メモリーマップで直接データを使える ディスク上のメモリー以上のデータを扱える✓ ✓
  • 20. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 {,デ}シリアライズコスト Apache Arrowフォーマット ほぼメタデータのパースコストだけ✓ ✓ それ以外の多くのフォーマット データ変換処理(CPU)✓ 作業用メモリー確保処理(メモリー)✓ ✓
  • 21. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い! Apache Arrowフォーマットにすると高速化! データソース データ処理 Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV など データソース データ処理 高速化
  • 22. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Apache Spark Spark PySpark Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation pickle 高速化 Spark PySpark 実行時間 Data at https://arrow.apache.org/blog/2017/07/26/spark-arrow/ 0s 20s (短いほど高速)
  • 23. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:Amazon Athena S3 Athena Apache Arrow logo is licensed under Apache License 2.0 © 2016-2020 The Apache Software Foundation CSV Federated Query 組込クエリーより 高速 S3 Athena 1億行の処理にかかった時間 Data at https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk#performance (短いほど高速) 0s 4s 8s
  • 24. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 利用事例:RAPIDS 25-100x Improvement Less Code Language Flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More Code Language Rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from Disk Spark In-Memory Processing Data Processing Evolution Faster Data Access, Less Data Movement RAPIDS Arrow Read ETL ML Train Query 50-100x Improvement Same Code Language Flexible Primarily on GPU https://docs.rapids.ai/overview/RAPIDS%200.15%20Release%20Deck.pdf#page=8
  • 25. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データサイズは? CPU・メモリーにやさしくて速いんだね!✓ じゃあ、データサイズはどうなの? 大きいとネットワーク・IOがボトルネック✓ 本来のデータ処理を最大限リソースを使いたい✓ データ交換をボトルネックにしたくない✓ ✓
  • 26. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データサイズ 別に小さくない 元データそのままなのでデータ量に応じて増加✓ ✓ Zstandard・LZ4での圧縮をサポート ゼロコピーではなくなるがサイズは数分の1✓ 圧縮・展開が必要→元データそのものを使えない✓ CPU・メモリー負荷は上がるが ネットワーク・IO負荷は下がる ✓ ネットワーク・IOがボトルネックになるなら効く✓ ✓
  • 27. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 圧縮時のデータサイズと読み込み速度 https://ursalabs.org/blog/2020-feather-v2/
  • 28. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ交換が速い!のまとめ {,デ}シリアライズが速い 元データをそのまま使うので処理が少ない✓ CPU・メモリーにやさしい✓ ✓ 圧縮もサポート ネットワーク・IOがボトルネックならこれ✓ CPU・メモリー負荷は上がるが データ交換のボトルネックを解消できるかも ✓ ✓
  • 29. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 交換したデータの扱い データ分析はデータ交換だけじゃない データ交換だけ速くしても基盤とは言えない✓ ✓ データ処理も速くしないと! データ処理を速くするにはデータ構造が大事✓ ✓
  • 30. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 高速処理のためのデータ構造 基本方針: 関連するデータを近くに置く✓ ✓ 効果: CPUキャッシュミスを減らす✓ SIMDを活用できる✓ ✓
  • 31. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 データ分析時の関連データ 分析時はカラムごとの処理が多い 集計・ソート・絞り込み…✓ ✓ 同じカラムのデータを近くに置く カラムナーフォーマット✓ ✓
  • 32. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 カラムナーフォーマット 列行 a b c 1 2 3 値 値 値 値 値 値 値 値 値 列 行 a b c 1 2 3 値 値 値 値 値 値 値 値 値 Apache Arrow 列ごと RDBMS 列 行 値の管理単位 行ごと 高速なアクセス単位
  • 33. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 各カラムでのデータの配置 関連するデータを近くに置く✓ 定数時間でアクセスできるように置く✓ SIMDできるように置く アラインする アライン:データの境界を64の倍数とかに揃える ✓ 条件分岐をなくす✓ ✓
  • 34. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 真偽値・数値のデータの配置 固定長データなので連続して配置 32ビット整数:[1, 2, 3] 0x01 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x03 ...
  • 35. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 文字列・バイト列:データの配置 実データバイト列+長さ配列に配置 UTF-8文字列:["Hello", "", "!"] 実データバイト列:"Hello!" 長さ配列:[0, 5, 5, 6] i番目の長さ:長さ配列[i+1] - 長さ配列[i] i番目のデータ: 実データバイト列[長さ配列[i]..長さ配列[i+1]] 注:長さ→データ→長さ→データ→…で置くと 定数時間でi番目にアクセスできない
  • 36. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐 null対応のアプローチ: null値:Julia: missing, R: NA✓ 別途ビットマップを用意:Apache Arrow✓ ✓ ビットマップを使うと条件分岐をなくせる✓
  • 37. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐とSIMD null 1 0 1 data 1 X 3 0 1 1 X 2 5 +[1, null, 3] [null, 2, 5] 0 0 1 X X 8 [null, null, 8] ビット単位の& nullのところも 気にせずSIMDで+ null data
  • 38. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 nullと条件分岐とSIMD https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
  • 39. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 高速なデータ処理のまとめ 高速なデータ処理にはデータ構造が重要 データ分析にはカラムナーフォマットが適切✓ ✓ 定数時間でアクセス可能なデータの配置✓ SIMDにやさしいデータの持ち方 アライン・null用のビットマップ✓ ✓
  • 40. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ Apache Arrow なんかデータ分析に便利そうなすごいやつ!✓ 「よくわからん」と言われるからといって 全体像を説明してもらえなかった! ✓ 使い方も説明してもらえなかった!✓ ✓ Apache Arrowフォーマット これだけ説明してもらえた!✓ ✓
  • 41. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:Apache Arrowフォーマット Apache Arrowフォーマット 通信用・インメモリー用のデータフォーマット✓ 表形式のデータ用✓ ✓ Apache Arrowフォーマットは速い データ交換が速い✓ データ処理が速い✓ データ交換してすぐに高速処理できるフォーマット✓ ✓
  • 42. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:なぜデータ交換が速いのか 元データをそのままやりとりできるから {,デ}シリアライズコストが低い✓ CPU・メモリーにやさしい✓ ✓ ネットワーク・IOの負荷を下げたい Zstandard・LZ4による圧縮✓ CPU・メモリー負荷は上がるが ネットワーク・IO負荷は下がる ✓ ✓
  • 43. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 まとめ:なぜデータ処理が速いのか 最適化されたデータ構造 カラムナーフォーマット✓ SIMDを使えるデータの持ち方✓ ✓ 最適化された実装 今回は紹介していない!!!✓ ✓
  • 44. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 次回予告! 案1:Apache Arrowデータの高速処理✓ 案2:Apache Arrowデータの高速RPC✓ 案3:○○言語でApache Arrowを使う方法✓ ...✓
  • 45. Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 次のステップ もっと詳しく知りたくなったから イベント・社内・…で紹介して! https://www.clear-code.com/contact/✓ ✓ 使いたくなったから技術サポートして! https://www.clear-code.com/contact/✓ ✓ Apache Arrowの開発に参加したい! https://arrow.apache.org/community/✓ ✓