SlideShare a Scribd company logo
1 of 47
Download to read offline
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data processing
with Ruby and Apache Arrow
Sutou Kouhei
ClearCode Inc.
RubyKaigi 2022
2022-09-10
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
A president Ruby committer
The president of ClearCode Inc.
クリアコードの社長
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
The Apache Arrow PMC chair
PMC: Project Management Committee
Apache Arrowのプロジェクト管理委員会のリーダー
✓
#2 commits(コミット数2位)
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sutou Kouhei
The pioneer in Ruby and Arrow
A Ruby committer
Maintain some standard libraries/default gems
標準ライブラリーとかデフォルトgemのメンテナンスをしている
✓
✓
The author of Red Arrow
✓
Red Arrow:
The official Apache Arrow library for Ruby
公式のRuby用のApache Arrowライブラリー
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why do I work on Red Arrow?
なぜRed Arrowの開発をしているか
To use Ruby for data processing too!
データ処理でもRubyを使いたい!
At least a part of data processing
データ処理の全部と言わず一部だけでも
✓
✓
Data processing is an important task
データ処理は最近の重要なタスクの1つ
# of Rubyists will be increased by this
データ処理にRubyを使えるようになるとRubyistが増えるはず
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Current situation
Negative spiral
今は負のスパイラル
Few users
Small community Few developers
Few useful tools
How to break the negative spiral?
どうやってこの負のスパイラルを打開する?
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Expand useful tools
with few developers
少人数で便利なツールを増やせればいいんじゃない?
Negative spiral
Positive spiral
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
But how?
でもどうやって?
Apache Arrow
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Apache Arrow
Cross-language dev platform for data
複数言語対応のデータ用の開発プラットフォーム
Ruby community doesn't need to dev everything
Rubyコミュニティーがすべてを開発しなくてもよい
✓
We can share common implementations
共通の実装を言語を超えて共有できる
✓
✓
Today's highlighted features
今日注目する機能
Fast data processing(高速データ処理)
✓
Fast data interchange(高速データ交換)
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
My approach
私のアプローチ
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Goal of this talk
このトークのゴール
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
You want to use Ruby
for some data processings
いくつかのデータ処理でRubyを使いたくなる
Especially, you want to implement a BI tool
特にBIツールを作りたくなる
✓
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
It provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data processing
高速データ処理
Ruby is slow to process data
Rubyでデータを処理すると遅い
✓
Resolve in external process:(別プロセスで解決)
Use case: Web app, batch process for Web app
Use fast data processing module (e.g.: DB)
DBとか速いデータ処理モジュールを使う
✓
✓
Resolve in the same process:(プロセス内で解決)
Use cases: IRB, batch process for Web app
Implement core features in other fast lang
他の速い言語でコアの機能を実装
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
External process
別プロセス
Ruby External process
Fast data processing
Request
Response
Ruby External process
Popular case
in current Ruby usage
今のRubyの使われ方だとよくあるケース
✓
Small response: No problem
レスポンスが小さい場合は問題ない
✓
Large response:
レスポンスが大きい場合:
Sending/receiving response are slow
レスポンスの送信・受信処理が遅い
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Sending/receiving response
レスポンスの送受信
Ruby External process
Serialize
Deserialize
Send
Ruby External process
Serialize/deserialize
are slow
シリアライズ・デシリアライズが遅い
✓
How to speed them up?
どうやって高速化すればよいか
Apache Arrow format
✓
Serialize/deserialize cost ≒ 0
シリアライズ・デシリアライズコストがほぼ0
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why Apache Arrow format is fast
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1
Apache Arrowフォーマットは
なぜ速いのか
須藤功平
株式会社クリアコード
db tech showcase ONLINE 2020
2020-12-08
https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Apache Arrow Flight SQL
Ruby SQL DB
Fast data processing
Request (SQL)
Response (Apache Arrow data)
Ruby SQL DB
gRPC based protocol
gRPCベースのプロトコル
NOTE: Other network libraries
such as UCX can be used
UCXなど他のネットワークライブラリーも使える
✓
✓
Specialized to Apache Arrow format
Apache Arrowフォーマットに特化
Serialize/deserialize cost ≒ 0
シリアライズ・デシリアライズコストがほぼ0
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Arrow Flight SQL
require "arrow-flight-sql"
location = "grpc://server:2929"
client = ArrowFlight::Client.new(location)
sql_client = ArrowFlightSQL::Client.new(client)
info = sql_client.execute("SELECT * FROM logs")
info.endpoints.each do |endpoint|
reader = sql_client.do_get(endpoint.ticket)
reader.read_all
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Which SQL DBs support
Apache Arrow Flight SQL?
SQL DB Support?
MySQL No
PostgreSQL No
BigQuery No
Trino No
Dremio Yes
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Why don't most SQL DBs support it?
どうしてほとんどのSQL DBはサポートしていないの?
Flight SQL is a new protocol
Apache Arrow Flight SQLは新しいプロトコルだから
The first release: 2022-02(最初のリリース)
✓
Still experimental(まだ実験的扱い)
✓
✓
Tradition SQL DBs may not support
MySQL・PostgreSQLとか昔からあるSQL DBはサポートしないかも
✓
New SQL DBs will support because...
新しいSQL DBはサポートするはず。なぜなら…
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Compatibility is important
互換性が重要だから
New SQL DBs often use major protocols
新しいSQL DBは既存のメジャーなプロトコルを使うことが多い
To reuse existing client libraries
ユーザーは既存のクライアントライブラリーで新しいSQL DBを使える
✓
✓
For example:
MySQL protocol: TiDB, ...
✓
PostgreSQL protocol: Cloud Spanner, CockroachDB, ...
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Future
将来
Flight SQL client libraries will be increased
Flight SQLのクライアントライブラリーが充実するだろう
✓
New SQL DBs will support Flight SQL
新しいSQL DBはFlight SQLをサポートするだろう
To reuse existing client libraries
既存のクライアントライブラリーを再利用するため
✓
✓
BI tools will support Flight SQL by default
BIツールはデフォルトでFlight SQLをサポートするだろう
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
What should we do next?
私たちは次はなにをするべき?
Negative spiral
Positive spiral
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
Implement an Active Record
adapter for Flight SQL
Flight SQL用のActive Recordアダプターを
実装するといいんじゃないかな
For easy to use from Ruby on Rails apps
Ruby on Railsアプリから使いやすくなるはず
✓
✓
Join Red Data Tools!
Red Data Toolsで開発しようぜ!
https://red-data-tools.github.io/
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
But I'm using MySQL/PostgreSQL...
でも、MySQL/PostgreSQLを使っているし。。。
ADBC: Apache Arrow Database Connectivity
https://voltrondata.com/news/simplifying-database-connectivity-with-arrow-
flight-sql-and-adbc/
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
ADBC
Generic fast
SQL DB client API
任意のSQL DBに接続できる高速なAPI
We can use Flight SQL
through ADBC
ADBC経由でFlight SQLも使える
✓
✓
Flight SQL is the most fast driver
Flight SQLが最速のドライバー
But the same API for all SQL DBs is useful
like Active Record for Rubyists
なんだけど、すべてのSQL DBに同じAPIでアクセスできるのは便利
Active Recordも便利でしょ?
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
ADBC and Ruby
Implementing Ruby bindings
Rubyバインディングを実装中
✓
Join Red Data Tools
to implement an Active Record adapter
for ADBC!
Red Data ToolsでADBC用のActive Recordアダプターを開発しようぜ!
https://red-data-tools.github.io/
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red ADBC
require "adbc"
options = {
driver: "adbc_driver_sqlite",
filename: ":memory:",
}
ADBC::Database.open(**options) do |database|
database.connect do |connection|
puts(connection.query("SELECT 1"))
end
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up: External process case
まとめ:別プロセスの場合
Large response is slow
レスポンスが大きいときに遅い
Bottle neck is serialize/deserialize
ボトルネックはシリアライズ・デシリアライズ
✓
✓
Apache Arrow Flight SQL/ADBC
そこでApache Arrow Flight SQL/ADBCですよ!
✓
Let's implement AR adapters for them
Active Recordのアダプターを実装しようぜ!
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
In-process
同一プロセス Ruby Library
Fast data processing
Pass data
Return processed data
Ruby Library
Not popular case
in current Ruby usage
今のRubyの使われ方だとあまりないケース
Use case: process data
on local/remote storage
IRB, batch process for Web app
ユースケース:ローカル・リモートストレージにあるデータの処理
✓
✓
Need a fast data processing library
implemented in other fast language
高速にデータ処理できる他の速い言語で実装されたライブラリーが必要
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast language?
速い言語?
C/C++
✓
Rust
✓
Julia
✓
...
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A C++ case: Apache Arrow
Apache Arrow has a computation module
Apache Arrowは計算モジュールも提供している
✓
Ruby bindings: Red Arrow/red-arrow
RubyバインディングはRed Arrow/red-arrow
✓
Data frame based on Red Arrow
Red Arrowベースのデータフレームもある
RedAmber/red_amber by @heronshoes
Red Data Toolsメンバーでもある鈴木さんが開発
https://mybinder.org/v2/gh/RubyData/docker-stacks/master?
filepath=red-amber.ipynb
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Arrow
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "arrow"
sliced_codes = codes.slice do |slicer|
slicer.prefecture == "東京都"
end
puts(sliced_codes)
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red Amber
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "red_amber"
data_frame = RedAmber::DataFrame.new(codes)
prefecture = data_frame[:prefecture]
puts(data_frame[prefecture == "東京都"])
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A C++ case: DuckDB
Similar to SQLite
but for data analytics
データ分析向けのSQLiteみたいなやつ
Fast aggregation/filter/sort
高速な集計・フィルター・ソート
✓
✓
Ruby bindings: ruby-duckdb by @suketa
Rubyコミッターでもある助田さんがRubyバインディングを開発
We can impl. fast data processing with DuckDB
DuckDBを使って高速データ処理を実現できる!
✓
If we can interchange data w/ DuckDB fast...
DuckDBと高速にデータ交換できればね…
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Is fast data interchange important?
高速データ交換は重要なの?
Ruby DuckDB
Fast data processing
Load data (!)
Read result (!)
Ruby DuckDB
If (!) are slow, total data processing is also slow
(!)が遅いと全体のデータ処理も遅くなる
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Fast data interchange
高速なデータ交換
Ruby DuckDB
Fast data processing
Pass Apache Arrow data directly
Read result with C data interface
Ruby DuckDB
Use data as-is: zero-copy
データをそのまま使う:ゼロコピー
✓
Apache Arrow C data/stream interface
C ABI for fast data interchange
高速にデータ交換するためのC ABI
✓
FYI: C ABI in Ruby: MemoryView
参考:RubyもMmeoryViewというC ABIを提供している
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
C data interface
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
// Release callback
void (*release)(struct ArrowArray*);
// Opaque producer-specific data
void* private_data;
};
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
DuckDB with Apache Arrow
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "arrow-duckdb"
db = DuckDB::Database.open
c = db.connect
c.register("codes", codes) do # Use Apache Arrow data as-is
c.query("SELECT * FROM codes WHERE prefecture = ?",
"東京都", # Tokyo
output: :arrow) # Output as Apache Arrow data
.to_table # C data interface
end
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
C data interface on Web
Web上でもC data interface
Some unofficial WebAssembly ports exist
非公式ながらいくつかWebAssembly対応のApache Arrowライブラリーがある
Rust based, Go based, ...
✓
✓
WASM Ruby + C data I/F is useful?
WebAssembly版のRubyとC data interfaceでなんかできるかも?
FYI: DuckDB supports WebAssembly too
参考:DuckDBもWebAssemblyをサポートしている
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
A Rust case: Arrow DataFusion
SQL query engine
Internal memory layout is Apache Arrow
DuckDBと似ているが内部のメモリーレイアウトはApache Arrow
✓
✓
Direct Ruby bindings:(直接のバインディング)
arrow-datafusion by @jychen7 with Magnus
✓
✓
Ruby bindings via C API:(C API経由)
datafusion-c with cargo-c
✓
Red DataFusion with datafusion-c
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Ruby bindings via C API
C API経由のRubyバインディング
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
To develop with
devs from other langs
他言語の開発者と一緒に開発するため
C API is useful for other languages too
C APIはJavaやGoなど他の言語のバインディング開発でも有用
✓
✓
C API provides a normal C library
C APIは普通のCライブラリーを提供する
Headers: Generated by cbindgen automatically
✓
Shared libraries: Built with cargo-c
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Red DataFusion
require "datasets-arrow"
codes = Datasets::PostalCodeJapan.new.to_arrow
require "datafusion"
context = DataFusion::SessionContext.new
context.register("codes", codes) # C data interface
data_frame = context.sql(<<-SQL)
SELECT * FROM codes WHERE prefecture = '東京都'
SQL
puts(data_frame.to_table) # C data interface
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Remote data
リモートデータ
Recent modules can read remote data
最近のモジュールはリモートデータを読み込める
✓
At least these modules support it:
少なくとも次のモジュールはできる
Apache Arrow
✓
DuckDB
✓
Apache Arrow DataFusion
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Remote data example: DuckDB
SELECT COUNT(*)
FROM parquet_scan('s3://ookla-open-data/parquet/performance/*/*/*/*.parquet',
HIVE_PARTITIONING=1);
-- ┌───────┐
-- │ count_star() │
-- ├───────┤
-- │ 144567188 │
-- └───────┘
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up: In-process case
まとめ:同一プロセスの場合
Ruby Library
Fast data processing
Pass data
Return processed data
Ruby Library
To process data
on local/remote storage
ローカル・リモートストレージにあるデータの処理
✓
Low-level APIs (bindings)
are preparing
低レベルのAPIは整備できてきた
✓
Let's implement high-level API!
高レベルのAPIを実装しようぜ!
e.g.: RedAmber, Active Record adapters, ...
https://red-data-tools.github.io/
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Acknowledgments
謝辞
Voltron Data
Supports my Apache Arrow related work
私のApache Arrow関係の作業にお金を払ってくれている
✓
You can join Voltron Data or ClearCode
to work on Apache Arrow as a job
Voltron Dataさんかクリアコードに転職すると
仕事でApache Arrowを開発できるよ!
https://www.clear-code.com/recruitment/
✓
✓
Red Data Tools members
They develop data processing tools for Ruby!
Ruby用のデータ処理ツールを開発しているよ!
✓
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Wrap up
まとめ
Can use Ruby for fast data processing
with Apache Arrow in some cases
Apache Arrowを使えば高速なデータ処理にRubyを使える…こともある
External process: Fast data interchange
別プロセスの場合:高速データ交換機能を使う
✓
In-process: Fast data processing/interchange
同一プロセスの場合:高速データ処理・交換機能を使う
✓
✓
But we still have missing pieces
でも、まだ足りないところがある
✓
Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2
Goal of this talk
このトークのゴール
Negative spiral
Positive spiral
Apache Arrow
Larger community More users
More developers More useful tools
Small community Few users
Few developers Few useful tools
You want to use Ruby
for some data processings
いくつかのデータ処理でRubyを使いたくなる
Especially, you want to implement a BI tool
特にBIツールを作りたくなる
✓
✓
You join Red Data Tools project
Red Data Toolsプロジェクトに参加する
It provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
https://red-data-tools.github.io/
✓
✓

More Related Content

Similar to RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow

Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Amazon Web Services
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceAmazon Web Services
 
SQLAlchemy Primer
SQLAlchemy PrimerSQLAlchemy Primer
SQLAlchemy Primer泰 増田
 
AWS Customer Presentation - Conde Nast
AWS Customer Presentation - Conde NastAWS Customer Presentation - Conde Nast
AWS Customer Presentation - Conde NastAmazon Web Services
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupChase Douglas
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationAlluxio, Inc.
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかApache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかKouhei Sutou
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyRikin Tanna
 
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystem
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystemI can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystem
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystemSidu Ponnappa
 
Onion Architecture with S#arp
Onion Architecture with S#arpOnion Architecture with S#arp
Onion Architecture with S#arpGary Pedretti
 

Similar to RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
Report from the Field on the PostgreSQL-compatible Edition of Amazon Aurora -...
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL Performance
 
SQLAlchemy Primer
SQLAlchemy PrimerSQLAlchemy Primer
SQLAlchemy Primer
 
AWS Customer Presentation - Conde Nast
AWS Customer Presentation - Conde NastAWS Customer Presentation - Conde Nast
AWS Customer Presentation - Conde Nast
 
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User GroupIs Serverless The New Swiss Cheese? - AWS Seattle User Group
Is Serverless The New Swiss Cheese? - AWS Seattle User Group
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data Orchestration
 
Function as a Service
Function as a ServiceFunction as a Service
Function as a Service
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Apache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのかApache Arrowフォーマットはなぜ速いのか
Apache Arrowフォーマットはなぜ速いのか
 
Rails Concept
Rails ConceptRails Concept
Rails Concept
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystem
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystemI can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystem
I can haz HTTP - Consuming and producing HTTP APIs in the Ruby ecosystem
 
Onion Architecture with S#arp
Onion Architecture with S#arpOnion Architecture with S#arp
Onion Architecture with S#arp
 

More from Kouhei Sutou

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Kouhei Sutou
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアKouhei Sutou
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataKouhei Sutou
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像Kouhei Sutou
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataKouhei Sutou
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Kouhei Sutou
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムKouhei Sutou
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroongaKouhei Sutou
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Kouhei Sutou
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムKouhei Sutou
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!Kouhei Sutou
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版Kouhei Sutou
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムKouhei Sutou
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionKouhei Sutou
 

More from Kouhei Sutou (20)

Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
 
Rubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェアRubyと仕事と自由なソフトウェア
Rubyと仕事と自由なソフトウェア
 
Apache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory dataApache Arrow 1.0 - A cross-language development platform for in-memory data
Apache Arrow 1.0 - A cross-language development platform for in-memory data
 
Apache Arrow 2019
Apache Arrow 2019Apache Arrow 2019
Apache Arrow 2019
 
Redmine検索の未来像
Redmine検索の未来像Redmine検索の未来像
Redmine検索の未来像
 
Apache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory dataApache Arrow - A cross-language development platform for in-memory data
Apache Arrow - A cross-language development platform for in-memory data
 
Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6Better CSV processing with Ruby 2.6
Better CSV processing with Ruby 2.6
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
MySQL 8.0でMroonga
MySQL 8.0でMroongaMySQL 8.0でMroonga
MySQL 8.0でMroonga
 
My way with Ruby
My way with RubyMy way with Ruby
My way with Ruby
 
Red Data Tools
Red Data ToolsRed Data Tools
Red Data Tools
 
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
Mroongaの高速全文検索機能でWordPress内のコンテンツを有効活用!
 
MariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システムMariaDBとMroongaで作る全言語対応超高速全文検索システム
MariaDBとMroongaで作る全言語対応超高速全文検索システム
 
PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!PGroonga 2 – Make PostgreSQL rich full text search system backend!
PGroonga 2 – Make PostgreSQL rich full text search system backend!
 
PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版PGroonga 2 - PostgreSQLでの全文検索の決定版
PGroonga 2 - PostgreSQLでの全文検索の決定版
 
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システムPostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
PostgreSQLとPGroongaで作るPHPマニュアル高速全文検索システム
 
Improve extension API: C++ as better language for extension
Improve extension API: C++ as better language for extensionImprove extension API: C++ as better language for extension
Improve extension API: C++ as better language for extension
 

Recently uploaded

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单aqpto5bt
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Go paperless and transform your procurement process with the Hive Collaborati...
Go paperless and transform your procurement process with the Hive Collaborati...Go paperless and transform your procurement process with the Hive Collaborati...
Go paperless and transform your procurement process with the Hive Collaborati...LitoGarin1
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontangsiskavia95
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...siskavia95
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 

Recently uploaded (20)

Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Go paperless and transform your procurement process with the Hive Collaborati...
Go paperless and transform your procurement process with the Hive Collaborati...Go paperless and transform your procurement process with the Hive Collaborati...
Go paperless and transform your procurement process with the Hive Collaborati...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 

RubyKaigi 2022 - Fast data processing with Ruby and Apache Arrow

  • 1. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data processing with Ruby and Apache Arrow Sutou Kouhei ClearCode Inc. RubyKaigi 2022 2022-09-10
  • 2. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei A president Ruby committer The president of ClearCode Inc. クリアコードの社長
  • 3. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei The Apache Arrow PMC chair PMC: Project Management Committee Apache Arrowのプロジェクト管理委員会のリーダー ✓ #2 commits(コミット数2位) ✓
  • 4. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sutou Kouhei The pioneer in Ruby and Arrow A Ruby committer Maintain some standard libraries/default gems 標準ライブラリーとかデフォルトgemのメンテナンスをしている ✓ ✓ The author of Red Arrow ✓ Red Arrow: The official Apache Arrow library for Ruby 公式のRuby用のApache Arrowライブラリー ✓ ✓
  • 5. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why do I work on Red Arrow? なぜRed Arrowの開発をしているか To use Ruby for data processing too! データ処理でもRubyを使いたい! At least a part of data processing データ処理の全部と言わず一部だけでも ✓ ✓ Data processing is an important task データ処理は最近の重要なタスクの1つ # of Rubyists will be increased by this データ処理にRubyを使えるようになるとRubyistが増えるはず ✓ ✓
  • 6. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Current situation Negative spiral 今は負のスパイラル Few users Small community Few developers Few useful tools How to break the negative spiral? どうやってこの負のスパイラルを打開する?
  • 7. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Expand useful tools with few developers 少人数で便利なツールを増やせればいいんじゃない? Negative spiral Positive spiral Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools
  • 8. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 But how? でもどうやって? Apache Arrow
  • 9. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Apache Arrow Cross-language dev platform for data 複数言語対応のデータ用の開発プラットフォーム Ruby community doesn't need to dev everything Rubyコミュニティーがすべてを開発しなくてもよい ✓ We can share common implementations 共通の実装を言語を超えて共有できる ✓ ✓ Today's highlighted features 今日注目する機能 Fast data processing(高速データ処理) ✓ Fast data interchange(高速データ交換) ✓ ✓
  • 10. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 My approach 私のアプローチ Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools
  • 11. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Goal of this talk このトークのゴール Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools You want to use Ruby for some data processings いくつかのデータ処理でRubyを使いたくなる Especially, you want to implement a BI tool 特にBIツールを作りたくなる ✓ ✓ You join Red Data Tools project Red Data Toolsプロジェクトに参加する It provides data processing tools for Ruby Ruby用のデータ処理ツールを提供するプロジェクト https://red-data-tools.github.io/ ✓ ✓
  • 12. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data processing 高速データ処理 Ruby is slow to process data Rubyでデータを処理すると遅い ✓ Resolve in external process:(別プロセスで解決) Use case: Web app, batch process for Web app Use fast data processing module (e.g.: DB) DBとか速いデータ処理モジュールを使う ✓ ✓ Resolve in the same process:(プロセス内で解決) Use cases: IRB, batch process for Web app Implement core features in other fast lang 他の速い言語でコアの機能を実装 ✓ ✓
  • 13. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 External process 別プロセス Ruby External process Fast data processing Request Response Ruby External process Popular case in current Ruby usage 今のRubyの使われ方だとよくあるケース ✓ Small response: No problem レスポンスが小さい場合は問題ない ✓ Large response: レスポンスが大きい場合: Sending/receiving response are slow レスポンスの送信・受信処理が遅い ✓ ✓
  • 14. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Sending/receiving response レスポンスの送受信 Ruby External process Serialize Deserialize Send Ruby External process Serialize/deserialize are slow シリアライズ・デシリアライズが遅い ✓ How to speed them up? どうやって高速化すればよいか Apache Arrow format ✓ Serialize/deserialize cost ≒ 0 シリアライズ・デシリアライズコストがほぼ0 ✓ ✓
  • 15. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why Apache Arrow format is fast Apache Arrowフォーマットはなぜ速いのか Apache Arrowフォーマットはなぜ速いのか Powered by Rabbit 3.0.1 Apache Arrowフォーマットは なぜ速いのか 須藤功平 株式会社クリアコード db tech showcase ONLINE 2020 2020-12-08 https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/
  • 16. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Apache Arrow Flight SQL Ruby SQL DB Fast data processing Request (SQL) Response (Apache Arrow data) Ruby SQL DB gRPC based protocol gRPCベースのプロトコル NOTE: Other network libraries such as UCX can be used UCXなど他のネットワークライブラリーも使える ✓ ✓ Specialized to Apache Arrow format Apache Arrowフォーマットに特化 Serialize/deserialize cost ≒ 0 シリアライズ・デシリアライズコストがほぼ0 ✓ ✓
  • 17. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Arrow Flight SQL require "arrow-flight-sql" location = "grpc://server:2929" client = ArrowFlight::Client.new(location) sql_client = ArrowFlightSQL::Client.new(client) info = sql_client.execute("SELECT * FROM logs") info.endpoints.each do |endpoint| reader = sql_client.do_get(endpoint.ticket) reader.read_all end
  • 18. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Which SQL DBs support Apache Arrow Flight SQL? SQL DB Support? MySQL No PostgreSQL No BigQuery No Trino No Dremio Yes
  • 19. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Why don't most SQL DBs support it? どうしてほとんどのSQL DBはサポートしていないの? Flight SQL is a new protocol Apache Arrow Flight SQLは新しいプロトコルだから The first release: 2022-02(最初のリリース) ✓ Still experimental(まだ実験的扱い) ✓ ✓ Tradition SQL DBs may not support MySQL・PostgreSQLとか昔からあるSQL DBはサポートしないかも ✓ New SQL DBs will support because... 新しいSQL DBはサポートするはず。なぜなら… ✓
  • 20. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Compatibility is important 互換性が重要だから New SQL DBs often use major protocols 新しいSQL DBは既存のメジャーなプロトコルを使うことが多い To reuse existing client libraries ユーザーは既存のクライアントライブラリーで新しいSQL DBを使える ✓ ✓ For example: MySQL protocol: TiDB, ... ✓ PostgreSQL protocol: Cloud Spanner, CockroachDB, ... ✓ ✓
  • 21. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Future 将来 Flight SQL client libraries will be increased Flight SQLのクライアントライブラリーが充実するだろう ✓ New SQL DBs will support Flight SQL 新しいSQL DBはFlight SQLをサポートするだろう To reuse existing client libraries 既存のクライアントライブラリーを再利用するため ✓ ✓ BI tools will support Flight SQL by default BIツールはデフォルトでFlight SQLをサポートするだろう ✓
  • 22. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 What should we do next? 私たちは次はなにをするべき? Negative spiral Positive spiral Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools Implement an Active Record adapter for Flight SQL Flight SQL用のActive Recordアダプターを 実装するといいんじゃないかな For easy to use from Ruby on Rails apps Ruby on Railsアプリから使いやすくなるはず ✓ ✓ Join Red Data Tools! Red Data Toolsで開発しようぜ! https://red-data-tools.github.io/ ✓
  • 23. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 But I'm using MySQL/PostgreSQL... でも、MySQL/PostgreSQLを使っているし。。。 ADBC: Apache Arrow Database Connectivity https://voltrondata.com/news/simplifying-database-connectivity-with-arrow- flight-sql-and-adbc/
  • 24. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 ADBC Generic fast SQL DB client API 任意のSQL DBに接続できる高速なAPI We can use Flight SQL through ADBC ADBC経由でFlight SQLも使える ✓ ✓ Flight SQL is the most fast driver Flight SQLが最速のドライバー But the same API for all SQL DBs is useful like Active Record for Rubyists なんだけど、すべてのSQL DBに同じAPIでアクセスできるのは便利 Active Recordも便利でしょ? ✓ ✓
  • 25. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 ADBC and Ruby Implementing Ruby bindings Rubyバインディングを実装中 ✓ Join Red Data Tools to implement an Active Record adapter for ADBC! Red Data ToolsでADBC用のActive Recordアダプターを開発しようぜ! https://red-data-tools.github.io/ ✓
  • 26. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red ADBC require "adbc" options = { driver: "adbc_driver_sqlite", filename: ":memory:", } ADBC::Database.open(**options) do |database| database.connect do |connection| puts(connection.query("SELECT 1")) end end
  • 27. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up: External process case まとめ:別プロセスの場合 Large response is slow レスポンスが大きいときに遅い Bottle neck is serialize/deserialize ボトルネックはシリアライズ・デシリアライズ ✓ ✓ Apache Arrow Flight SQL/ADBC そこでApache Arrow Flight SQL/ADBCですよ! ✓ Let's implement AR adapters for them Active Recordのアダプターを実装しようぜ! ✓
  • 28. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 In-process 同一プロセス Ruby Library Fast data processing Pass data Return processed data Ruby Library Not popular case in current Ruby usage 今のRubyの使われ方だとあまりないケース Use case: process data on local/remote storage IRB, batch process for Web app ユースケース:ローカル・リモートストレージにあるデータの処理 ✓ ✓ Need a fast data processing library implemented in other fast language 高速にデータ処理できる他の速い言語で実装されたライブラリーが必要 ✓
  • 29. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast language? 速い言語? C/C++ ✓ Rust ✓ Julia ✓ ... ✓
  • 30. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A C++ case: Apache Arrow Apache Arrow has a computation module Apache Arrowは計算モジュールも提供している ✓ Ruby bindings: Red Arrow/red-arrow RubyバインディングはRed Arrow/red-arrow ✓ Data frame based on Red Arrow Red Arrowベースのデータフレームもある RedAmber/red_amber by @heronshoes Red Data Toolsメンバーでもある鈴木さんが開発 https://mybinder.org/v2/gh/RubyData/docker-stacks/master? filepath=red-amber.ipynb ✓ ✓
  • 31. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Arrow require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "arrow" sliced_codes = codes.slice do |slicer| slicer.prefecture == "東京都" end puts(sliced_codes)
  • 32. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red Amber require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "red_amber" data_frame = RedAmber::DataFrame.new(codes) prefecture = data_frame[:prefecture] puts(data_frame[prefecture == "東京都"])
  • 33. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A C++ case: DuckDB Similar to SQLite but for data analytics データ分析向けのSQLiteみたいなやつ Fast aggregation/filter/sort 高速な集計・フィルター・ソート ✓ ✓ Ruby bindings: ruby-duckdb by @suketa Rubyコミッターでもある助田さんがRubyバインディングを開発 We can impl. fast data processing with DuckDB DuckDBを使って高速データ処理を実現できる! ✓ If we can interchange data w/ DuckDB fast... DuckDBと高速にデータ交換できればね… ✓ ✓
  • 34. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Is fast data interchange important? 高速データ交換は重要なの? Ruby DuckDB Fast data processing Load data (!) Read result (!) Ruby DuckDB If (!) are slow, total data processing is also slow (!)が遅いと全体のデータ処理も遅くなる
  • 35. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Fast data interchange 高速なデータ交換 Ruby DuckDB Fast data processing Pass Apache Arrow data directly Read result with C data interface Ruby DuckDB Use data as-is: zero-copy データをそのまま使う:ゼロコピー ✓ Apache Arrow C data/stream interface C ABI for fast data interchange 高速にデータ交換するためのC ABI ✓ FYI: C ABI in Ruby: MemoryView 参考:RubyもMmeoryViewというC ABIを提供している ✓ ✓
  • 36. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 C data interface struct ArrowArray { // Array data description int64_t length; int64_t null_count; int64_t offset; int64_t n_buffers; int64_t n_children; const void** buffers; struct ArrowArray** children; struct ArrowArray* dictionary; // Release callback void (*release)(struct ArrowArray*); // Opaque producer-specific data void* private_data; };
  • 37. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 DuckDB with Apache Arrow require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "arrow-duckdb" db = DuckDB::Database.open c = db.connect c.register("codes", codes) do # Use Apache Arrow data as-is c.query("SELECT * FROM codes WHERE prefecture = ?", "東京都", # Tokyo output: :arrow) # Output as Apache Arrow data .to_table # C data interface end
  • 38. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 C data interface on Web Web上でもC data interface Some unofficial WebAssembly ports exist 非公式ながらいくつかWebAssembly対応のApache Arrowライブラリーがある Rust based, Go based, ... ✓ ✓ WASM Ruby + C data I/F is useful? WebAssembly版のRubyとC data interfaceでなんかできるかも? FYI: DuckDB supports WebAssembly too 参考:DuckDBもWebAssemblyをサポートしている ✓ ✓
  • 39. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 A Rust case: Arrow DataFusion SQL query engine Internal memory layout is Apache Arrow DuckDBと似ているが内部のメモリーレイアウトはApache Arrow ✓ ✓ Direct Ruby bindings:(直接のバインディング) arrow-datafusion by @jychen7 with Magnus ✓ ✓ Ruby bindings via C API:(C API経由) datafusion-c with cargo-c ✓ Red DataFusion with datafusion-c ✓ ✓
  • 40. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Ruby bindings via C API C API経由のRubyバインディング Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools To develop with devs from other langs 他言語の開発者と一緒に開発するため C API is useful for other languages too C APIはJavaやGoなど他の言語のバインディング開発でも有用 ✓ ✓ C API provides a normal C library C APIは普通のCライブラリーを提供する Headers: Generated by cbindgen automatically ✓ Shared libraries: Built with cargo-c ✓ ✓
  • 41. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Red DataFusion require "datasets-arrow" codes = Datasets::PostalCodeJapan.new.to_arrow require "datafusion" context = DataFusion::SessionContext.new context.register("codes", codes) # C data interface data_frame = context.sql(<<-SQL) SELECT * FROM codes WHERE prefecture = '東京都' SQL puts(data_frame.to_table) # C data interface
  • 42. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Remote data リモートデータ Recent modules can read remote data 最近のモジュールはリモートデータを読み込める ✓ At least these modules support it: 少なくとも次のモジュールはできる Apache Arrow ✓ DuckDB ✓ Apache Arrow DataFusion ✓ ✓
  • 43. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Remote data example: DuckDB SELECT COUNT(*) FROM parquet_scan('s3://ookla-open-data/parquet/performance/*/*/*/*.parquet', HIVE_PARTITIONING=1); -- ┌───────┐ -- │ count_star() │ -- ├───────┤ -- │ 144567188 │ -- └───────┘
  • 44. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up: In-process case まとめ:同一プロセスの場合 Ruby Library Fast data processing Pass data Return processed data Ruby Library To process data on local/remote storage ローカル・リモートストレージにあるデータの処理 ✓ Low-level APIs (bindings) are preparing 低レベルのAPIは整備できてきた ✓ Let's implement high-level API! 高レベルのAPIを実装しようぜ! e.g.: RedAmber, Active Record adapters, ... https://red-data-tools.github.io/ ✓ ✓
  • 45. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Acknowledgments 謝辞 Voltron Data Supports my Apache Arrow related work 私のApache Arrow関係の作業にお金を払ってくれている ✓ You can join Voltron Data or ClearCode to work on Apache Arrow as a job Voltron Dataさんかクリアコードに転職すると 仕事でApache Arrowを開発できるよ! https://www.clear-code.com/recruitment/ ✓ ✓ Red Data Tools members They develop data processing tools for Ruby! Ruby用のデータ処理ツールを開発しているよ! ✓ ✓
  • 46. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Wrap up まとめ Can use Ruby for fast data processing with Apache Arrow in some cases Apache Arrowを使えば高速なデータ処理にRubyを使える…こともある External process: Fast data interchange 別プロセスの場合:高速データ交換機能を使う ✓ In-process: Fast data processing/interchange 同一プロセスの場合:高速データ処理・交換機能を使う ✓ ✓ But we still have missing pieces でも、まだ足りないところがある ✓
  • 47. Fast data processing with Ruby and Apache Arrow Powered by Rabbit 3.0.2 Goal of this talk このトークのゴール Negative spiral Positive spiral Apache Arrow Larger community More users More developers More useful tools Small community Few users Few developers Few useful tools You want to use Ruby for some data processings いくつかのデータ処理でRubyを使いたくなる Especially, you want to implement a BI tool 特にBIツールを作りたくなる ✓ ✓ You join Red Data Tools project Red Data Toolsプロジェクトに参加する It provides data processing tools for Ruby Ruby用のデータ処理ツールを提供するプロジェクト https://red-data-tools.github.io/ ✓ ✓