SlideShare a Scribd company logo
Embulk - 進化するバルク

データローダ
Sadayuki Furuhashi

Founder & Software Architect
Embulk Meetup Tokyo #2
A little about me…
Sadayuki Furuhashi
github: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
Execution overview
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads

(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin

(parser plugin if input is file-based)
Output plugin

(formatter plugin if output is file-based)
What’s added since the first release?
• v0.3
• Resuming
• Filter plugin type
• v0.4
• Plugin template generator
• Incremental execution (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher
What’s added since the first release?
• v0.6
• Executor plugin type
• Liquid template engine
• v0.7
• EmbulkEmbed & Embulk::Runner
• Plugin bundle (embulk-mkbundle)
• JRuby 9000
• Gradle v2.6
Resuming
• Retries a failed transaction without retrying
everything.
• Skips successful tasks by using information stored in
a file by the previous transaction.
• embulk run config.yml -r resume-state.yml
Filter plugin type
• Filtering rows out, filtering columns out, or enrich
the data. 18 plugins released.
Plugin template generator
• Generates template of a plugin.
• Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>
Incremental execution
• Store last file name or row in a file, and next
execution starts from there.
• Usecase:

sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml
Isolated ClassLoaders for Java plugins
• Embulk can load multiple versions of java plugins.
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
Polyglot launcher script
• embulk .jar is a jar file.
• embulk.jar is a shell script.
• embulk.jar is a bat script.
• It sets JVM options to improve performance.
• ./embulk run abc
Executor plugin type
• embulk-executor-mapreduce executes tasks on
distributed environment.
Liquid template engine
• A config file can include variables.
EmbulkEmbed & Embulk::Runner
• Embed embulk in an application.
Plugin bundle
• Uses fixed version of plugins.
• embulk mkbundle my-project
• embulk run -b my-project config.yml
Gradle v2.6
• Continous compiling.
• “embulk migrate .” upgrades gradle versio of your
plugin project.
• ./gradlew -t build
Future plan
• v0.8
• JSON type (issue #306)
• Error plugin type (#27, #124)
• More (or less) concurrency for output (#231)
• v0.9
• More Guess (#242, #235)
• Multiple jobs using a single config file (#167)

More Related Content

What's hot

AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティス
Akihiro Kuwano
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
NTT DATA Technology & Innovation
 
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
Yahoo!デベロッパーネットワーク
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
NTT DATA OSS Professional Services
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Web Services Japan
 
NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定
Kan Itani
 
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
infinite_loop
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Web Services Japan
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
Masahiko Sawada
 
君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?
Teppei Sato
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
NTT DATA Technology & Innovation
 
MySQL SYSスキーマのご紹介
MySQL SYSスキーマのご紹介MySQL SYSスキーマのご紹介
MySQL SYSスキーマのご紹介
Shinya Sugiyama
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
Tetsutaro Watanabe
 
Oracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけてOracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけて
Yoichi Sai
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
NTT DATA Technology & Innovation
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
Yoshiyasu SAEKI
 
AWS Black Belt Techシリーズ AWS IAM
AWS Black Belt Techシリーズ  AWS IAMAWS Black Belt Techシリーズ  AWS IAM
AWS Black Belt Techシリーズ AWS IAM
Amazon Web Services Japan
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
Shogo Wakayama
 
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
Ryota Watabe
 
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
Trainocate Japan, Ltd.
 

What's hot (20)

AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティス
 
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
 
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
Apache Hadoop HDFSの最新機能の紹介(2018)#dbts2018
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデートAmazon Redshift パフォーマンスチューニングテクニックと最新アップデート
Amazon Redshift パフォーマンスチューニングテクニックと最新アップデート
 
NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定NetApp XCP データ移行ツールインストールと設定
NetApp XCP データ移行ツールインストールと設定
 
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
大規模ソーシャルゲーム開発から学んだPHP&MySQL実践テクニック
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
 
君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?
 
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
kubernetes初心者がKnative Lambda Runtime触ってみた(Kubernetes Novice Tokyo #13 発表資料)
 
MySQL SYSスキーマのご紹介
MySQL SYSスキーマのご紹介MySQL SYSスキーマのご紹介
MySQL SYSスキーマのご紹介
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
 
Oracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけてOracleからamazon auroraへの移行にむけて
Oracleからamazon auroraへの移行にむけて
 
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
AWS Black Belt Techシリーズ AWS IAM
AWS Black Belt Techシリーズ  AWS IAMAWS Black Belt Techシリーズ  AWS IAM
AWS Black Belt Techシリーズ AWS IAM
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
 
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
OSS-DB Silver ポイント解説セミナー ~SQL編~ (PostgreSQL9.0)
 
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
AWSとオンプレミスを繋ぐときに知っておきたいルーティングの基礎知識(CCSI監修!)
 

Viewers also liked

Embulkを活用したログ管理システム
Embulkを活用したログ管理システムEmbulkを活用したログ管理システム
Embulkを活用したログ管理システム
Akihiro Ikezoe
 
Embulk at Treasure Data
Embulk at Treasure DataEmbulk at Treasure Data
Embulk at Treasure Data
Satoshi Akama
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
Toru Takahashi
 
20150219 初めての「embulk」
20150219 初めての「embulk」20150219 初めての「embulk」
20150219 初めての「embulk」
Hideto Masuoka
 
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
Yuya Niimi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Sadayuki Furuhashi
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
N Masahiro
 
Embulk 20150411
Embulk 20150411Embulk 20150411
Embulk 20150411
Hiroshi Nakamura
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
Sadayuki Furuhashi
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
Sadayuki Furuhashi
 
Fluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2studyFluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2study
SATOSHI TAGOMORI
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門
Keisuke Takahashi
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみた
Ryoji Kurosawa
 
Lucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポートLucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポート
Shinpei Nakata
 
Flumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システムFlumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システム
Satoshi Iijima
 

Viewers also liked (20)

Embulkを活用したログ管理システム
Embulkを活用したログ管理システムEmbulkを活用したログ管理システム
Embulkを活用したログ管理システム
 
Embulk at Treasure Data
Embulk at Treasure DataEmbulk at Treasure Data
Embulk at Treasure Data
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
20150219 初めての「embulk」
20150219 初めての「embulk」20150219 初めての「embulk」
20150219 初めての「embulk」
 
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
Embulk 20150411
Embulk 20150411Embulk 20150411
Embulk 20150411
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Fluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2studyFluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2study
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみた
 
Lucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポートLucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポート
 
Flumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システムFlumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システム
 

Similar to Embulk - 進化するバルクデータローダ

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
Renzo Tomà
 
Retrofitting Continuous Delivery
Retrofitting Continuous Delivery Retrofitting Continuous Delivery
Retrofitting Continuous Delivery
Alan Norton
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
Amazon Web Services Korea
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
Speck&Tech
 
Evolutionary Database Design
Evolutionary Database DesignEvolutionary Database Design
Evolutionary Database Design
Andrei Solntsev
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
Eduardo Castro
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskip
Gerard Klijs
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
Amazon Web Services Korea
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit
 
End to end testing a web application with Clojure
End to end testing a web application with ClojureEnd to end testing a web application with Clojure
End to end testing a web application with Clojure
Gerard Klijs
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
Erik Osterman
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
Amazon Web Services
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramChris Adkin
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
Treasure Data, Inc.
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Claus Ibsen
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
Amazon Web Services LATAM
 
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
Amazon Web Services
 
Novices guide to docker
Novices guide to dockerNovices guide to docker
Novices guide to docker
Alec Clews
 
Celery with python
Celery with pythonCelery with python
Celery with python
Alexandre González Rodríguez
 

Similar to Embulk - 進化するバルクデータローダ (20)

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Retrofitting Continuous Delivery
Retrofitting Continuous Delivery Retrofitting Continuous Delivery
Retrofitting Continuous Delivery
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Evolutionary Database Design
Evolutionary Database DesignEvolutionary Database Design
Evolutionary Database Design
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskip
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
End to end testing a web application with Clojure
End to end testing a web application with ClojureEnd to end testing a web application with Clojure
End to end testing a web application with Clojure
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
 
Novices guide to docker
Novices guide to dockerNovices guide to docker
Novices guide to docker
 
Celery with python
Celery with pythonCelery with python
Celery with python
 

More from Sadayuki Furuhashi

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
Sadayuki Furuhashi
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Sadayuki Furuhashi
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
Sadayuki Furuhashi
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
Sadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
Sadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
Sadayuki Furuhashi
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectSadayuki Furuhashi
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectSadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
 
Gumi study7 messagepack
Gumi study7 messagepackGumi study7 messagepack
Gumi study7 messagepack
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
 

Recently uploaded

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 

Recently uploaded (20)

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 

Embulk - 進化するバルクデータローダ

  • 1. Embulk - 進化するバルク
 データローダ Sadayuki Furuhashi
 Founder & Software Architect Embulk Meetup Tokyo #2
  • 2. A little about me… Sadayuki Furuhashi github: @frsyuki Fluentd - Unifid log collection infrastracture Embulk - Plugin-based parallel ETL Founder & Software Architect
  • 3. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration easy. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 4. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
  • 5. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  • 6. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 7. The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
  • 8. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming Plugins Plugins bulk load
  • 9. Input Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter Guess
  • 10. Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter GuessFileInput Parser Decoder
  • 11. Guess Embulk’s Plugin Architecture Embulk Core FileInput Executor Plugin Parser Decoder FileOutput Formatter Encoder Filter Filter
  • 12. Execution overview Task Transaction Task Task taskCount { taskIndex: 0, task: {…} } { taskIndex: 2, task: {…} } runs on a single thread runs on multiple threads
 (or machines)
  • 13. Parallel execution Task Task Task Task Threads Task queue run tasks in parallel (embulk-executor-local-thread)
  • 14. Distributed execution Task Task Task Task Map tasks Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  • 15. Distributed execution (w/ partitioning) Task Task Task Task Map - Shuffle - Reduce Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  • 16. Transaction control fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } } file input plugin parser plugin filter plugins formatter plugin file output plugin executor plugin Task Task
  • 17. Task configuration fileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task) taskCount is decided by input schema is decided by input, and may be modified by filters
  • 18. Task execution parser.run(fileInput, pageOutput) fileInput.open() formatter.open(fileOutput) fileOutput.open() parser plugin file input plugin filter plugins file output plugin formatter plugin …Task Task …
  • 19. Type conversion Embulk type systemInput type system Output type system boolean long double string timestamp boolean integer bigint double precision text varchar date timestamp timestamp with zone … (e.g. PostgreSQL) boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch) Input plugin
 (parser plugin if input is file-based) Output plugin
 (formatter plugin if output is file-based)
  • 20. What’s added since the first release? • v0.3 • Resuming • Filter plugin type • v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
  • 21. What’s added since the first release? • v0.6 • Executor plugin type • Liquid template engine • v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
  • 22. Resuming • Retries a failed transaction without retrying everything. • Skips successful tasks by using information stored in a file by the previous transaction. • embulk run config.yml -r resume-state.yml
  • 23. Filter plugin type • Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.
  • 24. Plugin template generator • Generates template of a plugin. • Generated code is already ready to compile. > You modify & compile it to do your work. • embulk new <category> <new>
  • 25. Incremental execution • Store last file name or row in a file, and next execution starts from there. • Usecase:
 sync new files on S3 to Elasticsearch every day. • embulk run config.yml -o next-config.yml
  • 26. Isolated ClassLoaders for Java plugins • Embulk can load multiple versions of java plugins.
  • 27. Plugin Version Conflicts Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Version conflicts! aws-sdk.jar v1.10 embulk-output-redshift.jar
  • 28. Multiple Classloaders in JVM Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Isolated environments aws-sdk.jar v1.10 embulk-output-redshift.jar Class Loader 1 Class Loader 2
  • 29. Polyglot launcher script • embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance. • ./embulk run abc
  • 30. Executor plugin type • embulk-executor-mapreduce executes tasks on distributed environment.
  • 31. Liquid template engine • A config file can include variables.
  • 32. EmbulkEmbed & Embulk::Runner • Embed embulk in an application.
  • 33. Plugin bundle • Uses fixed version of plugins. • embulk mkbundle my-project • embulk run -b my-project config.yml
  • 34. Gradle v2.6 • Continous compiling. • “embulk migrate .” upgrades gradle versio of your plugin project. • ./gradlew -t build
  • 35. Future plan • v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231) • v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)