Embulk, an open-source plugin-based parallel bulk data loader

Sadayuki Furuhashi
Sadayuki FuruhashiFounder & Architect at Treasure Data, Inc.
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
EmbulkAn open-source plugin-based parallel bulk data loader
that makes painful data integration work relaxed.
Sharing our knowledge on RubyGems to manage arbitrary files.
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> Prestogres - PostgreSQL protocol gateway for Presto
> Embulk - A plugin-based parallel bulk data loader
> ServerEngine - A Ruby framework to build multiprocess servers
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
Today’s talk
> What’s Embulk?
> How Embulk works?
> The architecture
> Writing Embulk plugins
> Roadmap & Development
> Q&A + Discussion
What’s Embulk?
> An open-source parallel bulk data loader
> using plugins
> to make data integration relaxed.
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration relaxed.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
The problems at Treasure Data
Treasure Data Service?
> “Fast, powerful SQL access to big data from connected
applications and products, with no new infrastructure or
special skills required.”
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data. Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
A solution:
> Package the efforts as a plugin.
> data cleaning, error handling, retrying
> Share & reuse the plugin.
> don’t repeat the pains!
> Keep improving the plugin code.
> rather than throwing away the efforts every time
> using OSS-style pull-reqs & frequent releases.
Embulk
Embulk is an open-source, plugin-based
parallel bulk data loader

that makes data integration works relaxed.
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotet retrying
bulk load
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotet retrying
Plugins Plugins
bulk load
How Embulk works?
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar
Installing embulk
Bintray

releases
Embulk is released on Bintray
wget embulk.jar
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml
Guess format & schema in:
type: file
paths: [data/examples/]
out:

type: example
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
out:

type: example
guess
by guess plugins
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
+--------------------------------------+---------------+--------------------+
| time:timestamp | uid:long | word:string |
+--------------------------------------+---------------+--------------------+
| 2015-01-27 19:23:49 UTC | 32,864 | embulk |
| 2015-01-27 19:01:23 UTC | 14,824 | jruby |
| 2015-01-28 02:20:02 UTC | 27,559 | plugin |
| 2015-01-29 11:54:36 UTC | 11,270 | fluentd |
+--------------------------------------+---------------+--------------------+
Preview & fix config
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
# run
$ ./embulk run config.yml -o config.yml
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
last_paths: [data/examples/sample_001.csv.gz]
out:

type: example
Deterministic run
in:
type: file
paths: [data/examples/]
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
header_line: true
columns:
- name: time

type: timestamp

format: '%Y-%m-%d %H:%M:%S'
- name: account

type: long
- name: purchase

type: timestamp

format: '%Y%m%d'
- name: comment

type: string
last_paths: [data/examples/sample_002.csv.gz]
out:

type: example
Repeat
# install
$ wget https://bintray.com/artifact/download/
embulk/maven/embulk-0.2.0.jar -o embulk.jar
$ chmod 755 embulk.jar

# guess
$ vi partial-config.yml
$ ./embulk guess partial-config.yml

-o config.yml

# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
# run
$ ./embulk run config.yml -o config.yml
# repeat
$ ./embulk run config.yml -o config.yml
$ ./embulk run config.yml -o config.yml
The architecture
InputPlugin OutputPlugin
Embulk
executor plugin
read records write records
InputPlugin OutputPlugin
Embulk
executor plugin
MySQL, Cassandra,
HBase, Elasticsearch,

Treasure Data, …
record
record
InputPlugin
FileInputPlugin
OutputPlugin
FileOutputPlugin
EncoderPlugin
FormatterPlugin
DecoderPlugin
ParserPlugin
Embulk
executor plugin
read files
decompress
parse files
into records
write files
compress
format records
into files
InputPlugin
FileInputPlugin
OutputPlugin
FileOutputPlugin
EncoderPlugin
FormatterPlugin
DecoderPlugin
ParserPlugin
Embulk
executor plugin
HDFS, S3,

Riak CS, …
gzip, bzip2,

3des, …
CSV, JSON,

RCFile, …
buffer
buffer
record
record
buffer
buffer
Writing Embulk plugins
InputPlugin
module Embulk
class InputExample < InputPlugin
Plugin.register_input('example', self)
def self.transaction(config, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: nil)
}
threads = config.param('threads', :int, default:
2)
columns = [
Column.new(0, 'col0', :long),
Column.new(1, 'col1', :double),
Column.new(2, 'col2', :string),
]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here
puts "Example input finished"
return {}
end
def run(task, schema, index, page_builder)
puts "Example input thread #{@index}…"
10.times do |i|
@page_builder.add([i, 10.0, "example"])
end
@page_builder.finish
commit_report = { }
return commit_report
end
end
end
OutputPlugin
module Embulk
class OutputExample < OutputPlugin
Plugin.register_output('example', self)
def self.transaction(
config, schema,
processor_count, &control)
# read config
task = {
'message' =>
config.param('message', :string, default: "record")
}
puts "Example output started."
commit_reports = yield(task)
puts "Example output finished. Commit
reports = #{commit_reports.to_json}"
return {}
end
def initialize(task, schema, index)
puts "Example output thread #{index}..."
super
@message = task.prop('message', :string)
@records = 0
end
def add(page)
page.each do |record|
hash = Hash[schema.names.zip(record)]
puts "#{@message}: #{hash.to_json}"
@records += 1
end
end
def finish
end
def abort
end
def commit
commit_report = {
"records" => @records
}
return commit_report
end
end
end
GuessPlugin
# guess_gzip.rb
module Embulk
class GzipGuess < GuessPlugin
Plugin.register_guess('gzip', self)
GZIP_HEADER = "x1f
x8b".force_encoding('ASCII-8BIT').freeze
def guess(config, sample_buffer)
if sample_buffer[0,2] == GZIP_HEADER
return {"decoders" => [{"type" => "gzip"}]}
end
return {}
end
end
end
# guess_
module Embulk
class GuessNewline < TextGuessPlugin
Plugin.register_guess('newline', self)
def guess_text(config, sample_text)
cr_count = sample_text.count("r")
lf_count = sample_text.count("n")
crlf_count = sample_text.scan(/rn/).length
if crlf_count > cr_count / 2 && crlf_count >
lf_count / 2
return {"parser" => {"newline" => "CRLF"}}
elsif cr_count > lf_count / 2
return {"parser" => {"newline" => "CR"}}
else
return {"parser" => {"newline" => "LF"}}
end
end
end
end
Releasing to RubyGems
Examples
> embulk-plugin-postgres-json.gem
> https://github.com/frsyuki/embulk-plugin-postgres-json
> embulk-plugin-redis.gem
> https://github.com/komamitsu/embulk-plugin-redis
> embulk-plugin-input-sfdc-event-log-files.gem
> https://github.com/nahi/embulk-plugin-input-sfdc-event-
log-files
Roadmap & Development
Roadmap
> Add missing JRuby Plugin APIs
> ParserPlugin, FormatterPlugin
> DecoderPlugin, EncoderPlugin
> Add Executor plugin SPI
> Add ssh distributed executor
> embulk run —command ssh %host embulk run %task
> Add MapReduce executor
> Add support for nested records (?)
Contributing to the Embulk project
> Pull-requests & issues on Github
> Posting blogs
> “I tried Embulk. Here is how it worked”
> “I read Embulk code. Here is how it’s written”
> “Embulk is good because…but bad because…”
> Talking on Twitter with a word “embulk"
> Writing & releasing plugins
> Windows support
> Integration to other software
> ETL tools, Fluentd, Hadoop, Presto, …
Q&A + Discussion?
Hiroshi Nakamura
@nahi
Muga Nishizawa
@muga_nishizawa
Sadayuki Furuhashi
@frsyuki
Embulk committers:
https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.
We’re hiring!
1 of 36

Recommended

Embulk - 進化するバルクデータローダ by
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
4.9K views35 slides
Fighting Against Chaotically Separated Values with Embulk by
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
2.1K views45 slides
Presto ベースのマネージドサービス Amazon Athena by
Presto ベースのマネージドサービス Amazon AthenaPresto ベースのマネージドサービス Amazon Athena
Presto ベースのマネージドサービス Amazon AthenaAmazon Web Services Japan
5.3K views42 slides
Plazma - Treasure Data’s distributed analytical database - by
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Treasure Data, Inc.
12.6K views31 slides
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料) by
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
2.8K views14 slides
Data Factoryの勘所・大事なところ by
Data Factoryの勘所・大事なところData Factoryの勘所・大事なところ
Data Factoryの勘所・大事なところTsubasa Yoshino
4.5K views44 slides

More Related Content

What's hot

[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa... by
[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...
[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...Insight Technology, Inc.
7.7K views44 slides
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料) by
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)NTT DATA Technology & Innovation
1.4K views55 slides
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019) by
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
20.5K views61 slides
Hive on Tezのベストプラクティス by
Hive on TezのベストプラクティスHive on Tezのベストプラクティス
Hive on TezのベストプラクティスYahoo!デベロッパーネットワーク
3.5K views40 slides
AWSで作る分析基盤 by
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤Yu Otsubo
7.3K views118 slides
PostgreSQL のイケてるテクニック7選 by
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選Tomoya Kawanishi
6.4K views14 slides

What's hot(20)

[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa... by Insight Technology, Inc.
[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...
[C33] 24時間365日「本当に」止まらないデータベースシステムの導入 ~AlwaysOn+Qシステムで完全無停止運用~ by Nobuyuki Sa...
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019) by Noritaka Sekiyama
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Noritaka Sekiyama20.5K views
AWSで作る分析基盤 by Yu Otsubo
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤
Yu Otsubo7.3K views
PostgreSQL のイケてるテクニック7選 by Tomoya Kawanishi
PostgreSQL のイケてるテクニック7選PostgreSQL のイケてるテクニック7選
PostgreSQL のイケてるテクニック7選
Tomoya Kawanishi6.4K views
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み) by Hironobu Suzuki
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
Hironobu Suzuki20.9K views
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~ by Miki Shimogai
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
Miki Shimogai118.6K views
Redisの特徴と活用方法について by Yuji Otani
Redisの特徴と活用方法についてRedisの特徴と活用方法について
Redisの特徴と活用方法について
Yuji Otani101.5K views
最近のストリーム処理事情振り返り by Sotaro Kimura
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
Sotaro Kimura17.3K views
db tech showcase 2019 SQL Database Hyperscale 徹底分析 - 最新アーキテクチャの特徴を理解する by Masayuki Ozawa
db tech showcase 2019 SQL Database Hyperscale 徹底分析 - 最新アーキテクチャの特徴を理解するdb tech showcase 2019 SQL Database Hyperscale 徹底分析 - 最新アーキテクチャの特徴を理解する
db tech showcase 2019 SQL Database Hyperscale 徹底分析 - 最新アーキテクチャの特徴を理解する
Masayuki Ozawa2.1K views
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送 by Google Cloud Platform - Japan
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
Securing Your Apache Spark Applications by Cloudera, Inc.
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
Cloudera, Inc.7.8K views
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性 by Ohyama Masanori
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
Ohyama Masanori22.1K views
超実践 Cloud Spanner 設計講座 by Samir Hammoudi
超実践 Cloud Spanner 設計講座超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座
Samir Hammoudi21.3K views
Data platformdesign by Ryoma Nagata
Data platformdesignData platformdesign
Data platformdesign
Ryoma Nagata1.5K views

Viewers also liked

【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術 by
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術Unity Technologies Japan K.K.
186.1K views89 slides
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法 by
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法Yoshifumi Kawai
54.1K views27 slides
RuntimeUnitTestToolkit for Unity by
RuntimeUnitTestToolkit for UnityRuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for UnityYoshifumi Kawai
70.2K views17 slides
NextGen Server/Client Architecture - gRPC + Unity + C# by
NextGen Server/Client Architecture - gRPC + Unity + C#NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#Yoshifumi Kawai
78.8K views38 slides
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践 by
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践Yoshifumi Kawai
253.6K views53 slides
Fluentd - road to v1 - by
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -N Masahiro
17.7K views29 slides

Viewers also liked(19)

【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術 by Unity Technologies Japan K.K.
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
【Unite 2017 Tokyo】「黒騎士と白の魔王」にみるC#で統一したサーバー/クライアント開発と現実的なUniRx使いこなし術
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法 by Yoshifumi Kawai
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
ZeroFormatterに見るC#で最速のシリアライザを作成する100億の方法
Yoshifumi Kawai54.1K views
RuntimeUnitTestToolkit for Unity by Yoshifumi Kawai
RuntimeUnitTestToolkit for UnityRuntimeUnitTestToolkit for Unity
RuntimeUnitTestToolkit for Unity
Yoshifumi Kawai70.2K views
NextGen Server/Client Architecture - gRPC + Unity + C# by Yoshifumi Kawai
NextGen Server/Client Architecture - gRPC + Unity + C#NextGen Server/Client Architecture - gRPC + Unity + C#
NextGen Server/Client Architecture - gRPC + Unity + C#
Yoshifumi Kawai78.8K views
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践 by Yoshifumi Kawai
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
「黒騎士と白の魔王」gRPCによるHTTP/2 - API, Streamingの実践
Yoshifumi Kawai253.6K views
Fluentd - road to v1 - by N Masahiro
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
N Masahiro17.7K views
H2O - making HTTP better by Kazuho Oku
H2O - making HTTP betterH2O - making HTTP better
H2O - making HTTP better
Kazuho Oku54.5K views
ガチ(?)対決!OSSのジョブ管理ツール by 賢 秋穂
ガチ(?)対決!OSSのジョブ管理ツールガチ(?)対決!OSSのジョブ管理ツール
ガチ(?)対決!OSSのジョブ管理ツール
賢 秋穂27.5K views
A Framework for LightUp Applications of Grani by Yoshifumi Kawai
A Framework for LightUp Applications of GraniA Framework for LightUp Applications of Grani
A Framework for LightUp Applications of Grani
Yoshifumi Kawai51.8K views
Fluentd and Embulk Game Server 4 by N Masahiro
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
N Masahiro8K views
EmbulkとDigdagとデータ分析基盤と by Toru Takahashi
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
Toru Takahashi22.5K views
How to Make Own Framework built on OWIN by Yoshifumi Kawai
How to Make Own Framework built on OWINHow to Make Own Framework built on OWIN
How to Make Own Framework built on OWIN
Yoshifumi Kawai38.1K views
async/await不要論 by bleis tift
async/await不要論async/await不要論
async/await不要論
bleis tift29K views
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践 by Yoshifumi Kawai
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
AWS + Windows(C#)で構築する.NET最先端技術によるハイパフォーマンスウェブアプリケーション開発実践
Yoshifumi Kawai191.5K views
The History of Reactive Extensions by Yoshifumi Kawai
The History of Reactive ExtensionsThe History of Reactive Extensions
The History of Reactive Extensions
Yoshifumi Kawai60.8K views
Reactive Programming by UniRx for Asynchronous & Event Processing by Yoshifumi Kawai
Reactive Programming by UniRx for Asynchronous & Event ProcessingReactive Programming by UniRx for Asynchronous & Event Processing
Reactive Programming by UniRx for Asynchronous & Event Processing
Yoshifumi Kawai65.5K views
UniRx - Reactive Extensions for Unity by Yoshifumi Kawai
UniRx - Reactive Extensions for UnityUniRx - Reactive Extensions for Unity
UniRx - Reactive Extensions for Unity
Yoshifumi Kawai68.7K views
HttpClient詳解、或いは非同期の落とし穴について by Yoshifumi Kawai
HttpClient詳解、或いは非同期の落とし穴についてHttpClient詳解、或いは非同期の落とし穴について
HttpClient詳解、或いは非同期の落とし穴について
Yoshifumi Kawai90.5K views
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例 by Yoshifumi Kawai
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Metaprogramming Universe in C# - 実例に見るILからRoslynまでの活用例
Yoshifumi Kawai125.1K views

Similar to Embulk, an open-source plugin-based parallel bulk data loader

Treasure Data and OSS by
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
6K views39 slides
AWS (Hadoop) Meetup 30.04.09 by
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
531 views59 slides
Plugin-based software design with Ruby and RubyGems by
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
7.9K views78 slides
Play framework productivity formula by
Play framework   productivity formula Play framework   productivity formula
Play framework productivity formula Sorin Chiprian
2.7K views13 slides
Site Performance - From Pinto to Ferrari by
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariJoseph Scott
3.5K views112 slides
PuppetDB: Sneaking Clojure into Operations by
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
1.5K views109 slides

Similar to Embulk, an open-source plugin-based parallel bulk data loader(20)

Treasure Data and OSS by N Masahiro
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
N Masahiro6K views
Plugin-based software design with Ruby and RubyGems by Sadayuki Furuhashi
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi7.9K views
Play framework productivity formula by Sorin Chiprian
Play framework   productivity formula Play framework   productivity formula
Play framework productivity formula
Sorin Chiprian2.7K views
Site Performance - From Pinto to Ferrari by Joseph Scott
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
Joseph Scott3.5K views
PuppetDB: Sneaking Clojure into Operations by grim_radical
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
grim_radical1.5K views
Digdagによる大規模データ処理の自動化とエラー処理 by Sadayuki Furuhashi
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi23.7K views
12 core technologies you should learn, love, and hate to be a 'real' technocrat by Jonathan Linowes
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes911 views
Clug 2011 March web server optimisation by grooverdan
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
grooverdan265 views
12 core technologies you should learn, love, and hate to be a 'real' technocrat by linoj
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
linoj902 views
Caching and tuning fun for high scalability by Wim Godden
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
Wim Godden1.5K views
Drupal Performance : DrupalCamp North by Philip Norton
Drupal Performance : DrupalCamp NorthDrupal Performance : DrupalCamp North
Drupal Performance : DrupalCamp North
Philip Norton1.7K views
High concurrency,
Low latency analytics
using Spark/Kudu by Chris George
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George799 views
Profiling PHP with Xdebug / Webgrind by Sam Keen
Profiling PHP with Xdebug / WebgrindProfiling PHP with Xdebug / Webgrind
Profiling PHP with Xdebug / Webgrind
Sam Keen43.8K views
Caching and tuning fun for high scalability by Wim Godden
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
Wim Godden2.4K views
The Future is Now: Leveraging the Cloud with Ruby by Robert Dempsey
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey1K views
Complex Made Simple: Sleep Better with TorqueBox by bobmcwhirter
Complex Made Simple: Sleep Better with TorqueBoxComplex Made Simple: Sleep Better with TorqueBox
Complex Made Simple: Sleep Better with TorqueBox
bobmcwhirter1.7K views

More from Sadayuki Furuhashi

Scripting Embulk Plugins by
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk PluginsSadayuki Furuhashi
2.6K views14 slides
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019 by
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
4.7K views63 slides
Making KVS 10x Scalable by
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x ScalableSadayuki Furuhashi
3.4K views39 slides
Automating Workflows for Analytics Pipelines by
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
1.9K views28 slides
Fluentd at Bay Area Kubernetes Meetup by
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
1.2K views31 slides
DigdagはなぜYAMLなのか? by
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
5.5K views35 slides

More from Sadayuki Furuhashi(20)

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019 by Sadayuki Furuhashi
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Sadayuki Furuhashi4.7K views
Automating Workflows for Analytics Pipelines by Sadayuki Furuhashi
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi1.9K views
Logging for Production Systems in The Container Era by Sadayuki Furuhashi
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi1.4K views
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11 by Sadayuki Furuhashi
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Sadayuki Furuhashi71.1K views
Understanding Presto - Presto meetup @ Tokyo #1 by Sadayuki Furuhashi
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi14.2K views
Presto - Hadoop Conference Japan 2014 by Sadayuki Furuhashi
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi107.3K views
Prestogres, ODBC & JDBC connectivity for Presto by Sadayuki Furuhashi
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi6.3K views
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual by Sadayuki Furuhashi
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi9.2K views

Recently uploaded

DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...Deltares
14 views23 slides
AI and Ml presentation .pptx by
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptxFayazAli87
12 views15 slides
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Lisi Hocke
35 views124 slides
SUGCON ANZ Presentation V2.1 Final.pptx by
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptxJack Spektor
23 views34 slides
Fleet Management Software in India by
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India Fleetable
11 views1 slide
nintendo_64.pptx by
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptxpaiga02016
5 views7 slides

Recently uploaded(20)

DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares14 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8712 views
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... by Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 views
SUGCON ANZ Presentation V2.1 Final.pptx by Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor23 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
Advanced API Mocking Techniques by Dimpy Adhikary
Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary19 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik7 views
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft... by Deltares
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
Deltares7 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares12 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j8 views
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info33492122 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin96 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok6 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
Myths and Facts About Hospice Care: Busting Common Misconceptions by Care Coordinations
Myths and Facts About Hospice Care: Busting Common MisconceptionsMyths and Facts About Hospice Care: Busting Common Misconceptions
Myths and Facts About Hospice Care: Busting Common Misconceptions

Embulk, an open-source plugin-based parallel bulk data loader

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. EmbulkAn open-source plugin-based parallel bulk data loader that makes painful data integration work relaxed. Sharing our knowledge on RubyGems to manage arbitrary files.
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > Prestogres - PostgreSQL protocol gateway for Presto > Embulk - A plugin-based parallel bulk data loader > ServerEngine - A Ruby framework to build multiprocess servers > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
  • 3. Today’s talk > What’s Embulk? > How Embulk works? > The architecture > Writing Embulk plugins > Roadmap & Development > Q&A + Discussion
  • 4. What’s Embulk? > An open-source parallel bulk data loader > using plugins > to make data integration relaxed.
  • 5. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration relaxed. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 6. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
  • 7. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  • 8. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 9. The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
  • 10. The problems at Treasure Data Treasure Data Service? > “Fast, powerful SQL access to big data from connected applications and products, with no new infrastructure or special skills required.” > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > Hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
  • 11. A solution: > Package the efforts as a plugin. > data cleaning, error handling, retrying > Share & reuse the plugin. > don’t repeat the pains! > Keep improving the plugin code. > rather than throwing away the efforts every time > using OSS-style pull-reqs & frequent releases.
  • 12. Embulk Embulk is an open-source, plugin-based parallel bulk data loader
 that makes data integration works relaxed.
  • 14. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying bulk load
  • 15. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying Plugins Plugins bulk load
  • 17. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar Installing embulk Bintray
 releases Embulk is released on Bintray wget embulk.jar
  • 18. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml Guess format & schema in: type: file paths: [data/examples/] out:
 type: example in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string out:
 type: example guess by guess plugins
  • 19. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary +--------------------------------------+---------------+--------------------+ | time:timestamp | uid:long | word:string | +--------------------------------------+---------------+--------------------+ | 2015-01-27 19:23:49 UTC | 32,864 | embulk | | 2015-01-27 19:01:23 UTC | 14,824 | jruby | | 2015-01-28 02:20:02 UTC | 27,559 | plugin | | 2015-01-29 11:54:36 UTC | 11,270 | fluentd | +--------------------------------------+---------------+--------------------+ Preview & fix config
  • 20. # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary # run $ ./embulk run config.yml -o config.yml in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string last_paths: [data/examples/sample_001.csv.gz] out:
 type: example Deterministic run
  • 21. in: type: file paths: [data/examples/] decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' header_line: true columns: - name: time
 type: timestamp
 format: '%Y-%m-%d %H:%M:%S' - name: account
 type: long - name: purchase
 type: timestamp
 format: '%Y%m%d' - name: comment
 type: string last_paths: [data/examples/sample_002.csv.gz] out:
 type: example Repeat # install $ wget https://bintray.com/artifact/download/ embulk/maven/embulk-0.2.0.jar -o embulk.jar $ chmod 755 embulk.jar
 # guess $ vi partial-config.yml $ ./embulk guess partial-config.yml
 -o config.yml
 # preview $ ./embulk preview config.yml $ vi config.yml # if necessary # run $ ./embulk run config.yml -o config.yml # repeat $ ./embulk run config.yml -o config.yml $ ./embulk run config.yml -o config.yml
  • 24. InputPlugin OutputPlugin Embulk executor plugin MySQL, Cassandra, HBase, Elasticsearch,
 Treasure Data, … record record
  • 26. InputPlugin FileInputPlugin OutputPlugin FileOutputPlugin EncoderPlugin FormatterPlugin DecoderPlugin ParserPlugin Embulk executor plugin HDFS, S3,
 Riak CS, … gzip, bzip2,
 3des, … CSV, JSON,
 RCFile, … buffer buffer record record buffer buffer
  • 28. InputPlugin module Embulk class InputExample < InputPlugin Plugin.register_input('example', self) def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2) columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ] # BEGIN here commit_reports = yield(task, columns, threads) # COMMIT here puts "Example input finished" return {} end def run(task, schema, index, page_builder) puts "Example input thread #{@index}…" 10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish commit_report = { } return commit_report end end end
  • 29. OutputPlugin module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self) def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") } puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}" return {} end def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end def finish end def abort end def commit commit_report = { "records" => @records } return commit_report end end end
  • 30. GuessPlugin # guess_gzip.rb module Embulk class GzipGuess < GuessPlugin Plugin.register_guess('gzip', self) GZIP_HEADER = "x1f x8b".force_encoding('ASCII-8BIT').freeze def guess(config, sample_buffer) if sample_buffer[0,2] == GZIP_HEADER return {"decoders" => [{"type" => "gzip"}]} end return {} end end end # guess_ module Embulk class GuessNewline < TextGuessPlugin Plugin.register_guess('newline', self) def guess_text(config, sample_text) cr_count = sample_text.count("r") lf_count = sample_text.count("n") crlf_count = sample_text.scan(/rn/).length if crlf_count > cr_count / 2 && crlf_count > lf_count / 2 return {"parser" => {"newline" => "CRLF"}} elsif cr_count > lf_count / 2 return {"parser" => {"newline" => "CR"}} else return {"parser" => {"newline" => "LF"}} end end end end
  • 31. Releasing to RubyGems Examples > embulk-plugin-postgres-json.gem > https://github.com/frsyuki/embulk-plugin-postgres-json > embulk-plugin-redis.gem > https://github.com/komamitsu/embulk-plugin-redis > embulk-plugin-input-sfdc-event-log-files.gem > https://github.com/nahi/embulk-plugin-input-sfdc-event- log-files
  • 33. Roadmap > Add missing JRuby Plugin APIs > ParserPlugin, FormatterPlugin > DecoderPlugin, EncoderPlugin > Add Executor plugin SPI > Add ssh distributed executor > embulk run —command ssh %host embulk run %task > Add MapReduce executor > Add support for nested records (?)
  • 34. Contributing to the Embulk project > Pull-requests & issues on Github > Posting blogs > “I tried Embulk. Here is how it worked” > “I read Embulk code. Here is how it’s written” > “Embulk is good because…but bad because…” > Talking on Twitter with a word “embulk" > Writing & releasing plugins > Windows support > Integration to other software > ETL tools, Fluentd, Hadoop, Presto, …
  • 35. Q&A + Discussion? Hiroshi Nakamura @nahi Muga Nishizawa @muga_nishizawa Sadayuki Furuhashi @frsyuki Embulk committers:
  • 36. https://jobs.lever.co/treasure-data Cloud service for the entire data pipeline. We’re hiring!