Embulk - 進化するバルクデータローダ

Embulk - 進化するバルク 
データローダ
Sadayuki Furuhashi 
Founder & Software Architect
Embulk Meetup Tokyo #2

A little about me…
Sadayuki Furuhashi
github: @frsyuki
Fluentd - Uniﬁd log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect

What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records, 
transactions (idempotency), 
performance, …

The pains of bulk data loading
Example: load a 10GB CSV ﬁle to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Inﬁnity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?

Example: load a 10GB CSV ﬁle to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD

Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …

The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?

HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load

Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess

Output
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder

Guess
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter

Execution overview
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads 
(or machines)

Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)

Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)

Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuﬄe - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)

Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task

Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters

Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …

Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin 
(parser plugin if input is file-based)
Output plugin 
(formatter plugin if output is file-based)

What’s added since the first release?
• v0.3
• Resuming
• Filter plugin type
• v0.4
• Plugin template generator
• Incremental execution (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher

What’s added since the ﬁrst release?
• v0.6
• Executor plugin type
• Liquid template engine
• v0.7
• EmbulkEmbed & Embulk::Runner
• Plugin bundle (embulk-mkbundle)
• JRuby 9000
• Gradle v2.6

Resuming
• Retries a failed transaction without retrying
everything.
• Skips successful tasks by using information stored in
a ﬁle by the previous transaction.
• embulk run conﬁg.yml -r resume-state.yml

Filter plugin type
• Filtering rows out, ﬁltering columns out, or enrich
the data. 18 plugins released.

Plugin template generator
• Generates template of a plugin.
• Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>

Incremental execution
• Store last file name or row in a file, and next
execution starts from there.
• Usecase: 
sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml

Isolated ClassLoaders for Java plugins
• Embulk can load multiple versions of java plugins.

Plugin Version Conﬂicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conﬂicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar

Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2

Polyglot launcher script
• embulk .jar is a jar ﬁle.
• embulk.jar is a shell script.
• embulk.jar is a bat script.
• It sets JVM options to improve performance.
• ./embulk run abc

Executor plugin type
• embulk-executor-mapreduce executes tasks on
distributed environment.

Liquid template engine
• A conﬁg ﬁle can include variables.

EmbulkEmbed & Embulk::Runner
• Embed embulk in an application.

Plugin bundle
• Uses ﬁxed version of plugins.
• embulk mkbundle my-project
• embulk run -b my-project conﬁg.yml

Gradle v2.6
• Continous compiling.
• “embulk migrate .” upgrades gradle versio of your
plugin project.
• ./gradlew -t build

Future plan
• v0.8
• JSON type (issue #306)
• Error plugin type (#27, #124)
• More (or less) concurrency for output (#231)
• v0.9
• More Guess (#242, #235)
• Multiple jobs using a single conﬁg ﬁle (#167)

Embulk - 進化するバルクデータローダ

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Embulk - 進化するバルクデータローダ

Similar to Embulk - 進化するバルクデータローダ (20)

More from Sadayuki Furuhashi

More from Sadayuki Furuhashi (20)

Recently uploaded

Recently uploaded (20)

Embulk - 進化するバルクデータローダ