Using Embulk at Treasure Data

Muga Nishizawa (西澤無我)
Using Embulk at Treasure Data

Today’s talk
> What’s Embulk?
> Why our customers use Embulk?
> Embulk
> Data Connector
> Data Connector
> The architecture
> The use case
> with MapReduce Executor
> How we conﬁgure MapReduce Executor?
2

What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
3
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records, 
transactions (idempotency), 
performance, …

HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load

Why our customers use Embulk?
> Upload various types of their data to TD with Embulk  
> Various ﬁle formats
> CSV, TSV, JSON, XML,..
> Various data source
> Local disk, RDBMS, SFTP,..
> Various network environments
> embulk-output-td
> https://github.com/treasure-data/embulk-output-td
5

Out of scope for Embulk
> They develop scripts for
> generating Embulk conﬁgs
> changing schema on a regular basis
> logic to select some ﬁles but not others
> managing cron settings
> e.g. some users want to upload yesterday’s data 
 
as daily batch
> Embulk is just “bulk loader”
6

Best practice to manage Embulk!!
7
http://www.slideshare.net/GONNakaTaka/embulk5

Data Connector
Users/Customers
PlazmaDBConnector Worker
submit
connector jobs
see loaded data
on Console
Guess/Preview API

2 types of hosted Embulk service
11
Import
(Data Connector)
Export
(Result Output)
MySQL
PostgreSQL
Redshift
AWS S3
Google Cloud Storage
SalesForce
Marketo
…etc
MySQL
PostgreSQL
Redshift
BigQuery
…etc

Guess/Preview API
Users/Customers
PlazmaDB
Connector Worker
submit
connector jobs
see loaded data
on Console
Guess/Preview API

Guess/Preview API
> Guesses Embulk config based on sample data
> Creates parser config
> Adds schema, escape char, quote char, etc..
> Creates rename filter config
> TD requires uncapitalized column names
> Preview data before uploading
> Ensures quick response
> Embulk performs this functionality running 
 
on our web application servers
13

Connector Worker
Users/Customers
PlazmaDB
Connector Worker
submit
connector jobs
see loaded data
on Console
Guess/Preview API

Connector Worker
> Generates Embulk conﬁg and executes Embulk
> Uses private output plugin instead of embulk-output-td 
 
to upload users’ data to PlazmaDB directly
> Appropriate retry mechanism
> Embulk runs on our Job Queue clients
15

Timestamp parsing
Users/Customers
PlazmaDB
Connector Worker
submit
connector jobs
see loaded data
on Console
Guess/Preview API

Timestamp parsing
> Implement strptime in Java
> Ported from CRuby implementation
> Can precompile the format
> Faster than JRuby’s strptime
> Has been maintained in Embulk repo obscurely..
> It will be merged into JRuby
17

How we use Data Connector at TD
> a. Monitoring our S3 buckets access
> e.g. “IAM users who accessed our S3 buckets?” 
 
“Access frequency”
> {in: {type: s3}} and {parser: {type: csv}}
> b. Measuring KPIs for development process
> e.g. “phases that we took a long time on the process”
> {in: {type: jira}}
> c. Measuring Business & Support Performance
> {in: {type: Salesforce, Marketo, ZenDesk, …}}
18

Scaling Embulk
> Requests for massive data loading from users
> e.g. “Upload 150GB data by hourly batch” 
 
“Start PoC and upload 500GB data today”
> Local Executor can not handle this scale
> MapReduce Executor enables us to scale
19

W/ MapReduce
Users/Customers
PlazmaDB
Connector Worker
submit
connector jobs
see loaded data
on Console
Guess/Preview API
Hadoop Clusters

What’s MapReduce Executor?
21
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop

MapReduce Executor
with TimestampPartitioning
22
Task
Map tasks
Task queue
run tasks on Hadoop
Reduce tasksShufﬂe

built Embulk configs
23
exec:
type: mapreduce
job_name: embulk.100000
config_files:
- /etc/hadoop/conf/core-site.xml
- /etc/hadoop/conf/hdfs-site.xml
- /etc/hadoop/conf/mapred-site.xml
config:
fs.defaultFS: “hdfs://my-hdfs.example.net:8020”
yarn.resourcemanager.hostname: "my-yarn.example.net"
dfs.replication: 1
mapreduce.client.submit.file.replication: 1
state_path: /mnt/xxx/embulk/
partitioning:
type: timestamp
unit: hour
column: time
unix_timestamp_unit: hour
map_side_partition_split: 3
reducers: 3
in:
...
Connector Workers (single-machine workers)
are still able to generate config

Different sized ﬁles
24
Map tasks Reduce tasksShufﬂe

Same time range data
25

Grouping input ﬁles
- {in: {min_task_size}}
26
Task
Task
Task
It also can reduce mapper’s launch cost.

One partition into multi-reducers
- {exec: {partitioning: {map_side_split}}}
27

Conclusion
> What’s Embulk?
> Why we use Embulk?
> Embulk
> Data Connector
> Data Connector
> The architecture of Data Connector
> The use case
> with MapReduce Executor
28

Using Embulk at Treasure Data

More Related Content

What's hot

Similar to Using Embulk at Treasure Data

Recently uploaded

Using Embulk at Treasure Data