Recent Updates at Embulk Meetup #3

Muga Nishizawa
Recent Updates
Embulk Meetup #3 Tokyo

I am..
> Muga Nishizawa
> @muga_nishizawa on Twitter
> @muga on GitHub
> Data Integration with Embulk
2

Today’s talk
> What’s Embulk?
> Recent Updates
> Major updates between Embulk meetup #2 and #3
> Our thoughts to address issues in plugins
> Future Plan
> Presented by @dmikurube
3

What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
4
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records, 
transactions (idempotency), 
performance, …

HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load

Recent Updates b/w #2 and #3
> Released 0.8.0 to 0.8.21 
> https://github.com/embulk/embulk/releases 
> Merged 157 PRs, closed 29 PRs 
> https://github.com/embulk/embulk/pulls
6

Number of the Embulk Plugins
> 49 inputs, 45 outputs, 54 ﬁlters, 28 parsers, 
 
2 decoders, 5 encoders, 8 formatters, 1 executer 
> exclude built-in plugins 
> http://www.embulk.org/plugins/
7

Support Page Scattering 
since 0.8.0
> Page scattering on LocalExecutorPlugin by default
> Executes output in parallel even if there is 
 
only 1 input task
> Elects output by mod (page count % scatter count)
> Improves the performance if input is a single huge ﬁle
> e.g. a single 5GB CSV ﬁle
8

Support JSON Type 
since 0.8.0
> Provided JSON type as Embulk type
> Stores JSON objects in pages
> Uses MessagePack’s Value objects 
 
as intermediate representation
> https://github.com/msgpack/msgpack-java
> Not provide JSON parser plugin at the moment
9

Support JSON Parser Plugin 
since 0.8.5
> JSON parser plugin as built-in plugin
> Reads and parses JSON data by line basis 
 
and generates MessagePack’s Value objects
> Requires a single column only in input schema
> Various types of JSON parser plugins
> e.g. ‘jsonl’ and ‘jsonpath’, etc..
> Needed to cover Schema-less JSON data
> Avoid unexpected data loss
> Add Java implementation to reduce objects passing  
 
between Java and JRuby
10

Specific Plugins vs. Generic Plugins
> Specific plugins
> + Simple impl. and easy to understand
> - It’s OK for now but, will not be easy for users to find
> 49 inputs, 45 outputs, 54 filters, 28 parsers
> Generic plugins
> + Users could reach out it easily
> - Users might need to install unwanted features as well
> This is small stuff
11

Built-in vs. 3rd-party
> Built-in plugins
> + Could be easily found on embulk.org
> - Provides standard features only
e.g. JSON parser, remove_column filter, etc …
> 3rd-party plugins
> + Could provide advanced features
e.g. columns filter plugin 
 
https://github.com/sonots/embulk-filter-column
12

Straggling Nested JSON Retrieval
> Need to store JSON file in memory in order to parse.
> e.g. AWS CloudTrail Log File Format
> Apache Drill or Presto based filter plugins?
> ‘FLATTEN’ very powerful and useful
> https://drill.apache.org/docs/flatten/
13
{
"Records": [{
"eventVersion": "1.0",
"userIdentity": {
"type": "IAMUser",
"principalId": "EX_PRINCIPAL_ID",
"arn": "arn:aws:iam::123456789012:user/Alice",
"accessKeyId": "EXAMPLE_KEY_ID",
"accountId": "123456789012",
"userName": "Alice"
}, … … ]}

New Embulk logo 
since 0.8.13
14
before after

Improved Rename Filter Plugin 
since 0.8.14
> Provides ‘rules’ option
> Users only have to specify ‘rule’s provided by plugin
> e.g. The rules converts upper-case alphabets to 
 
lower-case and then, truncates column names.
> Not need to specify column names
> column names sometimes will be changed
15
filters:
- type: rename
rules:
- {rule: upper_to_lower} 
- {rule: truncate, max_length: 20}
… …

Tolerance to Input Source Change
> Embulk config could work fine 
 
even if input would change
> e.g. Input schema will change, column will be renamed
> For the tolerance
> Plugins’ config options should be well-designed for it.
> e.g. The ‘columns’ option is really necessary?
16

embulk-test Framework 
since 0.8.15
> Framework to develop integration tests for plugins
> Easy to compare between actual and expected data
> Not unit test framework
> It’s sometimes difﬁcult to mock input/output
> Better to develop unit tests and integration tests
17

Improved CSV Parser Guess Plugin 
since 0.8.16
> Improved guess behavior for corner cases
> To be exact, corner cases that we thought were 
 
NOT corner cases for users.
> Added tests for several changes
> To avoid regression and improve correctly
> embulk-test framework was necessary to do that
18

CSV Guess Failure Examples in TD
19
Failed job count the reasons
1,711 “Attribute type is required but not set”
210 “Multiple entries with same key”
35 “No input ﬁles to read sample data”
> Sampled about existing 20,000 jobs and checked  
 
the reasons why their guess were failed.
> 1,711 jobs failed by “Attribute type is required..”

Why “Attribute type is required but
not set” happened?
20
Failed job count column count
1,032 1
136 3
79 5
77 2
39 4
38 7
35 10
> Checked the # of columns in CSV ﬁles by 1,711 jobs
> 1,032 jobs failed by a single column

A Single Column CSV Files
21
id
be535773-fd27-4133-b626-8cba82f03b4f
2b9d5b80-de29-4eed-bcf8-5e41bbbbcfaf
e4ae8fcb-0462-49c8-adec-97799e64170b
457fa021-9d67-4e53-956b-2842b7b2982f
55c73c6e-c3da-475c-b323-802b62889093
… …
count
10
2
69
845
91
… …
> Those doesn’t have delimiter but, they are also CSV

Why “Attribute type is required but
not set” happened?
22
Failed job count column count
1,032 1
136 3
79 5
77 2
39 4
38 7
35 10
> Checked failure reasons without a single column CSV

Other failures..
23
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk 
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
> Enabled guess by a few sample lines
> Added ‘;’ in delimiter candidates
id;account;time;purchase;comment
1;32864;2015-01-27 19:23:49;20150127;embulk 
2;14824;2015-01-27 19:01:23;20150127;embulk jruby
3;27559;2015-01-28 02:20:02;20150128;”Embulk ""csv"" parser plugin"
4;11270;2015-01-29 11:54:36;20150129;NULL

embulk-ﬁlter-calcite
> Transforms column values through SQL
> Query Page objects via Page storage adaptor
> Based on Apache Calcite
> The foundation for your next high-performance 
 
database
> https://calcite.apache.org/
24
filters:
- type: calcite
query: SELECT * FROM $PAGES WHERE message LIKE ‘%EMBULK%’
… …

Features
> We can use
> Operators and functions by Apache Calcite
> https://calcite.apache.org/docs/reference.html
> e.g. CEIL, FLOOR, SUBSTRING, etc..
> We might use them but
> Aggregation functions
e.g. COUNT, AVG, SUM, etc..
> JOIN expression with external source
25

‘csv_all_strings' Filter Plugin 
since 0.8.14
> They develop scripts for
> generating Embulk conﬁgs
> changing schema on a regular basis
> logic to select some ﬁles but not others
> managing cron settings
> e.g. some users want to upload yesterday’s data 
 
as daily batch
> Embulk is just “bulk loader”
27

SkipTransactionException 
since 0.8.15
> Plugin API that Embulk can skip the transaction  
 
if the exception is thrown by the (input) plugin
> Is useful for gracefully stopping Embulk transaction 
 
by input plugin’s condition
> e.g. ﬁles not found on input data source
> Avoids exporting data to output destination  
 
unexpectedly (depends on the output plugin’s design)
28

Recent Updates at Embulk Meetup #3

More Related Content

What's hot

Similar to Recent Updates at Embulk Meetup #3

Recently uploaded

Recent Updates at Embulk Meetup #3