Muga Nishizawa
Recent Updates
Embulk Meetup #3 Tokyo
I am..
> Muga Nishizawa
> @muga_nishizawa on Twitter
> @muga on GitHub
> Data Integration with Embulk
2
Today’s talk
> What’s Embulk?
> Recent Updates
> Major updates between Embulk meetup #2 and #3
> Our thoughts to address issues in plugins
> Future Plan
> Presented by @dmikurube
3
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
4
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load
Recent Updates b/w #2 and #3
> Released 0.8.0 to 0.8.21

> https://github.com/embulk/embulk/releases

> Merged 157 PRs, closed 29 PRs

> https://github.com/embulk/embulk/pulls
6
Number of the Embulk Plugins
> 49 inputs, 45 outputs, 54 filters, 28 parsers,



2 decoders, 5 encoders, 8 formatters, 1 executer

> exclude built-in plugins

> http://www.embulk.org/plugins/
7
Support Page Scattering

since 0.8.0
> Page scattering on LocalExecutorPlugin by default
> Executes output in parallel even if there is



only 1 input task
> Elects output by mod (page count % scatter count)
> Improves the performance if input is a single huge file
> e.g. a single 5GB CSV file
8
Support JSON Type

since 0.8.0
> Provided JSON type as Embulk type
> Stores JSON objects in pages
> Uses MessagePack’s Value objects



as intermediate representation
> https://github.com/msgpack/msgpack-java
> Not provide JSON parser plugin at the moment
9
Support JSON Parser Plugin

since 0.8.5
> JSON parser plugin as built-in plugin
> Reads and parses JSON data by line basis



and generates MessagePack’s Value objects
> Requires a single column only in input schema
> Various types of JSON parser plugins
> e.g. ‘jsonl’ and ‘jsonpath’, etc..
> Needed to cover Schema-less JSON data
> Avoid unexpected data loss
> Add Java implementation to reduce objects passing 



between Java and JRuby
10
Specific Plugins vs. Generic Plugins
> Specific plugins
> + Simple impl. and easy to understand
> - It’s OK for now but, will not be easy for users to find
> 49 inputs, 45 outputs, 54 filters, 28 parsers
> Generic plugins
> + Users could reach out it easily
> - Users might need to install unwanted features as well
> This is small stuff
11
Built-in vs. 3rd-party
> Built-in plugins
> + Could be easily found on embulk.org
> - Provides standard features only
e.g. JSON parser, remove_column filter, etc …
> 3rd-party plugins
> + Could provide advanced features
e.g. columns filter plugin



https://github.com/sonots/embulk-filter-column
12
Straggling Nested JSON Retrieval
> Need to store JSON file in memory in order to parse.
> e.g. AWS CloudTrail Log File Format
> Apache Drill or Presto based filter plugins?
> ‘FLATTEN’ very powerful and useful
> https://drill.apache.org/docs/flatten/
13
{
"Records": [{
"eventVersion": "1.0",
"userIdentity": {
"type": "IAMUser",
"principalId": "EX_PRINCIPAL_ID",
"arn": "arn:aws:iam::123456789012:user/Alice",
"accessKeyId": "EXAMPLE_KEY_ID",
"accountId": "123456789012",
"userName": "Alice"
}, … … ]}
New Embulk logo

since 0.8.13
14
before after
Improved Rename Filter Plugin

since 0.8.14
> Provides ‘rules’ option
> Users only have to specify ‘rule’s provided by plugin
> e.g. The rules converts upper-case alphabets to



lower-case and then, truncates column names.
> Not need to specify column names
> column names sometimes will be changed
15
filters:
- type: rename
rules:
- {rule: upper_to_lower}

- {rule: truncate, max_length: 20}
… …
Tolerance to Input Source Change
> Embulk config could work fine



even if input would change
> e.g. Input schema will change, column will be renamed
> For the tolerance
> Plugins’ config options should be well-designed for it.
> e.g. The ‘columns’ option is really necessary?
16
embulk-test Framework

since 0.8.15
> Framework to develop integration tests for plugins
> Easy to compare between actual and expected data
> Not unit test framework
> It’s sometimes difficult to mock input/output
> Better to develop unit tests and integration tests
17
Improved CSV Parser Guess Plugin

since 0.8.16
> Improved guess behavior for corner cases
> To be exact, corner cases that we thought were



NOT corner cases for users.
> Added tests for several changes
> To avoid regression and improve correctly
> embulk-test framework was necessary to do that
18
CSV Guess Failure Examples in TD
19
Failed job count the reasons
1,711 “Attribute type is required but not set”
210 “Multiple entries with same key”
35 “No input files to read sample data”
> Sampled about existing 20,000 jobs and checked 



the reasons why their guess were failed.
> 1,711 jobs failed by “Attribute type is required..”
Why “Attribute type is required but
not set” happened?
20
Failed job count column count
1,032 1
136 3
79 5
77 2
39 4
38 7
35 10
> Checked the # of columns in CSV files by 1,711 jobs
> 1,032 jobs failed by a single column
A Single Column CSV Files
21
id
be535773-fd27-4133-b626-8cba82f03b4f
2b9d5b80-de29-4eed-bcf8-5e41bbbbcfaf
e4ae8fcb-0462-49c8-adec-97799e64170b
457fa021-9d67-4e53-956b-2842b7b2982f
55c73c6e-c3da-475c-b323-802b62889093
… …
count
10
2
69
845
91
… …
> Those doesn’t have delimiter but, they are also CSV
Why “Attribute type is required but
not set” happened?
22
Failed job count column count
1,032 1
136 3
79 5
77 2
39 4
38 7
35 10
> Checked failure reasons without a single column CSV
Other failures..
23
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk

2,14824,2015-01-27 19:01:23,20150127,embulk jruby
> Enabled guess by a few sample lines
> Added ‘;’ in delimiter candidates
id;account;time;purchase;comment
1;32864;2015-01-27 19:23:49;20150127;embulk

2;14824;2015-01-27 19:01:23;20150127;embulk jruby
3;27559;2015-01-28 02:20:02;20150128;”Embulk ""csv"" parser plugin"
4;11270;2015-01-29 11:54:36;20150129;NULL
embulk-filter-calcite
> Transforms column values through SQL
> Query Page objects via Page storage adaptor
> Based on Apache Calcite
> The foundation for your next high-performance



database
> https://calcite.apache.org/
24
filters:
- type: calcite
query: SELECT * FROM $PAGES WHERE message LIKE ‘%EMBULK%’
… …
Features
> We can use
> Operators and functions by Apache Calcite
> https://calcite.apache.org/docs/reference.html
> e.g. CEIL, FLOOR, SUBSTRING, etc..
> We might use them but
> Aggregation functions
e.g. COUNT, AVG, SUM, etc..
> JOIN expression with external source
25
Future Plan..
26
‘csv_all_strings' Filter Plugin

since 0.8.14
> They develop scripts for
> generating Embulk configs
> changing schema on a regular basis
> logic to select some files but not others
> managing cron settings
> e.g. some users want to upload yesterday’s data



as daily batch
> Embulk is just “bulk loader”
27
SkipTransactionException

since 0.8.15
> Plugin API that Embulk can skip the transaction 



if the exception is thrown by the (input) plugin
> Is useful for gracefully stopping Embulk transaction



by input plugin’s condition
> e.g. files not found on input data source
> Avoids exporting data to output destination 



unexpectedly (depends on the output plugin’s design)
28

Recent Updates at Embulk Meetup #3

  • 1.
  • 2.
    I am.. > MugaNishizawa > @muga_nishizawa on Twitter > @muga on GitHub > Data Integration with Embulk 2
  • 3.
    Today’s talk > What’sEmbulk? > Recent Updates > Major updates between Embulk meetup #2 and #3 > Our thoughts to address issues in plugins > Future Plan > Presented by @dmikurube 3
  • 4.
    What’s Embulk? > Anopen-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration easy. > which was very painful… 4 Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 5.
    HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming Plugins Plugins bulk load
  • 6.
    Recent Updates b/w#2 and #3 > Released 0.8.0 to 0.8.21
 > https://github.com/embulk/embulk/releases
 > Merged 157 PRs, closed 29 PRs
 > https://github.com/embulk/embulk/pulls 6
  • 7.
    Number of theEmbulk Plugins > 49 inputs, 45 outputs, 54 filters, 28 parsers,
 
 2 decoders, 5 encoders, 8 formatters, 1 executer
 > exclude built-in plugins
 > http://www.embulk.org/plugins/ 7
  • 8.
    Support Page Scattering
 since0.8.0 > Page scattering on LocalExecutorPlugin by default > Executes output in parallel even if there is
 
 only 1 input task > Elects output by mod (page count % scatter count) > Improves the performance if input is a single huge file > e.g. a single 5GB CSV file 8
  • 9.
    Support JSON Type
 since0.8.0 > Provided JSON type as Embulk type > Stores JSON objects in pages > Uses MessagePack’s Value objects
 
 as intermediate representation > https://github.com/msgpack/msgpack-java > Not provide JSON parser plugin at the moment 9
  • 10.
    Support JSON ParserPlugin
 since 0.8.5 > JSON parser plugin as built-in plugin > Reads and parses JSON data by line basis
 
 and generates MessagePack’s Value objects > Requires a single column only in input schema > Various types of JSON parser plugins > e.g. ‘jsonl’ and ‘jsonpath’, etc.. > Needed to cover Schema-less JSON data > Avoid unexpected data loss > Add Java implementation to reduce objects passing 
 
 between Java and JRuby 10
  • 11.
    Specific Plugins vs.Generic Plugins > Specific plugins > + Simple impl. and easy to understand > - It’s OK for now but, will not be easy for users to find > 49 inputs, 45 outputs, 54 filters, 28 parsers > Generic plugins > + Users could reach out it easily > - Users might need to install unwanted features as well > This is small stuff 11
  • 12.
    Built-in vs. 3rd-party >Built-in plugins > + Could be easily found on embulk.org > - Provides standard features only e.g. JSON parser, remove_column filter, etc … > 3rd-party plugins > + Could provide advanced features e.g. columns filter plugin
 
 https://github.com/sonots/embulk-filter-column 12
  • 13.
    Straggling Nested JSONRetrieval > Need to store JSON file in memory in order to parse. > e.g. AWS CloudTrail Log File Format > Apache Drill or Presto based filter plugins? > ‘FLATTEN’ very powerful and useful > https://drill.apache.org/docs/flatten/ 13 { "Records": [{ "eventVersion": "1.0", "userIdentity": { "type": "IAMUser", "principalId": "EX_PRINCIPAL_ID", "arn": "arn:aws:iam::123456789012:user/Alice", "accessKeyId": "EXAMPLE_KEY_ID", "accountId": "123456789012", "userName": "Alice" }, … … ]}
  • 14.
    New Embulk logo
 since0.8.13 14 before after
  • 15.
    Improved Rename FilterPlugin
 since 0.8.14 > Provides ‘rules’ option > Users only have to specify ‘rule’s provided by plugin > e.g. The rules converts upper-case alphabets to
 
 lower-case and then, truncates column names. > Not need to specify column names > column names sometimes will be changed 15 filters: - type: rename rules: - {rule: upper_to_lower}
 - {rule: truncate, max_length: 20} … …
  • 16.
    Tolerance to InputSource Change > Embulk config could work fine
 
 even if input would change > e.g. Input schema will change, column will be renamed > For the tolerance > Plugins’ config options should be well-designed for it. > e.g. The ‘columns’ option is really necessary? 16
  • 17.
    embulk-test Framework
 since 0.8.15 >Framework to develop integration tests for plugins > Easy to compare between actual and expected data > Not unit test framework > It’s sometimes difficult to mock input/output > Better to develop unit tests and integration tests 17
  • 18.
    Improved CSV ParserGuess Plugin
 since 0.8.16 > Improved guess behavior for corner cases > To be exact, corner cases that we thought were
 
 NOT corner cases for users. > Added tests for several changes > To avoid regression and improve correctly > embulk-test framework was necessary to do that 18
  • 19.
    CSV Guess FailureExamples in TD 19 Failed job count the reasons 1,711 “Attribute type is required but not set” 210 “Multiple entries with same key” 35 “No input files to read sample data” > Sampled about existing 20,000 jobs and checked 
 
 the reasons why their guess were failed. > 1,711 jobs failed by “Attribute type is required..”
  • 20.
    Why “Attribute typeis required but not set” happened? 20 Failed job count column count 1,032 1 136 3 79 5 77 2 39 4 38 7 35 10 > Checked the # of columns in CSV files by 1,711 jobs > 1,032 jobs failed by a single column
  • 21.
    A Single ColumnCSV Files 21 id be535773-fd27-4133-b626-8cba82f03b4f 2b9d5b80-de29-4eed-bcf8-5e41bbbbcfaf e4ae8fcb-0462-49c8-adec-97799e64170b 457fa021-9d67-4e53-956b-2842b7b2982f 55c73c6e-c3da-475c-b323-802b62889093 … … count 10 2 69 845 91 … … > Those doesn’t have delimiter but, they are also CSV
  • 22.
    Why “Attribute typeis required but not set” happened? 22 Failed job count column count 1,032 1 136 3 79 5 77 2 39 4 38 7 35 10 > Checked failure reasons without a single column CSV
  • 23.
    Other failures.. 23 id,account,time,purchase,comment 1,32864,2015-01-27 19:23:49,20150127,embulk
 2,14824,2015-01-2719:01:23,20150127,embulk jruby > Enabled guess by a few sample lines > Added ‘;’ in delimiter candidates id;account;time;purchase;comment 1;32864;2015-01-27 19:23:49;20150127;embulk
 2;14824;2015-01-27 19:01:23;20150127;embulk jruby 3;27559;2015-01-28 02:20:02;20150128;”Embulk ""csv"" parser plugin" 4;11270;2015-01-29 11:54:36;20150129;NULL
  • 24.
    embulk-filter-calcite > Transforms columnvalues through SQL > Query Page objects via Page storage adaptor > Based on Apache Calcite > The foundation for your next high-performance
 
 database > https://calcite.apache.org/ 24 filters: - type: calcite query: SELECT * FROM $PAGES WHERE message LIKE ‘%EMBULK%’ … …
  • 25.
    Features > We canuse > Operators and functions by Apache Calcite > https://calcite.apache.org/docs/reference.html > e.g. CEIL, FLOOR, SUBSTRING, etc.. > We might use them but > Aggregation functions e.g. COUNT, AVG, SUM, etc.. > JOIN expression with external source 25
  • 26.
  • 27.
    ‘csv_all_strings' Filter Plugin
 since0.8.14 > They develop scripts for > generating Embulk configs > changing schema on a regular basis > logic to select some files but not others > managing cron settings > e.g. some users want to upload yesterday’s data
 
 as daily batch > Embulk is just “bulk loader” 27
  • 28.
    SkipTransactionException
 since 0.8.15 > PluginAPI that Embulk can skip the transaction 
 
 if the exception is thrown by the (input) plugin > Is useful for gracefully stopping Embulk transaction
 
 by input plugin’s condition > e.g. files not found on input data source > Avoids exporting data to output destination 
 
 unexpectedly (depends on the output plugin’s design) 28