Treasure Data is providing Embulk(Open Source bulk load tool) as a hosted bulkload tools.
This slide contains our usercase, relationship with community,and architectures.
2. About me…
Satoshi Akama
Embulk plugins
・embulk-output-bigquery
・embulk-input-gcs
・embulk-input-azure_blob_storage
・embulk-output-azure_blob_storage
Treasure Data Inc.
Software Engineer (Java/Scala/Ruby)
github.com/sakama/
@oreradio
3. We are providing Hosted Embulk
Data Connector
(Import)
Result Output
(export)
+
“Data Loading” should not be customer’s work
unless they’re developing ETL tools.
Streaming Import
MySQL
PostgreSQL
Redshift
AWS S3
Google Cloud Storage
SalesForce
Marketo
…etc
MySQL
PostgreSQL
Redshift
BigQuery
…etc
4. Treasure Data as a Datahub
Schema Less
(Treasure Data)
Something Data Store
(Schema full)
You can create Data Pipeline easily
Various formatted data
・log
・Sensor data(IoT)
・Visualize
・Digital Marketing
5. Data Connector(Import) - CUI
guess/preview/import
$ td connector:guess seed.yml -o load.yml
$ td connector:preview load.yml
$ td connector:issue load.yml —database td_sample_db
—table td_sample_table
Scheduled execution
$ td connector:create
daily_import
“10 5 * * * “
td_sample_db
td_sample_table
load.yml
—time-column created_at
GUI will come in the near future
7. Unchanged OSS Embulk/Embulk plugins
Send pull-request to OSS Embulk
We are using…
We will use at our service after
「いわゆるオープンソースソフトウェアの中で基本機能は無償で公開してコミュニティに任せる、でも機
能を追加したソフトを有償で提供するというモデルは実際にはそんなに上手く行ってないのではないか
と感じています。」-「「Fluentdをきっかけにビジネスが回る仕掛けがとっても気持ちイイです。」 ¦ Think IT(シンクイッ
ト)」 https://thinkit.co.jp/story/2015/07/17/6232
「オープンソースソフトウェアといってもいろいろな開発スタイルがあると思うんですが、fluentdの場
合、僕が所属するトレジャーデータが全面的にバックアップしています。現在は、この開発スタイル「企
業がバックについているけど、開発はオープンに行う」という手法が一番合っていると思います。」
- OSや言語ではなくデータベースを極めたい:グリー技術者が聞いた、fluentdの新機能とTreasure Data古橋氏の野心 (2/3) - @IT
http://www.atmarkit.co.jp/ait/articles/1310/07/news010_2.html
8. Process to use Embulk plugins at TD
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Send Pull-Request to
OSS Embulk or Embulk Plugins
Sorry, this is sorry closed source code
Release as “Data Connector” or ”Result Output”
9. Process to use Embulk plugins at TD (1)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
・Add some features
e.g. add various authentication method.
・Add some fixes
e.g.
add retry logic
fix error handling
10. Process to use Embulk plugins at TD (2)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Handling of file path
MR executor could not read local file path(like private key)
Fix authorization logic if need
transaction() and open() method will run at different
instances
11. Process to use Embulk plugins at TD (3)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Need 80% coverage
By internal rules,
we can’t deploy without 80% coverered unit test.
Write Unit test
Write unit test for Embulk plugin is difficult.
e.g. connect to cloud service…
12. Process to use Embulk plugins at TD (4)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Write Integration Test for Treasure Data Service
(1) Import data into TD
(2) Send query into Presto, Hive
(3) Check result with local file.
e.g.
13. Process to use Embulk plugins at TD (5)
Fix for MapReduce Executor
Write Unit test
Write Integration test
Add Features
Fix for Local Executor
Release as “Data Connector” or ”Result Output”
14. We hope Win-Win relationship
Embulk Community
Use at TD
Core development
Plugin development
Use at your
own environment
Contribute
15. Embulk Execution Platform at Treasure Data
Load Balancer
TD API(API Servers)Web Console
td commands
td connector:issue
td guess config.yml…
Response
Response
Request
Request
Bulkload API
(API Servers)
Perfect Queue
TD worker
(worker process)
enqueue
dequeue
Submit Job
(Retry if need)
Execute with MR / Local Executor
guess/preview
16. TD API / Bulkload API
TD API(API Servers)
Bulkload API(API Servers)
guess/preview is processed at different API Servers.
ResponseRequest
guess/preview
data import
Perfect Queue
Load Balancer
Queuing
Http Request/Response
guess/preview needs quick response
enqueue
17. Problems
Stability of Integration Tests
Execution time of Integration Tests
・Many plugins × Many test cases × Frequent execution
sometimes causes failure.
・Many plugins × Many test cases causes long execution time:)