Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Embulk at Treasure Data

2,394 views

Published on

Treasure Data is providing Embulk(Open Source bulk load tool) as a hosted bulkload tools.
This slide contains our usercase, relationship with community,and architectures.

Published in: Data & Analytics
  • Be the first to comment

Embulk at Treasure Data

  1. 1. Embulk at Treasure Data Satoshi Akama Dec. 15, 2015 Embulk meetup #2 ×
  2. 2. About me… Satoshi Akama Embulk plugins  ・embulk-output-bigquery  ・embulk-input-gcs  ・embulk-input-azure_blob_storage  ・embulk-output-azure_blob_storage Treasure Data Inc. Software Engineer (Java/Scala/Ruby) github.com/sakama/ @oreradio
  3. 3. We are providing Hosted Embulk Data Connector (Import) Result Output (export) + “Data Loading” should not be customer’s work unless they’re developing ETL tools. Streaming Import MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc MySQL PostgreSQL Redshift BigQuery …etc
  4. 4. Treasure Data as a Datahub Schema Less (Treasure Data) Something Data Store (Schema full) You can create Data Pipeline easily Various formatted data ・log ・Sensor data(IoT) ・Visualize ・Digital Marketing
  5. 5. Data Connector(Import) - CUI guess/preview/import $ td connector:guess seed.yml -o load.yml $ td connector:preview load.yml $ td connector:issue load.yml —database td_sample_db —table td_sample_table Scheduled execution $ td connector:create daily_import “10 5 * * * “ td_sample_db td_sample_table load.yml —time-column created_at GUI will come in the near future
  6. 6. Result Output(Output) - GUI/CUI
  7. 7. Unchanged OSS Embulk/Embulk plugins Send pull-request to OSS Embulk We are using… We will use at our service after 「いわゆるオープンソースソフトウェアの中で基本機能は無償で公開してコミュニティに任せる、でも機 能を追加したソフトを有償で提供するというモデルは実際にはそんなに上手く行ってないのではないか と感じています。」-「「Fluentdをきっかけにビジネスが回る仕掛けがとっても気持ちイイです。」 ¦ Think IT(シンクイッ ト)」 https://thinkit.co.jp/story/2015/07/17/6232 「オープンソースソフトウェアといってもいろいろな開発スタイルがあると思うんですが、fluentdの場 合、僕が所属するトレジャーデータが全面的にバックアップしています。現在は、この開発スタイル「企 業がバックについているけど、開発はオープンに行う」という手法が一番合っていると思います。」 - OSや言語ではなくデータベースを極めたい:グリー技術者が聞いた、fluentdの新機能とTreasure Data古橋氏の野心 (2/3) - @IT http://www.atmarkit.co.jp/ait/articles/1310/07/news010_2.html
  8. 8. Process to use Embulk plugins at TD Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Send Pull-Request to OSS Embulk or Embulk Plugins Sorry, this is sorry closed source code Release as “Data Connector” or ”Result Output”
  9. 9. Process to use Embulk plugins at TD (1) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor ・Add some features e.g. add various authentication method. ・Add some fixes  e.g. add retry logic fix error handling
  10. 10. Process to use Embulk plugins at TD (2) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Handling of file path MR executor could not read local file path(like private key) Fix authorization logic if need transaction() and open() method will run at different instances
  11. 11. Process to use Embulk plugins at TD (3) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Need 80% coverage By internal rules, we can’t deploy without 80% coverered unit test. Write Unit test Write unit test for Embulk plugin is difficult. e.g. connect to cloud service…
  12. 12. Process to use Embulk plugins at TD (4) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Write Integration Test for Treasure Data Service (1) Import data into TD (2) Send query into Presto, Hive (3) Check result with local file. e.g.
  13. 13. Process to use Embulk plugins at TD (5) Fix for MapReduce Executor Write Unit test Write Integration test Add Features Fix for Local Executor Release as “Data Connector” or ”Result Output”
  14. 14. We hope Win-Win relationship Embulk Community Use at TD Core development Plugin development Use at your own environment Contribute
  15. 15. Embulk Execution Platform at Treasure Data Load Balancer TD API(API Servers)Web Console td commands td connector:issue td guess config.yml… Response Response Request Request Bulkload API (API Servers) Perfect Queue TD worker (worker process) enqueue dequeue Submit Job (Retry if need) Execute with MR / Local Executor guess/preview
  16. 16. TD API / Bulkload API TD API(API Servers) Bulkload API(API Servers) guess/preview is processed at different API Servers. ResponseRequest guess/preview data import Perfect Queue Load Balancer Queuing Http Request/Response guess/preview needs quick response enqueue
  17. 17. Problems Stability of Integration Tests Execution time of Integration Tests ・Many plugins × Many test cases × Frequent execution  sometimes causes failure. ・Many plugins × Many test cases causes long execution time:)

×