Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Embulk at Treasure Data

1,136 views

Published on

Using Embulk at Treasure Data

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Using Embulk at Treasure Data

  1. 1. Muga Nishizawa (西澤 無我) Using Embulk at Treasure Data
  2. 2. Today’s talk > What’s Embulk? > Why our customers use Embulk? > Embulk > Data Connector > Data Connector > The architecture > The use case > with MapReduce Executor > How we configure MapReduce Executor? 2
  3. 3. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration easy. > which was very painful… 3 Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  4. 4. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming Plugins Plugins bulk load
  5. 5. Why our customers use Embulk? > Upload various types of their data to TD with Embulk 
 > Various file formats > CSV, TSV, JSON, XML,.. > Various data source > Local disk, RDBMS, SFTP,.. > Various network environments > embulk-output-td > https://github.com/treasure-data/embulk-output-td 5
  6. 6. Out of scope for Embulk > They develop scripts for > generating Embulk configs > changing schema on a regular basis > logic to select some files but not others > managing cron settings > e.g. some users want to upload yesterday’s data
 
 as daily batch > Embulk is just “bulk loader” 6
  7. 7. Best practice to manage Embulk!! 7 http://www.slideshare.net/GONNakaTaka/embulk5
  8. 8. Yes, yes,.. 8
  9. 9. Data Connector Users/Customers PlazmaDBConnector Worker submit connector jobs see loaded data on Console Guess/Preview API
  10. 10. Data Connector Users/Customers PlazmaDBConnector Worker submit connector jobs see loaded data on Console Guess/Preview API
  11. 11. 2 types of hosted Embulk service 11 Import (Data Connector) Export (Result Output) MySQL PostgreSQL Redshift AWS S3 Google Cloud Storage SalesForce Marketo …etc MySQL PostgreSQL Redshift BigQuery …etc
  12. 12. Guess/Preview API Users/Customers PlazmaDB Connector Worker submit connector jobs see loaded data on Console Guess/Preview API
  13. 13. Guess/Preview API > Guesses Embulk config based on sample data > Creates parser config > Adds schema, escape char, quote char, etc.. > Creates rename filter config > TD requires uncapitalized column names > Preview data before uploading > Ensures quick response > Embulk performs this functionality running
 
 on our web application servers 13
  14. 14. Connector Worker Users/Customers PlazmaDB Connector Worker submit connector jobs see loaded data on Console Guess/Preview API
  15. 15. Connector Worker > Generates Embulk config and executes Embulk > Uses private output plugin instead of embulk-output-td
 
 to upload users’ data to PlazmaDB directly > Appropriate retry mechanism > Embulk runs on our Job Queue clients 15
  16. 16. Timestamp parsing Users/Customers PlazmaDB Connector Worker submit connector jobs see loaded data on Console Guess/Preview API
  17. 17. Timestamp parsing > Implement strptime in Java > Ported from CRuby implementation > Can precompile the format > Faster than JRuby’s strptime > Has been maintained in Embulk repo obscurely.. > It will be merged into JRuby 17
  18. 18. How we use Data Connector at TD > a. Monitoring our S3 buckets access > e.g. “IAM users who accessed our S3 buckets?”
 
 “Access frequency” > {in: {type: s3}} and {parser: {type: csv}} > b. Measuring KPIs for development process > e.g. “phases that we took a long time on the process” > {in: {type: jira}} > c. Measuring Business & Support Performance > {in: {type: Salesforce, Marketo, ZenDesk, …}} 18
  19. 19. Scaling Embulk > Requests for massive data loading from users > e.g. “Upload 150GB data by hourly batch”
 
 “Start PoC and upload 500GB data today” > Local Executor can not handle this scale > MapReduce Executor enables us to scale 19
  20. 20. W/ MapReduce Users/Customers PlazmaDB Connector Worker submit connector jobs see loaded data on Console Guess/Preview API Hadoop Clusters
  21. 21. What’s MapReduce Executor? 21 Task Task Task Task Map tasks Task queue run tasks on Hadoop
  22. 22. MapReduce Executor with TimestampPartitioning 22 Task Map tasks Task queue run tasks on Hadoop Reduce tasksShuffle
  23. 23. built Embulk configs 23 exec: type: mapreduce job_name: embulk.100000 config_files: - /etc/hadoop/conf/core-site.xml - /etc/hadoop/conf/hdfs-site.xml - /etc/hadoop/conf/mapred-site.xml config: fs.defaultFS: “hdfs://my-hdfs.example.net:8020” yarn.resourcemanager.hostname: "my-yarn.example.net" dfs.replication: 1 mapreduce.client.submit.file.replication: 1 state_path: /mnt/xxx/embulk/ partitioning: type: timestamp unit: hour column: time unix_timestamp_unit: hour map_side_partition_split: 3 reducers: 3 in: ... Connector Workers (single-machine workers) are still able to generate config
  24. 24. Different sized files 24 Map tasks Reduce tasksShuffle
  25. 25. Same time range data 25 Map tasks Reduce tasksShuffle
  26. 26. Grouping input files - {in: {min_task_size}} 26 Map tasks Reduce tasksShuffle Task Task Task It also can reduce mapper’s launch cost.
  27. 27. One partition into multi-reducers - {exec: {partitioning: {map_side_split}}} 27 Map tasks Reduce tasksShuffle
  28. 28. Conclusion > What’s Embulk? > Why we use Embulk? > Embulk > Data Connector > Data Connector > The architecture of Data Connector > The use case > with MapReduce Executor 28
  29. 29. 29

×