Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improve data engineering work with Digdag and Presto UDF

479 views

Published on

Prestoに搭載された新しいUDFや、digdagの新機能を用いることで実現する効率的なワークフローを作りのテクニックを紹介します。

2018-10-17 PLAZMA TD Tech Talk 2018 at Shibuya : Part 2にて発表したスライドです
https://techplay.jp/event/694468

Published in: Engineering
  • Be the first to comment

Improve data engineering work with Digdag and Presto UDF

  1. 1. © 2018 Arm Limited • Kentaro Yoshida Improve data engineering work with Digdag and Presto UDF • 2018/10/17 at Plazma TD TechTalk 2018 Fall
  2. 2. © 2018 Arm Limited2 About me • @yoshi_ken • Leading DATA Team • Support data driven work at TD • Published DWH Platform books Familiar Products
  3. 3. © 2018 Arm Limited3 What is DATA Team? • Management for internal data ETL & Analysis Platform on TreasureData • As historical reason, using Luigi, Airflow(with embulk) and Digdag • Management data visualizing and reporting workflow for business • Not only for engineers but also sales, marketing and operation • Make simple solution insight from complexed data ocean • Kind of data science(analysis) solution • A rare team who use TreasureData internally as daily basis • We can tell feedback as user mind for new improvements
  4. 4. © 2018 Arm Limited4 Technical Challenge of DATA Team • Make scalable&robust data pipeline • ex) 1 query generates numerous metrics logs from each components • Improve fact data for supporting data-driven business/engineering • ex) make easier to use data beforehand enrich/pre-processing • Seek performance tuning insights for presto/hive at the platform side • ex) root cause of making table fragmentation • Change semi-realtime data processing from daily jobs • ex) fresh/quick stat data make good insight for engineer/support
  5. 5. © 2018 Arm Limited Introduce nice improvements For Presto UDF and digdag
  6. 6. © 2018 Arm Limited6 Introduced nice improvements in Digdag and Presto • New feature of Digdag 1. Added ${td.last_job.num_records} • Which has number of records for job results 2. Added “_else_do” after if> operator since digdag v0.9.31 3. Added param_set> and param_get> • For parameter sharing between workflow (not available in TD workflow) • New feature of Presto 1. Added TD_TIME_STRING() UDF • In SELECT clause, Make easier to format date string 2. Added TD_INTERVAL() UDF • In WHERE clause, Make easier to specify time range extraction
  7. 7. © 2018 Arm Limited New Feature of Digdag
  8. 8. © 2018 Arm Limited8 Situation of zero result error in workflow • Due to some reason, in the case of final results got zero result unexpectedly. • It need to investigate result number of rows for each step-by-step. • I wish if digdag check the result number of rows at each step… • I wish if digdag has function of result output with job_id… Oops!
  9. 9. © 2018 Arm Limited9 Situation of zero result error in workflow • Introduced ${td.last_job.num_records} has number of records for job results $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
  10. 10. © 2018 Arm Limited10 Situation of zero result error in workflow • Introduced “_else_do” after if> operator since digdag v0.9.31 $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows. _else_do: sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job _export: result_path: td://@/workflow_logs/jobid_${td.last_job_id}
  11. 11. © 2018 Arm Limited New Feature of Presto TD_TIME_STRING() UDF
  12. 12. © 2018 Arm Limited12 Efficient way to format date string in SELECT • It was required to use burden of writing date format conversion. • This type of query has used GROUP BY statement in generally. • So, I have used to be add preset custom dictionary with “td” for my IME.
  13. 13. © 2018 Arm Limited13 Efficient way to format date string in SELECT • TD_TIME_STRING() is awesome UDF • Easier way to truncate timestamp format string format example y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700 q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700 M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700 w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700 d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700 h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700 m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700 s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700 y! yyyy 2018 q! yyyy-MM 2018-04 M! yyyy-MM 2018-09 w! yyyy-MM-dd 2018-09-09 d! yyyy-MM-dd 2018-09-13 h! yyyy-MM-dd HH 2018-09-13 16 m! yyyy-MM-dd HH:mm 2018-09-13 16:45 s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34 —- Before TD_TIME_FORMAT( TD_DATE_TRUNC('day', time), 'yyyy-MM-dd') —- After TD_TIME_STRING(time, 'd!') day,
  14. 14. © 2018 Arm Limited New Feature of Presto TD_INTERVAL() UDF
  15. 15. © 2018 Arm Limited15 Efficient way to specify range of date in WHERE • There are many complicated technique to gather specific range —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() )
  16. 16. © 2018 Arm Limited16 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-6M/0d')
  17. 17. © 2018 Arm Limited17 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-1d')
  18. 18. © 2018 Arm Limited18 Efficient way to specify range of date in WHERE
  19. 19. © 2018 Arm Limited19 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h') # From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45) SELECT ... WHERE TD_INTERVAL(time, '-1h/now') # The last hour since the beginning of today [2018-08-13 23:00:00, 2018-08-14 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h/0d') • After slash, it can specify the borderline of the day.
  20. 20. © 2018 Arm Limited20 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25') # The last 10 days since the beginning of the last month [2018-06-21 00:00:00, 2018-07-01 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M') • After slash, it can specify the borderline of the day. • Effective way, It also work ${session_date} if using digdag.
  21. 21. © 2018 Arm Limited21 Tips about handling time range -- recommend to test with such a time_series table CREATE TABLE time_series AS SELECT time, TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date FROM ( SELECT times FROM ( VALUES SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60) ) AS x (times) ) t1 CROSS JOIN UNNEST(times) AS t (time) ORDER BY time https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
  22. 22. © 2018 Arm Limited22 Let’s enjoy data engineering work with digdag! And also feel free to talk to me
  23. 23. Thank You Danke Merci 谢谢 ありがとう Gracias Kiitos 감사합니다 धन्यवाद ‫תודה‬© 2018 Arm Limited23

×