SlideShare a Scribd company logo
© 2018 Arm Limited
• Kentaro Yoshida
Improve data engineering work
with Digdag and Presto UDF
• 2018/10/17
at Plazma TD TechTalk 2018 Fall
© 2018 Arm Limited2
About me
• @yoshi_ken
• Leading DATA Team
• Support data driven work at TD
• Published DWH Platform books
Familiar Products
© 2018 Arm Limited3
What is DATA Team?
• Management for internal data ETL & Analysis Platform on TreasureData
• As historical reason, using Luigi, Airflow(with embulk) and Digdag
• Management data visualizing and reporting workflow for business
• Not only for engineers but also sales, marketing and operation
• Make simple solution insight from complexed data ocean
• Kind of data science(analysis) solution
• A rare team who use TreasureData internally as daily basis
• We can tell feedback as user mind for new improvements
© 2018 Arm Limited4
Technical Challenge of DATA Team
• Make scalable&robust data pipeline
• ex) 1 query generates numerous metrics logs from each components
• Improve fact data for supporting data-driven business/engineering
• ex) make easier to use data beforehand enrich/pre-processing
• Seek performance tuning insights for presto/hive at the platform side
• ex) root cause of making table fragmentation
• Change semi-realtime data processing from daily jobs
• ex) fresh/quick stat data make good insight for engineer/support
© 2018 Arm Limited
Introduce nice improvements
For Presto UDF and digdag
© 2018 Arm Limited6
Introduced nice improvements in Digdag and Presto
• New feature of Digdag
1. Added ${td.last_job.num_records}
• Which has number of records for job results
2. Added “_else_do” after if> operator since digdag v0.9.31
3. Added param_set> and param_get>
• For parameter sharing between workflow (not available in TD workflow)
• New feature of Presto
1. Added TD_TIME_STRING() UDF
• In SELECT clause, Make easier to format date string
2. Added TD_INTERVAL() UDF
• In WHERE clause, Make easier to specify time range extraction
© 2018 Arm Limited
New Feature of Digdag
© 2018 Arm Limited8
Situation of zero result error in workflow
• Due to some reason, in the case of final results got zero result unexpectedly.
• It need to investigate result number of rows for each step-by-step.
• I wish if digdag check the result number of rows at each step…
• I wish if digdag has function of result output with job_id…
Oops!
© 2018 Arm Limited9
Situation of zero result error in workflow
• Introduced ${td.last_job.num_records} has number of records for job
results
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
© 2018 Arm Limited10
Situation of zero result error in workflow
• Introduced “_else_do” after if> operator since digdag v0.9.31
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
_else_do:
sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job
_export:
result_path: td://@/workflow_logs/jobid_${td.last_job_id}
© 2018 Arm Limited
New Feature of Presto
TD_TIME_STRING() UDF
© 2018 Arm Limited12
Efficient way to format date string in SELECT
• It was required to use burden of writing date format conversion.
• This type of query has used GROUP BY statement in generally.
• So, I have used to be add preset custom dictionary with “td” for my IME.
© 2018 Arm Limited13
Efficient way to format date string in SELECT
• TD_TIME_STRING() is awesome UDF
• Easier way to truncate timestamp
format
string
format example
y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700
q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700
M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700
w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700
d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700
h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700
m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700
s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700
y! yyyy 2018
q! yyyy-MM 2018-04
M! yyyy-MM 2018-09
w! yyyy-MM-dd 2018-09-09
d! yyyy-MM-dd 2018-09-13
h! yyyy-MM-dd HH 2018-09-13 16
m! yyyy-MM-dd HH:mm 2018-09-13 16:45
s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34
—- Before
TD_TIME_FORMAT(
TD_DATE_TRUNC('day', time),
'yyyy-MM-dd')
—- After
TD_TIME_STRING(time, 'd!') day,
© 2018 Arm Limited
New Feature of Presto
TD_INTERVAL() UDF
© 2018 Arm Limited15
Efficient way to specify range of date in WHERE
• There are many complicated technique to gather specific range
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
© 2018 Arm Limited16
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-6M/0d')
© 2018 Arm Limited17
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-1d')
© 2018 Arm Limited18
Efficient way to specify range of date in WHERE
© 2018 Arm Limited19
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h')
# From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45)
SELECT ... WHERE TD_INTERVAL(time, '-1h/now')
# The last hour since the beginning of today [2018-08-13 23:00:00,
2018-08-14 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h/0d')
• After slash, it can specify the borderline of the day.
© 2018 Arm Limited20
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25
00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25')
# The last 10 days since the beginning of the last month [2018-06-21
00:00:00, 2018-07-01 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M')
• After slash, it can specify the borderline of the day.
• Effective way, It also work ${session_date} if using digdag.
© 2018 Arm Limited21
Tips about handling time range
-- recommend to test with such a time_series table
CREATE TABLE time_series AS
SELECT
time,
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date
FROM (
SELECT times
FROM (
VALUES
SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60)
) AS x (times)
) t1
CROSS JOIN UNNEST(times) AS t (time)
ORDER BY time
https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
© 2018 Arm Limited22
Let’s enjoy data engineering work with digdag!
And also feel free to talk to me
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫תודה‬© 2018 Arm Limited23

More Related Content

What's hot

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Kevin Xu
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Luke Han
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 
ElastiCache and Redis
ElastiCache and RedisElastiCache and Redis
ElastiCache and Redis
Amazon Web Services
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
Taro L. Saito
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
InfluxData
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
DataWorks Summit
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
Taro L. Saito
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
InfluxData
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache Pulsar
Streamlio
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
Cloudera, Inc.
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
Carol McDonald
 

What's hot (20)

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
ElastiCache and Redis
ElastiCache and RedisElastiCache and Redis
ElastiCache and Redis
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache Pulsar
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 

Similar to Improve data engineering work with Digdag and Presto UDF

Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
InfluxData
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxData
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcase
Keisuke Suzuki
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
InfluxData
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
 
Hash join use memory optimization
Hash join use memory optimizationHash join use memory optimization
Hash join use memory optimization
ICTeam S.p.A.
 
Optimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real WorldOptimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real World
DevOps.com
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Toru Takahashi
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Taro L. Saito
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
ScyllaDB
 
Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStore
MariaDB plc
 
Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0
oysteing
 
Using TICK Stack For System and App Metrics
Using TICK Stack For System and App MetricsUsing TICK Stack For System and App Metrics
Using TICK Stack For System and App Metrics
Aayush Tuladhar
 
Lecture 2a
Lecture 2aLecture 2a
Lecture 2a
Max Lyons
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
 
Workload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud MarketplacesWorkload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud Marketplaces
Gravitant, Inc.
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Dieter Plaetinck
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchRim Moussa
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 

Similar to Improve data engineering work with Digdag and Presto UDF (20)

Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcase
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Hash join use memory optimization
Hash join use memory optimizationHash join use memory optimization
Hash join use memory optimization
 
Optimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real WorldOptimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real World
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
 
Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStore
 
Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0
 
Using TICK Stack For System and App Metrics
Using TICK Stack For System and App MetricsUsing TICK Stack For System and App Metrics
Using TICK Stack For System and App Metrics
 
Lecture 2a
Lecture 2aLecture 2a
Lecture 2a
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Workload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud MarketplacesWorkload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud Marketplaces
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 

More from Kentaro Yoshida

TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
Kentaro Yoshida
 
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Kentaro Yoshida
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編
Kentaro Yoshida
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
Kentaro Yoshida
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
Kentaro Yoshida
 
Fluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターンFluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターン
Kentaro Yoshida
 
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
Kentaro Yoshida
 
MySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearchMySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearch
Kentaro Yoshida
 
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasualFluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Kentaro Yoshida
 
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
Kentaro Yoshida
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
Kentaro Yoshida
 
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Kentaro Yoshida
 
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
Kentaro Yoshida
 

More from Kentaro Yoshida (13)

TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
 
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Fluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターンFluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターン
 
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
 
MySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearchMySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearch
 
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasualFluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
 
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
 
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
 
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
 

Recently uploaded

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 

Improve data engineering work with Digdag and Presto UDF

  • 1. © 2018 Arm Limited • Kentaro Yoshida Improve data engineering work with Digdag and Presto UDF • 2018/10/17 at Plazma TD TechTalk 2018 Fall
  • 2. © 2018 Arm Limited2 About me • @yoshi_ken • Leading DATA Team • Support data driven work at TD • Published DWH Platform books Familiar Products
  • 3. © 2018 Arm Limited3 What is DATA Team? • Management for internal data ETL & Analysis Platform on TreasureData • As historical reason, using Luigi, Airflow(with embulk) and Digdag • Management data visualizing and reporting workflow for business • Not only for engineers but also sales, marketing and operation • Make simple solution insight from complexed data ocean • Kind of data science(analysis) solution • A rare team who use TreasureData internally as daily basis • We can tell feedback as user mind for new improvements
  • 4. © 2018 Arm Limited4 Technical Challenge of DATA Team • Make scalable&robust data pipeline • ex) 1 query generates numerous metrics logs from each components • Improve fact data for supporting data-driven business/engineering • ex) make easier to use data beforehand enrich/pre-processing • Seek performance tuning insights for presto/hive at the platform side • ex) root cause of making table fragmentation • Change semi-realtime data processing from daily jobs • ex) fresh/quick stat data make good insight for engineer/support
  • 5. © 2018 Arm Limited Introduce nice improvements For Presto UDF and digdag
  • 6. © 2018 Arm Limited6 Introduced nice improvements in Digdag and Presto • New feature of Digdag 1. Added ${td.last_job.num_records} • Which has number of records for job results 2. Added “_else_do” after if> operator since digdag v0.9.31 3. Added param_set> and param_get> • For parameter sharing between workflow (not available in TD workflow) • New feature of Presto 1. Added TD_TIME_STRING() UDF • In SELECT clause, Make easier to format date string 2. Added TD_INTERVAL() UDF • In WHERE clause, Make easier to specify time range extraction
  • 7. © 2018 Arm Limited New Feature of Digdag
  • 8. © 2018 Arm Limited8 Situation of zero result error in workflow • Due to some reason, in the case of final results got zero result unexpectedly. • It need to investigate result number of rows for each step-by-step. • I wish if digdag check the result number of rows at each step… • I wish if digdag has function of result output with job_id… Oops!
  • 9. © 2018 Arm Limited9 Situation of zero result error in workflow • Introduced ${td.last_job.num_records} has number of records for job results $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
  • 10. © 2018 Arm Limited10 Situation of zero result error in workflow • Introduced “_else_do” after if> operator since digdag v0.9.31 $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows. _else_do: sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job _export: result_path: td://@/workflow_logs/jobid_${td.last_job_id}
  • 11. © 2018 Arm Limited New Feature of Presto TD_TIME_STRING() UDF
  • 12. © 2018 Arm Limited12 Efficient way to format date string in SELECT • It was required to use burden of writing date format conversion. • This type of query has used GROUP BY statement in generally. • So, I have used to be add preset custom dictionary with “td” for my IME.
  • 13. © 2018 Arm Limited13 Efficient way to format date string in SELECT • TD_TIME_STRING() is awesome UDF • Easier way to truncate timestamp format string format example y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700 q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700 M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700 w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700 d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700 h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700 m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700 s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700 y! yyyy 2018 q! yyyy-MM 2018-04 M! yyyy-MM 2018-09 w! yyyy-MM-dd 2018-09-09 d! yyyy-MM-dd 2018-09-13 h! yyyy-MM-dd HH 2018-09-13 16 m! yyyy-MM-dd HH:mm 2018-09-13 16:45 s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34 —- Before TD_TIME_FORMAT( TD_DATE_TRUNC('day', time), 'yyyy-MM-dd') —- After TD_TIME_STRING(time, 'd!') day,
  • 14. © 2018 Arm Limited New Feature of Presto TD_INTERVAL() UDF
  • 15. © 2018 Arm Limited15 Efficient way to specify range of date in WHERE • There are many complicated technique to gather specific range —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() )
  • 16. © 2018 Arm Limited16 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-6M/0d')
  • 17. © 2018 Arm Limited17 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-1d')
  • 18. © 2018 Arm Limited18 Efficient way to specify range of date in WHERE
  • 19. © 2018 Arm Limited19 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h') # From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45) SELECT ... WHERE TD_INTERVAL(time, '-1h/now') # The last hour since the beginning of today [2018-08-13 23:00:00, 2018-08-14 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h/0d') • After slash, it can specify the borderline of the day.
  • 20. © 2018 Arm Limited20 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25') # The last 10 days since the beginning of the last month [2018-06-21 00:00:00, 2018-07-01 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M') • After slash, it can specify the borderline of the day. • Effective way, It also work ${session_date} if using digdag.
  • 21. © 2018 Arm Limited21 Tips about handling time range -- recommend to test with such a time_series table CREATE TABLE time_series AS SELECT time, TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date FROM ( SELECT times FROM ( VALUES SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60) ) AS x (times) ) t1 CROSS JOIN UNNEST(times) AS t (time) ORDER BY time https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
  • 22. © 2018 Arm Limited22 Let’s enjoy data engineering work with digdag! And also feel free to talk to me