Debugging data pipelines @OLA by Karan Kumar

Debugging data pipelines
Karan Kumar
SDE 3
Dataplatform

Overview
● Our Journey
● Analytics@Ola
● CDC Overview
● Application Events
● Majority Sources
● Hello Presto
● The Presto Kafka Problem
● Solution
● Results
● We like ambari!
● How to expose?
● Hue drawbacks
● Presto as a ﬁrst class citizen of Hue
● Roadmap

Overview of analytics@Ola
● 25k query run daily by business analysts.
● ~400 business analysts.
● 2.5 TB of daily data ingest.
● ~3k tables maintained by dataplatform.
● Auth managed via Ranger

Majority sources
● MYSQL
● PSQL
● Kafka
● MongoDB
● Hbase
● ScyllaDB
● Hive

Hello Presto
● Single uniﬁed view across data sources
● Proﬁling and automated alerting
● Drastic reduction in TAT.
● Integration with Jira Hooks

The Presto-Kafka Problem
● Gets all the partitions, start scanning from earliest and then apply ﬁlters
● Topic addition requires conﬁg change

The Presto-Kafka Solution
● Hit the broker for the topic list every time.
● Make use of message_timestamp in kafka versions > 0.10.1xx

Results
● Earlier .
● With message timestamp
● With predicate pushdown

We like ambari!!
● Exposing presto on ambari .
● Patching open source ambari to ﬁt our needs of pulling tars from s3.
● Out of the box alerting and monitoring.
● Releasing plugins via s3 poll.
● Autoscaling via AWS autoscaling groups.

That's okay but how to expose?
● We had 3 choices.
○ MSTR
○ Hue
○ New interface like superset

Why Hue will not work?
● No results download
● No query progress
● No query kill functionality
● Result caching
● Download limit on rows fetched and not size.
● Launching jvm for each user

Why MSTR did not work?
● Downloading was tedious.
● Per user memory issue.
● UI unfamiliarity.

Presto as a ﬁrst class citizen for hue
● Results download upto 100 mb.
● Query progress .
● Query kill supported .
● Query expiry after 7 days. No need to rerun historical q’s
● Coordinator query url

Roadmap
1. Contributing presto kafka connector back
2. Presto oozie support
3. Getting Presto Ranger PR merged
4. Deprecating Hive for analysts

Debugging data pipelines @OLA by Karan Kumar

More Related Content

What's hot

Similar to Debugging data pipelines @OLA by Karan Kumar

More from Shubham Tagra

Recently uploaded

Debugging data pipelines @OLA by Karan Kumar