Debugging data pipelines
Karan Kumar
SDE 3
Dataplatform
Overview
● Our Journey
● Analytics@Ola
● CDC Overview
● Application Events
● Majority Sources
● Hello Presto
● The Presto Kafka Problem
● Solution
● Results
● We like ambari!
● How to expose?
● Hue drawbacks
● Presto as a first class citizen of Hue
● Roadmap
Overview of analytics@Ola
● 25k query run daily by business analysts.
● ~400 business analysts.
● 2.5 TB of daily data ingest.
● ~3k tables maintained by dataplatform.
● Auth managed via Ranger
CDC Overview
Application Events
Majority sources
● MYSQL
● PSQL
● Kafka
● MongoDB
● Hbase
● ScyllaDB
● Hive
Hello Presto
● Single unified view across data sources
● Profiling and automated alerting
● Drastic reduction in TAT.
● Integration with Jira Hooks
The Presto-Kafka Problem
● Gets all the partitions, start scanning from earliest and then apply filters
● Topic addition requires config change
The Presto-Kafka Solution
● Hit the broker for the topic list every time.
● Make use of message_timestamp in kafka versions > 0.10.1xx
Results
● Earlier .
● With message timestamp
● With predicate pushdown
We like ambari!!
● Exposing presto on ambari .
● Patching open source ambari to fit our needs of pulling tars from s3.
● Out of the box alerting and monitoring.
● Releasing plugins via s3 poll.
● Autoscaling via AWS autoscaling groups.
That's okay but how to expose?
● We had 3 choices.
○ MSTR
○ Hue
○ New interface like superset
Why Hue will not work?
● No results download
● No query progress
● No query kill functionality
● Result caching
● Download limit on rows fetched and not size.
● Launching jvm for each user
Why MSTR did not work?
● Downloading was tedious.
● Per user memory issue.
● UI unfamiliarity.
Presto as a first class citizen for hue
● Results download upto 100 mb.
● Query progress .
● Query kill supported .
● Query expiry after 7 days. No need to rerun historical q’s
● Coordinator query url
Roadmap
1. Contributing presto kafka connector back
2. Presto oozie support
3. Getting Presto Ranger PR merged
4. Deprecating Hive for analysts
Thanks!! Questions?

Debugging data pipelines @OLA by Karan Kumar

  • 1.
    Debugging data pipelines KaranKumar SDE 3 Dataplatform
  • 2.
    Overview ● Our Journey ●Analytics@Ola ● CDC Overview ● Application Events ● Majority Sources ● Hello Presto ● The Presto Kafka Problem ● Solution ● Results ● We like ambari! ● How to expose? ● Hue drawbacks ● Presto as a first class citizen of Hue ● Roadmap
  • 4.
    Overview of analytics@Ola ●25k query run daily by business analysts. ● ~400 business analysts. ● 2.5 TB of daily data ingest. ● ~3k tables maintained by dataplatform. ● Auth managed via Ranger
  • 5.
  • 6.
  • 7.
    Majority sources ● MYSQL ●PSQL ● Kafka ● MongoDB ● Hbase ● ScyllaDB ● Hive
  • 8.
    Hello Presto ● Singleunified view across data sources ● Profiling and automated alerting ● Drastic reduction in TAT. ● Integration with Jira Hooks
  • 9.
    The Presto-Kafka Problem ●Gets all the partitions, start scanning from earliest and then apply filters ● Topic addition requires config change
  • 10.
    The Presto-Kafka Solution ●Hit the broker for the topic list every time. ● Make use of message_timestamp in kafka versions > 0.10.1xx
  • 11.
    Results ● Earlier . ●With message timestamp ● With predicate pushdown
  • 12.
    We like ambari!! ●Exposing presto on ambari . ● Patching open source ambari to fit our needs of pulling tars from s3. ● Out of the box alerting and monitoring. ● Releasing plugins via s3 poll. ● Autoscaling via AWS autoscaling groups.
  • 13.
    That's okay buthow to expose? ● We had 3 choices. ○ MSTR ○ Hue ○ New interface like superset
  • 14.
    Why Hue willnot work? ● No results download ● No query progress ● No query kill functionality ● Result caching ● Download limit on rows fetched and not size. ● Launching jvm for each user
  • 15.
    Why MSTR didnot work? ● Downloading was tedious. ● Per user memory issue. ● UI unfamiliarity.
  • 16.
    Presto as afirst class citizen for hue ● Results download upto 100 mb. ● Query progress . ● Query kill supported . ● Query expiry after 7 days. No need to rerun historical q’s ● Coordinator query url
  • 17.
    Roadmap 1. Contributing prestokafka connector back 2. Presto oozie support 3. Getting Presto Ranger PR merged 4. Deprecating Hive for analysts
  • 18.