Data platform from scratch (Sunny Shah - MakeMyTrip/GoIbibo)

Data Platform from the
scratch
By Goibibo-MakeMyTrip Data Team

What do you like about the Taj Mahal?

Please appreciate the platform. :-)

What and Why of the Data Platform?
20+
18+
3

What and Why of the Data Platform?
20+
18+
3
Data
Platform

● Data Platform for Analysts
● Data Platform for Data Scientists
● Data Platform for Backend Applications
● Data Platform for Streaming Applications

SqlShift
SDT: Simple Data Platform ( 2016 )
ETL for
Flat
Tables
mShift
cShift
Backend
Events

Monthly Infra Cost: $2,250-$2,750
● Redshift Cost
○ 6 DS2.xlarge => 12 TB compressed storage
○ ~25 TB of raw data
○ $1,380 per month
● 2 i3.4x large for the Spark cluster
○ 32 Cores
○ 244 GB RAM
○ 8 TB of SSD storage for the logs storage
○ $995 per month
● S3 + Other cost
○ $100-$700 per month

Initial 6 months Business Impact:
● >5000 Redash queries in 6 months time
● 100+ hourly/daily emailers
● 50k queries per day on Redshift
● Finance team used it Vendor payouts and reconciliation
● Marketing team started using it for the User Targeting

Issue: Inconsistent data
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Sam 2020-01-01 11:45:00
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Nitin 2020-01-01 11:45:00
3 Saurabh 2020-01-01 12:10:00
Employees Table At: 2020-01-01 12:00
Employees Table At: 2020-01-01 13:00

The simple data-platform + CDC + Kafka ( 2017 )
Log Compacted Topics
kShift
Backend

Best Practices to make SDP a success!
● Rely on Airflow for retries and scheduling the job.
○ Crontabs are highly unreliable! Don’t use them for anything!
● Alerting and Monitoring
○ Alert calls/Emails for job failures, Single job failure is a complete warehouse failure.
○ SLA miss needs an alert as well

● Set-up WLM / Query termination rules
● Unit test cases, Integration test cases to ensure tools are accurate.
● Distributed locks
○ Only one application at a time can sync a table.
● Offset Management
○ Instead of pulling last hour/day’s data
Best Practices to make SDP a success!

● This particular setup worked for us from 2016-
2018
● We scaled up our Redshift cluster from 3 node-
6TB storage to 18 node-36 TB storage.
● ~4000+ tables.

And we don’t have the approval for Scale-up!

Redshift Spectrum & Immutable data
● ~75% data in Redshift was the events data.
● Events are Immutable
● Queries on Parquet 5-10% faster than Queries on ORC

S3 + Spectrum to the rescue
kShift
Immutable
events data.
Redshift
Spectrum
Backend

S3 + Spectrum to the rescue, But it’s broken.
kShift
Immutable
events data.
Redshift
Spectrum
Issues:
1. Small files
2. Failure causes data duplication
Backend

How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets

How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets
Failure, Duplicate Data
Failure, Duplicate Data

● Delta is a storage format which internally uses Parquet as a file format.
● Provides ACID guarantees
● Provides a way to merge small files
● Provides insert, update, delete support
● Works well with Spark
● Provides schema evaluation support
● For more info on Delta to Spectrum converter, please checkout our blog
Delta Lake = Parquet + Two-Phase-Commit

Delta on S3 + Redshift Spectrum
kShift
Redshift
Spectrum
Delta
Delta To
Spectrum
Immutable
events data.Backend

● Data Analysts needs aggregation engine.
● Data scientists needs raw data
● Redshift isn’t great with frequent >50GB unloads.
● Solution: Keep both mutable and immutable data in S3 in the Delta format.
Data Platform for Data Scientists ( 2019 )
Delta
On Demand
Spark Clusters

Delta To
Spectrum
The new Ingestion Architecture ( 2019 )
Delta
Backend

Data Platform for the Backend folks
● The Pipe Dream of Backend engineers -> Redshift/Data-lake as an OLTP db.
● All the data at one place.
● Only Data Lake has
○ Click-stream data
○ events data
○ Flat table data
● Problem: How can we provide them all the events data in <10 ms?
● Problem: How can we serve multiple types of events in <10ms?

EventStore architecture
AVRO
Schema
Registry
Backend

Next stuff
● Highly resource efficient and user-friendly realtime user segmentation.
● We are excited about Streaming Democratization
○ Resource efficient and fully self-service alternatives of Flink-Sql or kSQL
○ Excited about Materialize project ( https://github.com/MaterializeInc/materialize )
■ Streaming Sql engine built with Rust.
● We are also super excited about the edge-analytics
○ https://github.com/cwida/duckdb
● We are interested in making the Delta Lake update and deletes faster.

Data platform from scratch (Sunny Shah - MakeMyTrip/GoIbibo)

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Data platform from scratch (Sunny Shah - MakeMyTrip/GoIbibo)

Editor's Notes