Speaker: Sunny Shah (https://www.linkedin.com/in/sunny-shah-8924577/)
Video: https://www.youtube.com/channel/UC2698J-retd2cw1VZZUnLHw
Talk presented during Bangalore Kafka group's meetup at Near
(https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/267874122/)
7. ● Data Platform for Analysts
● Data Platform for Data Scientists
● Data Platform for Backend Applications
● Data Platform for Streaming Applications
10. Monthly Infra Cost: $2,250-$2,750
● Redshift Cost
○ 6 DS2.xlarge => 12 TB compressed storage
○ ~25 TB of raw data
○ $1,380 per month
● 2 i3.4x large for the Spark cluster
○ 32 Cores
○ 244 GB RAM
○ 8 TB of SSD storage for the logs storage
○ $995 per month
● S3 + Other cost
○ $100-$700 per month
11. Initial 6 months Business Impact:
● >5000 Redash queries in 6 months time
● 100+ hourly/daily emailers
● 50k queries per day on Redshift
● Finance team used it Vendor payouts and reconciliation
● Marketing team started using it for the User Targeting
12. Issue: Inconsistent data
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Sam 2020-01-01 11:45:00
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Nitin 2020-01-01 11:45:00
3 Saurabh 2020-01-01 12:10:00
Employees Table At: 2020-01-01 12:00
Employees Table At: 2020-01-01 13:00
14. Best Practices to make SDP a success!
● Rely on Airflow for retries and scheduling the job.
○ Crontabs are highly unreliable! Don’t use them for anything!
● Alerting and Monitoring
○ Alert calls/Emails for job failures, Single job failure is a complete warehouse failure.
○ SLA miss needs an alert as well
15. ● Set-up WLM / Query termination rules
● Unit test cases, Integration test cases to ensure tools are accurate.
● Distributed locks
○ Only one application at a time can sync a table.
● Offset Management
○ Instead of pulling last hour/day’s data
Best Practices to make SDP a success!
16. ● This particular setup worked for us from 2016-
2018
● We scaled up our Redshift cluster from 3 node-
6TB storage to 18 node-36 TB storage.
● ~4000+ tables.
18. Redshift Spectrum & Immutable data
● ~75% data in Redshift was the events data.
● Events are Immutable
● Queries on Parquet 5-10% faster than Queries on ORC
19. S3 + Spectrum to the rescue
kShift
Immutable
events data.
Redshift
Spectrum
Backend
20. S3 + Spectrum to the rescue, But it’s broken.
kShift
Immutable
events data.
Redshift
Spectrum
Issues:
1. Small files
2. Failure causes data duplication
Backend
21. How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets
22. How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets
Failure, Duplicate Data
Failure, Duplicate Data
23. ● Delta is a storage format which internally uses Parquet as a file format.
● Provides ACID guarantees
● Provides a way to merge small files
● Provides insert, update, delete support
● Works well with Spark
● Provides schema evaluation support
● For more info on Delta to Spectrum converter, please checkout our blog
Delta Lake = Parquet + Two-Phase-Commit
24. Delta on S3 + Redshift Spectrum
kShift
Redshift
Spectrum
Delta
Delta To
Spectrum
Immutable
events data.Backend
25. ● Data Analysts needs aggregation engine.
● Data scientists needs raw data
● Redshift isn’t great with frequent >50GB unloads.
● Solution: Keep both mutable and immutable data in S3 in the Delta format.
Data Platform for Data Scientists ( 2019 )
Delta
On Demand
Spark Clusters
27. Data Platform for the Backend folks
● The Pipe Dream of Backend engineers -> Redshift/Data-lake as an OLTP db.
● All the data at one place.
● Only Data Lake has
○ Click-stream data
○ events data
○ Flat table data
● Problem: How can we provide them all the events data in <10 ms?
● Problem: How can we serve multiple types of events in <10ms?
29. Next stuff
● Highly resource efficient and user-friendly realtime user segmentation.
● We are excited about Streaming Democratization
○ Resource efficient and fully self-service alternatives of Flink-Sql or kSQL
○ Excited about Materialize project ( https://github.com/MaterializeInc/materialize )
■ Streaming Sql engine built with Rust.
● We are also super excited about the edge-analytics
○ https://github.com/cwida/duckdb
● We are interested in making the Delta Lake update and deletes faster.
Editor's Notes
Come on guys, Please appreciate the platform as well. Taj mahal is built on the banks of the Yamuna. It isn’t possible to construct such a long lasting building on sand. The depth of the foundation is hundreds of meters. It has hundreds of meters of delth and it’s built with Wood, rubbles and Iron.
Ok, Now we all appreciate the platform. So let’s understand What’s data-platform, Why do we need it and How to build it from the scratch.
Thanks to a 50+ backend microservices, We have our data stored in 20+ MySql Clusters, 18+ Mongodb Clusters, 3 Cassandra clusters, More than 2 TB data in 50+ Dynamodb tables, Aerospike cluster, Backend pushes data to Kafka with 800+ topics and >1 GBPS throughput. From the front-end we get data in GA and Segment.io
Data-platform pulls data from all these diverse sources gives a unified view of the data to the data analysts, data scientists and backend applications and streaming applications.
Me and my team members got lucky and got the opportunity to build the DataPlatform from the scratch for the Goibibo, InGoMMT ( Goibibo-Makemytrip Supply Platform ), HotelSimply ( Hotel Management System ) and Goibibo-MakeMyTrip common services.
Agenda of the talk is to tell you, How we built the Dataplatform for Analysts, Dataplatform for Data scientists from
This is our first data platform. We were on the AWS so we chose Redshift as our Data Warehouse. We built Spark tools to pull data in parallel from the MySql, mongoDB and Cassandra to Redshift. For MySql to Redshift, we built SqlShift. It can pull data incrementally or do the full dump. These Spark jobs ensured that Redshift has the data of our backend databases.
Next thing that we wanted in our data warehouse was Page visits and events data. We asked our backend teams to write each events to the separate log files and every hour we would upload the log file to the S3. Airflow job was scheduled to load these log to the Redshift.
Single hotel transaction would write data to several databases and our analysts were joining these tables repeatedly, We wrote a spark ETL job to create flat table, These flat tables has ~250 columns and almost all the required information for the hotel transaction. We would run these ETL jobs every 5 minutes to give business near real time view of the business.
We use Redash as the visualization tool.
It’s possible to build this data-platform in 3 months time, Cost of this data platform for ~25 TB data would be $2250.
We learnt one thing out of this experience, A lot of People genuinely love doing analysis, Provided it’s easy to do so.
Our flat tables made it easy and fast for the people to do the analysis, As they don’t have to join tables.
In late 2016, Redshift came out with the python UDF feature, Our finance team learnt python to move their tax computation and other formulas to a function.
Marketing team built user targeting and user engagement through the Redshift.
In this particular case, We have two versions of the Employees table, The top one is table at the 12th hour and the bottom one is the table at the 13th hour.
Let's imagine that we have our data synced till the 12th hour
At the 13th hour, Our job would ask for the data from the 12th hour to the 13th hour. We will receive only the record with id=3, We won’t receive the changed id-2 record because updated_at didn’t change for it. And id-1 record won’t get deleted from the warehouse because It’s just not there in the source database anymore.
So sometimes we would miss updates and we would always lose deletes.
Solution was to read the changelog ( binlog in the case of MySQL and OpLog for Mongodb ).
Debezium was in a nascent stage around early 2017, We had to rewrite significant part of Mongodb debezium and fix a few issues in the MySql debezium to make it production ready.
Debezium reads the binlog/oplog and creates one Kafka topic per MySql table or Mongodb collection.
We keep these topics log compacted, With this setup the advantage is, Our jobs can use Kafka to read the complete data and never hit MySql/Mongodb cluster. This saves the Infra cost of keeping the additional MySql or MongoDb slaves for the data platform.
We had to build kShift, It’s a tool written in Scala and Spark to pull data from Kafka and Sync to the Redshift
Use Airflow instead of CronTabs, It gives us retries in the cases of failures, SLA miss alert and job failure alerts
Treat one job failure alert like an entire data platform failure. Have an SLA of <1 hour for fixing the data-sync issues.
Bad queries can take entire warehouse down, Don’t handle it manually. Redshift has auto query termination rules, If your warehouse doesn’t support it then write an airflow job to kill long running queries.
Take tools accuracy seriously, Write integration test cases.
Data consistency goes for toss when multiple instances of the same job syncs the table.
To solve this problem, We use zookeeper based distributed locks to ensure that this doesn’t happen.
It’s possible to build a distributed lock using Dynamodb or Redis as well.
Commit the offset and next time when job runs, start from the committed offset. This is a lot more fault tolerant than pulling the last hour’s data.
Interestingly one issue with Redshift is that, Without significant downtime, If we want to scale it up then it requires 2X nodes scale up. This means, After 18 node cluster the next scale-up would be 36 node.
So one day, we get the Redshift full alert and we don’t have the approval for 36 node scale-up!
Around this time, Redshift added a capability of querying data from the S3 and Parquet, ORC format.
We figured out that ~75% of the data in Redshift was events data and events are immutable.
We decided to move our events from Redshift to S3 and access it through Parquet.
Every job run would produce small files, Spectrum would perform really bad for the small files.
Not possible to merge these files without having duplicate data for a timebeing/inconsistency for sometime.
Every job failure would cause data duplication.
Let’s understand the data duplication bit more.
At an architectural level, The core reason behind this issue is Parquet+S3 doesn’t transactional properties across files and partitions.
In other words, We can’t say, Either write these 20 files in 2 partitions, Add them to Redshift or don’t write.
We found the solution of this problem in the Delta lake file format.
Important thing to note is, Our Delta to redshift connector just performs the metadata operation, This means it doesn’t copy the data. Even for tables of size 1 TB, Delta To Spectrum connector doesn’t take more than 1-2 seconds.