SlideShare a Scribd company logo
1 of 29
Data Platform from the
scratch
By Goibibo-MakeMyTrip Data Team
What do you like about the Taj Mahal?
Please appreciate the platform. :-)
What and Why of the Data Platform?
20+
18+
3
What and Why of the Data Platform?
20+
18+
3
Data
Platform
Data Platform For
● Data Platform for Analysts
● Data Platform for Data Scientists
● Data Platform for Backend Applications
● Data Platform for Streaming Applications
Data Platform for Analysts
SqlShift
SDT: Simple Data Platform ( 2016 )
ETL for
Flat
Tables
mShift
cShift
Backend
Events
Monthly Infra Cost: $2,250-$2,750
● Redshift Cost
○ 6 DS2.xlarge => 12 TB compressed storage
○ ~25 TB of raw data
○ $1,380 per month
● 2 i3.4x large for the Spark cluster
○ 32 Cores
○ 244 GB RAM
○ 8 TB of SSD storage for the logs storage
○ $995 per month
● S3 + Other cost
○ $100-$700 per month
Initial 6 months Business Impact:
● >5000 Redash queries in 6 months time
● 100+ hourly/daily emailers
● 50k queries per day on Redshift
● Finance team used it Vendor payouts and reconciliation
● Marketing team started using it for the User Targeting
Issue: Inconsistent data
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Sam 2020-01-01 11:45:00
Id Name updated_at
1 Sunny 2020-01-01 11:30:00
2 Nitin 2020-01-01 11:45:00
3 Saurabh 2020-01-01 12:10:00
Employees Table At: 2020-01-01 12:00
Employees Table At: 2020-01-01 13:00
The simple data-platform + CDC + Kafka ( 2017 )
Log Compacted Topics
kShift
Backend
Best Practices to make SDP a success!
● Rely on Airflow for retries and scheduling the job.
○ Crontabs are highly unreliable! Don’t use them for anything!
● Alerting and Monitoring
○ Alert calls/Emails for job failures, Single job failure is a complete warehouse failure.
○ SLA miss needs an alert as well
● Set-up WLM / Query termination rules
● Unit test cases, Integration test cases to ensure tools are accurate.
● Distributed locks
○ Only one application at a time can sync a table.
● Offset Management
○ Instead of pulling last hour/day’s data
Best Practices to make SDP a success!
● This particular setup worked for us from 2016-
2018
● We scaled up our Redshift cluster from 3 node-
6TB storage to 18 node-36 TB storage.
● ~4000+ tables.
And we don’t have the approval for Scale-up!
Redshift Spectrum & Immutable data
● ~75% data in Redshift was the events data.
● Events are Immutable
● Queries on Parquet 5-10% faster than Queries on ORC
S3 + Spectrum to the rescue
kShift
Immutable
events data.
Redshift
Spectrum
Backend
S3 + Spectrum to the rescue, But it’s broken.
kShift
Immutable
events data.
Redshift
Spectrum
Issues:
1. Small files
2. Failure causes data duplication
Backend
How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets
How spark writes to S3 + Spectrum?
Write Parquet files to S3
Add the partitions to Redshift
Commit offsets
Failure, Duplicate Data
Failure, Duplicate Data
● Delta is a storage format which internally uses Parquet as a file format.
● Provides ACID guarantees
● Provides a way to merge small files
● Provides insert, update, delete support
● Works well with Spark
● Provides schema evaluation support
● For more info on Delta to Spectrum converter, please checkout our blog
Delta Lake = Parquet + Two-Phase-Commit
Delta on S3 + Redshift Spectrum
kShift
Redshift
Spectrum
Delta
Delta To
Spectrum
Immutable
events data.Backend
● Data Analysts needs aggregation engine.
● Data scientists needs raw data
● Redshift isn’t great with frequent >50GB unloads.
● Solution: Keep both mutable and immutable data in S3 in the Delta format.
Data Platform for Data Scientists ( 2019 )
Delta
On Demand
Spark Clusters
Delta To
Spectrum
The new Ingestion Architecture ( 2019 )
Log Compacted Topics
Delta
Backend
Data Platform for the Backend folks
● The Pipe Dream of Backend engineers -> Redshift/Data-lake as an OLTP db.
● All the data at one place.
● Only Data Lake has
○ Click-stream data
○ events data
○ Flat table data
● Problem: How can we provide them all the events data in <10 ms?
● Problem: How can we serve multiple types of events in <10ms?
EventStore architecture
Log Compacted Topics
AVRO
Schema
Registry
Backend
Next stuff
● Highly resource efficient and user-friendly realtime user segmentation.
● We are excited about Streaming Democratization
○ Resource efficient and fully self-service alternatives of Flink-Sql or kSQL
○ Excited about Materialize project ( https://github.com/MaterializeInc/materialize )
■ Streaming Sql engine built with Rust.
● We are also super excited about the edge-analytics
○ https://github.com/cwida/duckdb
● We are interested in making the Delta Lake update and deletes faster.

More Related Content

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Data platform from scratch (Sunny Shah - MakeMyTrip/GoIbibo)

  • 1. Data Platform from the scratch By Goibibo-MakeMyTrip Data Team
  • 2. What do you like about the Taj Mahal?
  • 3. Please appreciate the platform. :-)
  • 4. What and Why of the Data Platform? 20+ 18+ 3
  • 5. What and Why of the Data Platform? 20+ 18+ 3 Data Platform
  • 7. ● Data Platform for Analysts ● Data Platform for Data Scientists ● Data Platform for Backend Applications ● Data Platform for Streaming Applications
  • 8. Data Platform for Analysts
  • 9. SqlShift SDT: Simple Data Platform ( 2016 ) ETL for Flat Tables mShift cShift Backend Events
  • 10. Monthly Infra Cost: $2,250-$2,750 ● Redshift Cost ○ 6 DS2.xlarge => 12 TB compressed storage ○ ~25 TB of raw data ○ $1,380 per month ● 2 i3.4x large for the Spark cluster ○ 32 Cores ○ 244 GB RAM ○ 8 TB of SSD storage for the logs storage ○ $995 per month ● S3 + Other cost ○ $100-$700 per month
  • 11. Initial 6 months Business Impact: ● >5000 Redash queries in 6 months time ● 100+ hourly/daily emailers ● 50k queries per day on Redshift ● Finance team used it Vendor payouts and reconciliation ● Marketing team started using it for the User Targeting
  • 12. Issue: Inconsistent data Id Name updated_at 1 Sunny 2020-01-01 11:30:00 2 Sam 2020-01-01 11:45:00 Id Name updated_at 1 Sunny 2020-01-01 11:30:00 2 Nitin 2020-01-01 11:45:00 3 Saurabh 2020-01-01 12:10:00 Employees Table At: 2020-01-01 12:00 Employees Table At: 2020-01-01 13:00
  • 13. The simple data-platform + CDC + Kafka ( 2017 ) Log Compacted Topics kShift Backend
  • 14. Best Practices to make SDP a success! ● Rely on Airflow for retries and scheduling the job. ○ Crontabs are highly unreliable! Don’t use them for anything! ● Alerting and Monitoring ○ Alert calls/Emails for job failures, Single job failure is a complete warehouse failure. ○ SLA miss needs an alert as well
  • 15. ● Set-up WLM / Query termination rules ● Unit test cases, Integration test cases to ensure tools are accurate. ● Distributed locks ○ Only one application at a time can sync a table. ● Offset Management ○ Instead of pulling last hour/day’s data Best Practices to make SDP a success!
  • 16. ● This particular setup worked for us from 2016- 2018 ● We scaled up our Redshift cluster from 3 node- 6TB storage to 18 node-36 TB storage. ● ~4000+ tables.
  • 17. And we don’t have the approval for Scale-up!
  • 18. Redshift Spectrum & Immutable data ● ~75% data in Redshift was the events data. ● Events are Immutable ● Queries on Parquet 5-10% faster than Queries on ORC
  • 19. S3 + Spectrum to the rescue kShift Immutable events data. Redshift Spectrum Backend
  • 20. S3 + Spectrum to the rescue, But it’s broken. kShift Immutable events data. Redshift Spectrum Issues: 1. Small files 2. Failure causes data duplication Backend
  • 21. How spark writes to S3 + Spectrum? Write Parquet files to S3 Add the partitions to Redshift Commit offsets
  • 22. How spark writes to S3 + Spectrum? Write Parquet files to S3 Add the partitions to Redshift Commit offsets Failure, Duplicate Data Failure, Duplicate Data
  • 23. ● Delta is a storage format which internally uses Parquet as a file format. ● Provides ACID guarantees ● Provides a way to merge small files ● Provides insert, update, delete support ● Works well with Spark ● Provides schema evaluation support ● For more info on Delta to Spectrum converter, please checkout our blog Delta Lake = Parquet + Two-Phase-Commit
  • 24. Delta on S3 + Redshift Spectrum kShift Redshift Spectrum Delta Delta To Spectrum Immutable events data.Backend
  • 25. ● Data Analysts needs aggregation engine. ● Data scientists needs raw data ● Redshift isn’t great with frequent >50GB unloads. ● Solution: Keep both mutable and immutable data in S3 in the Delta format. Data Platform for Data Scientists ( 2019 ) Delta On Demand Spark Clusters
  • 26. Delta To Spectrum The new Ingestion Architecture ( 2019 ) Log Compacted Topics Delta Backend
  • 27. Data Platform for the Backend folks ● The Pipe Dream of Backend engineers -> Redshift/Data-lake as an OLTP db. ● All the data at one place. ● Only Data Lake has ○ Click-stream data ○ events data ○ Flat table data ● Problem: How can we provide them all the events data in <10 ms? ● Problem: How can we serve multiple types of events in <10ms?
  • 28. EventStore architecture Log Compacted Topics AVRO Schema Registry Backend
  • 29. Next stuff ● Highly resource efficient and user-friendly realtime user segmentation. ● We are excited about Streaming Democratization ○ Resource efficient and fully self-service alternatives of Flink-Sql or kSQL ○ Excited about Materialize project ( https://github.com/MaterializeInc/materialize ) ■ Streaming Sql engine built with Rust. ● We are also super excited about the edge-analytics ○ https://github.com/cwida/duckdb ● We are interested in making the Delta Lake update and deletes faster.

Editor's Notes

  1. Come on guys, Please appreciate the platform as well. Taj mahal is built on the banks of the Yamuna. It isn’t possible to construct such a long lasting building on sand. The depth of the foundation is hundreds of meters. It has hundreds of meters of delth and it’s built with Wood, rubbles and Iron. Ok, Now we all appreciate the platform. So let’s understand What’s data-platform, Why do we need it and How to build it from the scratch.
  2. Thanks to a 50+ backend microservices, We have our data stored in 20+ MySql Clusters, 18+ Mongodb Clusters, 3 Cassandra clusters, More than 2 TB data in 50+ Dynamodb tables, Aerospike cluster, Backend pushes data to Kafka with 800+ topics and >1 GBPS throughput. From the front-end we get data in GA and Segment.io
  3. Data-platform pulls data from all these diverse sources gives a unified view of the data to the data analysts, data scientists and backend applications and streaming applications.
  4. Me and my team members got lucky and got the opportunity to build the DataPlatform from the scratch for the Goibibo, InGoMMT ( Goibibo-Makemytrip Supply Platform ), HotelSimply ( Hotel Management System ) and Goibibo-MakeMyTrip common services.
  5. Agenda of the talk is to tell you, How we built the Dataplatform for Analysts, Dataplatform for Data scientists from
  6. This is our first data platform. We were on the AWS so we chose Redshift as our Data Warehouse. We built Spark tools to pull data in parallel from the MySql, mongoDB and Cassandra to Redshift. For MySql to Redshift, we built SqlShift. It can pull data incrementally or do the full dump. These Spark jobs ensured that Redshift has the data of our backend databases. Next thing that we wanted in our data warehouse was Page visits and events data. We asked our backend teams to write each events to the separate log files and every hour we would upload the log file to the S3. Airflow job was scheduled to load these log to the Redshift. Single hotel transaction would write data to several databases and our analysts were joining these tables repeatedly, We wrote a spark ETL job to create flat table, These flat tables has ~250 columns and almost all the required information for the hotel transaction. We would run these ETL jobs every 5 minutes to give business near real time view of the business. We use Redash as the visualization tool. It’s possible to build this data-platform in 3 months time, Cost of this data platform for ~25 TB data would be $2250.
  7. We learnt one thing out of this experience, A lot of People genuinely love doing analysis, Provided it’s easy to do so. Our flat tables made it easy and fast for the people to do the analysis, As they don’t have to join tables. In late 2016, Redshift came out with the python UDF feature, Our finance team learnt python to move their tax computation and other formulas to a function. Marketing team built user targeting and user engagement through the Redshift.
  8. In this particular case, We have two versions of the Employees table, The top one is table at the 12th hour and the bottom one is the table at the 13th hour. Let's imagine that we have our data synced till the 12th hour At the 13th hour, Our job would ask for the data from the 12th hour to the 13th hour. We will receive only the record with id=3, We won’t receive the changed id-2 record because updated_at didn’t change for it. And id-1 record won’t get deleted from the warehouse because It’s just not there in the source database anymore. So sometimes we would miss updates and we would always lose deletes. Solution was to read the changelog ( binlog in the case of MySQL and OpLog for Mongodb ).
  9. Debezium was in a nascent stage around early 2017, We had to rewrite significant part of Mongodb debezium and fix a few issues in the MySql debezium to make it production ready. Debezium reads the binlog/oplog and creates one Kafka topic per MySql table or Mongodb collection. We keep these topics log compacted, With this setup the advantage is, Our jobs can use Kafka to read the complete data and never hit MySql/Mongodb cluster. This saves the Infra cost of keeping the additional MySql or MongoDb slaves for the data platform. We had to build kShift, It’s a tool written in Scala and Spark to pull data from Kafka and Sync to the Redshift
  10. Use Airflow instead of CronTabs, It gives us retries in the cases of failures, SLA miss alert and job failure alerts Treat one job failure alert like an entire data platform failure. Have an SLA of <1 hour for fixing the data-sync issues.
  11. Bad queries can take entire warehouse down, Don’t handle it manually. Redshift has auto query termination rules, If your warehouse doesn’t support it then write an airflow job to kill long running queries. Take tools accuracy seriously, Write integration test cases. Data consistency goes for toss when multiple instances of the same job syncs the table. To solve this problem, We use zookeeper based distributed locks to ensure that this doesn’t happen. It’s possible to build a distributed lock using Dynamodb or Redis as well. Commit the offset and next time when job runs, start from the committed offset. This is a lot more fault tolerant than pulling the last hour’s data.
  12. Interestingly one issue with Redshift is that, Without significant downtime, If we want to scale it up then it requires 2X nodes scale up. This means, After 18 node cluster the next scale-up would be 36 node. So one day, we get the Redshift full alert and we don’t have the approval for 36 node scale-up!
  13. Around this time, Redshift added a capability of querying data from the S3 and Parquet, ORC format. We figured out that ~75% of the data in Redshift was events data and events are immutable. We decided to move our events from Redshift to S3 and access it through Parquet.
  14. Every job run would produce small files, Spectrum would perform really bad for the small files. Not possible to merge these files without having duplicate data for a timebeing/inconsistency for sometime. Every job failure would cause data duplication. Let’s understand the data duplication bit more.
  15. At an architectural level, The core reason behind this issue is Parquet+S3 doesn’t transactional properties across files and partitions. In other words, We can’t say, Either write these 20 files in 2 partitions, Add them to Redshift or don’t write. We found the solution of this problem in the Delta lake file format.
  16. Important thing to note is, Our Delta to redshift connector just performs the metadata operation, This means it doesn’t copy the data. Even for tables of size 1 TB, Delta To Spectrum connector doesn’t take more than 1-2 seconds.