Mydbops Opensource Database Meetup -13.
Flipkart built and scaled one of the largest managed MySQL platforms, managing over 700+ MySQL clusters, 2500+ VMs, and a 1000TB+ storage footprint. The presentation through the journey of problems that have been solved through the platform, features, and architecture of the product.
Scaling managed MySQL Platform in Flipkart - (Sachin Japate - Flipkart) - Mydbops 13th Opensource Database Meetup
1. Scaling Managed MySQL Platform in
Flipkart
The story of how flipkart.com manages its massive MySQL fleets
Sachin Japate
LEAD SRE
India's
largest
e-commerce
player
2. 400
Million
Registered
Users
10 Million
Daily Page
Visits
8 Million
Shipments
per month
100,000
Sellers
22
state-of-the-art
warehouses
3
On-Prem
Data Centers
Sachin Japate
Lead SRE/MySQL SME @ Flipkart
9+ Years in Flipkart
Managed MySQL and D-SQL Platform Teams
India's largest
e-commerce
player
Flipkart Group
4. Tech Landscape
At the heart of all e-commerce businesses is an incredibly
complex transactional network of multiple microservices such
as Order Management, Supply Chain, Logistics, and
Seller-Management that have strong consistency
requirements.
A wide variety of tech stacks power different Microservices,
which facilitate the seamless functioning of the e-commerce
systems.
● MySQL is the most common data store used by over 70% of our systems.
● Other datastores are Redis / ElasticSearch / HBase / MongoDB / ZooKeeper / TiDB / Cassandra, etc.
● The Hot Store Transactional footprint is over 2 Petabyte.
Overview of Databases @ Flipkart
Microservices
5. 3 state of the art on-prem Data Centers in India
Two in Chennai & one in Hyderabad (Renewable Energy)
Customized Hardware
Customized hardware for mission critical computing, storage, artificial
intelligence & machine learning capabilities, backed by an ultra-low
latency network.
VM and their choices
● Compute / Memory / Storage optimized instance types
● Various generations of Hardware (cores, disk, memory)
● Storage Flavours - Local HDDs/SSDs/JBODs/Network-Attached Storages
● Custom cuts for very specific use-cases
Robust Design
All Data Centers built for security, scale, elasticity and multi-zone
resilience with custom-designed racks, intelligent power and cooling.
Hybrid Cloud
Hybrid setup with Google Cloud Platform for bursting into public cloud. Why ?
Flipkart Cloud Platform
6. Developer productivity was seen to take a major hit.
Every team using MySQL needed to invest heavily on:
● Developer bandwidth
● Best practice adoption
● DB Tuning
● Time spent on OPs (solutioning / setup /
maintenance, backup, migration)
● Overdependence on MySQL specialists
● Tribal Knowledge risks
Enforcing Security & Auditing policies on a decentralised model
meant heavy program management and far longer time to get to
the desired state.
Developer Productivity
Policy Enforcement Challenges
As a result, teams were finding it increasingly difficult to focus
on the core business products, as a lot of time was instead
being spent on the management of these underlying
technology stacks.
Core vs Context
The
BIG
Problem
7. Enter Altair
★ DBaaS for MySQL built on top of Flipkart Cloud Platform (in-house)
★ Offered a seamless MySQL provisioning / maintenance / cluster
management experience
★ Abstracted infrastructure provisioning with complete platform service
integration
★ Systematically solved Flipkart's MySQL challenges
Let's see how this was achieved and what challenges came along !
Flipkart's in-house DBaaS
8. Challenge #1
The Time Challenge
“How do we reduce the overall time to create a MySQL cluster?”
Engineers had to first get hardware funded, then create a VM using
CLI, figure out all the permissions, install MySQL & dependent
libraries, find out process to import data relying mainly on
documentation which could be out of date. Typically this process
took almost a day.
Here’s what we did:
Removed the need for infrastructure provisioning, installing and
maintaining MySQL software. Everything was under the hood now.
Built a self-serve user interface and pre-provisioned all accounts so
there were no manual operations.
Altair facilitated project conception to deployment with a target of < 2 minute provisioning to use production grade
MySQL on:3306. Behind the scenes, all integrations with Cloud Services happened in a jiffy.
9. Challenge #2
The High Adoption Challenge
“How to ensure adoption is high ?”
Most of Flipkart was on MySQL 5.6 and 5.7. They feared the
move (losing control of their MySQL databases to a different
team) and they came up with various reasons not to onboard.
Here’s what we did:
Handheld some of the largest teams and moved them to Altair.
Seamless cluster migration flow.
Drove an internal program encouraging teams to move their Stage/Dev/NFR clusters to Altair.
Eventually teams started moving their production clusters to Altair and haven't moved out since!
10. The High Security Challenge
“How to ensure tight security controls?”
Teams were using non secure versions of MySQL, installing
scripts on the DB box, sharing root credentials openly and not
paying a lot of attention to security controls.
Here’s what we did:
We completely blocked All SSH access for everyone, including the
owners of the MySQL clusters. Only the central team had access.
Differentiated between human and machine access - service accounts for apps, while humans had an approval-based
system for controlled time-bound access to MySQL.
No more spurious scripts and non-descript crons running on MySQL boxes. Only certain limited privileges were now
available for MySQL users. The internal databases were accessible only by root.
Challenge #3
We completely blocked giving out SUPER/Admin privileges to
MySQL user.
11. The Disaster Recovery and Business Continuity Process Challenge
“How to ensure disaster recovery and business continuity planning ?”
BCP/DR was a decentralised model in Flipkart, meaning more
program management. Not all teams paid close attention to
BCP/DR. In addition, the tooling had to be set up manually via CLI.
Here’s what we did:
Integrated with internal tooling that allowed teams to define their first
class RPOs and RTOs for their databases.
Tool ensured backups were taken at a predefined time regularly.It also
supported both INCR and FULL backups.
Built a self-serve way to restore the latest backup on either region in addition to supporting multi-region MySQL clusters
Schrödinger Backups were eliminated - "The state of a backup is unknown unless a restore is performed on it"
Started regularly tracking the backups that kept failing for various issues and fixing them under the hood systematically.
Challenge #4
Backups started getting pushed to both near-site and far-site to recover from DC wide failures from a dedicated
backup node instead of an HS or RR node.
12. The High Availability Challenge
High Availability was one of the most important challenges to
solve in Flipkart. MySQL could go down at late nights, and
failover was manual with config changes in apps (restart)
Built a ZK-based highly available monitoring system that detected
failures in seconds.
Developed the Auto-promote feature using well-tested recovery
workflows that immediately kick-started the recovery process
after thorough & deep checks for false positives.
Integrated with internal DNS and Floating-IP to ensure the newly promoted Source continued to be accessible on
the same DNS.
This meant no more stopping apps, changing IP addresses, and restarting. It was just a blip in the traffic and the
regular connection retry handled DB failure just fine.
“How to ensure High Availability?”
Challenge #5
Here’s what we did:
13. The DB Tuning Challenge
DB Tuning was not a very well understood problem because it
needed specialised knowledge to tune memory configurations of
MySQL (SME / DBA); which wouldn’t scale.
Built an in-house variable validation system working on various
combinations of about 50 variables and a recommendation system
that recommended values for the tunable, considering the hardware
and the MySQL version.
Set up an auto-restart for variables which needed MySQL restart, differentiated tuning for Source and RR.
Posted clear error messages for users who wanted to increase all parameters for the best performance.
Teams were far more confident of their tuning - it was also saved in Altair so they could just forget about losing them.
Challenge #6
Here’s what we did:
“How to ensure databases are well tuned ?”
Developed a team HA DBAs for tuning very specific and corner cases.
14. The Observability Challenge
There were no standard deep dashboards across teams for MySQL
observability, which were typically powered by metrics that needed
ROOT access - something which we didn't intend to provide.
Standardised dashboard across the Organization and integrated with
OpenTSDB based internal metric monitoring system.
Pre-built deep Grafana dashboards with overall cluster health, member health, MySQL specific,
InnoDB specific, System & Network dashboards at a MySQL cluster level; PMM was the
benchmark here - we have started work on supporting PMM.
Pre-created cluster level Alerts with recommended thresholds and frequencies, integrated event-based alerting
that tied to the team's on-call calendar directly, Separated customer alerts and Altair Admin alerts.
We built auditing & event-logging on the cluster. Users could download slow-query/error.log etc., directly from the UI.
Challenge #7
Here’s what we did:
“How to ensure good observability despite lack of ROOT access ?”
15. The Hardware Abstraction Challenge
Hardware failures are more common in any large fleet. Earlier, we
tracked the hardware maintenance schedule on emails which
was cumbersome to remember, regulate, and reschedule.
Integrated with the hardware maintenance schedule API (low level APIs)
Scheduled Maintenance helped move the VMs away from the affected mother ships well before the actual hardware
maintenance activity. Ensured a good FD (Failure Domain) distribution at a cluster level
Built deep health-checks to track various hardware problems and replace VMs for unplanned maintenances
Teams largely benefited from this feature and gained back significant time on their hands. Adoption also increased.
Created an internal Scheduled Maintenance mapped to the underlying
hardware Scheduled Maintenance which the client could reschedule to
low-traffic hours.
Challenge #8
Here’s what we did:
“How to abstract hardware problems away from the user ?”
16. The Feature Compatibility without ROOT Challenge
Teams were using common features that needed ROOT/elevated
access. Altair had to bridge that gap for successful onboarding
without increasing our on-call load.
Automatic Binlog trimming and Binlog streaming for binlogs and GTID.
Custom topology support by scaling out read replicas and Adding / Removing HS/Backup nodes.
We could support upgrading and downgrading MySQL clusters before sale events, Migration & Cutover, along
with User and DB creation from UI. So far, nobody has complained about losing the ROOT privileges !
Automatic handling of disk divergence between Primary and Replicas,
Auto durability settings (for reducing replica lags).
Challenge #9
Here’s what we did:
“How to achieve feature compatibility without ROOT access ?”
Built interfaces to CRUD databases and users.
Pre-created stored procedures that allowed viewing debugging information without ROOT access.
17. Stats
700+ Clusters
Across 128 teams
3500+ Failure Recoveries
Includes Planned and
Unplanned failures of all
Nodes
8000+
Dashboards
2500+ VMs
Across CH and
HYD continuously
500+ Live Migrations
Existing clusters to
Altair and A2A
400+ Auto Failovers
Includes planned and
unplanned failures of
Source Node
1.5 Petabyte
footprint
8
Member
Team
19. What Next ?
● K8s statefulset support
● GCP support
● Compute Storage Segregation
Building for the future
We have started work on
open-sourcing Altair as an operator
on Kubernetes starting with MySQL
Open Sourcing Track
● MySQL v8.0 support
● Semi Sync Replication
● Bidirectional Replication
MySQL upgrades