As more and more businesses move from enterprise IT solutions to web scale cloud solutions to cater to the growing customer needs, they need to be innovative and find ways the applications and infrastructures would to scale rapidly and be highly available.
High availability is an important requirement for any online business and trying to architect around failures and expecting infrastructure to fail and even then be highly available is the key to success. One such effort here at Netflix was the Active-Active implementation where we provided region resiliency. This presentation would discuss the brief overview of the active-active implementation and how it leveraged Cassandra’s architecture in the backend to achieve its goal. It will cover our journey though A-A from Cassandra’s perspective, the data validation we did to prove the backend would work without impacting customer experience. The various problems we ran into like long repair times and gc grace settings. Our lessons learnt and what would we do differently next time around?