Hi everyone, welcome to Netflix! And welcome to our second Billing & Payments Engineering Meetup. I’m happy to see that so many of you are attending tonight. If you missed the first meetup, it’s available in our tech blog.
There will be 5 presentations this evening, given by different teams within Netflix. We have a demo booth setup for you. There are also some food & drinks in the back of the theater. Feel free to go there and grab something. The only thing I’ll ask from you is to be respectful of the presentations and not making too much noise. There will be time for networking when the presentations are over.
Alright, let’s start. Hi everyone, my name is Mat, I run the Payments Engineering team, here at Netflix. Tonight, I’ll be talking about how we re-engineered our payment system to run in the cloud
First of all, a few numbers. Netflix has more than 57M subscribers worldwide, across around 50 countries.
In terms of payments, Netflix supports 12 currencies and 9 different payment types. We integrate with more than 15 payment processors and verification services. On average, if we do the math, we process around 2 million transactions per day
And we’re counting! As Netflix continues to expand internationally, we will support more processors, more payments types, and so on
The very first responsibility of the payments application is to store securely, in an encrypted way, your method of payment. We call it your MOP. It is also responsible of connecting to all the processors and payment verification services. Also, we have a lot of batch processing running in the background And finally, and this is really important for us, it has the responsibility of offering to our internal clients inside Netflix an agnostic interface. This way it abstracts to them all the complexities of various payment types and payment flows.
Our historical payments application has been running in our data center.
Integrating new payment types, and mostly new payment flows, turned out to be painful in some cases, especially if it was deviating too much from the original design.
Finally, developers come and go, and they all brought their own contribution to the project. So over the years, the code was more looking like sedimentary layers of legacy code. It was becoming harder and harder to maintain.
From a high level architecture perspective, our payments app runs in the datacenter, it connects to an Oracle database for storing the MOP and the transactions. It also connects to our 3rd party processors. We also have other batch applications that are doing a similar job. We still have some client applications running in the DC that are in the process of moving to the cloud, just like us. Our payments app offers an API for them to talk to. For all the applications that are running in the cloud, we expose a cloud proxy, connected to the DC through a tunnel. This proxy exposes the same API to our clients running in the cloud.
If you follow what Netflix does, you probably know that Netflix loves the cloud. It hasn’t always been like this. When Netflix was funded, back in 1997, the scale was completely different. By the way, I let you appreciate our original logo, that I dug up for you. It a good example of old fashioned web 1.0 artwork! Over the years, we moved from a DC monolithic aplication to a micro-service architecture running in AWS. In 2013, now that the cloud matured, and that Netflix expertise in the cloud matured as well, we decided that it was time to move critical applications, like the payments application, into the cloud.
In 2007, we started offering streaming. At the time, our application was monolithic. Over the next few years, we decided that in order to scale, and to offer the best quality of service for our growing number of subscribers, we had to rethink our architecture. This is when we started moving to a micro services architecture, and when we chose AWS as an infrastructure platform
Before writing our new application, we had to do our due diligence. First of all in terms of compliance. We couldn’t possibly move to the cloud if it wasn’t PCI compliant.
We also wanted to rethink our division of labor. A token service was created, while the only responsibility of storing the MOP securely and returning a token associated with it. We’re also using cloudHSM as a secure storage solution for our encryption keys.
And finally, we also had to go through a lot of technical evaluation, especially in terms of database. We wanted to evaluate if we kept a relational database in the cloud, or if we chose a NoSQL solution.
So we decided to go with Cassandra, and to benefit from the expertise that Netflix acquired using it over the years. Cassandra offers tunable consistency, meaning that you can adjust your desired level of consistency according to your needs. Another great advantage of Cassandra is that it offers multi-regional support. It’s a great win when you want to keep your data synchronized. Are you familiar with the CAP theorem? Please raise your hand if you know what this is? Ok, so the CAP theorem states that it’s impossible for a distributed system to provide simultaneously all of the 3 guarantees: Consistency, Availability, Partition tolerance With Cassandra, we decided to privilege consistency. After all, we’re a payments application, it’s more important to know that our data are consistent, even if we have to pay a cost in terms of latency. We also had to work on our data model. We had to rethink and denormalize our model to make it fit with Cassandra.
In terms of technologies, when we re-engineered our payments application, we took a step back. And we thought of what could be the proper foundation stones, for a strong, extensible, and sustainable system. When we looked at our application, we realized it’s a workflow that fits perfectly with enterprise integration patterns. We evaluated several integration frameworks and Apache Camel got our preference.
We decided to keep Spring Batch for our batch applications.
For the actual task of uplifting our data into the cloud, we chose Apache Storm.
Also, in the cloud, we’re now able to take full advantage of Netflix open source framework as well as some other AWS services.
So the new architecture design looks like that. The cloud payments apps runs natively in the cloud. It exposes an API to other client applications, running in the cloud too. The app talks to the tokenizer to manage MOPs securely. And send transactions to the processor. The transaction info are now stored in Cassandra. And no more DC, everything is running in the cloud!
As I said earlier, availability is really important to us. The way AWS is structured is that it offers several availability zone, within a given AWS region. At Netflix, all of our microservices are deployed in every availability zone. Our Cassandra ring is also deployed across all these zones, so we can maintain resiliency and keep our data synchronized. We even pushed it further and have replicated the same configuration in several AWS regions. The data remains synchronized with a Cassandra replication mechanism. Now, we can safely deploy our payments application in all the availability zones and be fully resilient.
In order to reach that final goal, we had to go through transition stages. We had to decouple our new implementation from the operational flow. We also wanted to keep a safety net in the DC. Everything that is written in the cloud is also written in the DC. So in case anything goes south, we’re always able to rollback. Finally, we had to decide what would be our migration strategy. We decided to go country by country, this way we maintained a certain level of partitioning that helped moving to the cloud chunk by chunk.
For the decoupling. On the left, this is our historical architecture. On the right, the new one. The only thing that we asked our clients to implements, was to add a country code to every request they were sending to us. On our side, we developed a routing logic, that was able to route the traffic to the new architecture, based on the country we received
Let’s take an example. Let’s say US transactions are processed from the DC. If a US transaction is coming to us from the cloud, it will be sent to the DC for processing and be stored in our Oracle database. If the same transaction is received from the DC, then no problem either. Now, let’s say that French transactions are processed from the cloud. If we receive a French transaction from the cloud, it would still be received by the same endpoint, then routed to the new architecture. If the same transaction is received from the DC, then it will be the DC app that will route it to the new cloud architecture.
The way we implemented shadow write, was to introduce an SQS queue. Everything that we write in Cassandra is also sent to the queue. Then it’s pulled from the queue and sent back to the DC. If we take the previous example of the French transaction, it will be sent to both Cassandra and SQS. Then pulled from the queue and stored in the Oracle database. This way, we always have up-to-date data in the DC.
For the migration strategy, we decided to do it country by country. The only requirement is that all payment processors had to be implemented in the cloud before we could start the migration.
It looks a bit like this. This is a very simplistic matrix but it should give you an idea of the reasoning behind it. We started implementing Worldpay. When it was ready, we identified a candidate country, Paraguay, that we configured to run with Worldpay only. Because the country was small enough, a few hundreds of transactions per day, it was the perfect way to start. Then we worked on BNP and PayPal. When it was done, then we were able to migrate France to the cloud. The only remaining processor of the list is Paymentech. And as soon as it was completed, then we could start migrating the US
(+ US CID based routing)
During this effort, there were a few things we had to pay attention to. First of all, we don’t troubleshoot the same way in the cloud. It’s not as easy to logging into a relational database and running complex queries. In the cloud, we had to learn how to troubleshoot differently and how to use the right tools for that. Also, there was a lot of custom business logic in the historical application. We had to make sure we captured properly all that business logic and re-implemented it in the cloud. Finally, we partnered with our processor to ensure their platform would be able to process our requests coming from the cloud.
So how far are we now? We’re basically almost done. We just need to completed some migrations, especially some countries with local processors and some batch applications. It’s been a long and enriching process. We learnt a lot from it. If you want to know more, the team and me will be happy to talk further with you about it afterwards.
3/18/15 Billing&Payments Eng Meetup II - Payments Processing in the Cloud
Billing & Payments Meetup II
March 18th, 2015
Billing & Payments Meetup II
• Mathieu Chauvin
Payment Processing in the Cloud
• Sangeeta Handa & John Brandy
Billing Workflows in the Cloud
• Shankar Vedaraman
Payment Analytics at Netflix
• Poorna Udupi & Rudra Peram
Security for Billing & Payments
• Rahul Dani
Escape from PCI Land
Payment Processing in the
• > 57M members
• ~ 50 countries
• 12 currencies
• 9 payment types
• 15+ payment processors
& verification services
• 2M transactions per day
• … and counting!
• Method Of Payment (MOP)
– Secure storage
• Connection to 3rd party
– Payment processors
– Verification services
• A lot of batch processing
• Agnostic interface to clients
Historical Payments Application
• Data center
• Difficulties integrating
new payment types
• Sedimentary layers of
Historical Payments Application