T-Mobile is using Cassandra to back many of our national platforms including device activations, eSim management, bill payments, and T-Mobile Digits. In this talk we will discuss our journey as we migrated away from legacy relational databases and the challenges of having to overcome skill gaps, develop automation, and create tooling to manage our new distributed data system.
Video available at https://www.youtube.com/watch?v=r7UqWeHCcz8
We are also Magenta and you will see much of this throughout the slide presentation.
Apache Cassandra is a Distributed, wide column store, eventually consistent NoSQL Database. - Masterless Architecture - Tunable Consistency / AP Design Query Language : CQL Heavily redundant / resilient Near real time replication Linear scalability hardware and very high uptime.
DataStax is the software company that delivers DataStax Enterprise on top of Apache Cassandra to provide enterprise support and some extra magic.
I’m a Principal Engineer at T-Mobile with the Distributed Data Systems team within T-Mobile Engineering
T-Mobile Engineering builds many of the widgets on and behind the phones. Including things like Activations and eSim functionality.
Now that the stage is set, let’s begin our story of how we got to Cassandra. Later we will get down to some more technical aspects of our environment.
Running single instance oracle servers Using Goldengate and standby servers for some services
Discuss a bit about our layout without feedback
The teams got together. Studies were done. Decisions were made.
Oracle DBs were slow to build, hard to maintain, and lacked redundancy Manual Failovers
We needed an always on solution, distributed over multiple datacenters, near real time replication
This is where Cassandra came in.
On paper this was great and wonderful. Always on, real time replication, globally distributed. Scalable. Fast. Everything we wanted.
Selected DataStax for Enterprise Support We will end up using advanced features but will come back to that
Here is where we seem to differ from most customers Everyone is doing cloud these days Easy to spin up, Easy to deploy
No Cloud Sorry not an option We have our own data centers and bare metal
We need a huge pile of metal
Note: Picture is from the internet and not even the same type. This is just an illustration.
Make that multiple huge piles of metal as we have multiple physical data centers
Capacity Planning was done Hardware specs defined Servers were acquired This step took some time… Eventually we had servers with operating systems and could start installing the databases
Note: Picture is from the internet and not even the same type. This is just an illustration
Let’s take a moment to talk about networking. This was our first big roadblock.
In Cassandra terms this ended up taking a lot of fiddling with listen_address, broadcast_address, native_transport_address, rpc_address.
OpsCenter was a little more painful and resulted in a bug that we reported and DataStax fixed as we do not allow Shell access from the broadcast address.
I have a CQL Prompt!
We have a database across multiple data centers and it is ready for our data.
Submitted configuration for review by DataStax. …
They had a few suggestions. One major change was adjusting the num_tokens parameter. This parameter controls the number of vnodes in the cluster but unfortunately can only be changed by rebuilding.
So we burned it down and started over.
I have a CQL Prompt Again!!
Everything was running again but this won’t be the last.
- Now that we have a DB, our developers can get to work - For this we selected Spring and specifically Spring Data – Spring is a framework for Java which simplifies much of the development flow. Spring boot is a layer on top of Spring which follows a Convention-Over-Configuration methodology meaning basically you only code the unique bits and leave the defaults where they apply. This leads to code being very abstracted but significantly simpler and faster to deploy.
Cassandra + Spring = Happy developers
- Not so happy. Out of the box Spring Data Cassandra has many issues as it is unconfigured and all based on defaults Problems… Excessive Tombstones Overuse of batches Wide partition scans Lead to development of Casquatch which I’ll come back to later
Moving from Relational to Non Relational has a big learning curve especially when you have to abandon an old mindset for a new. Easy for new developers but harder for the experienced.
Training. Lots of training. We were able to partner with DataStax to get some fantastic resources for workshops and office hours. More on these later.
Our biggest success was by doing wide trainings but cultivating an expert in each group. Train the trainer if you will.
Now we will talk a bit more details about our environment and tools used.
There are 3 main components / Workloads Cassandra – Core Traffic Solr ™ – Search Traffic Spark ™ – Analytics
We split these out into different logical datacenters to segregate the workloads
Each Data Center has a complete set of each set of logical DCs to support failover
Let’s talk about Replication Factor Best practice is 3 replicas with Local Quorum
True for Cassandra
There’s a catch when it comes to Solr and Spark. Both of these are going to be running as LOCAL_ONE
Solr works by querying the lucene index(s) to get a list of document ids aka primary keys for the individual rows. It then queries for each row (one at a time) using local one. Originally we had this set as an RF 1 to align with this logic but then we had failures and learned that the driver did not have workload aware failover.
At this point you have to run a balance between consistency and availability. Higher RF increases your availability while allowing for inconsistency due to LOCAL_ONE
In the end we decided to make this a per application / keyspace decision.
We use a multi tenant model (not multi instance) Each application is place in a single keyspace Many keyspaces are stored in a single cluster Reduces resource costs for small clients while allowing increased redundancy
Let’s talk a bit more about training. I told you we’d come back to this. Training is the core central part of succeeding. When under deadline no one is going to use an unknown product that they are uncomfortable with. So train, train some more, then train again. When people stop coming keep holding the sessions just maybe not so often We started with frequent scheduled sessions covering all the fundamentals and then the advanced topics. We had many one on ones to answer questions and deep dive in to topics. Then we started a weekly Office Hours session to discuss any and all questions. This became popular even for teams not in our organization and became our Cassandra Center of Excellence. Today it is down to monthly but we are always available for anyone who wants to talk.
Ansible ® is a simple open source, agentless, automation framework written in Python Based in repeatable playbooks backed by variables
Deploy Cassandra to brand new systems without ever manually logging in to the node Manage configuration through templates Monitor by running local commands and storing metrics in a database Alarm when metrics cross a threshold
Through Jenkins ® we have developed self service jobs to deploy changes and fixes to Cassandra. The jobs are submitted and trigger an approval. Results are stored and logged. Schema is versioned. Healthchecks before and after job
The architecture of the apps follow a general pattern like the one on the slide. Every layer is going to have multiple instances for redundancy and capacity, of course.
Load balancer is sitting out from, some cases this is software such as an API Gateway while other apps use hardware.
Behind this are a number of microservices, these may each call services from other microservices to create layers as required.
When writing to the database, they target a DAO microservice. There may of course be multiple DAOs for different entity flows. These DAOs are the microservices which interact with the Cassandra Database
Casquatch is a Java Abstraction Framework for Cassandra.
This is generally the software we have running on the DAOs in the earlier architecture slide. This results in the DB being fronted by a REST API with all database logic including Failovers, exception handling, retries, etc masked behind the DAO microservice.
I recently released the 2.0 version of the software which takes advantage of the 4.x driver out of DataStax, has annotation processing, new failover policy, and will even generate the REST API logic for you.
Check out the github
Casquatch contains POJOs which you can write and annotate or generate from CQL.
It then uses annotation processing to generate helper classes on compile time to bidirectionally map between the POJOs and relevant CQL queries.
The queries then utilize the DSE Driver to interact with the underlying database.
Run through a demo of Casquatch REST API tutorial.
We are hiring! There is currently a mid-level engineer position for a large Cassandra environment on a peer team in T-Mobile Engineering and our group is hiring Java Devs of all levels. Please reach out if you want to discuss further or know anyone who might be interested.