(speaker notes here : https://docs.google.com/document/d/12mXLYEFkEEd0pwOwD8bC1JQ8CPpx_PiRPXikHZ6MMYQ/pub )
t.co is the URL shortening service created by Twitter. As part of scaling up, t.co moved to using Mesos. We saw significant gain is deployment speed, scalability and reduction in operational headaches.
This talk will provide an introduction to Mesos+Aurora, and cover how t.co service migrated from running on physical hardware to Mesos. It will also cover the challenges t.co had during the migration, the "gotchas" and debugging techniques for uncovering performance issues.
Agenda:
- Introduction to Mesos + Aurora
- Benefits of moving to Mesos
- Migration steps for moving from t.co to Mesos
- Challenges faced and how t.co overcame them
4. Static partitioning has problems
Unequal load distribution on machines
Slower to add capacity
Not fault tolerant
5. Is there a better way?
Do we want machines or do we want resources?
6. Mesos
Resource manager - the datacenter is one big pool
Can run multi-tenant workloads
Failure detection
Services are isolated from one another
7. Why Mesos - Better resource utilization
Run multi-tenant workload on machines
Dynamic partitioning - no dedicated machines for tasks
Less resource hungry than virtual machines
8. Why Mesos - all the other good things
Fault tolerant - automatically restart failed jobs
Elasticity - grow and shrink on demand
Faster deploys
Hello friends, today I am going to talk about lessons I learned when moving a service from physical hosts to Mesos.
I will give a brief overview of Mesos, what its benefits are and how we migrated the service. After that, I will dive into the issues I saw after migration and what lessons I learned.
When it comes to managing a cluster in the data center today most operations team use a static partitioning scheme. They start out with a set of servers and then provision out the servers into separate roles. The machines in a role usually run a specific service, for example apache or rails or memcache or hadoop
With this scheme you will often have periods where machines in one partition may be resource starved while another partition is under-utilized. However, there is no easy way to reassign resources across partitioned clusters. For example, you cannot assign CPU from your cache to your web servers.
It is also slower to increase the capacity of a partition. If you are expecting a spike in traffic for an event, to increase the server capacity you have to order more servers and provision these new servers.
Using static partitioning also increases your mean time to recovery from faults. For example, if you lose two of your webservers, usually someone needs to come online and setup additional machines to replace the lost servers.
Can we improve this situation? Instead of having silos of servers, can we treat the machines available as a pool and request whatever resources we need to run our services?
This is where mesos comes in. It provides a way to treat the machines in your datacenter as one big computer. Services can request quotas for CPU, memory, disk. Mesos allocates these resources and runs these services. It can run different types of services: cron jobs, batch jobs, long running services. It can detect down services and restart them without any human intervention. Even though multiple services can run on the same server, they are isolated from one another
So, what are the benefits of mesos. One of the biggest benefit of Mesos is better resource utilization. The services can elastsically grow and shrink based on the amount of resources they need and mesos will handle scheduling the services.
For isolation, mesos uses containers which are less expensive than virtual machines
Mesos also provides automatic failure detection and recovery. Failed jobs get restarted without any human intervention and this helps in reducing mean time to recovery
The services can increase or reduce their resource utilization easily and this helps in better resource utilization.
We also saw our deploys to be faster. Mesos also allows us to run multiple versions of the same service in the same environment and that makes it easier to rollout services.
That was a brief overview of mesos. The next part is about the migration of a service from physical hosts to mesos. The service is called t.co . This is the service that handles url shortening for Twitter. When you tweet a url, this service converts it into a small url. We migrated from tens of physical hosts across multiple datacenters to running around tens of jobs. These jobs were run on a shared pool of servers which would run other services as well.
The first step was to create a standalone package for the service. The service could not assume the availbility of any third party libraries. It could not assume that it would have access to system level directories. We also packaged any configuration files that would be required. Some services have pushed their configuration options to key value databases and would pull them from there on startup.
Next, we deployed this to physical servers and to Mesos cluster.
After the service was setup on Mesos cluster, we ran some load tests on the service. We collected production logs and ran the load test on the service running on mesos. We compared the performance of the cluster on the metrics like latency, total queries per second, garbage collection behaviour and latency. We would also monitor the coredumps or service restart.
After the service passed sanity testing, the mesos service started getting a portion of production traffic. Initally, this was 1% traffic. We would keep an eye on the metrics using monitoring alerts to catch any breakages. We then migrated to more traffic in gradual steps like 10%, 20%, 50% and 100%.
Did we get the benefits we were hoping for? There was less operational cost, we moved to using machines that were shared with other team.
Routine maintenance tasks were easier or being handled by a dedicated mesos sre team.
Deployment and rollback was faster
Now I will talk about the issues we saw and what we learned after the migration.
The first symptom we saw was that the clients using t.co service would report sudden spikes in latencies from t.co service. WHen we investigated, we found that this was caused by how mesos does resource isolation. Mesos uses linux control groups, called cgroups, to provide resource isolation. WHen a process starts, the cgroup provides it a quota of CPU cycles for a certain timeslice. If a process consumes its complete CPU cycles in the first few milliseconds, it is frozen until the next cycle. Our cacluation had not allocated enough cycles to account for garbage collection of the JVM. We added more CPU quota, increased the number of instances and this problem got fixed.
To calculate the complete capacity of cluster, we used a simple approach. We ran a load test on a single job and then multipled that number with the total number of instances running. However, we saw that the cluster could not handle the traffic we had projected. This was caused by the heteregenous environment of servers on which the service was being scheduled. Some CPU variants could give higher throughput than others. To do a better capacity planning, we run load test on all cpu variants and then use the lowest number to plan how mcuh instances we need.
Suppose we have a PHP application that needs to connect to cache server. How does the application know which machine and which port to connect to? In the world where the servers were statically allocated, the application could connect to a server on a static port and assume that the cache server would always be available on that connection. -distributed systems like Mesos require service discovery as an essential building block to connect applications and services.
With mesos, the server and port get assigned dynamically, when the service starts up. Mesos would use zookeeper service to keep track of what service was running on what machine and port. Anyone that needed to use t.co service would query zookeeper to get the list of machines and ports and connect to them.
This was a problem for some services that could not do this dynamically. We ended up setting up a few static proxy servers and the legacy applications would connect to them. THese proxies would query to zookeeper and forward the connection to the right hosts.
Airbnb and tellapart have open source their software for this
Sudden spikes in latencies even after eliminating job throttling
What we learned:
Co-running processes doing a lot of disk and network read/writes affect neighbours
Async disk I/O helps alleviate pain
Network is harder to isolate (ingress)
Questions:
Tweet to @ilunatech https://twitter.com/ilunatech
Or email