Lessons in moving from physical hosts to mesos

•Download as PPTX, PDF•

0 likes•237 views

(speaker notes here : https://docs.google.com/document/d/12mXLYEFkEEd0pwOwD8bC1JQ8CPpx_PiRPXikHZ6MMYQ/pub ) t.co is the URL shortening service created by Twitter. As part of scaling up, t.co moved to using Mesos. We saw significant gain is deployment speed, scalability and reduction in operational headaches. This talk will provide an introduction to Mesos+Aurora, and cover how t.co service migrated from running on physical hardware to Mesos. It will also cover the challenges t.co had during the migration, the "gotchas" and debugging techniques for uncovering performance issues. Agenda: - Introduction to Mesos + Aurora - Benefits of moving to Mesos - Migration steps for moving from t.co to Mesos - Challenges faced and how t.co overcame them

Engineering

Lessons in moving from
physical hosts to Mesos
Raj Shekhar, Senior Site Reliability Engineer
@ilunatech

Static partitioning has problems
Unequal load distribution on machines
Slower to add capacity
Not fault tolerant

Is there a better way?
Do we want machines or do we want resources?

Mesos
Resource manager - the datacenter is one big pool
Can run multi-tenant workloads
Failure detection
Services are isolated from one another

Why Mesos - Better resource utilization
Run multi-tenant workload on machines
Dynamic partitioning - no dedicated machines for tasks
Less resource hungry than virtual machines

Why Mesos - all the other good things
Fault tolerant - automatically restart failed jobs
Elasticity - grow and shrink on demand
Faster deploys

T.co - URL shortening
http://example.com/example http://t.co/examp

Life after Go Live
Lowered operating expense
Fewer routine operational tasks
Faster deploys

Job throttling
Sudden spikes in latencies
What we learned
cgroups and cpu quotas

Capacity planning
Max traffic of the cluster was lower than our expectation
What we learned
Different CPU variants have different throughput

Rethink service discovery
Services get hosts and ports assigned dynamically
What we learned
Use static proxies to forward connections

No perfect isolation
Sudden spike in latency
What we learned
Async ops where possible, noisy neighbours still affect us

Questions?
rajlist@rajshekhar.net
@ilunatech

What's hot

How to cache your static resourcesWesley Smits

Veeam backup and_replicationCheer Chain Enterprise Co., Ltd.

5 Things to Ask Your Virtualization AdministratorDell Virtualization Operations Management

Virtualizing OTM - Real World Experiences and PitfallsMavenWire

Data storage for the cloud ce11aseager

Using Virtualization Manager 4.0 to Manage Your EnvironmentSolarWinds

Llunitebe2018 worst config mgr cb mistakesKenny Buntinx

Exam results in SaaSinstantexamresults

Vnx brochureCommaGroup

10 Tips for Optimising WordPressAndrew Marks

How to make your site 5 times faster in 10 minutesGal Baras

Moving to the CloudStacey Meyers

10 Reasons to Move to the CloudCloudUniversity

Caching idea for midcomtepheikk

Eclipse OpenJ9 - SpringOne 2018 Lightning talkSteve Poole

What's hot (15)

How to cache your static resources

Veeam backup and_replication

5 Things to Ask Your Virtualization Administrator

Virtualizing OTM - Real World Experiences and Pitfalls

Data storage for the cloud ce11

Using Virtualization Manager 4.0 to Manage Your Environment

Llunitebe2018 worst config mgr cb mistakes

Exam results in SaaS

Vnx brochure

10 Tips for Optimising WordPress

How to make your site 5 times faster in 10 minutes

Moving to the Cloud

10 Reasons to Move to the Cloud

Caching idea for midcom

Eclipse OpenJ9 - SpringOne 2018 Lightning talk

Similar to Lessons in moving from physical hosts to mesos

SharePoint Topology Information Technology

Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt

Intro to Cloud Architecturewlscaudill

Top System Design Interview QuestionsSoniaMathias2

Netezza Deep DivesRush Shah

Distributed DevelopmentDmitri Nesteruk

An operating system for multicore and clouds: mechanism and implementationMohanadarshan Vivekanandalingam

Cloud ComputingJoão Paulo Preti

Building Low Cost Scalable Web Applications Tools & Techniquesrramesh

Web Speed And ScalabilityJason Ragsdale

Mesos: Cluster Management SystemErhan Bagdemir

System Architecture at DDVEAlvar Lumberg

Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Amazon Web Services

Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Dealmaker Media

Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comErtuğ Karamatlı

Caching fundamentals by Shrikant VashishthaShriKant Vashishtha

Scalable Service ArchitecturesZoltán Németh

Datacenter Computing with Apache Mesos - BigData DCPaco Nathan

Best practice adoption (and lack there of)John Pape

scale_perf_best_practiceswebuploader

Similar to Lessons in moving from physical hosts to mesos (20)

SharePoint Topology

Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database

Intro to Cloud Architecture

Recently uploaded

result management system report for college projectTonystark477637

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Extrusion Processes and Their Limitations120cr0395

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

UNIT-II FMM-Flow Through Circular Conduitsrknatarajan

AKTU Computer Networks notes --- Unit 3.pdfankushspencer015

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Introduction and different types of Ethernet.pptxupamatechverse

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

Recently uploaded (20)

result management system report for college project

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Extrusion Processes and Their Limitations

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Coefficient of Thermal Expansion and their Importance.pptx

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

UNIT-II FMM-Flow Through Circular Conduits

AKTU Computer Networks notes --- Unit 3.pdf

Processing & Properties of Floor and Wall Tiles.pptx

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

KubeKraft presentation @CloudNativeHooghly

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

UNIT-III FMM. DIMENSIONAL ANALYSIS

Introduction and different types of Ethernet.pptx

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

Lessons in moving from physical hosts to mesos

1. Lessons in moving from physical hosts to Mesos Raj Shekhar, Senior Site Reliability Engineer @ilunatech

2. Mesos WHAT WHY HOW NOW WHAT

3. How most Ops teams run clusters today

4. Static partitioning has problems Unequal load distribution on machines Slower to add capacity Not fault tolerant

5. Is there a better way? Do we want machines or do we want resources?

6. Mesos Resource manager - the datacenter is one big pool Can run multi-tenant workloads Failure detection Services are isolated from one another

7. Why Mesos - Better resource utilization Run multi-tenant workload on machines Dynamic partitioning - no dedicated machines for tasks Less resource hungry than virtual machines

8. Why Mesos - all the other good things Fault tolerant - automatically restart failed jobs Elasticity - grow and shrink on demand Faster deploys

9. T.co - URL shortening http://example.com/example http://t.co/examp

10. How Package Deploy Test Go Live!

11. Life after Go Live Lowered operating expense Fewer routine operational tasks Faster deploys

12. Job throttling Sudden spikes in latencies What we learned cgroups and cpu quotas

13. Capacity planning Max traffic of the cluster was lower than our expectation What we learned Different CPU variants have different throughput

14. Rethink service discovery Services get hosts and ports assigned dynamically What we learned Use static proxies to forward connections

15. No perfect isolation Sudden spike in latency What we learned Async ops where possible, noisy neighbours still affect us

16. Questions? rajlist@rajshekhar.net @ilunatech

Editor's Notes

Hello friends, today I am going to talk about lessons I learned when moving a service from physical hosts to Mesos.
I will give a brief overview of Mesos, what its benefits are and how we migrated the service. After that, I will dive into the issues I saw after migration and what lessons I learned.
When it comes to managing a cluster in the data center today most operations team use a static partitioning scheme. They start out with a set of servers and then provision out the servers into separate roles. The machines in a role usually run a specific service, for example apache or rails or memcache or hadoop
With this scheme you will often have periods where machines in one partition may be resource starved while another partition is under-utilized. However, there is no easy way to reassign resources across partitioned clusters. For example, you cannot assign CPU from your cache to your web servers. It is also slower to increase the capacity of a partition. If you are expecting a spike in traffic for an event, to increase the server capacity you have to order more servers and provision these new servers. Using static partitioning also increases your mean time to recovery from faults. For example, if you lose two of your webservers, usually someone needs to come online and setup additional machines to replace the lost servers.
Can we improve this situation? Instead of having silos of servers, can we treat the machines available as a pool and request whatever resources we need to run our services?
This is where mesos comes in. It provides a way to treat the machines in your datacenter as one big computer. Services can request quotas for CPU, memory, disk. Mesos allocates these resources and runs these services. It can run different types of services: cron jobs, batch jobs, long running services. It can detect down services and restart them without any human intervention. Even though multiple services can run on the same server, they are isolated from one another
So, what are the benefits of mesos. One of the biggest benefit of Mesos is better resource utilization. The services can elastsically grow and shrink based on the amount of resources they need and mesos will handle scheduling the services. For isolation, mesos uses containers which are less expensive than virtual machines
Mesos also provides automatic failure detection and recovery. Failed jobs get restarted without any human intervention and this helps in reducing mean time to recovery The services can increase or reduce their resource utilization easily and this helps in better resource utilization. We also saw our deploys to be faster. Mesos also allows us to run multiple versions of the same service in the same environment and that makes it easier to rollout services.
That was a brief overview of mesos. The next part is about the migration of a service from physical hosts to mesos. The service is called t.co . This is the service that handles url shortening for Twitter. When you tweet a url, this service converts it into a small url. We migrated from tens of physical hosts across multiple datacenters to running around tens of jobs. These jobs were run on a shared pool of servers which would run other services as well.
The first step was to create a standalone package for the service. The service could not assume the availbility of any third party libraries. It could not assume that it would have access to system level directories. We also packaged any configuration files that would be required. Some services have pushed their configuration options to key value databases and would pull them from there on startup. Next, we deployed this to physical servers and to Mesos cluster. After the service was setup on Mesos cluster, we ran some load tests on the service. We collected production logs and ran the load test on the service running on mesos. We compared the performance of the cluster on the metrics like latency, total queries per second, garbage collection behaviour and latency. We would also monitor the coredumps or service restart. After the service passed sanity testing, the mesos service started getting a portion of production traffic. Initally, this was 1% traffic. We would keep an eye on the metrics using monitoring alerts to catch any breakages. We then migrated to more traffic in gradual steps like 10%, 20%, 50% and 100%.
Did we get the benefits we were hoping for? There was less operational cost, we moved to using machines that were shared with other team. Routine maintenance tasks were easier or being handled by a dedicated mesos sre team. Deployment and rollback was faster
Now I will talk about the issues we saw and what we learned after the migration. The first symptom we saw was that the clients using t.co service would report sudden spikes in latencies from t.co service. WHen we investigated, we found that this was caused by how mesos does resource isolation. Mesos uses linux control groups, called cgroups, to provide resource isolation. WHen a process starts, the cgroup provides it a quota of CPU cycles for a certain timeslice. If a process consumes its complete CPU cycles in the first few milliseconds, it is frozen until the next cycle. Our cacluation had not allocated enough cycles to account for garbage collection of the JVM. We added more CPU quota, increased the number of instances and this problem got fixed.
To calculate the complete capacity of cluster, we used a simple approach. We ran a load test on a single job and then multipled that number with the total number of instances running. However, we saw that the cluster could not handle the traffic we had projected. This was caused by the heteregenous environment of servers on which the service was being scheduled. Some CPU variants could give higher throughput than others. To do a better capacity planning, we run load test on all cpu variants and then use the lowest number to plan how mcuh instances we need.
Suppose we have a PHP application that needs to connect to cache server. How does the application know which machine and which port to connect to? In the world where the servers were statically allocated, the application could connect to a server on a static port and assume that the cache server would always be available on that connection. -distributed systems like Mesos require service discovery as an essential building block to connect applications and services. With mesos, the server and port get assigned dynamically, when the service starts up. Mesos would use zookeeper service to keep track of what service was running on what machine and port. Anyone that needed to use t.co service would query zookeeper to get the list of machines and ports and connect to them. This was a problem for some services that could not do this dynamically. We ended up setting up a few static proxy servers and the legacy applications would connect to them. THese proxies would query to zookeeper and forward the connection to the right hosts. Airbnb and tellapart have open source their software for this
Sudden spikes in latencies even after eliminating job throttling What we learned: Co-running processes doing a lot of disk and network read/writes affect neighbours Async disk I/O helps alleviate pain Network is harder to isolate (ingress)
Questions: Tweet to @ilunatech https://twitter.com/ilunatech Or email

Lessons in moving from physical hosts to mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Lessons in moving from physical hosts to mesos

Similar to Lessons in moving from physical hosts to mesos (20)

Recently uploaded

Recently uploaded (20)

Lessons in moving from physical hosts to mesos

Editor's Notes