SlideShare a Scribd company logo
Lessons in moving from
physical hosts to Mesos
Raj Shekhar, Senior Site Reliability Engineer
@ilunatech
Mesos
WHAT
WHY
HOW
NOW WHAT
How most Ops teams run clusters today
Static partitioning has problems
Unequal load distribution on machines
Slower to add capacity
Not fault tolerant
Is there a better way?
Do we want machines or do we want resources?
Mesos
Resource manager - the datacenter is one big pool
Can run multi-tenant workloads
Failure detection
Services are isolated from one another
Why Mesos - Better resource utilization
Run multi-tenant workload on machines
Dynamic partitioning - no dedicated machines for tasks
Less resource hungry than virtual machines
Why Mesos - all the other good things
Fault tolerant - automatically restart failed jobs
Elasticity - grow and shrink on demand
Faster deploys
T.co - URL shortening
http://example.com/example http://t.co/examp
How
Package Deploy Test Go Live!
Life after Go Live
Lowered operating expense
Fewer routine operational tasks
Faster deploys
Job throttling
Sudden spikes in latencies
What we learned
cgroups and cpu quotas
Capacity planning
Max traffic of the cluster was lower than our expectation
What we learned
Different CPU variants have different throughput
Rethink service discovery
Services get hosts and ports assigned dynamically
What we learned
Use static proxies to forward connections
No perfect isolation
Sudden spike in latency
What we learned
Async ops where possible, noisy neighbours still affect us
Questions?
rajlist@rajshekhar.net
@ilunatech

More Related Content

What's hot

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resources
Wesley Smits
 
Veeam backup and_replication
Veeam backup and_replicationVeeam backup and_replication
Veeam backup and_replication
Cheer Chain Enterprise Co., Ltd.
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
Dell Virtualization Operations Management
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and Pitfalls
MavenWire
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
aseager
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your Environment
SolarWinds
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakes
Kenny Buntinx
 
Exam results in SaaS
Exam results in SaaSExam results in SaaS
Exam results in SaaS
instantexamresults
 
Vnx brochure
Vnx brochureVnx brochure
Vnx brochure
CommaGroup
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPress
Andrew Marks
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutes
Gal Baras
 
Moving to the Cloud
Moving to the CloudMoving to the Cloud
Moving to the Cloud
Stacey Meyers
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud
CloudUniversity
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcom
tepheikk
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Steve Poole
 

What's hot (15)

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resources
 
Veeam backup and_replication
Veeam backup and_replicationVeeam backup and_replication
Veeam backup and_replication
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and Pitfalls
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your Environment
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakes
 
Exam results in SaaS
Exam results in SaaSExam results in SaaS
Exam results in SaaS
 
Vnx brochure
Vnx brochureVnx brochure
Vnx brochure
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPress
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutes
 
Moving to the Cloud
Moving to the CloudMoving to the Cloud
Moving to the Cloud
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcom
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
 

Similar to Lessons in moving from physical hosts to mesos

SharePoint Topology
SharePoint Topology SharePoint Topology
SharePoint Topology
Information Technology
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Tim Vaillancourt
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
wlscaudill
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview Questions
SoniaMathias2
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
Rush Shah
 
Distributed Development
Distributed DevelopmentDistributed Development
Distributed Development
Dmitri Nesteruk
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
Mohanadarshan Vivekanandalingam
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
João Paulo Preti
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
rramesh
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
Jason Ragsdale
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
Erhan Bagdemir
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVE
Alvar Lumberg
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Amazon Web Services
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Dealmaker Media
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Ertuğ Karamatlı
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
ShriKant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service Architectures
Zoltán Németh
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
Paco Nathan
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)
John Pape
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 

Similar to Lessons in moving from physical hosts to mesos (20)

SharePoint Topology
SharePoint Topology SharePoint Topology
SharePoint Topology
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview Questions
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Distributed Development
Distributed DevelopmentDistributed Development
Distributed Development
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVE
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service Architectures
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 

Recently uploaded

A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
drshikhapandey2022
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
DharmaBanothu
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
Kamal Acharya
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
VanTuDuong1
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
mahaffeycheryld
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 

Recently uploaded (20)

A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUESAN INTRODUCTION OF AI & SEARCHING TECHIQUES
AN INTRODUCTION OF AI & SEARCHING TECHIQUES
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...This study Examines the Effectiveness of Talent Procurement through the Imple...
This study Examines the Effectiveness of Talent Procurement through the Imple...
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Accident detection system project report.pdf
Accident detection system project report.pdfAccident detection system project report.pdf
Accident detection system project report.pdf
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
 
AI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdfAI in customer support Use cases solutions development and implementation.pdf
AI in customer support Use cases solutions development and implementation.pdf
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 

Lessons in moving from physical hosts to mesos

Editor's Notes

  1. Hello friends, today I am going to talk about lessons I learned when moving a service from physical hosts to Mesos.
  2. I will give a brief overview of Mesos, what its benefits are and how we migrated the service. After that, I will dive into the issues I saw after migration and what lessons I learned.
  3. When it comes to managing a cluster in the data center today most operations team use a static partitioning scheme. They start out with a set of servers and then provision out the servers into separate roles. The machines in a role usually run a specific service, for example apache or rails or memcache or hadoop
  4. With this scheme you will often have periods where machines in one partition may be resource starved while another partition is under-utilized. However, there is no easy way to reassign resources across partitioned clusters. For example, you cannot assign CPU from your cache to your web servers. It is also slower to increase the capacity of a partition. If you are expecting a spike in traffic for an event, to increase the server capacity you have to order more servers and provision these new servers. Using static partitioning also increases your mean time to recovery from faults. For example, if you lose two of your webservers, usually someone needs to come online and setup additional machines to replace the lost servers.
  5. Can we improve this situation? Instead of having silos of servers, can we treat the machines available as a pool and request whatever resources we need to run our services?
  6. This is where mesos comes in. It provides a way to treat the machines in your datacenter as one big computer. Services can request quotas for CPU, memory, disk. Mesos allocates these resources and runs these services. It can run different types of services: cron jobs, batch jobs, long running services. It can detect down services and restart them without any human intervention. Even though multiple services can run on the same server, they are isolated from one another
  7. So, what are the benefits of mesos. One of the biggest benefit of Mesos is better resource utilization. The services can elastsically grow and shrink based on the amount of resources they need and mesos will handle scheduling the services. For isolation, mesos uses containers which are less expensive than virtual machines
  8. Mesos also provides automatic failure detection and recovery. Failed jobs get restarted without any human intervention and this helps in reducing mean time to recovery The services can increase or reduce their resource utilization easily and this helps in better resource utilization. We also saw our deploys to be faster. Mesos also allows us to run multiple versions of the same service in the same environment and that makes it easier to rollout services.
  9. That was a brief overview of mesos. The next part is about the migration of a service from physical hosts to mesos. The service is called t.co . This is the service that handles url shortening for Twitter. When you tweet a url, this service converts it into a small url. We migrated from tens of physical hosts across multiple datacenters to running around tens of jobs. These jobs were run on a shared pool of servers which would run other services as well.
  10. The first step was to create a standalone package for the service. The service could not assume the availbility of any third party libraries. It could not assume that it would have access to system level directories. We also packaged any configuration files that would be required. Some services have pushed their configuration options to key value databases and would pull them from there on startup. Next, we deployed this to physical servers and to Mesos cluster. After the service was setup on Mesos cluster, we ran some load tests on the service. We collected production logs and ran the load test on the service running on mesos. We compared the performance of the cluster on the metrics like latency, total queries per second, garbage collection behaviour and latency. We would also monitor the coredumps or service restart. After the service passed sanity testing, the mesos service started getting a portion of production traffic. Initally, this was 1% traffic. We would keep an eye on the metrics using monitoring alerts to catch any breakages. We then migrated to more traffic in gradual steps like 10%, 20%, 50% and 100%.
  11. Did we get the benefits we were hoping for? There was less operational cost, we moved to using machines that were shared with other team. Routine maintenance tasks were easier or being handled by a dedicated mesos sre team. Deployment and rollback was faster
  12. Now I will talk about the issues we saw and what we learned after the migration. The first symptom we saw was that the clients using t.co service would report sudden spikes in latencies from t.co service. WHen we investigated, we found that this was caused by how mesos does resource isolation. Mesos uses linux control groups, called cgroups, to provide resource isolation. WHen a process starts, the cgroup provides it a quota of CPU cycles for a certain timeslice. If a process consumes its complete CPU cycles in the first few milliseconds, it is frozen until the next cycle. Our cacluation had not allocated enough cycles to account for garbage collection of the JVM. We added more CPU quota, increased the number of instances and this problem got fixed.
  13. To calculate the complete capacity of cluster, we used a simple approach. We ran a load test on a single job and then multipled that number with the total number of instances running. However, we saw that the cluster could not handle the traffic we had projected. This was caused by the heteregenous environment of servers on which the service was being scheduled. Some CPU variants could give higher throughput than others. To do a better capacity planning, we run load test on all cpu variants and then use the lowest number to plan how mcuh instances we need.
  14. Suppose we have a PHP application that needs to connect to cache server. How does the application know which machine and which port to connect to? In the world where the servers were statically allocated, the application could connect to a server on a static port and assume that the cache server would always be available on that connection. -distributed systems like Mesos require service discovery as an essential building block to connect applications and services. With mesos, the server and port get assigned dynamically, when the service starts up. Mesos would use zookeeper service to keep track of what service was running on what machine and port. Anyone that needed to use t.co service would query zookeeper to get the list of machines and ports and connect to them. This was a problem for some services that could not do this dynamically. We ended up setting up a few static proxy servers and the legacy applications would connect to them. THese proxies would query to zookeeper and forward the connection to the right hosts. Airbnb and tellapart have open source their software for this
  15. Sudden spikes in latencies even after eliminating job throttling What we learned: Co-running processes doing a lot of disk and network read/writes affect neighbours Async disk I/O helps alleviate pain Network is harder to isolate (ingress)
  16. Questions: Tweet to @ilunatech https://twitter.com/ilunatech Or email