SlideShare a Scribd company logo
1 of 16
Lessons in moving from
physical hosts to Mesos
Raj Shekhar, Senior Site Reliability Engineer
@ilunatech
Mesos
WHAT
WHY
HOW
NOW WHAT
How most Ops teams run clusters today
Static partitioning has problems
Unequal load distribution on machines
Slower to add capacity
Not fault tolerant
Is there a better way?
Do we want machines or do we want resources?
Mesos
Resource manager - the datacenter is one big pool
Can run multi-tenant workloads
Failure detection
Services are isolated from one another
Why Mesos - Better resource utilization
Run multi-tenant workload on machines
Dynamic partitioning - no dedicated machines for tasks
Less resource hungry than virtual machines
Why Mesos - all the other good things
Fault tolerant - automatically restart failed jobs
Elasticity - grow and shrink on demand
Faster deploys
T.co - URL shortening
http://example.com/example http://t.co/examp
How
Package Deploy Test Go Live!
Life after Go Live
Lowered operating expense
Fewer routine operational tasks
Faster deploys
Job throttling
Sudden spikes in latencies
What we learned
cgroups and cpu quotas
Capacity planning
Max traffic of the cluster was lower than our expectation
What we learned
Different CPU variants have different throughput
Rethink service discovery
Services get hosts and ports assigned dynamically
What we learned
Use static proxies to forward connections
No perfect isolation
Sudden spike in latency
What we learned
Async ops where possible, noisy neighbours still affect us
Questions?
rajlist@rajshekhar.net
@ilunatech

More Related Content

What's hot

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resourcesWesley Smits
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsMavenWire
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11aseager
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentSolarWinds
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesKenny Buntinx
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPressAndrew Marks
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesGal Baras
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the CloudCloudUniversity
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcomtepheikk
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkSteve Poole
 

What's hot (15)

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resources
 
Veeam backup and_replication
Veeam backup and_replicationVeeam backup and_replication
Veeam backup and_replication
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and Pitfalls
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your Environment
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakes
 
Exam results in SaaS
Exam results in SaaSExam results in SaaS
Exam results in SaaS
 
Vnx brochure
Vnx brochureVnx brochure
Vnx brochure
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPress
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutes
 
Moving to the Cloud
Moving to the CloudMoving to the Cloud
Moving to the Cloud
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcom
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
 

Similar to Lessons in moving from physical hosts to mesos

Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecturewlscaudill
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview QuestionsSoniaMathias2
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationMohanadarshan Vivekanandalingam
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniquesrramesh
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management SystemErhan Bagdemir
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVEAlvar Lumberg
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Amazon Web Services
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Dealmaker Media
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comErtuğ Karamatlı
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaShriKant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service ArchitecturesZoltán Németh
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)John Pape
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 

Similar to Lessons in moving from physical hosts to mesos (20)

SharePoint Topology
SharePoint Topology SharePoint Topology
SharePoint Topology
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview Questions
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Distributed Development
Distributed DevelopmentDistributed Development
Distributed Development
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVE
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service Architectures
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 

Recently uploaded

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 

Recently uploaded (20)

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 

Lessons in moving from physical hosts to mesos

Editor's Notes

  1. Hello friends, today I am going to talk about lessons I learned when moving a service from physical hosts to Mesos.
  2. I will give a brief overview of Mesos, what its benefits are and how we migrated the service. After that, I will dive into the issues I saw after migration and what lessons I learned.
  3. When it comes to managing a cluster in the data center today most operations team use a static partitioning scheme. They start out with a set of servers and then provision out the servers into separate roles. The machines in a role usually run a specific service, for example apache or rails or memcache or hadoop
  4. With this scheme you will often have periods where machines in one partition may be resource starved while another partition is under-utilized. However, there is no easy way to reassign resources across partitioned clusters. For example, you cannot assign CPU from your cache to your web servers. It is also slower to increase the capacity of a partition. If you are expecting a spike in traffic for an event, to increase the server capacity you have to order more servers and provision these new servers. Using static partitioning also increases your mean time to recovery from faults. For example, if you lose two of your webservers, usually someone needs to come online and setup additional machines to replace the lost servers.
  5. Can we improve this situation? Instead of having silos of servers, can we treat the machines available as a pool and request whatever resources we need to run our services?
  6. This is where mesos comes in. It provides a way to treat the machines in your datacenter as one big computer. Services can request quotas for CPU, memory, disk. Mesos allocates these resources and runs these services. It can run different types of services: cron jobs, batch jobs, long running services. It can detect down services and restart them without any human intervention. Even though multiple services can run on the same server, they are isolated from one another
  7. So, what are the benefits of mesos. One of the biggest benefit of Mesos is better resource utilization. The services can elastsically grow and shrink based on the amount of resources they need and mesos will handle scheduling the services. For isolation, mesos uses containers which are less expensive than virtual machines
  8. Mesos also provides automatic failure detection and recovery. Failed jobs get restarted without any human intervention and this helps in reducing mean time to recovery The services can increase or reduce their resource utilization easily and this helps in better resource utilization. We also saw our deploys to be faster. Mesos also allows us to run multiple versions of the same service in the same environment and that makes it easier to rollout services.
  9. That was a brief overview of mesos. The next part is about the migration of a service from physical hosts to mesos. The service is called t.co . This is the service that handles url shortening for Twitter. When you tweet a url, this service converts it into a small url. We migrated from tens of physical hosts across multiple datacenters to running around tens of jobs. These jobs were run on a shared pool of servers which would run other services as well.
  10. The first step was to create a standalone package for the service. The service could not assume the availbility of any third party libraries. It could not assume that it would have access to system level directories. We also packaged any configuration files that would be required. Some services have pushed their configuration options to key value databases and would pull them from there on startup. Next, we deployed this to physical servers and to Mesos cluster. After the service was setup on Mesos cluster, we ran some load tests on the service. We collected production logs and ran the load test on the service running on mesos. We compared the performance of the cluster on the metrics like latency, total queries per second, garbage collection behaviour and latency. We would also monitor the coredumps or service restart. After the service passed sanity testing, the mesos service started getting a portion of production traffic. Initally, this was 1% traffic. We would keep an eye on the metrics using monitoring alerts to catch any breakages. We then migrated to more traffic in gradual steps like 10%, 20%, 50% and 100%.
  11. Did we get the benefits we were hoping for? There was less operational cost, we moved to using machines that were shared with other team. Routine maintenance tasks were easier or being handled by a dedicated mesos sre team. Deployment and rollback was faster
  12. Now I will talk about the issues we saw and what we learned after the migration. The first symptom we saw was that the clients using t.co service would report sudden spikes in latencies from t.co service. WHen we investigated, we found that this was caused by how mesos does resource isolation. Mesos uses linux control groups, called cgroups, to provide resource isolation. WHen a process starts, the cgroup provides it a quota of CPU cycles for a certain timeslice. If a process consumes its complete CPU cycles in the first few milliseconds, it is frozen until the next cycle. Our cacluation had not allocated enough cycles to account for garbage collection of the JVM. We added more CPU quota, increased the number of instances and this problem got fixed.
  13. To calculate the complete capacity of cluster, we used a simple approach. We ran a load test on a single job and then multipled that number with the total number of instances running. However, we saw that the cluster could not handle the traffic we had projected. This was caused by the heteregenous environment of servers on which the service was being scheduled. Some CPU variants could give higher throughput than others. To do a better capacity planning, we run load test on all cpu variants and then use the lowest number to plan how mcuh instances we need.
  14. Suppose we have a PHP application that needs to connect to cache server. How does the application know which machine and which port to connect to? In the world where the servers were statically allocated, the application could connect to a server on a static port and assume that the cache server would always be available on that connection. -distributed systems like Mesos require service discovery as an essential building block to connect applications and services. With mesos, the server and port get assigned dynamically, when the service starts up. Mesos would use zookeeper service to keep track of what service was running on what machine and port. Anyone that needed to use t.co service would query zookeeper to get the list of machines and ports and connect to them. This was a problem for some services that could not do this dynamically. We ended up setting up a few static proxy servers and the legacy applications would connect to them. THese proxies would query to zookeeper and forward the connection to the right hosts. Airbnb and tellapart have open source their software for this
  15. Sudden spikes in latencies even after eliminating job throttling What we learned: Co-running processes doing a lot of disk and network read/writes affect neighbours Async disk I/O helps alleviate pain Network is harder to isolate (ingress)
  16. Questions: Tweet to @ilunatech https://twitter.com/ilunatech Or email