SlideShare a Scribd company logo
1 of 16
Lessons in moving from
physical hosts to Mesos
Raj Shekhar, Senior Site Reliability Engineer
@ilunatech
Mesos
WHAT
WHY
HOW
NOW WHAT
How most Ops teams run clusters today
Static partitioning has problems
Unequal load distribution on machines
Slower to add capacity
Not fault tolerant
Is there a better way?
Do we want machines or do we want resources?
Mesos
Resource manager - the datacenter is one big pool
Can run multi-tenant workloads
Failure detection
Services are isolated from one another
Why Mesos - Better resource utilization
Run multi-tenant workload on machines
Dynamic partitioning - no dedicated machines for tasks
Less resource hungry than virtual machines
Why Mesos - all the other good things
Fault tolerant - automatically restart failed jobs
Elasticity - grow and shrink on demand
Faster deploys
T.co - URL shortening
http://example.com/example http://t.co/examp
How
Package Deploy Test Go Live!
Life after Go Live
Lowered operating expense
Fewer routine operational tasks
Faster deploys
Job throttling
Sudden spikes in latencies
What we learned
cgroups and cpu quotas
Capacity planning
Max traffic of the cluster was lower than our expectation
What we learned
Different CPU variants have different throughput
Rethink service discovery
Services get hosts and ports assigned dynamically
What we learned
Use static proxies to forward connections
No perfect isolation
Sudden spike in latency
What we learned
Async ops where possible, noisy neighbours still affect us
Questions?
rajlist@rajshekhar.net
@ilunatech

More Related Content

What's hot

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resourcesWesley Smits
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsMavenWire
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11aseager
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentSolarWinds
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesKenny Buntinx
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPressAndrew Marks
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesGal Baras
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the CloudCloudUniversity
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcomtepheikk
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkSteve Poole
 

What's hot (15)

How to cache your static resources
How to cache your static resourcesHow to cache your static resources
How to cache your static resources
 
Veeam backup and_replication
Veeam backup and_replicationVeeam backup and_replication
Veeam backup and_replication
 
5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator5 Things to Ask Your Virtualization Administrator
5 Things to Ask Your Virtualization Administrator
 
Virtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and PitfallsVirtualizing OTM - Real World Experiences and Pitfalls
Virtualizing OTM - Real World Experiences and Pitfalls
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Using Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your EnvironmentUsing Virtualization Manager 4.0 to Manage Your Environment
Using Virtualization Manager 4.0 to Manage Your Environment
 
Llunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakesLlunitebe2018 worst config mgr cb mistakes
Llunitebe2018 worst config mgr cb mistakes
 
Exam results in SaaS
Exam results in SaaSExam results in SaaS
Exam results in SaaS
 
Vnx brochure
Vnx brochureVnx brochure
Vnx brochure
 
10 Tips for Optimising WordPress
10 Tips for Optimising WordPress10 Tips for Optimising WordPress
10 Tips for Optimising WordPress
 
How to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutesHow to make your site 5 times faster in 10 minutes
How to make your site 5 times faster in 10 minutes
 
Moving to the Cloud
Moving to the CloudMoving to the Cloud
Moving to the Cloud
 
10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud10 Reasons to Move to the Cloud
10 Reasons to Move to the Cloud
 
Caching idea for midcom
Caching idea for midcomCaching idea for midcom
Caching idea for midcom
 
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talkEclipse OpenJ9 - SpringOne 2018 Lightning talk
Eclipse OpenJ9 - SpringOne 2018 Lightning talk
 

Similar to Lessons in moving from physical hosts to mesos

Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecturewlscaudill
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview QuestionsSoniaMathias2
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationMohanadarshan Vivekanandalingam
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniquesrramesh
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management SystemErhan Bagdemir
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVEAlvar Lumberg
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Amazon Web Services
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Dealmaker Media
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comErtuğ Karamatlı
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaShriKant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service ArchitecturesZoltán Németh
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)John Pape
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practiceswebuploader
 

Similar to Lessons in moving from physical hosts to mesos (20)

SharePoint Topology
SharePoint Topology SharePoint Topology
SharePoint Topology
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Intro to Cloud Architecture
Intro to Cloud ArchitectureIntro to Cloud Architecture
Intro to Cloud Architecture
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview Questions
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Distributed Development
Distributed DevelopmentDistributed Development
Distributed Development
 
An operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementationAn operating system for multicore and clouds: mechanism and implementation
An operating system for multicore and clouds: mechanism and implementation
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Mesos: Cluster Management System
Mesos: Cluster Management SystemMesos: Cluster Management System
Mesos: Cluster Management System
 
System Architecture at DDVE
System Architecture at DDVESystem Architecture at DDVE
System Architecture at DDVE
 
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
Accelerating Application Performance with Amazon ElastiCache (DAT207) | AWS r...
 
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
Scalability for Startups (Frank Mashraqi, Startonomics SF 2008)
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
 
Scalable Service Architectures
Scalable Service ArchitecturesScalable Service Architectures
Scalable Service Architectures
 
Datacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DCDatacenter Computing with Apache Mesos - BigData DC
Datacenter Computing with Apache Mesos - BigData DC
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 

Recently uploaded

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 

Lessons in moving from physical hosts to mesos

Editor's Notes

  1. Hello friends, today I am going to talk about lessons I learned when moving a service from physical hosts to Mesos.
  2. I will give a brief overview of Mesos, what its benefits are and how we migrated the service. After that, I will dive into the issues I saw after migration and what lessons I learned.
  3. When it comes to managing a cluster in the data center today most operations team use a static partitioning scheme. They start out with a set of servers and then provision out the servers into separate roles. The machines in a role usually run a specific service, for example apache or rails or memcache or hadoop
  4. With this scheme you will often have periods where machines in one partition may be resource starved while another partition is under-utilized. However, there is no easy way to reassign resources across partitioned clusters. For example, you cannot assign CPU from your cache to your web servers. It is also slower to increase the capacity of a partition. If you are expecting a spike in traffic for an event, to increase the server capacity you have to order more servers and provision these new servers. Using static partitioning also increases your mean time to recovery from faults. For example, if you lose two of your webservers, usually someone needs to come online and setup additional machines to replace the lost servers.
  5. Can we improve this situation? Instead of having silos of servers, can we treat the machines available as a pool and request whatever resources we need to run our services?
  6. This is where mesos comes in. It provides a way to treat the machines in your datacenter as one big computer. Services can request quotas for CPU, memory, disk. Mesos allocates these resources and runs these services. It can run different types of services: cron jobs, batch jobs, long running services. It can detect down services and restart them without any human intervention. Even though multiple services can run on the same server, they are isolated from one another
  7. So, what are the benefits of mesos. One of the biggest benefit of Mesos is better resource utilization. The services can elastsically grow and shrink based on the amount of resources they need and mesos will handle scheduling the services. For isolation, mesos uses containers which are less expensive than virtual machines
  8. Mesos also provides automatic failure detection and recovery. Failed jobs get restarted without any human intervention and this helps in reducing mean time to recovery The services can increase or reduce their resource utilization easily and this helps in better resource utilization. We also saw our deploys to be faster. Mesos also allows us to run multiple versions of the same service in the same environment and that makes it easier to rollout services.
  9. That was a brief overview of mesos. The next part is about the migration of a service from physical hosts to mesos. The service is called t.co . This is the service that handles url shortening for Twitter. When you tweet a url, this service converts it into a small url. We migrated from tens of physical hosts across multiple datacenters to running around tens of jobs. These jobs were run on a shared pool of servers which would run other services as well.
  10. The first step was to create a standalone package for the service. The service could not assume the availbility of any third party libraries. It could not assume that it would have access to system level directories. We also packaged any configuration files that would be required. Some services have pushed their configuration options to key value databases and would pull them from there on startup. Next, we deployed this to physical servers and to Mesos cluster. After the service was setup on Mesos cluster, we ran some load tests on the service. We collected production logs and ran the load test on the service running on mesos. We compared the performance of the cluster on the metrics like latency, total queries per second, garbage collection behaviour and latency. We would also monitor the coredumps or service restart. After the service passed sanity testing, the mesos service started getting a portion of production traffic. Initally, this was 1% traffic. We would keep an eye on the metrics using monitoring alerts to catch any breakages. We then migrated to more traffic in gradual steps like 10%, 20%, 50% and 100%.
  11. Did we get the benefits we were hoping for? There was less operational cost, we moved to using machines that were shared with other team. Routine maintenance tasks were easier or being handled by a dedicated mesos sre team. Deployment and rollback was faster
  12. Now I will talk about the issues we saw and what we learned after the migration. The first symptom we saw was that the clients using t.co service would report sudden spikes in latencies from t.co service. WHen we investigated, we found that this was caused by how mesos does resource isolation. Mesos uses linux control groups, called cgroups, to provide resource isolation. WHen a process starts, the cgroup provides it a quota of CPU cycles for a certain timeslice. If a process consumes its complete CPU cycles in the first few milliseconds, it is frozen until the next cycle. Our cacluation had not allocated enough cycles to account for garbage collection of the JVM. We added more CPU quota, increased the number of instances and this problem got fixed.
  13. To calculate the complete capacity of cluster, we used a simple approach. We ran a load test on a single job and then multipled that number with the total number of instances running. However, we saw that the cluster could not handle the traffic we had projected. This was caused by the heteregenous environment of servers on which the service was being scheduled. Some CPU variants could give higher throughput than others. To do a better capacity planning, we run load test on all cpu variants and then use the lowest number to plan how mcuh instances we need.
  14. Suppose we have a PHP application that needs to connect to cache server. How does the application know which machine and which port to connect to? In the world where the servers were statically allocated, the application could connect to a server on a static port and assume that the cache server would always be available on that connection. -distributed systems like Mesos require service discovery as an essential building block to connect applications and services. With mesos, the server and port get assigned dynamically, when the service starts up. Mesos would use zookeeper service to keep track of what service was running on what machine and port. Anyone that needed to use t.co service would query zookeeper to get the list of machines and ports and connect to them. This was a problem for some services that could not do this dynamically. We ended up setting up a few static proxy servers and the legacy applications would connect to them. THese proxies would query to zookeeper and forward the connection to the right hosts. Airbnb and tellapart have open source their software for this
  15. Sudden spikes in latencies even after eliminating job throttling What we learned: Co-running processes doing a lot of disk and network read/writes affect neighbours Async disk I/O helps alleviate pain Network is harder to isolate (ingress)
  16. Questions: Tweet to @ilunatech https://twitter.com/ilunatech Or email