In the old world, both our datacenter operator Dan and our application developer Alice had to be concerned about both the physical infrastructure running an application and the actual configuration and deployment of an application. - Dan would have to know how to run Alice's application in case a node went away. - We get a painful situation where Dan has to treat each of his machines like carefully crafted snowflakes. - Alice would have to know how much hardware to ask Dan for, and would usually overestimate to be safe. - This means that resource utilisation is often pretty bad.Now: - In the Mesos world, you have a homogenous cluster running your applications. It doesn't matter where your software gets deployed since it brings its dependencies with it. - Dan doesn't have to worry about specific machines going away and disproportionately impacting specific applications. - Alice doesn't have to worry too much about requesting enough resources ahead of time, because it's quick and easy to scale her application.
If an application or server went down, your infrastructure might have enough slack to tolerate it but often that's not possible. Your operator and/or developer gets paged and has to restart the application manually or try to revive the faulty hardware. Now - Mesos and Marathon provide a health checking mechanism that allows them to tell if an application is up and responsive. If it's not, they'll reap the application and start it again somewhere else. - Similarly, if a machine goes down, Mesos communicates the task failure to Marathon. It will try to maintain the desired state (e.g. 100 instances) and re-deploy the application somewhere else. - Dan doesn't get woken up because DCOS takes care of these failures.
- Provisioning machines was a privileged action. Alice would have to petition Dan for an upgrade. Changes were slow. Now - - Dan can provide Alice with her own instance of Marathon. These tasks are inherently isolated from other tasks running on the same cluster. Alice is able to do what she run what she likes using this instance (with some reasonable constraints). - Alice can now programmatically deploy applications to Marathon using its APIs.
Coordinator -> scheduler Slaves -> agents
Spring 2009: 262B project between BenH, Andy and Matei2009: Nexus project started at UC Berkeley’s RAD Lab (later renamed to Mesos) March 2010: Grad students give talk at Twitter, Twitter starts exploring deployment September 2010: Mesos technical report published (http://mesos.berkeley.edu/mesos_tech_report.pdf) December 2010: Mesos enters Apache Incubator (http://incubator.apache.org/projects/mesos.html)
Slave -> agent
Sunil starts here
Let’s talk a little about how Mesos had so much utility for Twitter
When Mesos was presented at Twitter, engineers in the audience immediately saw potential. A couple of engineers, who had been at Google, saw this as a possible replacement for Borg.
Mesos actually works pretty well for long running services.
This a little more involved. - Marathon takes care of deploying your application, just post marathon.json to http://dcos-host/service/my-marathon
- It has different strategies, default is to start as many healthy new instances of the app as are currently running before killing the old app.