Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling Container Architectures
with OSS & Mesos
David Greenberg
QCon NY
6/13/2016
Who am I?
Engineer, Architect, and Operator
Formerly with Two Sigma
Open Source Fan
Today
 Why does a hedge fund need a cluster?
 What did we have before?
 What did we replace it with?
 What challenges ...
Today
 Why does a hedge fund need a cluster?
 What did we have before?
 What did we replace it with?
 What challenges ...
What’s a quant fund do?
Math
Trading system
Data
Feed
Machine
Learning
Trade
How do we choose data
feeds?
Experimentation
How do we design models?
We need a platform
Today
 Why does a hedge fund need a cluster?
 What did we have before?
 What did we replace it with?
 What challenges ...
A long, long time ago
In a galaxy far, far away
Send work to job system
Abstracts scheduling & resources
Scaled beautifully
Limitations
It’s not a feature, it’s a bug
Limitation 1: Quotas
Limitation 1: Quotas
Limitation 2: API
Limitation 3:
Isolation, in theory
Limitation 3:
Isolation, in reality
Today
 Why does a hedge fund need a cluster?
 What did we have before?
 What did we replace it with?
 What challenges ...
Requirements
• Scalability to 1000+
machines
• Throughput guarantee
• Strong isolation
• High utilization
What even scales?
Alternatives
OpenStack
Alternative Schedulers
Service goals differ between
organizations
Torque, Moab, Slurm
Introducing: Cook
Our DIY, not NIH scheduler
How to replace a critical
system?
Slay a few dragons
SLA v1
What SLA?
SLA v2
All that work for no improvement
SLA v3
The easiest thing
SLA v4
Enhanced Mesos’s DRF
with CRS
Today
 Why does a hedge fund need a cluster?
 What did we have before?
 What did we replace it with?
 What challenges ...
Scaling is painful
How do you debug?
SSH and grep logs
=
How do you automate manual
processes?
Sometimes 100% isn’t good
enough
Retrospective
That went well
Scaled to thousands of nodes
Supports Spark & dozens of
internal tools
Automatically recovers from
common issues
Takeaways
Does your workload fit the tool?
Architecural Refactoring
Questions?
Want to ask a question,
but don’t know what to ask?
• Understanding your workload
• Dragonslaying/architectural...
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
QCon Scaling Container Architectures With Oss & Mesos
Upcoming SlideShare
Loading in …5
×

QCon Scaling Container Architectures With Oss & Mesos

481 views

Published on

David Greenberg QCon 16

Published in: Technology
  • Be the first to comment

  • Be the first to like this

QCon Scaling Container Architectures With Oss & Mesos

  1. 1. Scaling Container Architectures with OSS & Mesos David Greenberg QCon NY 6/13/2016
  2. 2. Who am I? Engineer, Architect, and Operator
  3. 3. Formerly with Two Sigma
  4. 4. Open Source Fan
  5. 5. Today  Why does a hedge fund need a cluster?  What did we have before?  What did we replace it with?  What challenges did we solve?
  6. 6. Today  Why does a hedge fund need a cluster?  What did we have before?  What did we replace it with?  What challenges did we solve?
  7. 7. What’s a quant fund do? Math
  8. 8. Trading system Data Feed Machine Learning Trade
  9. 9. How do we choose data feeds?
  10. 10. Experimentation
  11. 11. How do we design models?
  12. 12. We need a platform
  13. 13. Today  Why does a hedge fund need a cluster?  What did we have before?  What did we replace it with?  What challenges did we solve?
  14. 14. A long, long time ago In a galaxy far, far away
  15. 15. Send work to job system Abstracts scheduling & resources
  16. 16. Scaled beautifully
  17. 17. Limitations It’s not a feature, it’s a bug
  18. 18. Limitation 1: Quotas
  19. 19. Limitation 1: Quotas
  20. 20. Limitation 2: API
  21. 21. Limitation 3: Isolation, in theory
  22. 22. Limitation 3: Isolation, in reality
  23. 23. Today  Why does a hedge fund need a cluster?  What did we have before?  What did we replace it with?  What challenges did we solve?
  24. 24. Requirements • Scalability to 1000+ machines • Throughput guarantee • Strong isolation • High utilization
  25. 25. What even scales?
  26. 26. Alternatives
  27. 27. OpenStack
  28. 28. Alternative Schedulers Service goals differ between organizations
  29. 29. Torque, Moab, Slurm
  30. 30. Introducing: Cook Our DIY, not NIH scheduler
  31. 31. How to replace a critical system? Slay a few dragons
  32. 32. SLA v1 What SLA?
  33. 33. SLA v2 All that work for no improvement
  34. 34. SLA v3 The easiest thing
  35. 35. SLA v4 Enhanced Mesos’s DRF with CRS
  36. 36. Today  Why does a hedge fund need a cluster?  What did we have before?  What did we replace it with?  What challenges did we solve?
  37. 37. Scaling is painful
  38. 38. How do you debug? SSH and grep logs
  39. 39. =
  40. 40. How do you automate manual processes? Sometimes 100% isn’t good enough
  41. 41. Retrospective That went well
  42. 42. Scaled to thousands of nodes
  43. 43. Supports Spark & dozens of internal tools
  44. 44. Automatically recovers from common issues
  45. 45. Takeaways
  46. 46. Does your workload fit the tool?
  47. 47. Architecural Refactoring
  48. 48. Questions? Want to ask a question, but don’t know what to ask? • Understanding your workload • Dragonslaying/architectural refactoring • Mesos • Operations at scale

×