Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Zoe: Swarming Spark applications
Daniele Venzano
Research Engineer, EURECOM
My background
Software engineering (2010)
• Linux embedded systems, kernel drivers, graphical
interfaces
Research (2012)
•...
DSG and Eurecom
Research center on the French Riviera
Like this?
3
DSG and Eurecom
Research center on the French Riviera
Or more like this?
4
DSG and Eurecom
Engineering research center
• Academic research in telecommunication, multimedia, networks and
security
• ...
Docker at the Distributed Systems Group
Started investigating Docker in 2012
• Virtualization platform for Big Data resear...
Use cases
Internally at Eurecom:
• Laboratory sessions for Data Science course
• ~100 students, fixed configuration, throw...
The last 3 years: OpenStack + Sahara
Public/private cloud with VM-based virtualization
We contributed Spark support to Sah...
Why build on top of Docker and Swarm?
Swarm has a simple, documented API
Start solving our problem immediately
Packaging s...
Zoe
Application scheduler on top of Swarm
Queues requests when resources are scarce
Users can submit their own application...
What is a Zoe application?
11
Zoe architecture
Zoe
scheduler
Swarm
Images from
private registry
or Docker Hub
Monitoring data
Users submit
application
d...
Automatic resize of running applications
Volumes
Data layer
Applications
Example: a data layer is not needed if there are ...
Examples of scheduling policies
FIFO – First In First Out
Priority based
Researchers near deadlines have more priority
Fit...
Zoe implementation
Two client implementations
Web interface
Command line for scripting
Simple FIFO scheduler
Docker images...
Zoe - future
Set date: March 2016 version 1.0
Big plans for Zoe
One full-time programmer
Companies we spoke to, all, are v...
Using Docker Swarm for data-intensive apps
L2 networking for Docker
containers
Service discovery via DNS
Docker bridge
eth...
Key takeaways
1. Zoe is a data-intensive application scheduler that targets data
scientists and private clouds
2. It is ve...
Thank you!
Daniele Venzano
http://zoe-analytics.eu
venza@brownhat.org
Upcoming SlideShare
Loading in …5
×

Zoe - Swarming Spark applications

5,863 views

Published on

Slides from the talk given at DockerCon EU 2015 by Daniele Venzano on using Docker Swarm for data intensive applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Zoe - Swarming Spark applications

  1. 1. Zoe: Swarming Spark applications Daniele Venzano Research Engineer, EURECOM
  2. 2. My background Software engineering (2010) • Linux embedded systems, kernel drivers, graphical interfaces Research (2012) • Code analysis, OpenFlow, automatic bug detection More research (now) • Virtualization, networking, distributed systems performance 2
  3. 3. DSG and Eurecom Research center on the French Riviera Like this? 3
  4. 4. DSG and Eurecom Research center on the French Riviera Or more like this? 4
  5. 5. DSG and Eurecom Engineering research center • Academic research in telecommunication, multimedia, networks and security • Close ties with local and international companies Distributed Systems Group • Focusing on data-intensive applications (so called “big data”) at all levels • Performance impact of virtualization, storage and network technologies (that’s me!) • Data processing frameworks (Hadoop, Spark) • Machine learning algorithms 5
  6. 6. Docker at the Distributed Systems Group Started investigating Docker in 2012 • Virtualization platform for Big Data research Summer 2015 • Built Swarm cluster • Planning to shift from VMs to Containers for most use cases Bigfoot project 6
  7. 7. Use cases Internally at Eurecom: • Laboratory sessions for Data Science course • ~100 students, fixed configuration, throw-away environments • Academic research • very dynamic loads, all kinds of software combinations, higher priorities near deadlines Companies have similar use cases • Production jobs • Fixed configuration, periodic executions • Research teams Smart airports Power load forecasting Customer location forecasting 7
  8. 8. The last 3 years: OpenStack + Sahara Public/private cloud with VM-based virtualization We contributed Spark support to Sahara Users can create clusters on-demand Assumes infinite resources Slow • Create an HDFS+Spark cluster: 5 to 10 minutes • Swarm takes a few seconds for the same task Supporting new services/versions requires code changes Users make static allocations 8
  9. 9. Why build on top of Docker and Swarm? Swarm has a simple, documented API Start solving our problem immediately Packaging software is very easy Freedom to experiment Fast deployments No static allocation, automatic resizing Swarm does only one thing and does it well 9
  10. 10. Zoe Application scheduler on top of Swarm Queues requests when resources are scarce Users can submit their own applications And create their own container images! Dynamically resizes active applications Free unused resources to speed-up other apps Can coexist with other Swarm users MSC Zoe Launch: August 2015 Tonnage: 197,362t Capacity: 19,224 TEU Length: 395.4 m Engine: 83,800 HP Crew: 22 10
  11. 11. What is a Zoe application? 11
  12. 12. Zoe architecture Zoe scheduler Swarm Images from private registry or Docker Hub Monitoring data Users submit application descriptions Zoe schedules requests 12
  13. 13. Automatic resize of running applications Volumes Data layer Applications Example: a data layer is not needed if there are no users Data is kept in volumes The data layer can be restarted when needed 13
  14. 14. Examples of scheduling policies FIFO – First In First Out Priority based Researchers near deadlines have more priority Fits nicely the Swarm priority model Deadline Finish this work by 3 p.m. Streaming analysis latency must be less than 200ms Size-based Run first the smallest applications Need to know the runtime in advance 14
  15. 15. Zoe implementation Two client implementations Web interface Command line for scripting Simple FIFO scheduler Docker images for Spark, HDFS, iPython and Spark notebooks Open source on GitHub, images available on the Docker Hub 15
  16. 16. Zoe - future Set date: March 2016 version 1.0 Big plans for Zoe One full-time programmer Companies we spoke to, all, are very interested Features for 1.0 and after: Create Zoe applications with more and more services Automatic resizing of applications Use the new volume management Monitoring Advanced scheduling 16
  17. 17. Using Docker Swarm for data-intensive apps L2 networking for Docker containers Service discovery via DNS Docker bridge eth0 eth1 Docker bridge eth0 eth1 What about Swarm 1.0 multi-host networking? - We need hostnames to be visible from outside - Will run measurements on overlay network performance c1 c2 c3 c4 17
  18. 18. Key takeaways 1. Zoe is a data-intensive application scheduler that targets data scientists and private clouds 2. It is very easy to build cloud applications on top of Swarm 3. Data-intensive frameworks like Spark can run easily and efficiently on top of Swarm 4. Network between Docker containers on different hosts can be made transparent 18
  19. 19. Thank you! Daniele Venzano http://zoe-analytics.eu venza@brownhat.org

×