Clusterous is a new open source tool to make cluster computing on AWS easier for scientists, data scientists, and anyone who isn't a cloud computing expert.
https://github.com/sirca/clusterous
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Clusterous: Easy cluster computing with Docker and AWS
1. Clusterous - Easy Cluster
Computing with Docker and AWS
SIRCA
Balram Ramanathan
Tuesday 22nd March 2016
2. Who we are
● SIRCA was founded in 1997 by a group of Australian and New Zealand
universities as a not for profit company
● Our mission is to enable data intensive research
● We also provide academics access to a number of key large-scale data sets
primarily in the finance space
3. Project background
Clusterous is part of SIRCA’s contribution to the Big Data Knowledge Discovery project, a
collaborative project funded by the Science and Industry Endowment Fund (SIEF). The project was
created to realise the potential of bringing scientists in data centric disciplines together with leaders
in information technology to explore how they can utilise big data and machine learning to create a
new paradigm in research and unlock new learnings.
4. Problem we are trying to solve
● Scientists want access to compute power, but often end up stuck with
physical machines - hard to scale
● AWS provides an answer, but can be daunting to get started and tedious to
setup and use a compute cluster
○ Any productivity gained from faster compute threatens to be offset by setup/admin overhead
● Getting your code to run on remote machines can be a headache of its own
○ Different OS versions, dependencies, etc.
○ How to deploy across multiple machines?
● Clearly a need for a tool to make cluster computing in the cloud easy for those
who write code but aren’t cloud experts
5. Clusterous makes cluster computing easier
● Open source command line tool written in Python
● Use the simple config “wizard” to enter your AWS credentials and configure
your account
● Put a few cluster parameters in a YAML file - such as instance types and
number of instances
● Start the cluster
● All clusters have a shared volume for your data, config files, etc.
6.
7. BYO Code
● Clusterous doesn’t impose any parallel compute framework or language
● Put your code plus supporting libraries in Docker containers, and deploy to
the cluster with the help of “Environments”
8. Environments
● An “environment” is a complete running environment for your code
● An environment file is a simple YAML-based script for deploying your
containers to the cluster
● Also copies files, builds Docker images (if needed), creates a tunnel
● Get your application deployed and running in a single step
● Environment files are redistributable - write once, run many
● We have created environments for IPython Parallel and PySpark -
many users may just use those
9. Our users so far
● Our partners have run their own parallel compute software on
Clusterous clusters
● One project partner uses R for ecology simulations - they created
rrqueue, an open source distributed task queue for R
● A team at Data61 ran Stateline, a framework for distributed Markov
Chain Monte Carlo sampling on Clusterous
11. Credits
● Our team consists of Balram Ramanathan, Lolo Fernandez and Ben King
● Big thank you to our project partners at Data61, University of Sydney and
Macquarie University for their input
● Thanks to SIEF for the funding
● We are aiming to release version 1.0 in the next few weeks