SlideShare a Scribd company logo
1 of 12
Download to read offline
Clusterous - Easy Cluster
Computing with Docker and AWS
SIRCA
Balram Ramanathan
Tuesday 22nd March 2016
Who we are
● SIRCA was founded in 1997 by a group of Australian and New Zealand
universities as a not for profit company
● Our mission is to enable data intensive research
● We also provide academics access to a number of key large-scale data sets
primarily in the finance space
Project background
Clusterous is part of SIRCA’s contribution to the Big Data Knowledge Discovery project, a
collaborative project funded by the Science and Industry Endowment Fund (SIEF). The project was
created to realise the potential of bringing scientists in data centric disciplines together with leaders
in information technology to explore how they can utilise big data and machine learning to create a
new paradigm in research and unlock new learnings.
Problem we are trying to solve
● Scientists want access to compute power, but often end up stuck with
physical machines - hard to scale
● AWS provides an answer, but can be daunting to get started and tedious to
setup and use a compute cluster
○ Any productivity gained from faster compute threatens to be offset by setup/admin overhead
● Getting your code to run on remote machines can be a headache of its own
○ Different OS versions, dependencies, etc.
○ How to deploy across multiple machines?
● Clearly a need for a tool to make cluster computing in the cloud easy for those
who write code but aren’t cloud experts
Clusterous makes cluster computing easier
● Open source command line tool written in Python
● Use the simple config “wizard” to enter your AWS credentials and configure
your account
● Put a few cluster parameters in a YAML file - such as instance types and
number of instances
● Start the cluster
● All clusters have a shared volume for your data, config files, etc.
BYO Code
● Clusterous doesn’t impose any parallel compute framework or language
● Put your code plus supporting libraries in Docker containers, and deploy to
the cluster with the help of “Environments”
Environments
● An “environment” is a complete running environment for your code
● An environment file is a simple YAML-based script for deploying your
containers to the cluster
● Also copies files, builds Docker images (if needed), creates a tunnel
● Get your application deployed and running in a single step
● Environment files are redistributable - write once, run many
● We have created environments for IPython Parallel and PySpark -
many users may just use those
Our users so far
● Our partners have run their own parallel compute software on
Clusterous clusters
● One project partner uses R for ecology simulations - they created
rrqueue, an open source distributed task queue for R
● A team at Data61 ran Stateline, a framework for distributed Markov
Chain Monte Carlo sampling on Clusterous
Demo time
Credits
● Our team consists of Balram Ramanathan, Lolo Fernandez and Ben King
● Big thank you to our project partners at Data61, University of Sydney and
Macquarie University for their input
● Thanks to SIEF for the funding
● We are aiming to release version 1.0 in the next few weeks
https://github.com/sirca/clusterous
balram.ramanathan@sirca.org.au

More Related Content

Viewers also liked

Viewers also liked (12)

Come diventare un H.I.M.E. - WebReevolution
Come diventare un H.I.M.E. - WebReevolutionCome diventare un H.I.M.E. - WebReevolution
Come diventare un H.I.M.E. - WebReevolution
 
łUkasz stypa
łUkasz stypałUkasz stypa
łUkasz stypa
 
Guanchesdocx
GuanchesdocxGuanchesdocx
Guanchesdocx
 
El vientrematerno
El vientrematernoEl vientrematerno
El vientrematerno
 
AtelierDLA 88 / Loi ESS
AtelierDLA 88 / Loi ESSAtelierDLA 88 / Loi ESS
AtelierDLA 88 / Loi ESS
 
George Morris, NHS Health Scotland, United Kingdom
George Morris, NHS Health Scotland, United KingdomGeorge Morris, NHS Health Scotland, United Kingdom
George Morris, NHS Health Scotland, United Kingdom
 
Poetic design
Poetic designPoetic design
Poetic design
 
Social Media and Personal/Corporate Branding
Social Media and Personal/Corporate BrandingSocial Media and Personal/Corporate Branding
Social Media and Personal/Corporate Branding
 
Gorky Park History
Gorky Park History Gorky Park History
Gorky Park History
 
Résidence Gares BZH / 3 visions de gare
Résidence Gares BZH / 3 visions de gareRésidence Gares BZH / 3 visions de gare
Résidence Gares BZH / 3 visions de gare
 
Gamma, Expoential, Poisson And Chi Squared Distributions
Gamma, Expoential, Poisson And Chi Squared DistributionsGamma, Expoential, Poisson And Chi Squared Distributions
Gamma, Expoential, Poisson And Chi Squared Distributions
 
Fatima ra
Fatima raFatima ra
Fatima ra
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Clusterous: Easy cluster computing with Docker and AWS

  • 1. Clusterous - Easy Cluster Computing with Docker and AWS SIRCA Balram Ramanathan Tuesday 22nd March 2016
  • 2. Who we are ● SIRCA was founded in 1997 by a group of Australian and New Zealand universities as a not for profit company ● Our mission is to enable data intensive research ● We also provide academics access to a number of key large-scale data sets primarily in the finance space
  • 3. Project background Clusterous is part of SIRCA’s contribution to the Big Data Knowledge Discovery project, a collaborative project funded by the Science and Industry Endowment Fund (SIEF). The project was created to realise the potential of bringing scientists in data centric disciplines together with leaders in information technology to explore how they can utilise big data and machine learning to create a new paradigm in research and unlock new learnings.
  • 4. Problem we are trying to solve ● Scientists want access to compute power, but often end up stuck with physical machines - hard to scale ● AWS provides an answer, but can be daunting to get started and tedious to setup and use a compute cluster ○ Any productivity gained from faster compute threatens to be offset by setup/admin overhead ● Getting your code to run on remote machines can be a headache of its own ○ Different OS versions, dependencies, etc. ○ How to deploy across multiple machines? ● Clearly a need for a tool to make cluster computing in the cloud easy for those who write code but aren’t cloud experts
  • 5. Clusterous makes cluster computing easier ● Open source command line tool written in Python ● Use the simple config “wizard” to enter your AWS credentials and configure your account ● Put a few cluster parameters in a YAML file - such as instance types and number of instances ● Start the cluster ● All clusters have a shared volume for your data, config files, etc.
  • 6.
  • 7. BYO Code ● Clusterous doesn’t impose any parallel compute framework or language ● Put your code plus supporting libraries in Docker containers, and deploy to the cluster with the help of “Environments”
  • 8. Environments ● An “environment” is a complete running environment for your code ● An environment file is a simple YAML-based script for deploying your containers to the cluster ● Also copies files, builds Docker images (if needed), creates a tunnel ● Get your application deployed and running in a single step ● Environment files are redistributable - write once, run many ● We have created environments for IPython Parallel and PySpark - many users may just use those
  • 9. Our users so far ● Our partners have run their own parallel compute software on Clusterous clusters ● One project partner uses R for ecology simulations - they created rrqueue, an open source distributed task queue for R ● A team at Data61 ran Stateline, a framework for distributed Markov Chain Monte Carlo sampling on Clusterous
  • 11. Credits ● Our team consists of Balram Ramanathan, Lolo Fernandez and Ben King ● Big thank you to our project partners at Data61, University of Sydney and Macquarie University for their input ● Thanks to SIEF for the funding ● We are aiming to release version 1.0 in the next few weeks