Clusterous: Easy cluster computing with Docker and AWS

•

1 like•449 views

Clusterous is a new open source tool to make cluster computing on AWS easier for scientists, data scientists, and anyone who isn't a cloud computing expert. https://github.com/sirca/clusterous

Technology

Clusterous - Easy Cluster
Computing with Docker and AWS
SIRCA
Balram Ramanathan
Tuesday 22nd March 2016

Who we are
● SIRCA was founded in 1997 by a group of Australian and New Zealand
universities as a not for profit company
● Our mission is to enable data intensive research
● We also provide academics access to a number of key large-scale data sets
primarily in the finance space

Project background
Clusterous is part of SIRCA’s contribution to the Big Data Knowledge Discovery project, a
collaborative project funded by the Science and Industry Endowment Fund (SIEF). The project was
created to realise the potential of bringing scientists in data centric disciplines together with leaders
in information technology to explore how they can utilise big data and machine learning to create a
new paradigm in research and unlock new learnings.

Problem we are trying to solve
● Scientists want access to compute power, but often end up stuck with
physical machines - hard to scale
● AWS provides an answer, but can be daunting to get started and tedious to
setup and use a compute cluster
○ Any productivity gained from faster compute threatens to be offset by setup/admin overhead
● Getting your code to run on remote machines can be a headache of its own
○ Different OS versions, dependencies, etc.
○ How to deploy across multiple machines?
● Clearly a need for a tool to make cluster computing in the cloud easy for those
who write code but aren’t cloud experts

Clusterous makes cluster computing easier
● Open source command line tool written in Python
● Use the simple config “wizard” to enter your AWS credentials and configure
your account
● Put a few cluster parameters in a YAML file - such as instance types and
number of instances
● Start the cluster
● All clusters have a shared volume for your data, config files, etc.

BYO Code
● Clusterous doesn’t impose any parallel compute framework or language
● Put your code plus supporting libraries in Docker containers, and deploy to
the cluster with the help of “Environments”

Environments
● An “environment” is a complete running environment for your code
● An environment file is a simple YAML-based script for deploying your
containers to the cluster
● Also copies files, builds Docker images (if needed), creates a tunnel
● Get your application deployed and running in a single step
● Environment files are redistributable - write once, run many
● We have created environments for IPython Parallel and PySpark -
many users may just use those

Our users so far
● Our partners have run their own parallel compute software on
Clusterous clusters
● One project partner uses R for ecology simulations - they created
rrqueue, an open source distributed task queue for R
● A team at Data61 ran Stateline, a framework for distributed Markov
Chain Monte Carlo sampling on Clusterous

Credits
● Our team consists of Balram Ramanathan, Lolo Fernandez and Ben King
● Big thank you to our project partners at Data61, University of Sydney and
Macquarie University for their input
● Thanks to SIEF for the funding
● We are aiming to release version 1.0 in the next few weeks

https://github.com/sirca/clusterous
balram.ramanathan@sirca.org.au

Viewers also liked

Come diventare un H.I.M.E. - WebReevolution

Antonio Maresca

łUkasz stypa

lukasz121

Guanchesdocx

Ana Delia López García

El vientrematerno

Ana Delia López García

AtelierDLA 88 / Loi ESS

elodie_fond

George Morris, NHS Health Scotland, United Kingdom

Sosiaali- ja terveysministeriö / yleiset

Poetic design

峰銓古

Social Media and Personal/Corporate Branding

Brian Vickery

Gorky Park History

Anna Pozniak

Résidence Gares BZH / 3 visions de gare

Stéphane VINCENT

Gamma, Expoential, Poisson And Chi Squared Distributions

DataminingTools Inc

Fatima ra

sam09874

Viewers also liked (12)

Come diventare un H.I.M.E. - WebReevolution

łUkasz stypa

Guanchesdocx

El vientrematerno

AtelierDLA 88 / Loi ESS

George Morris, NHS Health Scotland, United Kingdom

Poetic design

Social Media and Personal/Corporate Branding

Gorky Park History

Résidence Gares BZH / 3 visions de gare

Gamma, Expoential, Poisson And Chi Squared Distributions

Fatima ra

Recently uploaded

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

WSO2

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Passkeys: Developing APIs to enable passwordless authentication Cody Salas, Sr Developer Advocate | Solutions Architect - Yubico Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

apidays

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

Exploring Multimodal Embeddings with Milvus

Zilliz

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

DBX First Quarter 2024 Investor Presentation

Dropbox

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Recently uploaded (20)

Six Myths about Ontologies: The Basics of Formal Ontology

Platformless Horizons for Digital Adaptability

AWS Community Day CPH - Three problems of Terraform

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Vector Search -An Introduction in Oracle Database 23ai.pptx

Boost Fertility New Invention Ups Success Rates.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Why Teams call analytics are critical to your entire business

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

MS Copilot expands with MS Graph connectors

[BuildWithAI] Introduction to Gemini.pdf

Corporate and higher education May webinar.pptx

Exploring Multimodal Embeddings with Milvus

Apidays New York 2024 - The value of a flexible API Management solution for O...

DBX First Quarter 2024 Investor Presentation

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Clusterous: Easy cluster computing with Docker and AWS

1. Clusterous - Easy Cluster Computing with Docker and AWS SIRCA Balram Ramanathan Tuesday 22nd March 2016

2. Who we are ● SIRCA was founded in 1997 by a group of Australian and New Zealand universities as a not for profit company ● Our mission is to enable data intensive research ● We also provide academics access to a number of key large-scale data sets primarily in the finance space

3. Project background Clusterous is part of SIRCA’s contribution to the Big Data Knowledge Discovery project, a collaborative project funded by the Science and Industry Endowment Fund (SIEF). The project was created to realise the potential of bringing scientists in data centric disciplines together with leaders in information technology to explore how they can utilise big data and machine learning to create a new paradigm in research and unlock new learnings.

4. Problem we are trying to solve ● Scientists want access to compute power, but often end up stuck with physical machines - hard to scale ● AWS provides an answer, but can be daunting to get started and tedious to setup and use a compute cluster ○ Any productivity gained from faster compute threatens to be offset by setup/admin overhead ● Getting your code to run on remote machines can be a headache of its own ○ Different OS versions, dependencies, etc. ○ How to deploy across multiple machines? ● Clearly a need for a tool to make cluster computing in the cloud easy for those who write code but aren’t cloud experts

5. Clusterous makes cluster computing easier ● Open source command line tool written in Python ● Use the simple config “wizard” to enter your AWS credentials and configure your account ● Put a few cluster parameters in a YAML file - such as instance types and number of instances ● Start the cluster ● All clusters have a shared volume for your data, config files, etc.

7. BYO Code ● Clusterous doesn’t impose any parallel compute framework or language ● Put your code plus supporting libraries in Docker containers, and deploy to the cluster with the help of “Environments”

8. Environments ● An “environment” is a complete running environment for your code ● An environment file is a simple YAML-based script for deploying your containers to the cluster ● Also copies files, builds Docker images (if needed), creates a tunnel ● Get your application deployed and running in a single step ● Environment files are redistributable - write once, run many ● We have created environments for IPython Parallel and PySpark - many users may just use those

9. Our users so far ● Our partners have run their own parallel compute software on Clusterous clusters ● One project partner uses R for ecology simulations - they created rrqueue, an open source distributed task queue for R ● A team at Data61 ran Stateline, a framework for distributed Markov Chain Monte Carlo sampling on Clusterous

10. Demo time

11. Credits ● Our team consists of Balram Ramanathan, Lolo Fernandez and Ben King ● Big thank you to our project partners at Data61, University of Sydney and Macquarie University for their input ● Thanks to SIEF for the funding ● We are aiming to release version 1.0 in the next few weeks

12. https://github.com/sirca/clusterous balram.ramanathan@sirca.org.au

Clusterous: Easy cluster computing with Docker and AWS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Recently uploaded

Recently uploaded (20)

Clusterous: Easy cluster computing with Docker and AWS