BDM29: AdamCloud Project - Part I

AdamCloud: a cloud
infrastructure for a
genomics project
David Lauzon & Sébastien Bonami
Presented at Big Data Montreal #29 on October 7th 2014

Plan
● Project
● Use Cases
● Requirements
● Technologies
● Environments
● Planning
● Challenges
● Conclusion

Project
Final project / Projet de fin d’études (PFE)
Sébastien Bonami
Student in IT Engineering at École de technologie
supérieure (ÉTS)
Goal: Doing a proof of concept of a new genomics platform
and optimize the infrastructure for portability

Use Cases
● Genome ETL
● Genome Data Mining

Use Cases: UC1 Genome ETL
● About 98% of a human DNA is similar to
every other humans
● ETL
o Workload: CPU & RAM & HDD intensive
o Data size (per patient sample)
 Input: 100 - 200 GB
 Output: 10 MB
o Process Duration: currently takes 2-3 weeks

Use Cases: UC2 Genome Data Mining
● Not really big data
● Researchers use output of UC1 for data
mining

Requirements
● Infrastructure portability
o from local workstations to the Cloud
● RAD
o Ease of development
o IT students focus on infrastructure
o Developer students focus on development

Requirements
● Demo to hospitals and conferences
o Avoid firewall / bureaucratic issues
● Knowledge/Project Transfer
o For next student who picks up the project
o Quick startup

Technologies
● UC1 Genome ETL
o Apache Spark
o Berkeley Genomics Stack (based on Spark)
 Snap
 Adam
 Avocado

Technologies
● UC2 Genome Data Mining
o Backend
 Adam / Spark
 HDFS
 Play! Framework (for REST API)
 ...
o Frontend
 HTML5 / Bootstrap / Backbone / jQuery
 ...

Environments
● Local
o For UC1 & UC2 developer
● Cluster of Mini PCs
o For testing
● ÉTS servers
o For private data
● Amazon AWS
o For public data

Infrastructure centralization ?
e.g. management layer
Berkeley Genomics (Snap / Adam / Avocado)
Spark
Docker (1 container per service)
VMWareBoot2Docker
Amazon
AMI
Linux Bare
Metal
Mac Mini
(w/ Linux)
Mac
OS X
Windows

Planning
1. Run the Stack in 1 container on 1 node
2. Then, in multiple containers on 1 node
3. Then, on multiple nodes
4. Then, on a different hardware infrastructure
5. Then, in the Cloud
6. Monitor the environment
7. Document everything

Challenges
● Understanding genomics concepts
o What’s a genome ?
● Understanding the Berkeley Genomics Stack
o What does each tool do ?
o How they can integrate with each other ?
● Managing a distributed system and a cluster

BDM29: AdamCloud Project - Part I

More Related Content

What's hot

Similar to BDM29: AdamCloud Project - Part I

Recently uploaded

BDM29: AdamCloud Project - Part I