AdamCloud: a cloud
infrastructure for a
genomics project
David Lauzon & Sébastien Bonami
Presented at Big Data Montreal #29 on October 7th 2014
Plan
● Project
● Use Cases
● Requirements
● Technologies
● Environments
● Planning
● Challenges
● Conclusion
Project
Final project / Projet de fin d’études (PFE)
Sébastien Bonami
Student in IT Engineering at École de technologie
supérieure (ÉTS)
Goal: Doing a proof of concept of a new genomics platform
and optimize the infrastructure for portability
Use Cases
● Genome ETL
● Genome Data Mining
Use Cases: UC1 Genome ETL
● About 98% of a human DNA is similar to
every other humans
● ETL
o Workload: CPU & RAM & HDD intensive
o Data size (per patient sample)
 Input: 100 - 200 GB
 Output: 10 MB
o Process Duration: currently takes 2-3 weeks
Use Cases: UC2 Genome Data Mining
● Not really big data
● Researchers use output of UC1 for data
mining
Requirements
● Infrastructure portability
o from local workstations to the Cloud
● RAD
o Ease of development
o IT students focus on infrastructure
o Developer students focus on development
Requirements
● Demo to hospitals and conferences
o Avoid firewall / bureaucratic issues
● Knowledge/Project Transfer
o For next student who picks up the project
o Quick startup
Technologies
● UC1 Genome ETL
o Apache Spark
o Berkeley Genomics Stack (based on Spark)
 Snap
 Adam
 Avocado
Technologies
● UC2 Genome Data Mining
o Backend
 Adam / Spark
 HDFS
 Play! Framework (for REST API)
 ...
o Frontend
 HTML5 / Bootstrap / Backbone / jQuery
 ...
Environments
● Local
o For UC1 & UC2 developer
● Cluster of Mini PCs
o For testing
● ÉTS servers
o For private data
● Amazon AWS
o For public data
Infrastructure centralization ?
e.g. management layer
Berkeley Genomics (Snap / Adam / Avocado)
Spark
Docker (1 container per service)
VMWareBoot2Docker
Amazon
AMI
Linux Bare
Metal
Mac Mini
(w/ Linux)
Mac
OS X
Windows
Planning
1. Run the Stack in 1 container on 1 node
2. Then, in multiple containers on 1 node
3. Then, on multiple nodes
4. Then, on a different hardware infrastructure
5. Then, in the Cloud
6. Monitor the environment
7. Document everything
Challenges
● Understanding genomics concepts
o What’s a genome ?
● Understanding the Berkeley Genomics Stack
o What does each tool do ?
o How they can integrate with each other ?
● Managing a distributed system and a cluster
Conclusion
Any comments ?

BDM29: AdamCloud Project - Part I

  • 1.
    AdamCloud: a cloud infrastructurefor a genomics project David Lauzon & Sébastien Bonami Presented at Big Data Montreal #29 on October 7th 2014
  • 2.
    Plan ● Project ● UseCases ● Requirements ● Technologies ● Environments ● Planning ● Challenges ● Conclusion
  • 3.
    Project Final project /Projet de fin d’études (PFE) Sébastien Bonami Student in IT Engineering at École de technologie supérieure (ÉTS) Goal: Doing a proof of concept of a new genomics platform and optimize the infrastructure for portability
  • 4.
    Use Cases ● GenomeETL ● Genome Data Mining
  • 5.
    Use Cases: UC1Genome ETL ● About 98% of a human DNA is similar to every other humans ● ETL o Workload: CPU & RAM & HDD intensive o Data size (per patient sample)  Input: 100 - 200 GB  Output: 10 MB o Process Duration: currently takes 2-3 weeks
  • 6.
    Use Cases: UC2Genome Data Mining ● Not really big data ● Researchers use output of UC1 for data mining
  • 7.
    Requirements ● Infrastructure portability ofrom local workstations to the Cloud ● RAD o Ease of development o IT students focus on infrastructure o Developer students focus on development
  • 8.
    Requirements ● Demo tohospitals and conferences o Avoid firewall / bureaucratic issues ● Knowledge/Project Transfer o For next student who picks up the project o Quick startup
  • 9.
    Technologies ● UC1 GenomeETL o Apache Spark o Berkeley Genomics Stack (based on Spark)  Snap  Adam  Avocado
  • 10.
    Technologies ● UC2 GenomeData Mining o Backend  Adam / Spark  HDFS  Play! Framework (for REST API)  ... o Frontend  HTML5 / Bootstrap / Backbone / jQuery  ...
  • 11.
    Environments ● Local o ForUC1 & UC2 developer ● Cluster of Mini PCs o For testing ● ÉTS servers o For private data ● Amazon AWS o For public data
  • 12.
    Infrastructure centralization ? e.g.management layer Berkeley Genomics (Snap / Adam / Avocado) Spark Docker (1 container per service) VMWareBoot2Docker Amazon AMI Linux Bare Metal Mac Mini (w/ Linux) Mac OS X Windows
  • 13.
    Planning 1. Run theStack in 1 container on 1 node 2. Then, in multiple containers on 1 node 3. Then, on multiple nodes 4. Then, on a different hardware infrastructure 5. Then, in the Cloud 6. Monitor the environment 7. Document everything
  • 14.
    Challenges ● Understanding genomicsconcepts o What’s a genome ? ● Understanding the Berkeley Genomics Stack o What does each tool do ? o How they can integrate with each other ? ● Managing a distributed system and a cluster
  • 15.