Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDM29: AdamCloud Project - Part I

620 views

Published on

AdamCloud: A Cloud infrastructure for a Genomic project. The AdamCloud project aims to simplify the installation of the AmpLab genomic pipeline (Snap, Adam, Avocado).

The results of the first iteration (part II) were presented here:
http://www.slideshare.net/davidonlaptop/bdm32-adam-cloud-part-2-43514904

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

BDM29: AdamCloud Project - Part I

  1. 1. AdamCloud: a cloud infrastructure for a genomics project David Lauzon & Sébastien Bonami Presented at Big Data Montreal #29 on October 7th 2014
  2. 2. Plan ● Project ● Use Cases ● Requirements ● Technologies ● Environments ● Planning ● Challenges ● Conclusion
  3. 3. Project Final project / Projet de fin d’études (PFE) Sébastien Bonami Student in IT Engineering at École de technologie supérieure (ÉTS) Goal: Doing a proof of concept of a new genomics platform and optimize the infrastructure for portability
  4. 4. Use Cases ● Genome ETL ● Genome Data Mining
  5. 5. Use Cases: UC1 Genome ETL ● About 98% of a human DNA is similar to every other humans ● ETL o Workload: CPU & RAM & HDD intensive o Data size (per patient sample)  Input: 100 - 200 GB  Output: 10 MB o Process Duration: currently takes 2-3 weeks
  6. 6. Use Cases: UC2 Genome Data Mining ● Not really big data ● Researchers use output of UC1 for data mining
  7. 7. Requirements ● Infrastructure portability o from local workstations to the Cloud ● RAD o Ease of development o IT students focus on infrastructure o Developer students focus on development
  8. 8. Requirements ● Demo to hospitals and conferences o Avoid firewall / bureaucratic issues ● Knowledge/Project Transfer o For next student who picks up the project o Quick startup
  9. 9. Technologies ● UC1 Genome ETL o Apache Spark o Berkeley Genomics Stack (based on Spark)  Snap  Adam  Avocado
  10. 10. Technologies ● UC2 Genome Data Mining o Backend  Adam / Spark  HDFS  Play! Framework (for REST API)  ... o Frontend  HTML5 / Bootstrap / Backbone / jQuery  ...
  11. 11. Environments ● Local o For UC1 & UC2 developer ● Cluster of Mini PCs o For testing ● ÉTS servers o For private data ● Amazon AWS o For public data
  12. 12. Infrastructure centralization ? e.g. management layer Berkeley Genomics (Snap / Adam / Avocado) Spark Docker (1 container per service) VMWareBoot2Docker Amazon AMI Linux Bare Metal Mac Mini (w/ Linux) Mac OS X Windows
  13. 13. Planning 1. Run the Stack in 1 container on 1 node 2. Then, in multiple containers on 1 node 3. Then, on multiple nodes 4. Then, on a different hardware infrastructure 5. Then, in the Cloud 6. Monitor the environment 7. Document everything
  14. 14. Challenges ● Understanding genomics concepts o What’s a genome ? ● Understanding the Berkeley Genomics Stack o What does each tool do ? o How they can integrate with each other ? ● Managing a distributed system and a cluster
  15. 15. Conclusion Any comments ?

×