Bio IT World europe 2010

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Cloud Computing and Genomics Richard Holland BioITWorld Europe 2010 Cloud Computing Workshop
  • 2. Cloud Computing and Genomics ● Company background ● Who we support ● Sequence analysis in the cloud ● Genomic data in the cloud ● Cloud summary
  • 3. Raison d'être ● Founded by 3 ex- Ensembl staff. ● Commercial services for open-source bioinformatics. ● Support open- source through collaborations. ● RedHat of bioinformatics.
  • 4. What we do ● Help you choose the right solution for the problem. ● Right tools for the right jobs. ● For any chosen open-source system: ✔ Support and train. ✔ Modify, extend, and improve. ✔ Integrate private data/systems. ✔ Manage private mirrors.
  • 5. What we do ● Genomic data analysis pipelines ✔ SNP calling ✔ miRNA detection ✔ Probe mapping ✔ Assembly and annotation ✔ Etc. ● Results integrated into other systems.
  • 6. Open-source ● Used wherever possible. ● Produced wherever possible. ● Contribute all changes. ● Cutting-edge. ● Quality of concept (not necessarily code). ● Rapidly adjustable.
  • 7. Cloud ● Low-cost for ad-hoc work. ● Scalable. ● Low-maintenance. ● More secure. ● Accessible. ● No licences. ● No installation.
  • 8. Cloud Computing and Genomics ● Company background ● Who we support ● Sequence analysis in the cloud ● Genomic data in the cloud ● Cloud summary
  • 9. Ensembl ® ● Not only a genome browser, but also: ● Genome databases. ● Perl API and BioMart data warehouse. ● Collaboration agreement with EMBLEM. ● Exchange knowledge. ● Share revenue. ● EBI are working with Amazon. Ensembl is a registered trademark of Genome Research Ltd., and is developed by WTSI/EBI.
  • 10. Taverna ● Workflow design and execution. ● Collaboration agreement with the University of Manchester (in progress). ● Exchange knowledge. ● Share revenue. ● Taverna are actively researching cloud deployment.
  • 11. TraitTag ● Trait detection in seedlings. ● Avoid waiting for adult plants. ● Collaboration agreement with Plant Bioscience Ltd. (John Innes Centre) ● Exchange knowledge. ● Share revenue. ● Already cloud-capable.
  • 12. Others ● Other knowledge areas: ● O|B|F Bio* projects – e.g. BioJava, BioPerl, BioSQL, etc. ● BioMart ● DebianMed ● No formal collaborations with these.
  • 13. Big Pharma ● Drug development. ● Pharmacogenomics. ● Software-as-a-service. ● Operational bioinformatics. ● Traditionally suspicious of both open-source and the cloud. This is changing.
  • 14. Agri-biotech ● Chemical development. ● Seed development. ● Software-as-a-service. ● Operational bioinformatics. ● Into open-source (DuPont has BioPerl people), ready to explore the cloud.
  • 15. Biotech SMEs ● Anything and everything. ● Highly specialised. ● Poorly funded. ● Lack in-house bioinformatics. ● Many don't even know what it is. ● One-off analyses and consulting projects. ● Love anything that saves them money.
  • 16. Others ● CROs. ● Universities. ● Government institutes. ● Animal health an emerging area. ● Public bodies are hardest to convince about open-source and the cloud. ● Funding agencies often the cause.
  • 17. Cloud Computing and Genomics ● Company background ● Who we support ● Sequence analysis in the cloud ● Genomic data in the cloud ● Cloud summary
  • 18. eHive ● Used for all Eagle's data analysis pipelines. ● Developed by the Ensembl Compara project. ● Generic framework (yet another one?). ● Perl. ● Ensembl-aware but independent. ● Scalable and robust. ● Open-source.
  • 19. eHive Severin et al. eHive: An Artificial Intelligence workflow system for genomic analysis. BMC Bioinformatics 2010, 11:240
  • 20. eHive ● Works out-of-the-box on LSF. ● Can run standalone in a single machine. ● Modified to work on Condor and SGE. ● Modified to work on Amazon EC2 without needing Condor/SGE/LSF etc. (crafty trick)
  • 21. eHive Standalone pipelines Packaged into VMs (black-box appliances) Cumbersome to Doesn't scale well distribute What about the cloud?
  • 22. eHive ● Easy to set up same pipeline and interface on EC2. ● But doesn't solve scaling, and clusters are hard to implement in the cloud. ● Crafty trick – self-replicating self- terminating instances. ● Scales to as many parallel instances as required (up to a preset limit). ● Keeps idle instances alive to just short of the hour just-in-case.
  • 23. eHive on EC2 – an example ● A big pharma customer had some microarrays. ● Probe sequence data provided by vendors (Affymetrix, Agilent, etc.). ● Need high-quality mapping with quality scores and other metadata. ● Some chips have public mappings but not to required standard/format. ● Many chips have no public mappings at all.
  • 24. eHive on EC2 – an example eHive+EC2 Ens funcgen db Reports Probe sequences Extended Ens API Library of mapping pipelines
  • 25. eHive on EC2 – an example ● Getting data in/out no problem – very small. ● Mix-up with keys lead to key-per-project. ● Blackboard and results MySQL performance. ● Needed to prove that the job management is working. ● Not firing up too many machines. ● Not forgetting to shut them down.
  • 26. eHive on EC2 – an example ● Scales nicely to about 50 machines. ● Beyond that MySQL is the bottleneck. ● Performance-tuning MySQL raises limit. ● Deals better with lots of small input files and/or input as a database table. ● S3 slow but still plenty fast enough for projects like this (1 or 2 extra hours in the context of 100s is not much). ● Really need multiple-mount-RO EBS.
  • 27. eHive on EC2 – an example ● Customer data was sent to us encrypted, transferred to Amazon encrypted, only decrypted once inside Amazon. ● Results transferred out in the same way. ● All processing internal to Amazon. ● Usual steps to prevent access – firewall, stop unused daemons, disable logins, etc.
  • 28. Cloud Computing and Genomics ● Company background ● Who we support ● Sequence analysis in the cloud ● Genomic data in the cloud ● Cloud summary
  • 29. Hosted data services ● Genomic data can be big, but can also be small. ● Analysed data vs. raw data. ● Generic repositories vs. specialised resources. ● Organisations still sensitive about query intercept and log mining. ● Leads to in-house or secured third-party hosting.
  • 30. Hosted data services ● Simplest resources can run in a VM. ● Ensembl web browser (but not database). ● Functional, effective, if a bit slow. ● VM files slow to download, expensive to ship. ● Can achieve the same effect using the cloud. ● Distribute AMIs, or host instances?
  • 31. Hosted data services – an example ● A big pharma needs Ensembl in-house. ● Fed up with in-house maintenance. ● Wanted to try it in the cloud. ● Must run entirely within their own Amazon account.
  • 32. Hosted data services – an example Customer a/c Public Ens DB Eagle a/c Database Instance Database AMI Private Ens DB Browser AMI Browser instance
  • 33. Hosted data services – an example ● Ensembl public data in us-east region. ● Migration to other regions can be slow and expensive. ● No guaranteed update schedule. ● Updating it ourselves slow and expensive. ● How to avoid having to maintain multiple instances for multiple customers?
  • 34. Pistoia Sequence Services ● Pistoia Alliance is an industry alliance of big pharma and related companies. ● Sharing pre-competitive resources. ● Pistoia Sequence Services proof-of-concept ● To share Ensembl and related services. ● Eagle in partnership with Cognizant Technology Solutions were successful participants.
  • 35. Pistoia Sequence Services ● Already had the solution – an Ensembl AMI. ● Need to make it capable of running more than just Ensembl – easy. ● Some of the extra services need small amounts of private data, secured and partitioned – fairly easy. ● Need to secure it, and scale it – pretty hard.
  • 36. Pistoia Sequence Services Users Users Users Zeus Usage load balancer and with SSL Billing Mapping Ensembl PlasMapper etc. service Public mirror Private data
  • 37. Pistoia Sequence Services ● All user connections via SSL. ● Authentication using SSO against customer auth servers. ● OpenAM TrafficScript plugin for Zeus. ● SAML2 to LDAP, ActiveDirectory, etc. ● App servers firewalled. ● Load balancer connections only. ● No inter-connection except to RDBMS.
  • 38. Pistoia Sequence Services ● In future could add almost any sequence- related service that is/can be web-enabled. ● PoC runs until Christmas, with free access for all Pistoia members. ● ● ● Full commercial service launches Q1 2011 subject to demand and PoC success.
  • 39. Cloud Computing and Genomics ● Company background ● Who we support ● Sequence analysis in the cloud ● Genomic data in the cloud ● Cloud summary
  • 40. Cloud is bad ● Expensive (always-on solutions). ● Slow data transfer in/out (it's the internet). ● Not big (S3 5GB limit). ● Not fast (HTC not HPC). ● Myths and misconceptions. ● Over-hyped. ● Misused.
  • 41. Cloud is good ● Scalable. ● Upgradeable. ● Secure. ● Low-cost (for ad-hoc work). ● Accessible. ● Standardised. ● Fast enough. ● Big enough.
  • 42. Choose your tools to suit the job ● Don't assume your paradigm will translate. ● Rephrase or optimise. ● Do it right, reap the rewards. ● Do it wrong, better off not bothering.
  • 43. Thanks ● Jason Stowe at CycleComputing. ● Matt Wood at Amazon. ● Cambridge Healthtech Institute. ● Our partners at Cognizant. ● Our collaborators at the EBI, WTSI, JIC, and University of Manchester. ● All producers of open-source bioinformatics software and data.
  • 44.