Sequence Services Phase 2--Eagle Genomics and Cycle Computing

Uploaded on

William Spooner (Eagle) and Carl Chesal (Cycle) introduce the proof of concept provided by this consortium for Phase 2 of the Pistoia Alliance Sequence Services project. The presentation was delivered …

William Spooner (Eagle) and Carl Chesal (Cycle) introduce the proof of concept provided by this consortium for Phase 2 of the Pistoia Alliance Sequence Services project. The presentation was delivered at the Pistoia Alliance Conference in Boston, MA, on April 24, 2012.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Hello and welcome. I’m Will Spooner, CTO at Eagle GenomicsI’m proud to be here to day to present the Eagle/Cycle partnership’s Proof of concept solution to the Pistoia Sequence services phase 2 RfPIf you want to see the system in action and see how it looks, come to our demo later.
  • Firstly, I would like to thank the Pistoia Alliance for backing not just our project, but for being a driving force behind open innovation in the life sciences.I firmly believe that Pistoia’s approach to open innovation is, itself, truly innovative. I suggest that it is the focus on exploitation [translating ideas into deliverables] That really sets Pistoia apart.Initiatives like Sequence Services are a great example of this. We have been conceptually designing a platform for bioinformatics analysis for the past 4 years But it would probably still be on the drawing board without the support and backing of the Pistoia Alliance.
  • So, what is sequence services?“It is a Vision for a platform to users; where they can solve scientific problemsbased on DNA/RNA sequence informationtailored to the needs of the pharmaceutical industry.”The exact requirements to Phase 2 were set out in a 27 page “Request For Proposals” document, This is a fantastic synthesis of the, various priorities from several different pharmaFocused not just on managing NGS data and applications, but also on collaborating with research partners in a secure environment.
  • Eagle participated in Pistoia Sequence Services Phase 1 Partnered with Cognizant. We built a cloud architecture to host Ensembl and other sequence analysis applications from scratch. However, phase 2 was sufficiently broader in scope that we needed to build from an existing cloud architecture platform.Eagle had seen Cycle present at a conference and were impressed by their offering, Which was perfect for this particular use-case.The partnership between Eagle and Cycle was born in a meeting between Richard Holland (Eagle CBO) and Jason Stowe (Cycle CEO) in a Starbucks outside New York Penn Station – humble beginnings often lead to great things!Brings together two companies with shared values;Young, dynamic, ambitious,Dedicated to providing exceptional services,Acknowledged domain experts,Virtualisation/cloud evangelists,Implementers of progressive business models, covering both;Software as a service,Professional open-source.Supporters of and contributors to open-source,With highly-satisfied international, multi-sector client bases.
  • So, over the past 3 months, with backing from the Pistoia Alliance, We have built a platform capable of delivering pretty much all of the requirements set out in the RFP.I’m pleased to be able to introduceThe Elastic Analysis Platform [ElasticAP] The software as a service for storage, analysis and, importantly, sharing of life sciences data and applications on the cloud. The logo represents a ball of rubber bands? Why?Many parts make up the whole, but the whole is more than the sum of its parts. It is flexible and all parts are interchangeable. The whole thing is robust, secure and bounces back if you drop it
  • From a birds eye view, what weprovide ispretty simple.The platform enables users to upload, store, analyse, and share, Not just data, but applications as well.For example, a sequencing vendor that the biologist has commissioned uploads data to the platform.This data is shared with the customer, who grants access to the bioinformaticianThe bioinformatician analyses the data using either manual or automated workflows.The results of these analyses is shared with the biologist and subsequently, if appropriate, with collaborators, such as academic partners the Biologist is working with
  • Our approach to the implementation has been to adopt a Loosely coupled architectureSelect best of breed components Many of which are open source And connect them using industry standards, protocol, and web services Users interact with ElasticAP through a single point of entry embedded within a security gateway The Gateway supports customer-integrated single sign on Behind the gateway there is the Workspace for management of data and applications, From where data analysis tools can be launched and monitoredOther than the workspace, the system is single tenancy, Meaning customer’s data and applications are contained in isolated sandboxes, increasing security
  • We have spent a great deal of time and effort cementing the core infrastructure of ElasticAP without which none of the rest of the platform can function correctly.Now that this work is complete, we are racing ahead with populating the system with data and applicationsFor public data sets, we have populated the system with sample Ensembl, UCSC reference genomes.For bioinformatics applications, we are packaging those developed for Sequence Services phase 1, and have the Ensembl Genome Browser ready.For pipelines, we have BioLinux + Galaxy, that we expect to have available very shortlyPlease come to the demo in this room after the session, where I will show some of the functionality,
  • On 29th March, Amazon announced the availability of the 1000 Genomes public dataset, the largest catalog of human genetics, on the cloud. This is one of the datasets targeted for inclusion by P-SS2. This is fortunate timing for ElasticAP. Rather than spending time importing all of the data into the system we can focus on adding value through adding metadata to the AWS data files. This is facilitated through existing mappings between the ENA metadata to the industry standard ‘Investigation Study Assay’ format supported by our underlying softwareWe are starting with a single 1000 Genomes study, 1000Genomes Project Pilot 2, 20X whole genome sequence coverage of 6 individuals (consisting 2 trios: father, mother, offspring) from CEU and YRI population. Trios are interesting: In addition to variant detection, allows investigation of haplotype phasing and de-novo mutation events. I will show you some of these data at the demo.
  • Good real-world example. Current collaboration between a pharma and a universityComplex pipeline that involves both alignment and de-novo assembly stepsWe are using this as a case study for ElasticAP, using a similar, public dataset from GEO (another of the SS2-requested datasets) as a proxy for the real in-house data set.
  • Pistoia RFP was huge – not possible to implement every single part of it from the word go.Security is the first and most important part – and we’ve definitely finished and tested that part rigorously. Data is safe.Work is underway to implement the partial and todoitems above
  • It’s worth noting that we’re improving on existing technology here, not building something brand new.We’re taking components we’ve used before and combining with cutting-edge third-party open-source components.We add to this Cycle’s unique insight into cloud scalability and security. Cycle made a splash recently with their 50,000 node cluster. See yesterdays NYT articleNow we can use the platform internally to develop customer projects, freeing us from building new infrastructural support for each new project and increasing our efficiency.Same goes for end-users – no longer have to build their own infrastructure or wait for us to build ours for them, they can just log in and use it.See our brochures or websites for details of the kinds of things we’ve been involved in and will use this platform for in future
  • Pricing follows an AWS-style model Sign-up is free Data import is free And you only pay for the storage/compute that you useWe support either up-front payment Or monthly billingSLA-backed support contracts are availableAnd customisations are, of course, available at standard consulting rates.
  • Early access for preferred partnersEarly access to test and try out featuresStress-test the system and give feedbackInlonger term can expand outside sequence data into other R&D fieldsSecurity will always be at the heart of it.
  • Whether you subscribe to ElasticAP or look elsewhere,CloudSaaSis the future for data analysis, especially when collaboration is concerned.We are not intrinsically tied to any flavor of cloudOr even tied to the cloud at allWe strongly believe we have the nucleus of a truly secure and scalable R&D collaboration environmentThat can be extended beyond genomics to general R&D applications
  • If you have any other questions, look out for myself or David Flanders [TBC] from Eagle, or Carl Chesal from Cycle.I will also be attending Bio-IT World this week if you’d like to catch up with me there.In case you don’t get the chance to talk to me, we’ve got product brochures in the demo room for you to take away. Thanks for listening, see you at the demo.


  • 1. Sequence Services Phase 2Pistoia Alliance AGM, Boston MA, April 24th 2012
  • 2. Collaborate Explore Enterprise Work together Academia to find a Government common Foundations Open purpose Innovation Nurture Exploit Build trust, Turn ideas into shared tangible benefits language2/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 3. The Requirements FUNCTIONAL NON-FUNCTIONAL Login and workspace Charging Model Manage users Service Support Manage data Operational Upload private data Requirements Access public data Security Requirements Export Delete/archive Manage applications Share Upload scripts/pipelines Analyse data Monitor use/performance3/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 4. The Partnership Established: 2005 2008 Domain: High performance Operational bioinformatics computing Employees: 18, 16 engineers 12, 9 engineers, pool of external consultants Location: Across USA/Canada Cambridge, UK Sectors: Pharmaceutical, Pharmaceutical, biotechnology, financial, biotechnology, agri- computer gaming, biotechnology, consumer engineering, academia. goods, food, other life sciences. Customers: North America, Europe North America, Europe, Asia Partnerships: Schrodinger, VMWare, Amazon Web Services, Canonical Cognizant, European4/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012 Bioinformatics Institute,
  • 5. The Platform The platform for storage, analysis and sharing of life sciences data in the cloud5/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 6. The Proposal CIO Bioinformatician Start Upload Stored ANALYSES Depositor data Manual process Biologist Pipeline Stored process Stop Share Collaborator data6/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 7. The Architecture Customer Bioinformaticia Collaborato Single Authenticate Depositor Sign On n r SAML Token Exchange SAML HTTPS OpenAM Web IdP Amazon EC2 Cloud Gateway Shiboleth HTTPS Assets HTTPS MySQL DB Web Web Web Server SEEK Web Server CycleCloud HTTPS Data HTTPS Web Data Files Condor Web Data Fi Files Data Fi Data Fi Data Fi Encrypt/ Ensembl Decrypt BioLinux Customer Customer Sandbox Customer Sandbox S3 Storage Sandbox EC2/AMIs EC2/AMIs7/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 8. The Present Collaborator Depositor Bioinformatician8/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 9. The1000 Genomes• A Deep Catalogue of Human Variation – Freely available on AWS – 1,700 Individuals – 200Tb data – 10,000s data files – Almost no metadata!• ElasticAP evaluating 1000 Genomes Project Pilot 2 – 20X resequencing – 2 trios (6 individuals)
  • 10. TRUP: Tumor RNA-seq Unified Pipeline• Collaboration between – Max Planck Institute for Molecular Genetic – Bayer Pharma AG• Identifies gene fusion events in tumor samples• Involves both alignment and de- novo sequencing steps• Pipeline is being implemented on ElasticAP – Using public GEO datasets for
  • 11. The PoC FUNCTIONAL NON-FUNCTIONAL Login and workspace Charging Model Load data Service Support Manage public data Operational Load scripts and Requirements pipelines Security Requirements Analyse data Export data Archive data KEY Manage applications Fully implemented Partially implemented Manage users To-do list Monitor use/performance11/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 12. The Prior Art • Eagle have been building analysis pipelines and hosting secure cloud apps for years. • Cycle have been developing HPC solutions and deploying them on the cloud for years • We built this as a platform we could use ourselves in order to carry on delivering what we already do. • But now the results are interactive, and everyone can share and participate. • The most common tasks won’t need to involve us at all. th12/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24 April 2012
  • 13. The Price • AWS-style pay as you go business model – Free sign-up and account creation – Tiered applications by the hour. – Discounts for up-front reservation fee. – Offline data import/export also available. – Flat-rate data by the gigabyte-month. – Backup data by the gigabyte-month. – Monthly billing. – Support contracts available. • Customisation and new pipelines at Eagle/Cycle standard consulting rates.13/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 14. The Plan • Early access to preferred partner customers in July – talk to us now if you’d like to be part of that. • Full production in September with all partial/todo items implemented. • Increased number of public datasets. • Increased range of applications and pipelines. • User interface improvements based on feedback from early access period.14/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 15. The Potential • Available as customisation projects: – Conversions to other clouds. – Conversions to run on in-house infrastructure. • Truly secure and scalable R&D collaboration environment. – Applicable to all sciences, not just genomics. Change the way you do science15/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
  • 16. Will