Your SlideShare is downloading. ×
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud


Published on

Xu Xing's talk at ISCB-Asia EasyGenomics – Next Generation Bioinformatics on the Cloud, December 17th 2012

Xu Xing's talk at ISCB-Asia EasyGenomics – Next Generation Bioinformatics on the Cloud, December 17th 2012

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • ` ``
  • ` ``
  • This morning I was reading Monday’s USA Today. One of the cover story was a girl at 18, whose family history includes Huntington’s disease has decided to conduct genetic test to see whether she has the fatal gene. What impressed me is the fact that genetic testing and disease are now such close to our daily life. Imagine by 2030 the UN President candidates all publish their complete genome, who would you vote for?A few years ago it was science fiction but look at the trend today, the cost for 1Mb DNA sequencing has gone down dramatically and thanks to these great instruments, the total number of human genome sequenced has gone from 1 in 2003 when the Human Genome Project releases their data to a few thousands today. The number may vary but the trend won’t change. If the red-dotted Moore’s law continues as it was, we may well see $1000 a genome in 2012 or 2013 and the price will continue to drops toward $0.In contrast, we will be able to sequence a lot more genome then today, and I’d like to quote Martin Leach’s “Humanity Genome” or “Hunome”
  • Over the past few years, we have been thinking of $1000 a genome and of course have done tons of great works to archive that. GO-Big. Getting just 0.1% of world human population sequenced would cost $7 Billion, generating around 700 Petabyte of RAW ATGC, equivalent of 85 billions The Complete Harry Potter Collection - eBook. And that’s not the end of the story. Omicsmap team created this nice map to illustrate sequencing capacity around the world. As the price of sequencing drop, there is a reason to believe the map will be looked like this in a few years!The point is, sequencing is a commodity and it happens everywhere. Key takeaway 1.
  • A lot of the times when I chat with collaborators and partners, everyone was talking about the opportunities and possibilities introduced by NGS. Unfortunately not all of them possess the necessary knowledge and skills to handle the tremendous amount of data generated by NGS which indeed has become one of the biggest obstacles to fully utilize this technology. On the other hand, scientists often have to deal with numerous difficulties, such as data deliveries on hard drives, management of computing and storage resources, installation and integration of multiple algorithms, and optimization of a number of parameters, to get reliable and meaningful results.If you wonder how BGI solved it, you are on the right session. If you want to access BGI’s bioinformatics solution, the next 20 slides are just for you.
  • 云平台的建设,其核心技术包括生物信息流程,数据管理,以及高速数据交换
  • At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
  • When user start a new analysis project, there are three atomic objects he or she needs to look into. Sample which is created by aggregating raw data, Analysis that take Samples as input and Project which encloses multiple analysis. Filtering, QC Report, Alignment are built-in so that users don’t have to worry about it. While different pipelines may have different handles but the basic remains the same. In this way, EasyGenomics enables a unfied underlying data structure, mimicking your real research procedures.
  • At EasyGenomcis, we are serious about information security and have designed a secure multitenancy architecture from the ground up. Critical user data is 256bits encrypted to make sure everyone is in stealth mode. Sample and project data are stored in user’s designated virtual partition so that no one not even EasyGenomcs operation team can see them. Same as many online applications, a secure login mechanism is provided and every interaction you make with the system is encrypted using secure HTTP protocol. When it goes to data transfer security, EasyGenomics partnered with Aspera to send/receive your data fast and securely. Last but not least, we will never store your password in plain text!
  • That 700PB does freak a lot people, but if anyone in this room ask me what matters the most at today’s Big Genomics Data era? I would say information. Raw ATGC does NOT make any sense. When you trying to look into so call the Sex chromosome, 200 million bp decides our gender and more…Up until today, we only get to know a very limited set of knowledge hidden behind our gene. While sequencing continue to be a thrilling race, discovering information behind Big Genomics Data presents huge challenge to the community. And turning those scientific discoveries into consumable application is the silver bullet.Key takeaway2: Analysis and interpretation of the genome data is the KEY and to apply sequencing information onto application is the Silver Bullet
  • Transcript

    • 1. Next Generation Bioinformatics on the Cloud Xing Xu, Ph.D Director of Cloud Computing Product Contact Us
    • 2. Topics for Today Behind the cloud product BGI The team The product: EasyGenomics Why are we building this product? What can this product do? Future direction and open questions 2
    • 3. BGI  The world largest genome sequencing center Started with Human Genome Project in 1999 with only a few sequencers. Now more than 150 sequencers, 6 TB/day sequencing throughput. ABI Roche ABI Solexa Illumina MODEL 3730XL 454 SOLiD 4 GA IIx HiSeq 2000INSTALLATION 16 1 27 6 135
    • 4. BGI The world largest genome sequencing center The largest computing and storage center for genomics in China - 20,000+ CPU cores - 19 NVIDIA GPUs - 220+ Tflops peak performance - 17 PB data storage - The storage and computation capability increase by 10000 folds! - Still increasing …
    • 5. BGI The world largest genome sequencing center The largest computing and storage center for genomics in China One of world leading research institutes in Genomics Since 2007, - 253 papers in high-impact journals - Including 47 in Nature and its sub-journals, 9 in Science,2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors - 369 patent applications - 254 software authorship
    • 6. BGI The world largest genome sequencing center The largest computing and storage center for genomics in China One of world leading research institutes in GenomicsBGI has the sequencing capacity, hardware resourceand software proficiency to be the one of the strongestend-to-end service providers in the world for NGSsequencing, data analysis and data interpretation.
    • 7. Team for the Cloud Platform  Run like a software Product company  Managers are from leading softwareOperation Development companies, such as HP, Microsoft, and Levono.  Team members are Testing Young, Energetic, and Ambitious.  Fully supported by BGI BGI Support in-house algorithm development teams.
    • 8. Team for the Cloud Platform Development Team Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc. Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc. Pipeline: Liang Wang etc. Test & QA Team Xin Guan, Jingjuan Liu, etc. PMO & IT Operation Wenjun Zeng, Litong Lai, Jing Tian, etc. Product Team Xing Xu, Jing Guo, Fang Fang etc. Other BGI Teams + + +
    • 9. Topics for Today Behind the cloud product BGI The team The product: EasyGenomics Why are we building this product? What can this product do? Future direction and open questions 9
    • 10. Trend of Volume and Cost 10
    • 11. Geological side of the problem Sequencing happens EVERYWHERE.+ BGI Images from
    • 12. Difficulties of Analysis Post Tertiary Secondary Tertiary AnalysisPrimary analysis Analysis Analysis Base calling Mapping Variant Calling In-depth Annotation Complicated Data Computation Algorithms Lack of throughput intensive Computation knowledge Data storage Data storage intensive
    • 13. Problems and SolutionsSolutions Problems: Cloud • Big genomic dataHigh Speed Data Exchange • Geological distribution Pipelines • Algorithm integration+) Distributed Workloads • Computational demand 13
    • 14. EasyGenomics™ Computational Algorithms, Resources Workflows, Database, ReportsData management Web portal, High speed Simple UI connection EasyGenomics is a Software as a Service (SaaS)bioinformatics platform for research and applications.
    • 15. Key Features High Speed Connection Data ManagementBioinformatics Workflows
    • 16. Bioinformatics Workflow Four steps: Upload, Create a Sample, Perform Analyses, Download Results Algorithms: Carefully chosen, tested and optimized Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
    • 17. Homepage Navigation TabsFour task portals Status of Warning and recent works Logging
    • 18. Bioinformatics Workflow --- PipelinesExome Resequencing RNASeq Transcriptome 18
    • 19. Bioinformatics Workflow---Comprehensive Reports 19
    • 20. Bioinformatics Workflow---Comprehensive Reports 20
    • 21. Data Management Sample A Analysis I Analysis II Raw Data Analysis X Project I Sample B “Sample”, “Analysis”, “Project” Mimicking real research procedure Automatic management of underlying data structure
    • 22. Create a SampleAdd read groups
    • 23. Sample Page Individual report for each laneSummarized report for all lanes
    • 24. Data management ---Security • Username/Password Access • Biometric access • HTTPS , Aspera fastpTM • Trusted databaseMulti-tenancy connection • ACL, Data encryption • Physical isolation Isolation • Virtual isolation Compliance • ISO27000
    • 25. High Speed Data Exchange  Aspera’s patented fasp™ high-speed file transferring technology  10~100X faster than FTP 25
    • 26. Transfer 24GB in 30 Seconds Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June. 26
    • 27. Transfer 24GB in 30 Seconds Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June. A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s). 27
    • 28. Amount of Data that can be transferred in 24hrThe data amount transferred in 24hrs at different data transfer bandwidths. (Assuming the input read size is10GB, the total results is about 50GB, the clean reads is about 10GB and the aligned reads (BAM) is about20GB] 28
    • 29. Easy-to-Use UI Reusability Reuse the same sample for different analyses (different parameters) Reuse all parameter settings for different analyses Simple UI and interactive features As easy as to do online shopping Shortcut for predefined setting, at the same time fully customizable for advance users Handle batch analyses in one setting 29
    • 30. Create an Analysis Selectedsample(s) •One selected sample => Single Analysis •Multiple selected samples => Batch Analyses
    • 31. Create an AnalysisSelectable modulesPredefined Shortcut Settings
    • 32. Create an Analysis
    • 33. Create an Analysis Customizable
    • 34. Create an Analysis
    • 35. Project Table Add/Remove Project Filter and OperationProject list table search box short cuts
    • 36. Analysis Table
    • 37. Sample Table
    • 38. A typical user case Customers’ Local ResourcesA normal user case of EasyGenomics and Customers’ Local Computational resource.The double line items are Customers’ data or resource. The single line items areresults and data within BGI and EasyGenomics platform. The widths of arrowsrepresent the sizes of data flows (not in real proportion). 38
    • 39. Topics for Today Behind the cloud product BGI The team The product: EasyGenomics Why are we building this product? What can this product do? Future direction and open questions 39
    • 40. Future directions What is the market? Which direction to go? Cloud on the public infrastructure vs cloud on the private infrastructure SaaS vs PaaS Data analysis is only one step of the whole process. What will be the sustained model for the cloud service?
    • 41. Market Position Instrument Manufacturers Sequencing Service ProvidersSoftwareProviders Cloud Service Providers illumina Annotation Providers Personal Genetic Testing NOW Providers
    • 42. Challenge and Solution DNANexus Basespace GenomeSpace EasyGenomics Ingenuity/ (Illumina) NextBioCloud Public Public Public Private PrivateReasoning Great demand on Security, Privacy space and issue computation resourcesPositioning Infrastructure App Store Platform for SaaS Solution Information (PaaS) accessing available They are playing tools. the results from NGS not the raw reads.Advantage Funding Sequencing service Strong connection Sequencing Service Experience Advance in the Community of to academia Development field Partners Capability 42
    • 43. Public vs Private CloudPublic Cloud Private CloudPros: Pros: − “Limitless” resource − Flexibility − Share data to a wide − Security and Privacy range of people control − Offering nice platform − Long-term cost savingCons: Cons: − Security and reliability − Big initial investment − Short term cost saving − Maintaining the vs Long term cost infrastructure and nightmare software on the cloud But, the line between public and private cloud are blurring.
    • 44. A sustained model for cloud service? Key components of cost Storage Computational resource Data transfer Software usage App store or Cell phone plan Long term cost vs Short term cost
    • 45. Data analysis is NOT ALL! Management Interfacing Query Statistics Web-based Interface Management System WorkflowSales Billin EPM EPM g Project Wet Lab Bioinformatics Sample Center Management Operation Data Analysis Sample QC Budgeting Receipt/Storage Sample prep Data analysis Tasking Handover Sequencing Data QC
    • 46. Roadmap of EasyGenomics EG1.5 (est. in Dec)EG1.1 (in Jun) • QC indicator, QC module EG1.3 (in Sep)• New result reports • New Sample report • Data import from BGI • Transcriptome workflows• Fully Integrated Data sequencing service • Reference management Exchange Interface Jun Aug Sep Dec Apr 2012 2012 2012 2012 2013 EG2.0 (est. in Apr, 2013) EG1.2 (in Aug) • IRODs data management • New read filtering step, • Data sharing, collaboration speed up 20x • User own applications • Comparison, Filtering tools • Visualization 46
    • 47. Free Beta Trial is on going!!
    • 48. Interpretation is the KEY Analysis and Interpretation is the KEY
    • 49. Enabling Technology Hadoop-based Flexible Computing Human Genome SOAPdenovo EasyGenomicsTM (192 cores)Best Practice Award Genome Coverage 86% 86% Assembly Time 70h 55hfor IT Infrastructure No. of Servers 1 15 Memory Size 500GB x 1 24 GB x 15 Mode Centralized Distributed 49
    • 50. Enabling Technology SOAP Hadoop (Gaea) GPU 50