This document provides an overview of a presentation on population-scale high-throughput sequencing data analysis. It discusses:
1) The background and goals of the CSIRO/Omics Project which aims to investigate colorectal cancer susceptibility using sequencing data from 500 individuals.
2) Methods for processing large-scale NGS data on high-performance computing clusters and cloud infrastructure using the NGSANE framework, which allows processing modules to be run in parallel.
3) Preliminary research outcomes identifying cancer-associated and microbiome changes from analysis of colorectal cancer and control samples.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
VariantSpark a library for genomics by Lynn LangitData Con LA
VariantSpark is a library for scalable genomic analysis that can process large genomic datasets containing millions of variants and thousands of samples. It uses machine learning techniques like k-means clustering and random forests for unsupervised and supervised analysis. VariantSpark can analyze whole genome datasets faster than other methods and scale to process 100% of genomic data. It also integrates with cloud platforms like AWS and Databricks for easy access and demo of its capabilities through Jupyter notebooks.
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
The document discusses KeyGene's use of Apache Spark for high-throughput genomics data analysis. KeyGene is a crop innovation company that analyzes genomic data from thousands of plants to improve crop traits like yield and quality. They previously used conventional HPC clusters for genomics pipelines but found Spark enabled more interactive analysis. KeyGene developed a "Sparkified" genomics pipeline using tools like BWA, GATK and their own Guacamole variant caller. This allowed interactive variant selection and GWAS using Spark SQL queries, demonstrating Spark is well-suited for interactive plant genomics analysis.
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Challenges and Opportunities of Big Data GenomicsYasin Memari
The document discusses the challenges and opportunities of big data genomics. It notes that the bottleneck in genomics has shifted from data generation to data handling as sequencing capacity doubles every year. While compression can help address the data deluge, throughput from techniques like metagenomics and single-cell sequencing will continue to outpace storage gains. The document then explores solutions for analyzing and storing large genomic datasets through techniques like cloud computing, distributed file systems, and MapReduce frameworks.
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWSAWS Chicago
This document summarizes a presentation about building a cloud-based platform as a service (PaaS) for forensic DNA analysis using Amazon Web Services (AWS). It provides background on human genetics and DNA forensics using short tandem repeats (STRs). It describes initial cloud prototyping efforts from 2014-2018, including developing open source tools like STRait Razor and STRGazer. It outlines the production-level cloud system implemented from 2015-2018 using AWS services like S3, EC2, RDS, and Lambda. Performance and security policies of the system are discussed. Examples of technology transfer and validation of next-generation sequencing kits are also provided.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
How novel compute technology transforms life science researchDenis C. Bauer
Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
VariantSpark a library for genomics by Lynn LangitData Con LA
VariantSpark is a library for scalable genomic analysis that can process large genomic datasets containing millions of variants and thousands of samples. It uses machine learning techniques like k-means clustering and random forests for unsupervised and supervised analysis. VariantSpark can analyze whole genome datasets faster than other methods and scale to process 100% of genomic data. It also integrates with cloud platforms like AWS and Databricks for easy access and demo of its capabilities through Jupyter notebooks.
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
The document discusses KeyGene's use of Apache Spark for high-throughput genomics data analysis. KeyGene is a crop innovation company that analyzes genomic data from thousands of plants to improve crop traits like yield and quality. They previously used conventional HPC clusters for genomics pipelines but found Spark enabled more interactive analysis. KeyGene developed a "Sparkified" genomics pipeline using tools like BWA, GATK and their own Guacamole variant caller. This allowed interactive variant selection and GWAS using Spark SQL queries, demonstrating Spark is well-suited for interactive plant genomics analysis.
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Challenges and Opportunities of Big Data GenomicsYasin Memari
The document discusses the challenges and opportunities of big data genomics. It notes that the bottleneck in genomics has shifted from data generation to data handling as sequencing capacity doubles every year. While compression can help address the data deluge, throughput from techniques like metagenomics and single-cell sequencing will continue to outpace storage gains. The document then explores solutions for analyzing and storing large genomic datasets through techniques like cloud computing, distributed file systems, and MapReduce frameworks.
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWSAWS Chicago
This document summarizes a presentation about building a cloud-based platform as a service (PaaS) for forensic DNA analysis using Amazon Web Services (AWS). It provides background on human genetics and DNA forensics using short tandem repeats (STRs). It describes initial cloud prototyping efforts from 2014-2018, including developing open source tools like STRait Razor and STRGazer. It outlines the production-level cloud system implemented from 2015-2018 using AWS services like S3, EC2, RDS, and Lambda. Performance and security policies of the system are discussed. Examples of technology transfer and validation of next-generation sequencing kits are also provided.
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
Systems biology is becoming a data-intensive science due to the exponential growth of genomic and biological data. Large projects now produce petabytes of data that require new computational infrastructure to store, manage, and analyze. Cloud computing provides elastic resources that can scale to support the increasing data needs of systems biology. Case studies show how clouds are used for large-scale data integration and analysis, running combinatorial analysis over genomic marks, and enabling reanalysis of biological data through elastic virtual machines. The Open Cloud Consortium is working to provide open cloud resources for biological and biomedical research through testbeds and proposed bioclouds.
This document summarizes a study that benchmarked different metagenomic assembly approaches using a mock microbial community. The study found that while assembly generally improves functional annotation over analyzing unassembled reads, current assembly methods still have room for improvement, especially regarding misassemblies. The document also describes efforts to establish standardized assembly protocols and benchmarks in order to evaluate progress and better understand the challenges. Computational requirements for assembly remain high but are decreasing as methods improve.
The document discusses using Genome in a Bottle (GIAB) data on DNAnexus cloud platform. It describes two examples: 1) Comparing different mapper and variant caller combinations using GIAB pilot genome data. Benchmarking shows BWA and GATK Haplotype Caller performed best. 2) Assessing structural variation detection in the Ashkenazi Jewish Trio, combining data from Illumina and PacBio sequencing. DNAnexus is working with GIAB to develop benchmark datasets for structural variants.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.
VariantSpark - a Spark library for genomicsLynn Langit
VariantSpark a customer Apache Spark library for genomic data. Customer wide random forest machine learning algorithm, designed for workloads with millions of features.
This document discusses genomic-scale data pipelines. It introduces Dr. Denis Bauer and his transformational bioinformatics team. It describes how genomic data and research will grow exponentially to exabytes by 2025. It outlines genomic research workflows and challenges like processing, analyzing, and visualizing large variant call format (VCF) data. It presents two cloud data pipeline patterns used by the team: 1) A Spark server cluster pipeline for machine learning on large genomic datasets. 2) A serverless pipeline using AWS Lambda and Step Functions for scalable genomic searches.
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
This document discusses the use of cloud computing technologies for genomic big data analysis. It begins by defining big data and describing the exponential growth of genomic data. It then discusses how cloud computing provides flexibility, scalability, and accessibility for genomic data processing through virtualization and large computing clusters. Specific technologies enabled for the cloud that help with genomic analysis are described, such as Hadoop, MapReduce, and genomic analysis tools adapted for these frameworks. The document concludes by discussing challenges remaining around data transfer speeds and the need for cloud application expertise, but also describes how platforms like Galaxy Cloudman and Cloudgene allow genomic analysis in the cloud without programming expertise.
Tin-Lap Lee (CUHK) presentation "GDSAP- A Galaxy-based platform for large-scale genomics analysis" from the Galaxy Community Conference 2012, Chicago, July 26th 2012
A Topic i had presented at the Cloud and DevOps stream at TechXLR8 London 2017. Topic covers using Google cloud , Kubernetes , docker and cloud functions to create a managed distributed compute infrastructure to generate synthetic genomic data for simulation and infrastructure testing needs.
This document summarizes a presentation given by Luke Hickey of Pacific Biosciences on human genome sequencing using PacBio systems. It discusses PacBio sequencing technology developments, sequencing and assembly of the NA12878 genome, and the role of the NIST Genome in a Bottle (GIAB) reference materials. Specifically, it notes that PacBio sequenced the GIAB Ashkenazim trio genomes to high coverage and made the data publicly available. The sequencing and assembly of these genomes helps validate and improve PacBio sequencing technologies and supports the development and release of the trio as new NIST reference materials.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
Scientists have successfully stored 700 terabytes of data in a single gram of DNA, vastly exceeding previous DNA data density records. DNA is an ideal storage medium as it is incredibly dense, with each DNA base representing a binary digit. Additionally, DNA is very stable and can preserve data for hundreds of thousands of years without needing to be kept in controlled environments like other storage methods. Researchers are also exploring using DNA to build biological computers and memory devices, taking advantage of DNA's ability to store and process genetic information.
1. The document describes a method called Anchored Assembly for detecting structural variants from short-read sequencing data using read overlap assembly and reference removal.
2. The method was validated against other SV detection tools using validated SVs from fosmid/PacBio sequencing, detecting 15 previously undetected SVs with high sensitivity and specificity.
3. Examples are given of validated deletions and insertions detected in an Ashkenazi Jewish trio that were identical in the offspring and followed expected inheritance patterns from parents.
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
DNA has potential as a long-term data storage medium due to its stability, density, and redundancy. It can store 700 terabytes of data in 1 gram, which is equivalent to 3 million CDs and weighs 151 kilos if stored as hard drives. While DNA sequencing and synthesis speeds are currently slow, the cost per megabase of sequencing has dropped tremendously from $10,000 in 2001 to 10 cents in 2012. Researchers have successfully encoded digital files like images, documents and audio clips in DNA, demonstrating its viability for archiving large volumes of data in a small, stable format.
- PacBio HiFi reads are long (>10 kb) and accurate (>99%). HiFi reads are available now for HG002 and soon for HG001 and HG005.
- HiFi reads will be useful for comprehensive variant detection and phasing. Plans are outlined to apply HiFi reads to structural variant benchmarking and expand small variant calling to difficult regions.
The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive.
I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being generic to efficiently disseminate labour intensive maintenance and extension amongst the user community.
The presentation was given at the CIBCB, 2005, in San Diego about our approach to predict recombination sites in protein sequence. Recombination is the method of choice for designing new proteins with desired new or enhanced properties.
The publication is :
Bauer, D.C., Bodén, M., Thier, R. and Gillam, E. M. “STAR: Predicting recombination sites from amino acid sequence.” BMC Bioinformatics, 2006 Oct 8; 7:437. PMID: 17026775
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
Systems biology is becoming a data-intensive science due to the exponential growth of genomic and biological data. Large projects now produce petabytes of data that require new computational infrastructure to store, manage, and analyze. Cloud computing provides elastic resources that can scale to support the increasing data needs of systems biology. Case studies show how clouds are used for large-scale data integration and analysis, running combinatorial analysis over genomic marks, and enabling reanalysis of biological data through elastic virtual machines. The Open Cloud Consortium is working to provide open cloud resources for biological and biomedical research through testbeds and proposed bioclouds.
This document summarizes a study that benchmarked different metagenomic assembly approaches using a mock microbial community. The study found that while assembly generally improves functional annotation over analyzing unassembled reads, current assembly methods still have room for improvement, especially regarding misassemblies. The document also describes efforts to establish standardized assembly protocols and benchmarks in order to evaluate progress and better understand the challenges. Computational requirements for assembly remain high but are decreasing as methods improve.
The document discusses using Genome in a Bottle (GIAB) data on DNAnexus cloud platform. It describes two examples: 1) Comparing different mapper and variant caller combinations using GIAB pilot genome data. Benchmarking shows BWA and GATK Haplotype Caller performed best. 2) Assessing structural variation detection in the Ashkenazi Jewish Trio, combining data from Illumina and PacBio sequencing. DNAnexus is working with GIAB to develop benchmark datasets for structural variants.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.
VariantSpark - a Spark library for genomicsLynn Langit
VariantSpark a customer Apache Spark library for genomic data. Customer wide random forest machine learning algorithm, designed for workloads with millions of features.
This document discusses genomic-scale data pipelines. It introduces Dr. Denis Bauer and his transformational bioinformatics team. It describes how genomic data and research will grow exponentially to exabytes by 2025. It outlines genomic research workflows and challenges like processing, analyzing, and visualizing large variant call format (VCF) data. It presents two cloud data pipeline patterns used by the team: 1) A Spark server cluster pipeline for machine learning on large genomic datasets. 2) A serverless pipeline using AWS Lambda and Step Functions for scalable genomic searches.
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
This document discusses the use of cloud computing technologies for genomic big data analysis. It begins by defining big data and describing the exponential growth of genomic data. It then discusses how cloud computing provides flexibility, scalability, and accessibility for genomic data processing through virtualization and large computing clusters. Specific technologies enabled for the cloud that help with genomic analysis are described, such as Hadoop, MapReduce, and genomic analysis tools adapted for these frameworks. The document concludes by discussing challenges remaining around data transfer speeds and the need for cloud application expertise, but also describes how platforms like Galaxy Cloudman and Cloudgene allow genomic analysis in the cloud without programming expertise.
Tin-Lap Lee (CUHK) presentation "GDSAP- A Galaxy-based platform for large-scale genomics analysis" from the Galaxy Community Conference 2012, Chicago, July 26th 2012
A Topic i had presented at the Cloud and DevOps stream at TechXLR8 London 2017. Topic covers using Google cloud , Kubernetes , docker and cloud functions to create a managed distributed compute infrastructure to generate synthetic genomic data for simulation and infrastructure testing needs.
This document summarizes a presentation given by Luke Hickey of Pacific Biosciences on human genome sequencing using PacBio systems. It discusses PacBio sequencing technology developments, sequencing and assembly of the NA12878 genome, and the role of the NIST Genome in a Bottle (GIAB) reference materials. Specifically, it notes that PacBio sequenced the GIAB Ashkenazim trio genomes to high coverage and made the data publicly available. The sequencing and assembly of these genomes helps validate and improve PacBio sequencing technologies and supports the development and release of the trio as new NIST reference materials.
Opportunities for X-Ray science in future computing architecturesIan Foster
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends shown no signs of slowing down: the next 10 years will surely see exascale, new cloud offerings, and terabit networks. In this talk I review various of these developments and discuss their potential implications for a X-ray science and X-ray facilities.
Scientists have successfully stored 700 terabytes of data in a single gram of DNA, vastly exceeding previous DNA data density records. DNA is an ideal storage medium as it is incredibly dense, with each DNA base representing a binary digit. Additionally, DNA is very stable and can preserve data for hundreds of thousands of years without needing to be kept in controlled environments like other storage methods. Researchers are also exploring using DNA to build biological computers and memory devices, taking advantage of DNA's ability to store and process genetic information.
1. The document describes a method called Anchored Assembly for detecting structural variants from short-read sequencing data using read overlap assembly and reference removal.
2. The method was validated against other SV detection tools using validated SVs from fosmid/PacBio sequencing, detecting 15 previously undetected SVs with high sensitivity and specificity.
3. Examples are given of validated deletions and insertions detected in an Ashkenazi Jewish trio that were identical in the offspring and followed expected inheritance patterns from parents.
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
DNA has potential as a long-term data storage medium due to its stability, density, and redundancy. It can store 700 terabytes of data in 1 gram, which is equivalent to 3 million CDs and weighs 151 kilos if stored as hard drives. While DNA sequencing and synthesis speeds are currently slow, the cost per megabase of sequencing has dropped tremendously from $10,000 in 2001 to 10 cents in 2012. Researchers have successfully encoded digital files like images, documents and audio clips in DNA, demonstrating its viability for archiving large volumes of data in a small, stable format.
- PacBio HiFi reads are long (>10 kb) and accurate (>99%). HiFi reads are available now for HG002 and soon for HG001 and HG005.
- HiFi reads will be useful for comprehensive variant detection and phasing. Plans are outlined to apply HiFi reads to structural variant benchmarking and expand small variant calling to difficult regions.
The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive.
I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being generic to efficiently disseminate labour intensive maintenance and extension amongst the user community.
The presentation was given at the CIBCB, 2005, in San Diego about our approach to predict recombination sites in protein sequence. Recombination is the method of choice for designing new proteins with desired new or enhanced properties.
The publication is :
Bauer, D.C., Bodén, M., Thier, R. and Gillam, E. M. “STAR: Predicting recombination sites from amino acid sequence.” BMC Bioinformatics, 2006 Oct 8; 7:437. PMID: 17026775
The primary goal of my trip to Seattle was to establish a collaboration with a world-leading group on data integration. But by having chosen Seattle, a hub for technology companies, I also learned about synergies between business and research: Ilya Shmulevich from the Institute for Systems Biology makes use of Amazon's ''Random Forest" implementation and Google's 600.000 CPU cluster for cancer genomic association discovery. I also met with experts from University of Washington and Microsoft research to learn about technological advancements to tackle BigData and commoditizing parallelization. Finally, I observed a government funded research agency invest in solutions geared towards their enterprise structure rather than adopt solutions designed for research institutes without active computational community. In conclusion: CSIRO has unique properties and skill-sets that many collaborators would be interested in benefiting from, in return such collaborations would propel CSIRO instantly to the forefront of technology, which in particular for the analysis of big, unstructured datasets could be very rewarding.
Qbi Centre for Brain genomics (Informatics side)Denis C. Bauer
An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.
This session will follow up from transcript quantification of RNAseq data and discusses statistical means of identifying differentially regulated transcripts, and isoforms and contrasts these against microarray analysis approaches.
Allelic Imbalance for Pre-capture Whole Exome SequencingDenis C. Bauer
Exome sequencing has emerged as an economical way of focusing DNA sequencing efforts on the most functionally understood regions of the genome. Pre-capture pooling, where one bait library is used to pull down the exonic regions of several pooled samples simultaneously is a further financial improvement.
However, rare alleles in the pool might not be able to attract baits at the same rate as reference conform sequences can, and may hence be underrepresented. We investigated this potential issue by sequencing a hapmap family (4 individuals) using the pre-capture protocol from Illumina and Nimblegen. We did not observe clear evidence that heterozygote variants are missed but noted a trend for indels to be imbalanced.
Our findings do not provide clear evidence to rule out allelic imbalance or bias having an impact on research findings, this may be especially critical for low cellular cancer tissue where rare alleles are more ubiquitous.
Cell differentiation and differential gene expressionStephanie Beck
Cells from the same individual can have different appearances despite having identical DNA because cells can selectively express different genes. Early in development, stem cells differentiate into specialized cell types by activating different sets of genes through gene regulation. Gene expression involves using DNA as a template to produce mRNA and then translate it into specific proteins. Different cell types produce different proteins by expressing only the genes required for their function.
This document provides an overview of protein structure, including levels of structure and classification. It discusses the importance of protein structure in determining function. The primary levels of structure are defined as primary (amino acid sequence), secondary (local folding patterns like alpha helices and beta sheets), tertiary (packing of secondary structures), and quaternary (assembly of protein chains). Protein structures can be classified based on their secondary structure composition as all-alpha, all-beta, alpha/beta, or alpha+beta. Domains are compact folding units associated with function.
How novel compute technology transforms life science researchDenis C. Bauer
AgileIndia 2018 Keynote. This talk covers how ‘Datafication’ will make data ‘wider’ (more features describing a data point), which represents a paradigm shift for Machine Learning applications. It also covers serverless architecture, which can cater for even compute-intensive tasks. It concludes by stating that business and life-science research are not that different: so let’s build a community together!
The ProteomeXchange consortium allows researchers to easily deposit and retrieve proteomics data. It includes repositories like PRIDE, PeptideAtlas, and recently MassIVE. The goal is to standardize submission and access across repositories through common identifiers and supported workflows. Over 1,300 datasets have been submitted, with many tools now supporting standard formats like mzIdentML for complete submissions. The most accessed datasets include large reference maps of the human proteome. Open source tools are improving submission and analysis of ProteomeXchange data.
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
This document discusses the development of an open-source application called OpenStudyBuilder that was built using Neo4j graph database. OpenStudyBuilder has three main components - a clinical metadata repository, a web application interface, and an API layer. It applies domain-driven design principles to model complex clinical study data. Some challenges discussed include performance issues with the Neo4j ORM library, how to present graph data in tables, changing data models over time which requires data migrations, and potential limitations for non-profit or smaller users due to reliance on Neo4j Enterprise features. In summary, the document outlines how a Neo4j database was used as the data store for an enterprise clinical study specification application to effectively model
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services
This session outlines how to deal with “big” (many samples) and “wide” (many features per sample) data on Apache Spark, how to keep runtime constant by using instantaneously scalable micro services (AWS Lambda), and how AWS technology has enabled inspirational real-world research use cases at CSIRO.
Speaker: Denis Bauer, Transformational Bioinformatics Team Leader, CSIRO
Level: 200
The document describes two projects to improve interprocess communication and slew models. The first project tested different I/O protocols to optimize how a program passed data between processes, finding ZeroMQ fastest but not ultimately needed. The second fitted antenna slew models by correlating scheduled and logged slew times, updating models and reducing errors.
The webinar covered new features and updates to the Nephele 2.0 bioinformatics analysis platform. Key updates included a new website interface, improved performance through a new infrastructure framework, the ability to resubmit jobs by ID, and interactive mapping file submission. New pipelines for 16S analysis using DADA2 and quality control preprocessing were introduced, and the existing 16S mothur pipeline was updated. The quality control pipeline provides tools to assess data quality before running microbiome analyses through FastQC, primer/adapter trimming with cutadapt, and additional quality filtering options. The webinar emphasized the importance of data quality checks and highlighted troubleshooting tips such as examining the log file for error messages when jobs fail.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the 2017 MVAPICH User Group, Adam Moody from Lawrence Livermore National Laboratory presents: MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts.
"High-performance computing is being applied to solve the world's most daunting problems, including researching climate change, studying fusion physics, and curing cancer. MPI is a key component in this work, and as such, the MVAPICH team plays a critical role in these efforts. In this talk, I will discuss recent science that MVAPICH has enabled and describe future research that is planned. I will detail how the MVAPICH team has responded to address past problems and list the requirements that future work will demand."
Watch the video: https://wp.me/p3RLHQ-hp6
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
Graph-based modelling is becoming more popular, in the sciences and elsewhere, as a flexible and powerful way to exploit data to power world-changing digital applications. Com- pared to the initial vision of the Semantic Web, knowledge graphs and graph databases are be- coming a practical and computationally less formal way to manage graph data. On the other hand, linked data based on Semantic Web standards are a complementary, rather than alternative, ap- proach to deal with these data, since they still provide a common way to represent and exchange information. In this paper we introduce rdf2neo, a tool to populate Neo4j databases starting from RDF data sets, based on a configurable mapping between the two. By employing agrigenomics- related real use cases, we show how such mapping can allow for a hybrid approach to the man- agement of networked knowledge, based on taking advantage of the best of both RDF and prop- erty graphs.
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads
Iguana is a framework for benchmarking the read-write performance of triple stores. It provides a realistic scenario by simulating multiple concurrent users querying and updating a triple store. Iguana executes benchmarks on different datasets and triple stores, measuring key performance indicators like queries per second. Results are stored in files and triple stores for analysis. The framework is extensible and can benchmark any dataset, SPARQL/update queries, and triple store configuration.
In this deck from the GPU Technology Conference, Thorsten Kurth from Lawrence Berkeley National Laboratory and Josh Romero from NVIDIA present: Exascale Deep Learning for Climate Analytics.
"We'll discuss how we scaled the training of a single deep learning model to 27,360 V100 GPUs (4,560 nodes) on the OLCF Summit HPC System using the high-productivity TensorFlow framework. We discuss how the neural network was tweaked to achieve good performance on the NVIDIA Volta GPUs with Tensor Cores and what further optimizations were necessary to provide excellent scalability, including data input pipeline and communication optimizations, as well as gradient boosting for SGD-type solvers. Scalable deep learning becomes more and more important as datasets and deep learning models grow and become more complicated. This talk is targeted at deep learning practitioners who are interested in learning what optimizations are necessary for training their models efficiently at massive scale."
Watch the video: https://wp.me/p3RLHQ-kgT
Learn more: https://ml4sci.lbl.gov/home
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
We are living in the world of “Big Data”. “Big Data” is mainly expressed with three Vs – Volume, Velocity and Variety. The presentation will discuss how Big Data impacts us and how SAS programmers can use SAS skills in Big Data environment
The presentation will introduce Big Data Storage solution – Hadoop and NoSQL. In Hadoop, the presentation will discuss two major Hadoop capabilities - Hadoop Distributed File System (HDFS) and Map/Reduce (parallel computing in Hadoop). The presentation will show how SAS can work with Hadoop using HDFS LIBNAME, FILENAME, SAS/ACCESS to Hadoop HIVE and SAS GRID Managers to Hadoop YARN. The presentation will also introduce the concepts of NoSQL database for a big data solution.
The presentation will also introduce how SAS can work with the variety of data format, especially XML and JSON. The presentation will show the use case of converting XML documents to SAS datasets using LIBNAME XMLV2 XMLMAP statement. The presentation will also introduce REST API to extract data through internets and will demonstrate how SAS PROC HTTP can move the data through REST API.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Practical Chaos Engineering will show how to start running chaos experiments in your infrastructure and will try to guide your through the principles of chaos.
Exquiron migrated their cheminformatics platform from PipelinePilot to KNIME and ChemAxon technologies due to rising PipelinePilot costs. They worked with ChemAxon consultants to port complex workflows like hit expansion and dose-response reporting to KNIME. While the migration was successful, KNIME has higher memory requirements and slower loops than PipelinePilot. However, KNIME provides a modern reporting interface and faster fingerprint searches. With help from ChemAxon, Exquiron was able to optimize workflows and maintain project timelines during the platform migration.
This workshop is a hands-on using Perftools from Cray with NAS Benchmarks step by step on Shaheen II. I present CrayPat, Apprentice2, and Reveal tools. There is a short introduction to Extrae/Paraver from Barcelona Supercomputing Centre.
Similar to Population-scale high-throughput sequencing data analysis (20)
Cloud-native machine learning - Transforming bioinformatics research Denis C. Bauer
Cloud computing and artificial intelligence transforms bioinformatics research
Denis Bauer, Transformational Bioinformatics Team
Genomic data is outpacing traditional Big Data disciplines, producing more information than Astronomy, twitter, and YouTube combined. As such, Genomic research has leapfrogged to the forefront of Big Data and Cloud solutions. We developed software platforms using the latest in cloud architecture, artificial intelligence and machine learning to support every aspect genome medicine; from disease gene detection through to validation and personalized medicine.
This talk outlines how we find disease genes for complex genetic diseases, such as ALS, using VariantSpark, which is a custom machine learning implementation capable of dealing with Whole Genome Sequencing data of 80 million common and rare variants. To support disease gene validation, we created GT-Scan, which is an innovative web application, which we think of it as the “search engine for the genome”. It enables researchers to identify the optimal editing spot to create animal models efficiently. The talk concludes by demonstrating how cloud-based software distribution channels (digital Marketplaces) can be harnessed to share bioinformatics tools internationally and make research more reproducible.
Translating genomics into clinical practice - 2018 AWS summit keynoteDenis C. Bauer
CSIRO's part of the co-presented Keynote at the AWS Public Sector Summit in Canberra on genomics health care. Three key messages: 1) We need a shift from treatment towards prevention 2) Once you go serverless you never go back 3) DevOps 2.0: Hypothesis-driven architecture evolution
Going Server-less for Web-Services that need to Crunch Large Volumes of DataDenis C. Bauer
AgileIndia Breakout session on serverless applications. This talk covers how AWS serverless infrastructure can be used for a wide range of applications, such as compute intensive tasks (GT-Scan), tasks requiring continuous learning (CryptoBreeder), data intensive tasks (PhenGen Database).
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
The document discusses challenges in identifying causal variants for complex diseases from sequencing data. It notes that while ideal situations may involve finding a variant common in all affected individuals and absent in unaffected, reality involves sifting through around 3.5 million SNPs. Methods like genome-wide association studies and focusing on exonic variants can help prioritize, but functional variants may also reside outside of protein coding regions. Considering combinations of variants through statistical genetics approaches may be needed to explain disease heritability. Quality control, annotation, and filtering are important but finding causal variants remains difficult.
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
This document discusses various topics related to mapping short sequencing reads to a reference genome, including:
- File formats like FASTQ that store sequencing reads and BAM/SAM formats for aligned reads.
- Alignment algorithms like hash table-based (MAQ, BWA) and suffix tree-based (BWA, Bowtie) mappers.
- Visualizing alignments using the Integrative Genomics Viewer (IGV).
- Performing quality control on BAM files by checking the percentage of mapped reads and coverage uniformity.
- The next session will focus on identifying genomic variants from mapped reads through SNP/indel calling and filtering.
Introduction to second generation sequencingDenis C. Bauer
An introduction to second generation sequencing will be given with focus on the basic production informatics: The approach of raw data conversion and quality control will be discussed.
Bioinformatics is an interdisciplinary field that merges biology, computer science, and information technology. It is applied in areas like genomics, proteomics, and systems biology. While some basic analysis can be done through user-friendly tools, truly customized work requires programming skills and an understanding of underlying algorithms. Bioinformatics is not just a service field but rather involves scientific experimentation throughout the entire analysis process from experimental design to evaluation. It is a dedicated field of research in its own right, not a quick or interchangeable task.
Critical Run files can be missing/corrupt after the Run folder was transferred from the HiSeq storage to the cluster storage. This presentation discusses the issue and suggests four workarounds.
Deciphering the regulatory code in the genomeDenis C. Bauer
There are messages hidden within our genome, regulating when and how long a gene is switched on. The presentation describes a method, STREAM, targeted at deciphering this regulatory code.
This was our presentation for our imaginary product for the commercialization workshop. Note, all "research results" and illustrations are totally made up and and therefore not necessarily reflecting reality (== biological processes). This presentation was created as part of the learning experience of how to pitch biological research to venture capitalists.
Population-scale high-throughput sequencing data analysis
1. Denis C. Bauer | Bioinformatics | @allPowerde
08 July 2014
CSIRO COMPUTATIONAL INFORMATICS
Population-scale high-throughput sequencing
data analysis
ByMelody
2. Talk Overview
2 |
• Background: CSIRO/Omics Project
• Methods: NGS Data Processing on HPC/Cloud
• Research Outcome: Cancer and Microbes in Colorectal Cancer
Denis Bauer | @allPowerde
3. 62% of our people hold
university degrees
2000 doctorates
500 masters
With our university
partners, we develop
650 postgraduate
research students
Top 1% of global research
institutions in 14 of 22 research
fields
Top 0.1% in 4 research fields
Darwin
Alice Springs
Geraldton
2 sites
Atherton
Townsville
2 sites
Rockhampton
Toowoomba
Gatton
Myall Vale
Narrabri
Mopra
Parkes
Griffith
Belmont
Geelong
Hobart
Sandy Bay
Wodonga
Newcastle
Armidale
2 sites
Perth
3 sites
Adelaide
2 sites Sydney 5 sites
Canberra 7 sites
Murchison
Cairns
Irymple
Melbourne 5 sites
CSIRO: Who we are
Werribee 2 sites
Brisbane
6 sites
Bribie
Island
People
Divisions
Locations
Flagships
Budget
6500
13
58
11
$1B+
The Commonwealth Scientific and Industrial Research Organisation
Denis Bauer | @allPowerde3 |
4. Our business units
12Research Divisions11National Research Flagships
+National Research Facilities
and Collections
FOOD, HEALTH
& LIFE SCIENCE
INDUSTRIES
ENVIRONMENT MANUFACTURING,
MATERIALS &
MINERALS
ENERGY INFORMATION &
COMMUNICATIONS
+Transformational
Capability Platforms
Denis Bauer | @allPowerde4 |
5. Our track record: top inventions
4. EXTENDED
WEAR CONTACTS
2. POLYMER
BANKNOTES
3. RELENZA
FLU VACCINE
1. Fast WLAN
Wireless Local
Area Network
5. AEROGARD 6. TOTAL
WELLBEING DIET
7. RAFT
POLYMERISATION
8. BARLEYMAX 9. SELF TWISTING
YARN
10. SOFTLY
WASHING LIQUID
Denis Bauer | @allPowerde5 |
6. Part 1: The ‘omics project
The goalof the project is to investigate the
susceptibility to colorectal cancer in the
context of obesity and the gut
microbiome
Denis Bauer | @allPowerde6 |
7. Data from Pilot Study
Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private
Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith)
organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)
Denis Bauer | @allPowerde7 |
8. • Objective: capture genomic variances reliably in tumour normal
and adipose.
• Sequence effort:
• 12 tumour -> 6 lanes (2-plex)
• 12 normal -> 3 lanes (4-plex)
• 12 adipose -> 3 lanes (4-plex)
Considerations before sequencing: Undersampling
More depth needed due to
potentially low cellularity in
the tumour sample
additional
depth
tumour sample
normal sample
Denis Bauer | @allPowerde8 |
9. • Objective: process samples avoiding confounding factors
Considerations before sequencing: Flowcell design
L1
L2
L2
L2
O1
O1
O1
O2
O2
O2
Sequenced
over 3 lanes
L1
L1
Normal
Adipose
Tumour
4-plex
4-plex
4-plex
L2
O2
L1
O1
L2
O2
L1
O1
Sequence on
one lane each
L2
O2
L1
O1
Subject every
sample to the same
lane and flowcell
effects by
multiplexing
(labelling every
sample with a
identifying barcode)
Denis Bauer | @allPowerde9 |
11. Blue Monster says
Design your experiment with project-
specific pitfalls in mind
Auer PL et al. Statistical design and analysis of RNA sequencing data.
Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde11 |
12. Part 2: NGS Data Processing
Minimize project set-up overhead
while providing easily adaptable processing modules
for NGS analysis on high-performance-
compute clusters/cloud
Denis Bauer | @allPowerde12 |
13. Resource consumption for Variant Calling
qsub –t 1-36 task.qsub
Script
Submission
Scheduler
0
50
100
100
DNAseq
average
task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
36 samples (2.7T data) on average requires
128 hours CPU time (ste= 15)
77 GB RAM (ste=0.34)
CPU
(hours)
Real time
(hours)
Memory
(GB)
0
50
100
0
50
100
DNAseqRNAseq
cpu
cpu_real
memory
type
average task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
#PBS –l nodes=2:ppn=8
High-Performance-Compute
Denis Bauer | @allPowerde13 |
14. doi:10.1038/nbt.2421
Tailored processing for different sequencing applications
Wet-lab Protocols Production Informatics
Variant
Calling
Methylation
Sites
Gene
Expression
Despite different approaches
we want to use the same
processing framework!
Denis Bauer | @allPowerde14 |
15. reusability
cutting edgedata security
HPC environment
reproducibility
robustness
adaptability
knowledge transfer
(publication)
efficient
Wish list for a framework
Denis Bauer | @allPowerde15 |
19. DEMO - files
Project X fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
We can start from raw fastq files: here
3 files (Run1-3) in 2 different
conditions (Exp1-2)
Denis Bauer | @allPowerde19 |
20. DEMO – setting up config file
#********************
# Data
#********************
declare -a DIR; DIR=( Exp1 Exp2 )
#********************
# Tasks
#********************
RUNMAPPINGBOWTIE2="1" # mapping with bowtie2
#********************
# Paths
#********************
# reference genome
FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
20 | Denis Bauer, @allPowerde
We specify the folders NGSANE
should run on and what to do (here:
bowtie2 mapping). We can also
specify project specific settings (here:
use igenomes)
21. DEMO – dry run
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt
[NGSANE] Trigger mode: [empty] (dry run)
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup enviroment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
We run NGSANE in dry run to test
what jobs it would submit
Denis Bauer | @allPowerde21 |
22. DEMO – submit
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed
[NGSANE] Trigger mode: armed
Double check! Then type safetyoff and hit enter to launch the job: safetyoff
... take cover!
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup environment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424899
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424900
[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2
Jobnumber 2424901
We submit HPC jobs. Checkout the
returned qsub identifiers.
Denis Bauer | @allPowerde22 |
23. DEMO – scheduler
bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c
burnet-srv.idpx.hpsc.csiro.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00
2424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:00
2424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00
Three HPC jobs run in parallele because there
were three fastq files. But there is no limit to the
number of files to process in parallele: easy scale-
up to populations.
Denis Bauer | @allPowerde23 |
24. DEMO – report
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html
[NGSANE] Trigger mode: html
>>>>> Generate HTML report
>>>>> startdate Fri Jan 24 08:02:37 EST 2014
>>>>> hostname burnet-login
>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt
--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”
--Python--Python 2.7.2
QC - bowtie2
>>>>> Generate HTML report - FINISHED
>>>>> enddate Fri Jan 24 08:02:39 EST 2014
More report examples
Now create the HTML overview page,
to check if jobs finised sucessfully and
what the results are (bowtie2:
mapping statistics)
Denis Bauer | @allPowerde24 |
25. DEMO - files
Project X
Summary HTML
Exp1 Bowtie
Run1.bam
Run2.bam
Exp2 Bowtie Run3.bam
fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
The resulting file structure: every
experiment has a folder with the tasks
as subfolders and in them the results
(here: bam files)
Denis Bauer | @allPowerde25 |
26. NGSANE Currently supports
• Transfer data (smbclient)
• Quality Control
(GATK, FastQC, RNA-SeQC, custom summaries,
user code)
• Trimming
(Cutadapt,Trimgalore, Trimmomatic)
• Mapping
(BWA,Bowtie1,Bowtie2,Tophat)
• Transcript Quantification
(cufflinks, htseq, bedtools)
• Variant calling
(GATK, samtools)
• Variant annotation
(annovar)
• 3D Genome structure
(Hicup, fit-hi-c, Hiclib, Homer)
Denis Bauer | @allPowerde26 |
27. For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine
Denis Bauer | @allPowerde27 |
28. Blue Monster says
Analyze your data to be reproducible
and well documented with tools that
scale well to larger datasets
Buske FA et al. NGSANE: a lightweight production
informatics framework for high-throughput data analysis.
Bioinformatics. 2014 PMID: 24470576
Denis Bauer | @allPowerde28 |
29. Part 3: Combining Omics Data
Seeing the full picture requires taking all
information into account
Denis Bauer | @allPowerde29 |
30. Result overview: traditional differential analysis
1e−02
1e+00
1e+02
1e−02 1e+00 1e+02
tumour FPKM + 0
normalFPKM+0
1. 722 genes differentially expressed (DE) between tumour and
normal
• QC: We have good concordance with genes known to be up/down regulated in CRC
2. 841 differentially methylated (DM) genomic regions -- mostly
hypermethylated
• QC: good concordance with previously reported gut methylation profile
0.1
10.0
0.1 10.0
tumour FPKM + 0
normalFPKM+0
Fernandez et al. Genome Res. 2012CSIRO inhouse
Known DE gene Known DM locations
Denis Bauer | @allPowerde30 |
33. DNA methylation: Blood signatures in Adipose and Gut samples
Tim Peters
Some gut/adipose
samples have blood-
like signatures.
Denis Bauer | @allPowerde33 |
35. Medical History: Blood potentially resulting from medication
CARTIA
14,50,57
WARFARIN
40
ASPIRIN
59,7
COPLAVIX
12
No anti-clotting drug 2, 62, 4
No medication 19,20
Wilcoxon rank sum test p-value = 0.02
Anti-thrombosis drugs
significantly enriched in
individuals with human
material in digesta.
Denis Bauer | @allPowerde35 |
36. Microbial data: Blood “liking” opportunistic bacteria are enriched
in contaminated samples
E. coli and Salmonella etc
Opportunistic pathogens.
Respond to inflammation
and bleeding
Bacterial marker for low level
chronic gut bleeding ?
Denis Bauer | @allPowerde36 |
38. Three things to remember
• Good experimental design is necessary
(even) in sequencing experiments
• Reproducible, documented data
analysis is key (e.g. NGSANE, a
lightweight flexible tool for large-scale
sequence data analysis on high-
performance systems and Amazon’s
elastic cloud)
• Promising research opportunities are in
the integration of multiple high-
throughput data sources
Denis Bauer | @allPowerde38 |
39. COMPUTATIONAL INFORMATICS
Thank youComputational Informatics
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w www.csiro.au/bioinformatics
Buske et al.,
Bioinformatics,
Jan 2014
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Fabian A. Buske
Susan Clark
Hugh French
Martin Smith
Garvan Institute of Medical
Research, Sydney, Australia
Robert Dunne
Tim Peters
Paul Greenfield
Piotr Szul
Tomasz Bednarz
Computational Informatics,
CSIRO, Australia
Garry Hannan
Animal Food and Health Scinece,
CSIRO, Australia
Rodney Scott
University of Newcastle, Australia
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
http://www.genome-engineering.com.au/
Editor's Notes
Staff # as at 30 June 2012 = 6492 (FTE = 5720)
2011-12 budget = $1.2billion
--------------------
Some specifics about us:
CSIRO is Australia’s national science agency. We are a mission-directed, large-scale, multidisciplinary research and development organisation.
Since 1926, we have been in the business of applying scientific knowledge to the big issues facing Australia and increasingly the world. Globally we are recognised as one of the top 10 applied research organisations.
We bring together the best scientists in the world and teams of professionals to work together to help create industries, national wealth, a healthy environment and improved living standards.
We have delivered many innovations that have positively impacted on the daily lives of Australians and billions of others around the world.
In terms of our vital statistics we generate annual revenues of over A$1 billion . We have around 6,500 people in more than 50 locations across Australia. We lead 11 National Research Flagships addressing major challenges like water, climate, health, manufacturing, mineral resources. This includes our two new flagships: Biosecurity, Digital Productivity and Services (June 2012). We are a leading Australian patenting organisation with over 3,500 patents (granted and pending) and manage an IP portfolio of over 150 revenue bearing licenses.
While our research enjoys a high global ranking in terms of publication and citation rates it’s our focus on creating positive impact from science at scale that sets us apart from others in the Australian innovation system.
We do science with purpose. We do it well. We make a difference.
CSIRO operates in a matrix. This is to ensure we have the flexibility we need to be able to provide the right mix of skills and talent for major projects; pulling the right people from all across the organisation to form multidisciplinary teams. We understand that often the most successful science comes from crossing boundaries, and working in a matrix structure gives us the ability to do this and therefore help us to deliver impact for Australia.
We organise ourselves into 5 research groups. Within these we have 11 National Research Flagships (the two most recent being approved in June 2012 – Biosecurity, Digital Productivity and Services), plus core research portfolios, and 12 research Divisions as at 1 July 2012. We also have Transformational Capability Platforms and several national research facilities and collections.
There are equivalent organisations to CSIRO in a number of countries like India and South Africa albeit with slightly narrower missions. CSIRO’s research spans agriculture, food, manufacturing, materials, energy, minerals, health, ICT and the climate, water and environmental domains.
This is important to understand because increasingly the solutions to the major challenges we face are being found across sectors and at the interface of different scientific disciplines. So one of the hidden benefits of being large and multidisciplinary that we have discovered, is that you can more readily assemble the teams and partnerships necessary to deliver the scale of impact required to address the big questions facing humanity.
While we are a research organization, we were successful at commercializing a couple of our products. Most famously the wifi protocol which is now in every device using wireless technology like your laptop or phone. Closer to my area of research is Barlymax, a cereal which is high in fibre specifically developed to reduce the risk of bowel cancer.
We have a strong track record of commercial success. Our work has impacted the daily lives of Australians and those around the world. These are some of our top inventions.