Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.
This talk will be twofold.
First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.
Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.
ADAM is an open source platform for scalable genomic analysis that defines a data schema, Scala API, and command line interface. It uses Apache Spark for efficient parallel and distributed processing of large genomic datasets stored in Parquet format. Key features of ADAM include its ability to perform iterative analysis on whole genome datasets while minimizing data movement through Spark. The document also describes using ADAM and PacMin for long read assembly through techniques like minhashing for fast read overlapping and building consensus sequences on read graphs.
ADAM is a scalable genome analysis platform that uses a column-oriented file format called Parquet to efficiently store and access large genomic datasets across distributed systems. It provides APIs and tools for transforming, analyzing, and querying genomic data in a scalable way using Apache Spark. Some key goals of ADAM include enabling efficient processing of genomes using clusters/clouds, providing a data format for parallel data access, and enhancing data semantics to allow more flexible access patterns.
ADAM is an open source, high performance, distributed platform for genomic analysis built on Apache Spark. It defines a Scala API and data schema using Avro and Parquet to store data in a columnar format, addressing the I/O bottleneck in genomics pipelines. ADAM implements common genomics algorithms as data or graph parallel computations and minimizes data movement by sending code to the data using Spark. It is designed to scale to processing whole human genomes across distributed file systems and cloud infrastructure.
This document provides a summary of the Scalable Genome Analysis with ADAM project. ADAM is an open-source, high-performance, distributed platform for genomic analysis that defines a data schema, data layout on disk, and programming interface for distributed processing of genomic data using Spark and Scala. The goal of ADAM is to integrate across terabyte and petabyte-scale datasets to enable the discovery of low frequency genetic variants linked to traits and diseases.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
This document discusses scalable genome analysis using ADAM (Apache Spark-based framework). It begins by describing genomes and the goal of analyzing genetic variations. The document then discusses challenges like the large size of genomes and complexity of linking variations to traits. It proposes using ADAM's schema, optimized storage and algorithms to accelerate common access patterns like overlap joins. The document also emphasizes applying biological knowledge like protein grammars to make sense of non-coding variations. Finally, it acknowledges contributions from various institutions that have helped develop ADAM and its ability to enable genome analysis at scale.
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.
This talk will be twofold.
First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.
Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.
ADAM is an open source platform for scalable genomic analysis that defines a data schema, Scala API, and command line interface. It uses Apache Spark for efficient parallel and distributed processing of large genomic datasets stored in Parquet format. Key features of ADAM include its ability to perform iterative analysis on whole genome datasets while minimizing data movement through Spark. The document also describes using ADAM and PacMin for long read assembly through techniques like minhashing for fast read overlapping and building consensus sequences on read graphs.
ADAM is a scalable genome analysis platform that uses a column-oriented file format called Parquet to efficiently store and access large genomic datasets across distributed systems. It provides APIs and tools for transforming, analyzing, and querying genomic data in a scalable way using Apache Spark. Some key goals of ADAM include enabling efficient processing of genomes using clusters/clouds, providing a data format for parallel data access, and enhancing data semantics to allow more flexible access patterns.
ADAM is an open source, high performance, distributed platform for genomic analysis built on Apache Spark. It defines a Scala API and data schema using Avro and Parquet to store data in a columnar format, addressing the I/O bottleneck in genomics pipelines. ADAM implements common genomics algorithms as data or graph parallel computations and minimizes data movement by sending code to the data using Spark. It is designed to scale to processing whole human genomes across distributed file systems and cloud infrastructure.
This document provides a summary of the Scalable Genome Analysis with ADAM project. ADAM is an open-source, high-performance, distributed platform for genomic analysis that defines a data schema, data layout on disk, and programming interface for distributed processing of genomic data using Spark and Scala. The goal of ADAM is to integrate across terabyte and petabyte-scale datasets to enable the discovery of low frequency genetic variants linked to traits and diseases.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
This document discusses scalable genome analysis using ADAM (Apache Spark-based framework). It begins by describing genomes and the goal of analyzing genetic variations. The document then discusses challenges like the large size of genomes and complexity of linking variations to traits. It proposes using ADAM's schema, optimized storage and algorithms to accelerate common access patterns like overlap joins. The document also emphasizes applying biological knowledge like protein grammars to make sense of non-coding variations. Finally, it acknowledges contributions from various institutions that have helped develop ADAM and its ability to enable genome analysis at scale.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
This document summarizes lessons learned from managing large genomic datasets at Monsanto. It discusses how Monsanto uses big data technologies like Hadoop, HBase, and Solr to store and query genomic data at scale. Key lessons include using HBase more like a hashmap than a relational database, denormalizing HBase schemas, and using distributed search technologies like SolrCloud rather than rebuilding Solr indexes. The document provides examples of genomic data formats and architectures used to store, index, and retrieve genomic feature data from petabytes of sequence data.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
Challenges and Opportunities of Big Data GenomicsYasin Memari
The document discusses the challenges and opportunities of big data genomics. It notes that the bottleneck in genomics has shifted from data generation to data handling as sequencing capacity doubles every year. While compression can help address the data deluge, throughput from techniques like metagenomics and single-cell sequencing will continue to outpace storage gains. The document then explores solutions for analyzing and storing large genomic datasets through techniques like cloud computing, distributed file systems, and MapReduce frameworks.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
With the exponential growth of genomic data sets, healthcare practitioners now have the opportunity to improve human outcomes at an unprecedented pace. These outcomes are difficult to realize in the existing ecosystem of genomic tools, where biostatisticians regularly chain together command-line interfaces based on a single-node setup on premise. The Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analysis on our massively scalable platform in the cloud: in only minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data. Built on Apache Spark, we provide click-button implementations of accepted best practice workflows, as well as low-level Spark SQL optimizations for common genomics operations.
This document discusses the challenges of analyzing large datasets from metagenomic shotgun sequencing experiments. It notes that while sequencing costs have decreased significantly, the computational analysis of the massive amounts of data generated still poses major challenges. It introduces the concept of "digital normalization" as an approach to reduce dataset sizes while retaining most of the biological information by removing redundant reads. The document advocates for making analysis tools and datasets openly accessible to help advance understanding of microbial communities from metagenomics studies.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
This document provides information about bioinformatics resources including databases of nucleotide and protein sequences. It discusses flat file databases like GenBank that store sequence data in plain text files and relational databases that improve data organization. Examples of popular biological databases are described, such as GenBank, EMBL, and DDBJ for nucleotide sequences and Swiss-Prot and TrEMBL for protein sequences. The document also covers sequence file formats, web tools for querying databases, and trace files used in sequence assembly.
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
This document provides an overview of flat file databases and biological relational databases. It discusses flat file databases like RefSeq that store sequence data in plain text files. It describes common file formats like Genbank and EMBL. It also discusses the Trace Archive and how trace files are processed into consensus sequences using Phred and Phrap. Finally, it briefly introduces biological relational databases and references resources like Swiss-Prot and TrEMBL.
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
- ADAM is an open source, high performance, distributed library for genomic analysis that uses Spark and Scala. It defines schemas and interfaces for processing and analyzing large genomic datasets in a distributed manner.
- ADAM provides schemas for common genomic file formats like SAM/BAM, VCF, FASTA. It enables both batch and interactive analysis of genomic data while optimizing for performance and avoiding data lock-in issues.
- Benchmarking shows ADAM can process a 65x human genome dataset faster and cheaper than existing tools, enabling end-to-end analysis in under 2 hours on 1,024 cores.
Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft
Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
This document summarizes lessons learned from managing large genomic datasets at Monsanto. It discusses how Monsanto uses big data technologies like Hadoop, HBase, and Solr to store and query genomic data at scale. Key lessons include using HBase more like a hashmap than a relational database, denormalizing HBase schemas, and using distributed search technologies like SolrCloud rather than rebuilding Solr indexes. The document provides examples of genomic data formats and architectures used to store, index, and retrieve genomic feature data from petabytes of sequence data.
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change.
In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop!
Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
Challenges and Opportunities of Big Data GenomicsYasin Memari
The document discusses the challenges and opportunities of big data genomics. It notes that the bottleneck in genomics has shifted from data generation to data handling as sequencing capacity doubles every year. While compression can help address the data deluge, throughput from techniques like metagenomics and single-cell sequencing will continue to outpace storage gains. The document then explores solutions for analyzing and storing large genomic datasets through techniques like cloud computing, distributed file systems, and MapReduce frameworks.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
With the exponential growth of genomic data sets, healthcare practitioners now have the opportunity to improve human outcomes at an unprecedented pace. These outcomes are difficult to realize in the existing ecosystem of genomic tools, where biostatisticians regularly chain together command-line interfaces based on a single-node setup on premise. The Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analysis on our massively scalable platform in the cloud: in only minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data. Built on Apache Spark, we provide click-button implementations of accepted best practice workflows, as well as low-level Spark SQL optimizations for common genomics operations.
This document discusses the challenges of analyzing large datasets from metagenomic shotgun sequencing experiments. It notes that while sequencing costs have decreased significantly, the computational analysis of the massive amounts of data generated still poses major challenges. It introduces the concept of "digital normalization" as an approach to reduce dataset sizes while retaining most of the biological information by removing redundant reads. The document advocates for making analysis tools and datasets openly accessible to help advance understanding of microbial communities from metagenomics studies.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
This document discusses using Apache Spark to assemble metagenomes from short read sequencing data. Metagenomes are genomes from microbial communities containing many species. Spark provides an efficient and scalable approach compared to previous methods. The document demonstrates clustering reads from small test datasets in Spark and evaluates performance on real datasets ranging from 20GB to failures at 100GB. While Spark is easy to develop for and efficient, challenges remain in robustness at large scales and optimizing for different problem complexities.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
This document provides information about bioinformatics resources including databases of nucleotide and protein sequences. It discusses flat file databases like GenBank that store sequence data in plain text files and relational databases that improve data organization. Examples of popular biological databases are described, such as GenBank, EMBL, and DDBJ for nucleotide sequences and Swiss-Prot and TrEMBL for protein sequences. The document also covers sequence file formats, web tools for querying databases, and trace files used in sequence assembly.
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
This document provides an overview of flat file databases and biological relational databases. It discusses flat file databases like RefSeq that store sequence data in plain text files. It describes common file formats like Genbank and EMBL. It also discusses the Trace Archive and how trace files are processed into consensus sequences using Phred and Phrap. Finally, it briefly introduces biological relational databases and references resources like Swiss-Prot and TrEMBL.
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
- ADAM is an open source, high performance, distributed library for genomic analysis that uses Spark and Scala. It defines schemas and interfaces for processing and analyzing large genomic datasets in a distributed manner.
- ADAM provides schemas for common genomic file formats like SAM/BAM, VCF, FASTA. It enables both batch and interactive analysis of genomic data while optimizing for performance and avoiding data lock-in issues.
- Benchmarking shows ADAM can process a 65x human genome dataset faster and cheaper than existing tools, enabling end-to-end analysis in under 2 hours on 1,024 cores.
Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft
Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
This document discusses dot plots and their use in bioinformatics. It begins by defining dot plots as a graphical representation that uses two sequences on orthogonal axes and plots dots where regions of similarity meet a given threshold within a window. Dot plots allow visualization of all structures in common between sequences or repeated/inverted structures within a sequence. The document provides an example dot plot creation script in Perl and discusses how to reduce noise in dot plots by increasing the window size or stringency. It notes common uses of dot plots like comparing genomic and cDNA sequences to predict exons. Finally, it provides some rules of thumb for effective dot plot analysis and lists available dot plot programs.
The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
The document discusses dot plots and their use in bioinformatics. It explains that dot plots are a graphical representation that uses two sequences as axes and plots dots where regions of similarity are found based on a given threshold and window size. Dot plots can be used to visualize all similarities and repeats within and between sequences. Reducing window size and increasing stringency can reduce noise in dot plots. Available programs for generating dot plots are also mentioned.
New Developments in H2O: April 2017 EditionSri Ambati
H2O presentation at Trevor Hastie and Rob Tibshirani's Short Course on Statistical Learning & Data Mining IV: http://web.stanford.edu/~hastie/sldm.html
PDF and Keynote version of the presentation available here: https://github.com/h2oai/h2o-meetups/tree/master/2017_04_06_SLDM4_H2O_New_Developments
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
Johnny Miller – Cassandra + Spark = Awesome
This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
Efficient recommender systems are critical for the success of many industries, such as job recommendation, news recommendation, ecommerce, etc. This talk will illustrate how to build an efficient document recommender system by leveraging Natural Language Processing(NLP) and Deep Neural Networks (DNNs). The end-to-end flow of the document recommender system is build on AWS at scale, using Analytics Zoo for Spark and BigDL. The system first processes text rich documents into embeddings by incorporating Global Vectors (GloVe), then trains a K-means model using native Spark APIs to cluster users into several groups. The system further trains a recommender model for each group, and gives an ensemble prediction for each test record. By adopting the end-to-end pipeline of Analytics Zoo solution, we saw about 10% improvement of mean reciprocal ranking and 6% of precision respectively compared to the search recommendations for a job recommendation study.
Speaker: Guoqiong Song
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
This document discusses scalable ensemble learning using the H2O platform. It provides an overview of ensemble methods like bagging, boosting, and stacking. The stacking or Super Learner algorithm trains a "metalearner" to optimally combine the predictions from multiple "base learners". The H2O platform and its Ensemble package implement Super Learner and other ensemble methods for tasks like regression and classification. An R code demo is presented on training ensembles with H2O.
Graphs and Machine Learning have long been a focus for Franz Inc. and currently we are collaborating with a number of companies to deliver the ability to understand possible future events based on a company's internal as well as externally available data. By combining machine learning, semantic technologies, big data, graph databases and dynamic visualizations we will discuss the core components of a Cognitive Computing platform.
We discuss example Cognitive Computing platforms from Ecommerce, fraud detection and healthcare that combine structured/unstructured data, knowledge, linked open data, predictive analytics, and machine learning to enhance corporate decision making.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
The document discusses Microsoft's Cognitive Toolkit (CNTK), an open source deep learning toolkit developed by Microsoft. It provides the following key points:
1. CNTK uses computational graphs to represent machine learning models like DNNs, CNNs, RNNs in a flexible way.
2. It supports CPU and GPU training and works on Windows and Linux.
3. CNTK achieves state-of-the-art accuracy and is efficient, scaling to multi-GPU and multi-server settings.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
Similar to Fast Variant Calling with ADAM and avocado (20)
ADAM is an open source, high performance, distributed platform for genomic analysis that defines a data schema and layout on disk using Parquet and Avro, integrates with Spark's Scala and Java APIs, and provides a command line interface. ADAM achieves linear scalability out to 128 nodes for most tasks and provides a 2-4x performance improvement over other tools like GATK and samtools. The platform includes various tools like avocado for efficient local variant calling via de Bruijn graph reassembly of sequencing reads.
The document discusses ADAM, a new framework for scalable genomic data analysis. It aims to make genomic pipelines horizontally scalable by using a columnar data format and in-memory computing. This avoids disk I/O bottlenecks. The framework represents genomic data as schemas and stores data in Parquet for efficient column-based access. It has been shown to reduce genome analysis pipeline times from 100 hours to 1 hour by enabling analysis on large datasets in parallel across many nodes.
Reproducible Emulation of Analog Behavioral Modelsfnothaft
1) Analog behavioral models are abstracted using SystemVerilog real numbers to allow simulation in digital emulation environments with higher throughput.
2) Key challenges to emulating analog models include converting floating-point implementations to fixed-point and handling high sampling rates in filters.
3) The document describes techniques used by Broadcom to synthesize analog behavioral models for emulation, including pragmas for sensitivity analysis and parallelizing filters.
The document discusses genome assembly from sequencing reads. It describes how reads can be aligned to a reference genome if available, but for a new genome the reads must be assembled without a reference. Two main assembly approaches are described: overlap-layout-consensus which builds an overlap graph, and de Brujin graph assembly which constructs a de Brujin graph from k-mers. Both approaches aim to find contiguous sequences (contigs) from the reads but face challenges from computational complexity and sequencing errors in the reads.
ADAM is an open source, scalable genome analysis platform developed by researchers at UC Berkeley and other institutions. It includes tools for processing, analyzing and accessing large genomic datasets using Apache Spark. ADAM provides efficient data formats, rich APIs, and scalable algorithms to allow genome analysis to be performed on clusters and clouds. The goal is to enable fast, distributed analysis of genomic data across platforms while enhancing data access and flexibility.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
1. Fast Variant Calling with
ADAM and avocado
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
2/19/2015
2. Data Intensive Genomics
• New population-scale experiments will sequence
10-100k samples
• 100k samples @ 60x WGS will generate ~20PB of
read data and ~300TB of genotype data
• End-to-end pipeline latency is important to clinical work
• We want to jointly analyze samples to uncover low
frequency variations
3. How can we improve
analysis productivity?
• Flat file formats sacrifice interoperability but do not
improve performance
• Common sort order invariants imposed by tools
compromise correctness
• Genomics APIs tend to be at a lower level of
abstraction, which compromises productivity
4. Our building block: ADAM
• ADAM is an open source, high performance, distributed
platform for genomic analysis
• ADAM defines a:
1. Data schema and layout on disk*
2. Programming interface for distributed processing
of genomic data**
3. Command line interface
* Via Parquet and Avro
** Work on Python integration is underway
5. ADAM’s guiding principle:
Use a schema as a narrow waist
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Variant calling & analysis,
RNA-seq analysis, etc.
Disk, SDD, block
store, memory cache
HDFS, Tachyon, HPC file
systems, S3
Load data from Parquet and
legacy formats
Spark, Spark-SQL,
Hadoop
Enriched Read/Variant
Avro Schema for reads,
variants, and genotypes
Users define analyses
via transformations
Enriched models provide convenient
methods on common models
The evidence access layer
efficiently executes transformations
Schemas define the logical
structure of basic genomic objects
Common interfaces map logical
schema to bytes on disk
Parallel file system layer
coordinates distribution of data
Decoupling storage enables
performance/cost tradeoff
6. Data Format
• Schema can be updated without
breaking backwards compatibility
• Normalize metadata fields into
schema for O(1) metadata access
• Models are “dumb”; enhance as
necessary with rich objects
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Schemas at https://www.github.com/bigdatagenomics/bdg-formats
7. Parquet
• ASF Incubator project, based
on Google Dremel
• High performance columnar
store with support for
projections and push-down
predicates
• Short read data stored in
Parquet achieves a 25%
improvement in size over
compressed BAM
Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format
8. Backwards Compatibility
• Short reads: compatible with SAM, BAM, FASTQ
• Convert on read and write
• Working on CRAM support
• Variants, genotypes, and variant annotations
schemas can convert to/from VCF
• Support wide variety of genomic annotation formats
(e.g., GTF, BED, narrowPeak)
9. ADAM’s API Design
• ADAM is built on top of Apache Spark, which
provides the RDD abstraction —> distributed arrays
• Common primitives include:
• Aggregates: BQSR, Indel Realignment
• Bucketing: Duplicate Marking, Concordance
• Region Joins: Variant Calling and Filtration
10. ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• 2-4x improvement over {GATK,
samtools,Picard} on single
node
Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge
Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch
11. ADAM: Implementation
• 27k LOC (94% Scala)
• Apache 2 licensed OSS
• 33 contributors across 12 institutions
12. BDG: ADAM’s Ecosystem
!
ADAM:!
Core API +
CLIs
bdg-
formats:!
Data
schemas
RNAdam:!
RNA analysis
on ADAM
avocado:!
Distributed local
assembler
PacMin:!
Long read
assembly
eggo:!
Datasets
13. Downstream focus:
Genome Resequencing
• We’re working on two approaches:
• avocado: find variants via local reassembly
• PacMin: use long reads to find variants via de
novo assembly
• We’ll focus on avocado today
14. What are the challenges?
• For accurate INDEL discovery, we want to
reassemble variants, but reassembly is expensive
• We need to statistically integrate over a large
collection of samples to discover low frequency
variants
• The reference genome is not always representative
15. avocado performs efficient
de Bruijn reassembly
ACACTGCACT
ACA
CAC
ACT
CTG
TGC
GCA
CAC
ACT
ACA CAC ACT
CTGTGCGCA
• Several high accuracy variant callers (GATK, Platypus,
Scalpel) reassemble reads aligned at genomic regions
• Typically use a de Bruijn graph: nodes are k-mers, and
edges represent observed transitions between k-mers
16. Efficient Local Reassembly
• Current methods elaborate all paths through the graph, perform O(hn)
realignments at O(lrlh) cost, score O(h2
) haplotype pairs
• Instead, identify “bubbles” and emit statistics directly from the graph:
• Eliminate expensive realignment!
• Variant alleles are provably canonical.
ACA CAC ACT
CTGTGCGCA
CTTTTCTCA
Reference:
CTGA
Bubble:
CTTA
h: number of haplotypes (paths), n: number of reads, lr: read length, lh: haplotype length
Proofs that alleles are canonical are too long for slides; will gladly share offline.
20. Genotyping
• Use sliding “window” traversal of genome to bucket
sites
• Currently use a model based off of the samtools
mpileup genotype likelihood and EM algorithms
• Moving to monoallelic “allele graph” model
A CA C C T C T G T C
A C C C T C T G T C
A CA C C C C T G T C
A CA C C C C T G T C
A C C C T C TT G T C
21. Allele Graphs
• Edges of graph define conditional probabilities
!
!
• Can efficiently marginalize probabilities over graph via belief
propagation, exactly solve for argmax
ACACTCG
C
A
TCTCA
G
C
TCCACACT
Notes:!
X = copy number of this allele
Y = copy number of preceding allele
k = number of reads observed
j = number of reads supporting Y —> X transition
Pi = probability that read i supports Y —> X transition
22. Future Work
• When integrating over samples, we should cluster
samples by similarity
• Working on “multi-region” assembly; will integrate
alt references, “similar regions”
• Performance and accuracy evaluation on Illumina
Platinum pedigree, 1000 Genomes
23. Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher,
Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth
Pallaseni, Anthony Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael
Linderman, Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• UC Santa Cruz: Benedict Paten, David Haussler!
• And many other open source contributors, especially Michael Heuer,
Neil Ferguson, Andy Petrella, Xavier Tordior!
• Total of 27 contributors to ADAM/BDG from >12 institutions