The document discusses implementing the Smith-Waterman algorithm for genomic sequence alignment in Spark. It motivates the problem of scaling gene alignment to large genomic datasets. It describes testing the algorithm on reference datasets ranging from 49MB to 234MB, with reference sequences from 1 to over 30,000. Key challenges included understanding the dynamic programming algorithm, adapting it to Spark's functional programming model, and distributing either the algorithm or large reference datasets across a cluster.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Under the Hood of Alignment Algorithms for NGS ResearchersGolden Helix Inc
Most NGS analysis is founded on a very simple and powerful principle: look only at the differences of your data to a reference genome of your species. Alignment algorithms are the workhorse of this approach and accounts for the vast majority of the compute time necessary in a secondary analysis workflow. In this webcast, Gabe Rudy covers the history of alignment algorithms of short read, high-throughput sequencing data and the set of tools that represent the state of the art.
We will use the newly launched GenomeBrowse 2.0 visualization engine to review examples of different alignment artifacts, false-positive variant calls, and other alignment and variant meta-data.
What you can expect to learn:
- How all alignment algorithms are a trade-off of speed versus accuracy, and what those trade-offs can mean with your data.
- How the human reference sequence causes alignment artifacts, and how you can spot them.
- How BWA, BWA-MEM and BWA-SW differ.
- How local re-alignment works to improve variant calling, and when you will see it and won't see it in action in your data.
- How to read a CIGAR string and other per-alignment data to investigate alignments at a particular locus.
We will use the newly launched GenomeBrowse 2.0 visualization engine to review examples of different alignment artifacts, false-positive variant calls, and other alignment and variant meta-data.
This talk covers Kafka cluster sizing, instance type selections, scaling operations, replication throttling and more. Don’t forget to check out the Kafka-Kit repository.
https://www.youtube.com/watch?time_continue=2613&v=7uN-Vlf7W5E
Lucas Waye of Tivo talks about how the company uses Presto for SQL analytics. Meetup co-sponsored by Starburst (www.starburstdata.com) and Qubole (www.qubole.com).
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.
Liu R, Hu J.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Under the Hood of Alignment Algorithms for NGS ResearchersGolden Helix Inc
Most NGS analysis is founded on a very simple and powerful principle: look only at the differences of your data to a reference genome of your species. Alignment algorithms are the workhorse of this approach and accounts for the vast majority of the compute time necessary in a secondary analysis workflow. In this webcast, Gabe Rudy covers the history of alignment algorithms of short read, high-throughput sequencing data and the set of tools that represent the state of the art.
We will use the newly launched GenomeBrowse 2.0 visualization engine to review examples of different alignment artifacts, false-positive variant calls, and other alignment and variant meta-data.
What you can expect to learn:
- How all alignment algorithms are a trade-off of speed versus accuracy, and what those trade-offs can mean with your data.
- How the human reference sequence causes alignment artifacts, and how you can spot them.
- How BWA, BWA-MEM and BWA-SW differ.
- How local re-alignment works to improve variant calling, and when you will see it and won't see it in action in your data.
- How to read a CIGAR string and other per-alignment data to investigate alignments at a particular locus.
We will use the newly launched GenomeBrowse 2.0 visualization engine to review examples of different alignment artifacts, false-positive variant calls, and other alignment and variant meta-data.
This talk covers Kafka cluster sizing, instance type selections, scaling operations, replication throttling and more. Don’t forget to check out the Kafka-Kit repository.
https://www.youtube.com/watch?time_continue=2613&v=7uN-Vlf7W5E
Lucas Waye of Tivo talks about how the company uses Presto for SQL analytics. Meetup co-sponsored by Starburst (www.starburstdata.com) and Qubole (www.qubole.com).
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16.
DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches.
Liu R, Hu J.
2. Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed
3. Motivation
- What is your genomic sequence?
- Gene Therapy: personalised treatment based on your genomic sequence
- Many gene alignment algorithms available
- Not many distributed
The Problem
Where does this align?
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTCA
ATGCAGAC
ACTC_ A
A _TGCAGAC
4. About the Data
Dataset
- NCBI ftp site for dataset download
- Total size = 26.5GB out of approx. 200TB
- Dataset contains 518 files of 12,321,160 sequences of mRNA
- mean (average) number of base pairs per sequence = 2,160
- median number of base pairs per sequence = 1,609
Input Files
- Randomly selected sequences from the dataset
16. Scaling up: Parallelization
Parallelize the algorithm, reads or the reference set
- Algorithm
- Dynamic programming algorithm
- Fills up a matrix where each cell depends on 3 of its neighbouring cells
- Reads
- reads are much shorter
- but need to store all the scores for each read on reduce
- Reference Set
- much longer than reads
- # references per file range from 1 to 73,030
17. Challenges Encountered
- Understanding the algorithm
- Object-Oriented Java → Functional Spark
- OutOfMemory: Java heap space
- Distributing the algorithm
- Distributing either dataset
- ftp → S3
21. Let the following scores be:
Match = 5
Mismatch = -3
Gap = -4
- C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
C
0
alignment
insertion
deletion
score = max
{
22. - C G T G A A T T C A T
-
G
A
C
T
T 13 9
A 14
Alignment (diagonal)
Score = 13 + (match or mismatch)
= 13 + 5
= 18