Hanna bosc2010

The Genome Analysis Toolkit
A MapReduce framework for analyzing next-generation
DNA sequencing data

Ma#
Hanna
and
Mark
DePristo

Genome
Sequencing
and
Analysis
Group

Medical
and
Popula<on
Gene<cs
Program

Broad
Ins<tute
of
Harvard
and
MIT

The Genome Analysis Toolkit
Agenda

•  GATK
Overview
and
Concepts

•  GATK
Workﬂow

•  Example:
A
Simple
Bayesian
Genotyper

2
2 2

GATK: Overview and Concepts
Motivation

Coverage in xMHC region of JPT individuals"

•  Dataset size greatly increases analysis complexity.
•  Implementation issues can prematurely terminate
long-running jobs or introduce subtle bugs.

3

GATK: Overview
Simplifying the process of writing analysis tools for resequencing data

•  The
framework
is
designed
to
support
most
common

paradigms
of
analysis
algorithms

–  Provides
structured
access
to
reads
in
BAM
format,

reference
context,
as
well
as
reference-‐associated
meta

data

•  General-‐purpose

–  Op<mized
for
ease
of
use
and
completeness
of

func<onality
within
scope

•  Eﬃcient

–  Engineering
investment
on
performance
of
cri<cal
data

structures
and
manipula<on
rou<nes

•  Convenient

–  Structured
plug-‐in
model
makes
developing
in
Java
against

the
framework
rela<vely
painfree

4

GATK: Overview
The MapReduce design philosophy

Data elements a
b
c

d
e

Operations are
f(x) independent of
each other
X = f(x) A
B
C
D
E

r(x,y, …, z) Results depends on
all sites

R = r(A, R(B,…,E)) R

Result is:

Map Function f applied to each element of list

Reduce Function r recursively reduced over each f(…)

5

GATK: Overview
Rapid development of efficient and robust analysis tools

Genome
Analysis

Provides the Toolkit
(GATK)

boilerplate infrastructure

code required
to perform any
NGS analysis

Traversal
engine

Analysis

tool

Provided
by
framework
Implemented
by
user

6

GATK: Workflow
Introduction

•  GATK
Overview
and
Concepts

•  GATK
Workﬂow

•  An
example
of
one
of
the
GATK’s
most
common
workﬂows

•  Data
access
pa#ern:
by
locus

•  Inputs:
reads,
reference,
dbSNP

•  Example:
A
Simple
Bayesian
Genotyper

7

GATK: Workflow
The sharding system: dividing data into processor-sized pieces

Reads
Reference
dbSNP

•  Divides data into small chunks that can be
processed independently
•  Handles extraction of subsets of data
•  Groups small intervals together to avoid
repetitive decompression

8

GATK: Workflow
Traversal engines: preparing data for processing

Builds data structures
easy consumed by the
analysis

9

GATK: Workflow
Interaction between sharding system and traversal engines

•  Datasets are split into shards, which can be processed sequentially or in parallel
•  When processing sequentially, the reduce value of each shard is used to
bootstrap the next shard.
•  When processing in parallel, the result of each shard is computed independently
and then “tree-reduced” together.

10

GATK: Workflow
Walkers: Analyses written by end-users

dbsnp
exons
A
ref
A
reads C
C
A
C

Analysis

tool

•  Walkers (analyses) can easily be written by end users. The GATK is
distributed with a significant library of walkers.
•  Only the reads, reference, and reference metadata applicable to a single-
base location is presented to the analysis tool.
•  The GATK provides tools to filter the pileup automatically or on demand.

11

GATK: Workflow
Other data access patterns

Other data access patterns:
Traversal Type Description
Reads Call map per read, along with the reference
and reference-ordered metadata spanning
that read.
Duplicates Call map for each set of duplicate reads.
Read pair (naïve) Call map for each read and its mate (naïve,
requires the input BAM to be sorted in
query name order).

Straightforward (but not necessarily easy) to add any new
access pattern involving streaming data.

12

GATK: Additional features
Additional inputs and outputs

Reference metadata
•  Support for additional input data that is sorted in reference
order can easily be added to the GATK.
•  Input types can be added by creating two new classes: a
feature (data access object) and a codec (parser).
•  New file formats are indexed automatically.
•  New data types are autodiscovered via a classpath search.
•  Joint initiative with IGV.

Additional I/O
•  Analysis parameters can be added to a walker by annotating a
field in the walker with an @Argument annotation.
•  Command-line argument types can become very sophisticated.

13

Walkers: Example
A simple Bayesian genotyper

•  GATK
Overview
and
Concepts

•  GATK
Workﬂow

•  Example:
A
Simple
Bayesian
Genotyper

•  A
func<onal
genotyper
in
under
150
lines
of
code

•  A
minimal
example:
calls
are
much
lower
in
quality
than

the
UniﬁedGenotyper

14

Walkers: Example
A simple Bayesian genotyper: the model

Likelihood of the
Likelihood for Prior for the data given the
the genotype genotype genotype Independent base model
Bayesian

model

L(G | D) = P(G) P(D | G) = ∏
b∈{good _ bases}
P(b | G)

•  Likelihood
of
data
computed
using
pileup
of
bases
and

associated
quality
scores
at
given
locus

•  Only
“good
bases”
are
included:
those
sa<sfying
minimum

base
quality,
mapping
read
quality,
pair
mapping
quality,
NQS

•  L(G|D)
computed
for
all
10
genotypes

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper
for a more complete approach

15

Walkers: Example

•  Walker specifies the data access pattern and
declares command-line arguments.
•  Inheritance defines traversal type.
•  Annotation defines command-line argument.

public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {

@Argument(fullName = "log_odds_score",
shortName = "LOD",
doc = "The LOD threshold",
required = false)
private double LODScore = 3.0;

16

Walkers: Example

•  Walker prepares the input dataset.
•  ReadBackedPileup utility can be used to filter pileup on
demand.
public Integer map(RefMetaDataTracker tracker,
ReferenceContext ref,
AlignmentContext context) {

double likelihoods[] =
DiploidGenotypePriors.getReferencePolarizedPrior(
ref.getBase(),
DiploidGenotypePriors.HUMAN_HETEROZYGOSITY,
0.01);

// get the bases and qualities from the pileup
ReadBackedPileup pileup = context.getBasePileup().
getPileupWithoutMappingQualityZeroReads();
byte bases[] = pileup.getBases();
byte quals[] = pileup.getQuals();
…

17

Walkers: Example

•  Calculate the likelihood for each possible genotype.
•  Determine the best of the calculated genotypes.

for (GENOTYPE genotype : GENOTYPE.values())
for (int index = 0; index < bases.length; index++) {
// our epsilon is the de-Phred scored base quality
double epsilon = Math.pow(10, quals[index] / -10.0);

byte pileupBase = bases[index];
double p = 0;
for (char r : genotype.toString().toCharArray())
p += r == pileupBase ? 1 - epsilon : epsilon / 3;
likelihoods[genotype.ordinal()] += Math.log10(p /
genotype.length());
}

Integer sortedList[] = MathUtils.sortPermutation(likelihoods);

18

Walkers: Example

•  Conditionally output the results.
•  Use reduce to calculate number of genotypes called.
•  Writing to provided output stream is guaranteed to be
thread-safe.
…
if (lod > LODScore)
out.printf("%st%st%.4ft%c%n", context.getLocation(),
selectedGenotype, lod, (char)ref.getBase());
return 1;
}
}
// end of map() function

public Long reduce(Integer value, Long sum) {
return value + sum;
}

public void onTraversalDone(Integer result) {
out.printf("Simple Genotyper genotyped %d loci.”, result);
}

19

Walkers: Threading performance

GATK
performance
improves
nearly linearly
as processors
are added

20

Genome Analysis Toolkit
1000 Genomes Project

•  Supports
any
BAM-‐ Ini<al
alignment

compa<ble
aligner

•  All
of
these
tools
MSA
realignment

have
been
developed

in
the
GATK

Q-‐score

recalibra<on

•  They
are
memory

and
CPU
eﬃcient,

Base
error

cluster
friendly
and
are
modeling

easily
parallelized

•  They
are
now
Genotyping

publically
and
are

being
used
at
many

sites
around
the
world
SNP
ﬁltering

More
info:
h#p://www.broadins<tute.org/gsa/wiki/

Support

:
h#p://www.getsa<sfac<on.com/gsa/

21

Acknowledgments

Genome sequencing and Broad postdocs, staff, 1000 Genomes project
analysis group (MPG) and faculty In general but notably:
Kiran Garimella (Analysis Lead) Anthony Philippakis Matt Hurles
Michael Melgar Vineeta Agarwala Philip Awadalla
Chris Hartl Manny Rivas Richard Durbin
Sherman Jia Jared Maguire Goncalo Abecasis
Eric Banks (Development lead) Carrie Sougnez Richard Gibbs
Ryan Poplin David Jaffe Gabor Marth
Guillermo del Angel Nick Patterson Thomas Keane
Aaron McKenna Steve Schaffner Gil McVean
Khalid Shakir Shamil Sunyaev Gerton Lunter
Brett Thomas Paul de Bakker Heng Li
Corin Boyko
Copy number group Cancer genome
Bob Handsaker analysis
Genome Sequencing Platform Jim Nemesh Kristian Cibulskis
In general but notably: Josh Korn Andrey Sivachenko
Lauren Ambrogio Steve McCarroll Gad Getz
Illumina Production Team
Tim Fennell Integrative Genomics
Kathleen Tibbetts Viewer (IGV) MPG directorship
Alec Wysoker Jim Robinson Stacey Gabriel
Ben Weisburd Jesse Whitworth David Altshuler
Toby Bloom Helga Thorvaldsdottir Mark Daly
22

Hanna bosc2010

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Hanna bosc2010

Similar to Hanna bosc2010 (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

Hanna bosc2010