Finding Allelic Frequencies Using MapReduce/Hadoop

Finding Allelic Frequencies
Using MapReduce/Hadoop
Mahmoud Parsian
Ph.D in Computer Science
Senior Architect @ illumina1
2014 Hadoop Summit
Amsterdam, Netherlands
April 3, 2014
1
www.illumina.com
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 1 / 46

Table of Contents
1 Biography
2 Overview
3 Basic Deﬁnitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis

Biography
Outline
1 Biography
2 Overview

Biography
Who am I?
Name: Mahmoud Parsian
Education: Ph.D in Computer Science
Works: as Senior Architect @Illumina, Inc
Lead Big Data Team @Illumina
Develop scalable regression algorithms
Develop DNA-Seq and RNA-Seq workﬂows
Use Java/MapReduce/Hadoop/HBase
Author: of two books
JDBC Recipies (Apress)
JDBC MetaData Recipies (Apress)

Overview
Outline
1 Biography
2 Overview

Overview
Overview
Genetic variants in patients germline DNA is identiﬁed through
next-gen sequencing technology.

Overview
Overview
Patient Sample −→ ... −→ VCF

Overview
Overview
Magnitude of this data is challenging to store and analyze:
several million variants per patient
several billions for groups of patients

Overview
Overview
The group comparison will estimate allelic or genotypic frequency
diﬀerences between groups for all variants present in any individual in
the analysis cohort.

Overview
Overview
Use Fisher’s Exact test to determine whether the diﬀerence in
frequency is statistically signiﬁcant.

Overview
Overview
Find allelic frequencies (use MapReduce/Hadoop)

Overview
Overview
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (use
MapReduce/Hadoop)

Basic Deﬁnitions
Outline
1 Biography
2 Overview

Basic Deﬁnitions
Some Basic Deﬁnitions
Chromosome

Basic Deﬁnitions
Chromosome
Bioset

Basic Deﬁnitions
Chromosome
Bioset
Bioset Record

Basic Deﬁnitions
Chromosome
Bioset
Bioset Record
Allele

Basic Deﬁnitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency

Basic Deﬁnitions
Chromosome
The term chromosome comes from the Greek words for color
(chroma) and body (soma)
A chromosome is an organized structure of DNA,
protein, and RNA found in cells.
Human cells have 23 pairs of chromosomes labeled
as {1, 2, ..., 22, X, Y}.
Humans have a total of 46 chromosomes.
How are chromosomes inherited? In humans, one
copy of each chromosome is inherited from the
female parent and the other from the male parent.

Basic Deﬁnitions
Chromosome in Picture

Basic Deﬁnitions
Cells to DNA

Basic Deﬁnitions
What is a Bioset?
Individually analyzed data signatures are referred to as ”biosets”.
”Biosets” encompass data in the form of experimental sample
comparisons as well as genotype signatures
A bioset most commonly referred to as a ”gene signature”. A sample
record of a bioset will contain a chromosome, its start and stop
positions, two alleles, and other related information.
The number of entries/records for a germline bioset can have 4.3
million records
A patient may have any number of biosets
Each bioset has a set of genes

Basic Deﬁnitions
VCF to Bioset
1
FASTQ format is a text-based format for storing both a biological sequence
(usually nucleotide sequence) and its corresponding quality scores.
2
VCF = Variant Call Format = the format of a text ﬁle used in
bioinformatics for storing gene sequence variations.

Basic Deﬁnitions
VCF to Bioset
Sample −→ FASTQ1 Data
1
2

Basic Deﬁnitions
VCF to Bioset
FASTQ Data −→ DNA-Seq
1
2

Basic Deﬁnitions
VCF to Bioset
DNA-Seq −→ VCF2
1
2

Basic Deﬁnitions
VCF to Bioset
DNA-Seq −→ VCF2
VCF −→ Bioset
1
2

Basic Deﬁnitions
VCF to Bioset
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1
2

Basic Deﬁnitions
Sample Record of a Bioset?
A bioset can have 4.3 million records
A sample record of a bioset will contain
a chromosome (chromosomeID: 1, 2, 3, ...)
Start position
Stop position
Two alleles: Allele1, Allele2
Genome Reference
and other related information such as mutation class, ...

Basic Definitions
What is an Allele?
Allele is a viable DNA coding that occupies a given locus (position)
on a chromosome. There are two alleles per chromosome position and
they are called allele1 and allele2.
Allelic frequency is defined as ”the percentage of a population of a
species that carries a particular allele on a given chromosome locus.”
Alternatively, ”allele frequency” can be defined as the frequency of an
allele relative to that of other alleles of the same gene in a population.
The Fisher’s Exact Test is used to calculate the ”p-value” for Allelic
Frequency.

Basic Deﬁnitions
Two Alleles: allele1, allele2

Basic Deﬁnitions
Two Alleles: allele1, allele2
An allele is one of two or more versions of a gene. An individual
inherits two alleles for each gene, one from each parent.

Source of Data for Allelic Frequency
Outline
1 Biography
2 Overview

VCF to Bioset
Sample → FASTQ Data → DNA-Seq → VCF → Bioset
Bioset Record Elements:
1. chromosomeID
2. startPosition
3. stopPosition
4. allele1
5. allele2
6. referenceGenome
7. mutationClass
...

Size of Data for Analysis
One Bioset = 4.3 million records
For Allelic frequency: form two groups: Group-A, Group-B
Keep two sets of the same data:
one set for Group-A
one set for Group-B
Group-A = 6,000 Biosets
Group-B = 9,000 Biosets
6,000 + 9,000 = 15,000
15,000 Total Biosets to analyze
15,000 x 4.3M = 64.5 Billion records

Allelic Frequency Analysis
Outline
1 Biography
2 Overview

Given
Group-A = set of biosets = {A1, A2, ..., An}
Group-B = set of biosets = {B1, B2, ..., Bm}
Find
Allelic Frequecy for every chromosomeID, start, stop, allele
Find p-value for every chromosomeID, start, stop, allele
Find top-100 p-values

Allelic Frequency by Example
Group-A: 6 biosets
Bioset-ID Allele-1 Allele-2
1 A C
2 A A
3 A C
4 G G
5 A A
6 AC T

Allelic Frequency by Example...
Group-B: 5 biosets
Bioset-ID Allele-1 Allele-2
7 A A
8 C C
9 A C
10 A A
11 A A

Create Frequency Table for Group-A and Group-B:
Allele Group-A Group-A Group-B Group-B
Known Others Known Others
A 6 6 7 3
C 2 10 3 7
G 2 10 0 10
AC 1 11 0 10
T 1 11 0 10

Create a Contigency Table for each Allele: for Allele A:
Known Others
Group-A 6 6
Group-B 7 3
Now we can apply the Fisher’s Exact Test or other tests for analysis...

Fisher’s Exact Test Using R
# R (version 2.15.1)
> mytable = rbind( c(6, 6), c(7, 3) );
> mytable
[,1] [,2]
[1,] 6 6
[2,] 7 3
> fisher.test(mytable)
Fisher’s Exact Test for Count Data
data: mytable
p-value = 0.4149

Fisher’s Exact Test Deﬁnition
Note that a, b, c, d refers to the values that we
generate as a 2 × 2 contingency table shown below:
Known Others Row Totals
Group-A a b a + b
Group-B c d c + d
Column Totals a + c b + d n = a + b + c + d
p =
a + b
a
c + d
c
n
a + c
=
(a + b)! (c + d)! (a + c)! (b + d)!
a! b! c! d! n!

Allelic Frequency Using MapReduce/Hadoop
Outline
1 Biography
2 Overview

MapReduce PHASE-1:
Eliminate Duplicate Bioset Records

MapReduce PHASE-1:
MapReduce PHASE-2:
Allelic Frequency using Fisher’s Exact Test

MapReduce PHASE-1:
MapReduce PHASE-2:
Allelic Frequency using Fisher’s Exact Test
MapReduce PHASE-3:
Find Top-100

MapReduce PHASE-1: Eliminate Duplicate Records
Mapper:
// key = chrID:start:stop:group:allele1:allele2:reference
// group = {a, b}
// value = mutationClass
map(key, value) {
emit(key, value);
}

MapReduce PHASE-1: Eliminate Duplicate Records
Mapper:
// group = {a, b}
// value = mutationClass
map(key, value) {
emit(key, value);
}
Reducer:
// values = List<mutationClass>
reduce(key, values) {
maxMC = max(values); // max. mutationClass
outputKey = chrID:start:stop
outputValue = group:allele1:allele2:reference:maxMC
emit(outputKey, outputValue);
}

MapReduce PHASE-2: Allelic Frequency using Fisher’s
Exact Test: Mapper
Mapper:
// key = chrID:start:stop
// group = {a, b}
// value = group:allele1:allele2:reference:mutationClass
map(key, value) {
emit(key, value);
}

MapReduce PHASE-2: Allelic Frequency using Fisher’s
Exact Test: Reducer
Reducer:
// key = chrID:start:stop
// values = List<group:allele1:allele2:reference:mutationC
// group = {a, b}
reduce(key, values) {
setOfAlleles = all alleles in group A and group B;
freqTableA = (allele, known, others);
freqTableB = (allele, known, others);
for (String allele : setOfAlleles) {
contingecyTable = (allele, N11, N12, N21, N22);
pvalue = FishersExactTest(contingecyTable);
emit (value, entireRecored)
}
}

MapReduce PHASE-3: Find Top-100
Now that we have:
p-value:chrID:start:stop:allele
How we can ﬁnd top-100 p-values (close to 0.00)?
SQL solution:
SELECT *
FROM allele_frequency_table
ORDER BY pvalue LIMIT 100;

MapReduce PHASE-3: Find Top-100
top100() deﬁned as:
Let P = {p1, p2, ..., pn}
Then top100(P) = {s1, s2, ..., s100}
where si ∈ P and s1 ≤ s2 ≤ ... ≤ s100
NOTE: top100 for Allelic Frequency means: ﬁnd smallest p-values,
which are closer to 0.00

Find Top-100 p-values
1 Mapper:
Each mapper finds its local top-100 p-values
and sends that top-100 list to the reducer.
We will use many mappers.
2 Reducer:
The reducer finds the final top-100 p-values
from the top-100 lists sent from the mappers.
We will use a single reducer for final top-100.

Top-100 p-values Creates a Monoid
Associativity:
top100(x, top100(y, z)) = top100( top100(x, y), z)
Identity:
top100(x, {}) = top100({}, x) = top100(x)
Therefore, we can have a combiner as well:
The combiner finds the top-100 p-values
from the top-100 lists sent from the mappers.

MapReduce for Top-100 p-values: Mapper
public class Top100Mapper ... {
private SortedMap<Double, String> top100 =
new TreeMap<Double, String>();
// key is the pvalue of double type and range is 0.00 to 1.00
// value is the entire record of allelic frequency
// output (includes pvalue)
map(Double key, String entireRecord) {
top100.put(key, value); // sort by pvalue
if (top100.size() > 100) {
// remove the greatest pvalue
top100.remove(top100.lastKey());
}
}
// called once at the end of the mapper task.
cleanup() { ...}
}

MapReduce for Top-100 p-values: Mapper
public class Top100Mapper ... {
private SortedMap<Double, String> top100 =
map(Double key, String entireRecord) {...}
// called once at the end of the mapper task.
cleanup() {
for (Map.Entry<Double, String> entry : top100.entrySet() {
Double pvalue = entry.getKey();
String entireRecord = entry.getValue();
String outputValue = pair(pvalue, entireRecord);
// NULL key will send all key-value
// pairs to a single reducer only
emit(NULL, outputValue);
}
}
}

MapReduce for Top-100 p-values: Reducer
reduce(NullWritable key, Iterable<pair<Double, String>> values) {
SortedMap<Double, String> finalTop100 =
for (pair(Double, String) value : values) {
Double pvalue = value.pvalue;
String entireRecord = value.entireRecord;
finalTop100.put(pvalue, entireRecord);
if (finalTop100.size() > 100) {
// remove the greatest pvalue
finalTop100.remove(finalTop100.lastKey());
}
}
// now, we have the final top 100 list
emitFinalTop100();
}

MapReduce for Top-100 p-values: Reducer
reduce(NullWritable key, Iterable<pair<Double, String>> values) {
...
// now, we have the final top 100 list
// emitFinalTop100();
for (Map.Entry<Double, String> entry : finalTop100.entrySet() {
Double pvalue = entry.getKey();
String entireRecord = entry.getValue();
emit(pvalue, entireRecord);
}
}

Running Allelic Frequency Analysis
Outline
1 Biography
2 Overview

Sample Run
$ cat allelic_freq_test_100_by_100.sh}
#!/bin/bash
client=AllelicFrequencyClient
groupA=bioset_ids.txt.100.a
groupB=bioset_ids.txt.100.b
$client interactive 0 $groupA $groupB
$ wc -l bioset_ids.txt.100.a bioset_ids.txt.100.b}
100 bioset_ids.txt.100.a
100 bioset_ids.txt.100.b
$ cat bioset_ids.txt.100.a
427033
427039
...
$ cat bioset_ids.txt.100.b
656714
656720
...

Sample Run
$ ./allelic_freq_test_100_by_100.sh
Wed Feb 12 15:27:10 PST 2014
Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - executionType: interactive
Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - requestID: 0
Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - GroupA: bioset_ids.txt.100.a
Feb 12 2014 15:27:10 [INFO ] [AllelicFrequencyClient] - GroupB: bioset_ids.txt.100.b
...
Feb 12 2014 15:27:12 [main] [INFO ] [JobClient] - Running job: job_201401170112_0644
Feb 12 2014 15:27:13 [main] [INFO ] [JobClient] - map 0% reduce 0%
...
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Job complete: job_201401170112_0644
...
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map-Reduce Framework
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map output materialized bytes=134376521
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Map input records=9,352,649
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Reduce input groups=134,894
Feb 12 2014 15:28:45 [main] [INFO ] [JobClient] - Reduce output records=53,557
Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - run(): jobSucceeded=true
Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - run(): Job Finished in 94.423 seconds
Feb 12 2014 15:28:45 [main] [INFO ] [AllelicFrequencyDriver] - submitJob(): runStatus=0

Sample Run
$ hadoop fs -cat /biomarker/output/germline/0/part* | sort -g | head
3.9437604668735787E-115:1:2483112:2483112:32872,20773539,20774078,:8:C:C:198:2:0:200:147567
3.9437604668735787E-115:12:51372604:51373968:100191,:15:1365BP:1365BP:198:2:0:200:null
7.770768062434234E-115:13:113588869:113588869:10323,:8:G:G:1:199:199:1:40972240
2.668611249251343E-113:13:113587440:113587440:10323,:8:G:G:197:3:0:200:10286004
5.206158192319811E-111:13:113587440:113587440:10323,:8:C:C:1:199:197:3:10286004
7.693839401259585E-111:1:16682451:16684181:79290,:15:1,731BP:1,731BP:2:198:198:2:null
5.580066122186588E-110:13:113588869:113588869:10323,:8:C:C:195:5:0:200:null
2.6288489416374975E-109:17:36760271:36779253:15243,15247,:15:18,983BP:18,983BP:1:199:196:4:null
1.915822701950223E-108:17:36760271:36779253:15243,15247,:15:18983BP:18983BP:194:6:0:200:null
5.665361418625481E-107:1:2483112:2483112:32872,20773539,20774078,:8:G:G:0:200:193:7:147567

References
Wikipedia
Allele Frequency
http://en.wikipedia.org/wiki/Allele_frequency
Max Kuhn and Kjell Johnson
Applied Predictive Modeling
Springer, 2013

Finding Allelic Frequencies Using MapReduce/Hadoop

More Related Content

What's hot

Similar to Finding Allelic Frequencies Using MapReduce/Hadoop

Recently uploaded

Finding Allelic Frequencies Using MapReduce/Hadoop