SlideShare a Scribd company logo
1 of 79
Download to read offline
Cancer Outlier Profile Analysis
using
Apache Spark
Mahmoud Parsian
Ph.D in Computer Science
Senior Architect @ illumina
Apache Big Data
Vancouver, BC
May 10, 2016
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 1 / 74
Table of Contents
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 2 / 74
Biography
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 3 / 74
Biography
Who am I?
Name: Mahmoud Parsian
Education: Ph.D in Computer Science
Work: Senior Architect @Illumina, Inc
Lead Big Data Team @Illumina
Develop scalable regression algorithms
Develop DNA-Seq and RNA-Seq workflows
Use Java/MapReduce/Hadoop/Spark/HBase
Author: of 3 books
Data Algorithms (O’Reilly:
http://shop.oreilly.com/product/0636920033950.do/)
JDBC Recipies (Apress: http://apress.com/)
JDBC MetaData Recipies (Apress: http://apress.com/))
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 4 / 74
Biography
My book: Data Algorithms
Book: http://shop.oreilly.com/product/0636920033950.do
GIT: https://github.com/mahmoudparsian/data-algorithms-book
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 5 / 74
What is COPA using Spark?
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 6 / 74
What is COPA using Spark?
Background: COPA, Spark
What is COPA? Cancer Outlier Profile Analysis
Is COPA important?
What is an outlier?
Common methods to detect outliers
Outliers in cancer gene study
T-statistic in outlier study
What are COPA algorithms?
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 7 / 74
What is COPA using Spark?
What is COPA?
COPA algorithm :
analyze DNA microarray data for genes
markedly over-expressed ( outliers ) in a subset of cases
COPA: first algorithm that lead to the discovery of ERG
rearrangement in prostate cancer .
ERG (ETS-related gene) is an oncogene meaning that it encodes a
protein that typically is mutated in cancer.
COPA identified the ETS family members ERG and ETV1 as
high-ranking outliers in multiple prostate cancer profiling studies
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 8 / 74
What is COPA using Spark?
Informal COPA Algorithm
Input = A cancer gene expression dataset, consisting of thousands of
genes measured across hundreds of samples
STEPS: (A) → (B) → (C) → (D) → (E)
(A) genes are normalized and sorted via the COPA method.
(B) COPA normalizes the median expression per gene to zero
The MAD (median absolute deviation) to one
A percentile cutoff is selected (e.g., 75th%, 90th% or 95th%) and
genes are sorted by their COPA value at the selected percentile
(C) The top 1% of genes is deemed to have outlier expression profiles
(D) ...
(E) ...
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 9 / 74
What is COPA using Spark?
Informal COPA Algorithm
STEPS: (A) → (B) → (C) → (D) → (E)
A, B, C
(D) A collection of outliers from independent datasets are compiled
and submitted for meta-analysis
(E) Multiple datasets of a given cancer type are meta-analyzed
Identify genes that are consistently called outliers across independent
datasets
Significance is assessed via the binomial distribution
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 10 / 74
What is COPA using Spark?
What is an Outlier?
Definition:
outlier is an unusual value in a dataset;
one that does not fit the typical pattern of data
Sources of outliers
Recording of measurement errors
Natural variation of the data (valid data)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 11 / 74
What is COPA using Spark?
What is an Outlier?
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 12 / 74
What is COPA using Spark?
What is COPA Statistics?
COPA = Cancer Outlier Profile Analysis
Statistics designed to identify outliers in cancer gene expression
profile
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 13 / 74
What is COPA using Spark?
Outlier Detection, How?
Visual inspection of data
not applicable for large complex datasets
Automated methods
Normal distribution-based method (NORMAL)
Median Absolute Deviation (MAD)
Distance Based Method
COPA
...
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 14 / 74
What is COPA using Spark?
Normal distribution-based method
f (x) =
1
σ
√
2π
e−(x−µ)2
/2σ2
∞
−∞
f (x)dx = 1
where
x: sample data
µ: mean of distribution
σ: standard deviation
σ2: variance
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 15 / 74
What is COPA using Spark?
Normal distribution in picture...
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 16 / 74
What is COPA using Spark?
Normal distribution -based method
Assume normal distribution for each variable
Works on one variable at a time: Xk, k = 1, ..., p
Algorithm
Let i’th sample data for variable Xk , (i = 1, ..., n): xik
Sample mean for Xk : µ = 1
n
n
i=1
(xik )
Sample standard deviation for Xk : Sk = 1
n
n
i=1
(xik − µ)
2
Calculate spatial statistics Zik for i = 1, ..., n as Zik = xik −µ
Sk
Label xik an outlier if |Zik | > 3
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 17 / 74
What is COPA using Spark?
Problems with Normal distribution -based method
Very dependent on assumption of normality
Does not work well on big data
Works only for numeric-based variables
µ (mean) and Sk (standard deviation) are not robust to outliers
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 18 / 74
What is COPA using Spark?
Median Absolute Deviation (MAD)
Another method for dealing with robustness problem
Use median as robust estimate of mean
Use MAD as robust estimate of standard deviation
Algorithm
Calculate Dik = |Xik − median(Xk )| for i=1,...,n
Calculate MAD = median(D1k , ..., Dnk )
Calculate spatial statistics Zik = xik −median(Xk )
1.4826×MAD
Label xik an outlier if |Zik | > 3.5
Note: standard deviation σ ≈ 1.4826 MAD
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 19 / 74
What is COPA using Spark?
Outliers in Cancer Studies
In cancer studies:
mutations can often amplify or turn off gene expression in only a
minority of samples, namely produce outliers
t-statistic may yield high false discovery rate (FDR) when trying to
detect changes occurred in a small number of samples.
COPA
Is there a better method than COPA? may be?
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 20 / 74
COPA Algorithm
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 21 / 74
COPA Algorithm
COPA Algorithm
gene expression values are median centered
setting each gene’s median expression value to zero
the median absolute deviation (MAD) is calculated
scaled to 1 by dividing each gene expression value by its MAD
the 75th, 90th, and 95th percentiles of the transformed expression
values are tabulated for each gene
genes are rank-ordered by their percentile scores, providing a
prioritized list of outlier profiles.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 22 / 74
COPA Algorithm
COPA Transformation
Median and MAD were used for transformation
as opposed to mean and standard deviation
so that outlier expression values do not unduly influence the
distribution estimates,
and are thus preserved post-normalization
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 23 / 74
COPA Algorithm
COPA: ONE-Sided and TWO-Sided
ONE-Sided
Put all Studies together
Calculate COPA-Score per GENE-ID
TWO-Sided
Create K Study groups
Rank All Groups
Calculate COPA-Score per GENE-ID (per Study Group)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 24 / 74
COPA Algorithm
COPA: TWO-Sided
Handles K studies, where each study is a set of samples.
For each study, find the COPA scores and then the result will be the
median of the COPA scores for all studies.
For K studies, report
GENE-IDi
mean(COPA scores)
geometric mean of product rank (j=1, ..., K)
Rank-Converted-Scoresij for (j=1, ..., K)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 25 / 74
COPA Algorithm
Is COPA Big Data?
General case:
1000’s of Studies
Each Study = 100’s of Samples
Each Sample = 100,000+ (K, V) pairs
Example:
4000 studies, assume each study has 300 samples
Total Samples: 4000 × 300 = 1, 200, 000
Analyze: 1, 200, 000 × 100, 000 = 120, 000, 000, 000 (K, V) pairs
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 26 / 74
COPA Algorithm
Input Format
Each sample is a set of (K, V) pairs
General Format: K=GENE-ID, V=GENE-VALUE
Example-1: g127,1.20
Example-2: g4567,0.09
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
COPA Algorithm
Input Format
Each sample is a set of (K, V) pairs
General Format: K=GENE-ID, V=GENE-VALUE
Example-1: g127,1.20
Example-2: g4567,0.09
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
Input Data
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 28 / 74
Input Data
Input Data Format
Set of k studies {S1, S2, ..., Sk}
Each study has billions of (Key, Value) pairs
Sample Record:
<key-as-string><,><value-as-double-data-type>
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 29 / 74
Input Data
Input Data Persistence
Data persists in HDFS
Directory structure:
/input/study-0001/file-0001-1.txt
/input/study-0001/file-0001-2.txt
/input/study-0001/file-0001-3.txt
...
/input/study-3265/file-3265-1.txt
/input/study-3265/file-3265-2.txt
/input/study-3265/file-3265-3.txt
...
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 30 / 74
Input Data
What is a Ranking?
Let S = {(K1, 40), (K2, 70), (K3, 90), (K4, 80)}
Then Rank(S) = {(K1, 4), (K2, 3), (K3, 1), (K4, 2)}
Since 90 > 80 > 70 > 40
Ranks are assigned as: 1, 2, 3, 4, ..., N
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 31 / 74
Input Data
What is Rank Product?
Let {A1, ..., Ak} be a set of (key-value) pairs where keys are unique
per dataset.
Example of (key-value) pairs:
(K,V) = (item, number of items sold)
(K,V) = (user, number of followers for the user)
(K,V) = (gene, test expression)
Then the ranked product of {A1, ..., Ak} is computed based on the
ranks ri for key i across all k datasets. Typically ranks are assigned
based on the sorted values of datasets.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 32 / 74
Input Data
What is Rank Product?
Let A1 = {(K1, 30), (K2, 60), (K3, 10), (K4, 80)},
then Rank(A1) = {(K1, 3), (K2, 2), (K3, 4), (K4, 1)}
since 80 > 60 > 30 > 10
Note that 1 is the highest rank (assigned to the largest value).
Let A2 = {(K1, 90), (K2, 70), (K3, 40), (K4, 50)},
Rank(A2) = {(K1, 1), (K2, 2), (K3, 4), (K4, 3)}
since 90 > 70 > 50 > 40
Let A3 = {(K1, 4), (K2, 8)}
Rank(A3) = {(K1, 2), (K2, 1)}
since 8 > 4
The rank product of {A1, A2, A3} is expressed as:
{(K1, 3
√
3 × 1 × 2), (K2, 3
√
2 × 2 × 1), (K3, 2
√
4 × 4), (K4, 2
√
1 × 3)}
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 33 / 74
Input Data
Calculation of the Rank Product
Given n genes and k replicates,
Let eg,i be the fold change and rg,i the rank of gene g in the i’th
replicate.
Compute the rank product (RP) via the geometric mean:
RP(g) =
k
i=1
rg,i
1/k
RP(g) = k
k
i=1
rg,i
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 34 / 74
Input Data
Formalizing Rank Product
Let S = {S1, S2, ..., Sk} be a set of k studies, where k > 0 and each
study represent a micro-array experiment
Let Si (i = 1, 2, ..., k) be a study, which has an arbitrary number of
assays identified by {Ai1, Ai2, ...}
Let each assay (can be represented as a text file) be a set of arbitrary
number of records in the following format:
<gene_id><,><gene_value_as_double_data-type>
Let gene_id be in {g1, g2, ..., gn} (we have n genes).
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 35 / 74
Input Data
Rank Product: in 2 Steps
Let S = {S1, S2, ..., Sk} be a set of k studies:
STEP-1: find the mean of values per study per gene
you may replace the ”mean” function by your desired function
finding mean involves groupByKey() or combineByKey()
STEP-2: perform the ”Rank Product” per gene across all studies
finding ”Rank Product” involves groupByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 36 / 74
Input Data
Formalizing Rank Product
The last step will be to find the rank product for each gene per study:
S1 = {(g1, r11), (g2, r12), ...}
S2 = {(g1, r21), (g2, r22), ...}
...
Sk = {(g1, rk1), (g2, rk2), ...}
then Ranked Product of gj =
RP(gj ) =
k
i=1
ri,j
1/k
or
RP(gj ) = k
k
i=1
ri,j
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 37 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks).
Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
JavaPairRDD.groupByKey() (not efficient)
JavaPairRDD.combineByKey() (efficient)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks).
Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
JavaPairRDD.groupByKey() (not efficient)
JavaPairRDD.combineByKey() (efficient)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks).
Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
JavaPairRDD.groupByKey() (not efficient)
JavaPairRDD.combineByKey() (efficient)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks).
Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
JavaPairRDD.groupByKey() (not efficient)
JavaPairRDD.combineByKey() (efficient)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sort
the dataset by value, we will swap the key with value and then
perform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with its
element indices (these indices will be the ranks).
Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
JavaPairRDD.groupByKey() (not efficient)
JavaPairRDD.combineByKey() (efficient)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Two Spark Solutions: groupByKey() and combineByKey()
Two solutions are provided using Spark-1.6.0:
SparkRankProductUsingGroupByKey
uses groupByKey()
SparkRankProductUsingCombineByKey
uses combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 39 / 74
Input Data
How does groupByKey() work
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 40 / 74
Input Data
How does reduceByKey() work
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 41 / 74
Rank Product Algorithm
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 42 / 74
Rank Product Algorithm
Rank Product Algorithm in Spark
Algorithm: High-Level Steps
Step Description
STEP-1 import required interfaces and classes
STEP-2 handle input parameters
STEP-3 create a Spark context object
STEP-4 create list of studies (1, 2, ..., K)
STEP-5 compute mean per gene per study
STEP-6 sort by values
STEP-7 assign rank
STEP-8 compute rank product
STEP-9 save the result in HDFS
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 43 / 74
Rank Product Algorithm
Main Driver
Listing 1: performRrankProduct()
1 public static void main(String[] args) throws Exception {
2 // args[0] = output path
3 // args[1] = number of studies (K)
4 // args[2] = input path for study 1
5 // args[3] = input path for study 2
6 // ...
7 // args[K+1] = input path for study K
8 final String outputPath = args[0];
9 final String numOfStudiesAsString = args[1];
10 final int K = Integer.parseInt(numOfStudiesAsString);
11 List<String> inputPathMultipleStudies = new ArrayList<String>();
12 for (int i=1; i <= K; i++) {
13 String singleStudyInputPath = args[1+i];
14 inputPathMultipleStudies.add(singleStudyInputPath);
15 }
16 performRrankProduct(inputPathMultipleStudies, outputPath);
17 System.exit(0);
18 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 44 / 74
Rank Product Algorithm
groupByKey() vs. combineByKey()
Which one should we use? combineByKey() or groupByKey()?
According to their semantics, both will give you the same answer.
But combineByKey()| is more efficient .
In some situations, groupByKey() can even cause of
out of disk problems . In general, reduceByKey(), and
combineByKey() are preferred over groupByKey().
Spark shuffling is more efficient for reduceByKey()| than
groupByKey() and the reason is this: in the shuffle step for
reduceByKey(), data is combined so each partition outputs at most
one value for each key to send over the network, while in shuffle step
for groupByKey(), all the data is wastefully sent over the network
and collected on the reduce workers.
To understand the difference , the following figures show how the
shuffle is done for reduceByKey() and groupByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 45 / 74
Rank Product Algorithm
Understanding groupByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 46 / 74
Rank Product Algorithm
Understanding reduceByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 47 / 74
Rank Product Algorithm
Listing 2: performRrankProduct()
1 public static void performRrankProduct(
2 final List<String> inputPathMultipleStudies,
3 final String outputPath) throws Exception {
4 // create a context object, which is used
5 // as a factory for creating new RDDs
6 JavaSparkContext context = Util.createJavaSparkContext(useYARN);
7
8 // Spark 1.6.0 requires an array for creating union of many RDDs
9 int index = 0;
10 JavaPairRDD<String, Double>[] means =
11 new JavaPairRDD[inputPathMultipleStudies.size()];
12 for (String inputPathSingleStudy : inputPathMultipleStudies) {
13 means[index] = computeMeanByGroupByKey(context, inputPathSingleStudy);
14 index++;
15 }
16
17 // next compute rank
18 ...
19 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 48 / 74
Rank Product Algorithm
Listing 3: performRrankProduct()
1 public static void performRrankProduct(
2 ...
3 // next compute rank
4 // 1. sort values based on absolute value of mean value
5 // 2. assign rank from 1 to N
6 // 3. calculate rank product for each gene
7 JavaPairRDD<String,Long>[] ranks = new JavaPairRDD[means.length];
8 for (int i=0; i < means.length; i++) {
9 ranks[i] = assignRank(means[i]);
10 }
11 // calculate ranked products
12 // <gene, T2<rankedProduct, N>>
13 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts =
14 computeRankedProducts(context, ranks);
15
16 // save the result, shuffle=true
17 rankedProducts.coalesce(1,true).saveAsTextFile(outputPath);
18
19 // close the context and we are done
20 context.close();
21 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 49 / 74
Rank Product Algorithm
STEP-3: create a Spark context object
1 public static JavaSparkContext createJavaSparkContext(boolean useYARN) {
2 JavaSparkContext context;
3 if (useYARN) {
4 context = new JavaSparkContext("yarn-cluster", "MyAnalysis"); // YARN
5 }
6 else {
7 context = new JavaSparkContext(); // Spark cluster
8 }
9 // inject efficiency
10 SparkConf sparkConf = context.getConf();
11 sparkConf.set("spark.kryoserializer.buffer.mb","32");
12 sparkConf.set("spark.shuffle.file.buffer.kb","64");
13 // set a fast serializer
14 sparkConf.set("spark.serializer",
15 "org.apache.spark.serializer.KryoSerializer");
16 sparkConf.set("spark.kryo.registrator",
17 "org.apache.spark.serializer.KryoRegistrator");
18 return context;
19 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 50 / 74
Rank Product Algorithm
Listing 4: STEP-4: compute mean
1 static JavaPairRDD<String, Double> computeMeanByGroupByKey(
2 JavaSparkContext context,
3 final String inputPath) throws Exception {
4 JavaPairRDD<String, Double> genes =
5 getGenesUsingTextFile(context, inputPath, 30);
6
7 // group values by gene
8 JavaPairRDD<String, Iterable<Double>> groupedByGene = genes.groupByKey();
9
10 // calculate mean per gene
11 JavaPairRDD<String, Double> meanRDD = groupedByGene.mapValues(
12 new Function<
13 Iterable<Double>, // input
14 Double // output: mean
15 >() {
16 @Override
17 public Double call(Iterable<Double> values) {
18 double sum = 0.0;
19 int count = 0;
20 for (Double v : values) {
21 sum += v;
22 count++;
23 }
24 // calculate mean of samples
25 double mean = sum / ((double) count);
26 return mean;
27 }
28 });
29 return meanRDD;
30 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 51 / 74
Rank Product Algorithm
Listing 5: getGenesUsingTextFile()
1 static JavaPairRDD<String, Double> getGenesUsingTextFile(
2 JavaSparkContext context,
3 final String inputPath,
4 final int numberOfPartitions) throws Exception {
5
6 // read input and create the first RDD
7 // JavaRDD<String>: where String = "gene,test_expression"
8 JavaRDD<String> records = context.textFile(inputPath, numberOfPartitions);
9
10 // for each record, we emit (K=gene, V=test_expression)
11 JavaPairRDD<String, Double> genes
12 = records.mapToPair(new PairFunction<String, String, Double>() {
13 @Override
14 public Tuple2<String, Double> call(String rec) {
15 // rec = "gene,test_expression"
16 String[] tokens = StringUtils.split(rec, ",");
17 // tokens[0] = gene
18 // tokens[1] = test_expression
19 return new Tuple2<String, Double>(
20 tokens[0], Double.parseDouble(tokens[1]));
21 }
22 });
23 return genes;
24 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 52 / 74
Rank Product Algorithm
Listing 6: Assign Rank
1 // result is JavaPairRDD<String, Long> = (gene, rank)
2 static JavaPairRDD<String,Long> assignRank(JavaPairRDD<String,Double> rdd){
3 // swap key and value (will be used for sorting by key); convert value to abs(value)
4 JavaPairRDD<Double,String> swappedRDD = rdd.mapToPair(
5 new PairFunction<Tuple2<String, Double>, // T: input
6 Double, // K
7 String>>(){ // V
8 public Tuple2<Double,String> call(Tuple2<String, Double> s) {
9 return new Tuple2<Double,String>(Math.abs(s._2), s._1);
10 }
11 });
12 // we need 1 partition so that we can zip numbers into this RDD by zipWithIndex()
13 JavaPairRDD<Double,String> sorted = swappedRDD.sortByKey(false, 1); // sort means descending
14 // JavaPairRDD<T,Long> zipWithIndex()
15 // Long values will be 0, 1, 2, ...; for ranking, we need 1, 2, 3, ..., therefore, we will add 1
16 JavaPairRDD<Tuple2<Double,String>,Long> indexed = sorted.zipWithIndex();
17 // next convert JavaPairRDD<Tuple2<Double,String>,Long> into JavaPairRDD<String,Long>
18 // JavaPairRDD<Tuple2<value,gene>,rank> into JavaPairRDD<gene,rank>
19 JavaPairRDD<String, Long> ranked = indexed.mapToPair(
20 new PairFunction<Tuple2<Tuple2<Double,String>,Long>, // T: input
21 String, // K: mapped_id
22 Long>() { // V: rank
23 public Tuple2<String, Long> call(Tuple2<Tuple2<Double,String>,Long> s) {
24 return new Tuple2<String,Long>(s._1._2, s._2 + 1); // ranks are 1, 2, ..., n
25 }
26 });
27 return ranked;
28 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 53 / 74
Rank Product Algorithm
Listing 7: Compute Rank Product using groupByKey()
1 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProducts(
2 JavaSparkContext context,
3 JavaPairRDD<String, Long>[] ranks) {
4 JavaPairRDD<String, Long> unionRDD = context.union(ranks);
5
6 // next find unique keys, with their associated values
7 JavaPairRDD<String, Iterable<Long>> groupedByGeneRDD = unionRDD.groupByKey();
8
9 // next calculate ranked products and the number of elements
10 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = groupedByGeneRDD.mapValues(
11 new Function<
12 Iterable<Long>, // input: means for all studies
13 Tuple2<Double, Integer> // output: (rankedProduct, N)
14 >() {
15 @Override
16 public Tuple2<Double, Integer> call(Iterable<Long> values) {
17 int N = 0;
18 long products = 1;
19 for (Long v : values) {
20 products *= v;
21 N++;
22 }
23 double rankedProduct = Math.pow( (double) products, 1.0/((double) N));
24 return new Tuple2<Double, Integer>(rankedProduct, N);
25 }
26 });
27 return rankedProducts;
28 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 54 / 74
Rank Product Algorithm
Next FOCUS on combineByKey()
We do need to develop 2 functions:
computeMeanByCombineByKey()
computeRankedProductsUsingCombineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 55 / 74
Rank Product Algorithm
combineByKey(): how does it work?
combineByKey() is the most general of the per-key aggregation
functions. Most of the other per-key combiners are implemented
using it.
Like aggregate(), combineByKey() allows the user to return values
that are not the same type as our input data.
To understand combineByKey(), it is useful to think of how it
handles each element it processes.
As combineByKey() goes through the elements in a partition, each
element either has a key it has not seen before or has the same key as
a previous element.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 56 / 74
Rank Product Algorithm
combineByKey() by Combiner (C)
1 public <C> JavaPairRDD<K,C> combineByKey(
2 Function<V,C> createCombiner, // V -> C
3 Function2<C,V,C> mergeValue, // C+V -> C
4 Function2<C,C,C> mergeCombiners) // C+C -> C
Provide 3 functions:
1. createCombiner, which turns a V into a C
// creates a one-element list
2. mergeValue, to merge a V into a C
// adds it to the end of a list
3. mergeCombiners, to combine two Cs into a single one.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 57 / 74
Rank Product Algorithm
computeMeanByCombineByKey():
Define C data structure
1 //
2 // AverageCount is used by combineByKey()
3 // to hold the total values and their count.
4 //
5 static class AverageCount implements Serializable {
6 double total;
7 int count;
8
9 public AverageCount(double total, int count) {
10 this.total = total;
11 this.count = count;
12 }
13
14 public double average() {
15 return total / (double) count;
16 }
17 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 58 / 74
Rank Product Algorithm
computeMeanByCombineByKey()
1 static JavaPairRDD<String, Double> computeMeanByCombineByKey(
2 JavaSparkContext context,
3 final String inputPath) throws Exception {
4
5 JavaPairRDD<String, Double> genes = getGenesUsingTextFile(context, inputPath, 30);
6
7 // we need 3 function to be able to use combineByKey()
8 Function<Double, AverageCount> createCombiner = ...
9 Function2<AverageCount, Double, AverageCount> addAndCount = ...
10 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = ...
11
12 JavaPairRDD<String, AverageCount> averageCounts =
13 genes.combineByKey(createCombiner, addAndCount, mergeCombiners);
14
15 // now compute the mean/average per gene
16 JavaPairRDD<String,Double> meanRDD = averageCounts.mapToPair(
17 new PairFunction<
18 Tuple2<String, AverageCount>, // T: input
19 String, // K
20 Double // V
21 >() {
22 @Override
23 public Tuple2<String,Double> call(Tuple2<String, AverageCount> s) {
24 return new Tuple2<String,Double>(s._1, s._2.average());
25 }
26 });
27 return meanRDD;
28 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 59 / 74
Rank Product Algorithm
3 functions for combineByKey()
1 Function<Double, AverageCount> createCombiner = new Function<Double, AverageCount>() {
2 @Override
3 public AverageCount call(Double x) {
4 return new AverageCount(x, 1);
5 }
6 };
7
8 Function2<AverageCount, Double, AverageCount> addAndCount =
9 new Function2<AverageCount, Double, AverageCount>() {
10 @Override
11 public AverageCount call(AverageCount a, Double x) {
12 a.total += x;
13 a.count += 1;
14 return a;
15 }
16 };
17
18 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners =
19 new Function2<AverageCount, AverageCount, AverageCount>() {
20 @Override
21 public AverageCount call(AverageCount a, AverageCount b) {
22 a.total += b.total;
23 a.count += b.count;
24 return a;
25 }
26 };
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 60 / 74
Rank Product Algorithm
Per-key average using combineByKey() in Python
1 # C = Tuple2(sum, count)
2 sumCount = nums.combineByKey(
3 (lambda v: (v,1)),
4 (lambda C, v: (C[0] + v, C[1] + 1)),
5 (lambda C1, C2: (C1[0] + C2[0], C1[1] + C2[1]))
6 )
7
8 mean = sumCount.map(lambda key, C: (key, C[0]/C[1])).collectAsMap()
9 print(*mean)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 61 / 74
Rank Product Algorithm
Sample Run using groupByKey(): script
1 # define input/output for Hadoop/HDFS
2 #// args[0] = output path
3 #// args[1] = number of studies (K)
4 #// args[2] = input path for study 1
5 #// args[3] = input path for study 2
6 #// ...
7 #// args[K+1] = input path for study K
8 OUTPUT=/rankproduct/output
9 NUM_OF_INPUT=3
10 INPUT1=/rankproduct/input1
11 INPUT2=/rankproduct/input2
12 INPUT3=/rankproduct/input3
13
14 # remove all files under input
15 $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT
16 #
17 # remove all files under output
18 driver=org.dataalgorithms.bonus.rankproduct.spark.SparkRankProductUsingGroupByKey
19 $SPARK_HOME/bin/spark-submit --class $driver 
20 --master yarn-cluster 
21 --jars $OTHER_JARS 
22 --conf "spark.yarn.jar=$SPARK_JAR" 
23 $APP_JAR $OUTPUT $NUM_OF_INPUT $INPUT1 $INPUT2 $INPUT3
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 62 / 74
Rank Product Algorithm
Sample Run using groupByKey(): input
1 # hadoop fs -cat /rankproduct/input1/rp1.txt
2 K_1,30.0
3 K_2,60.0
4 K_3,10.0
5 K_4,80.0
6
7 # hadoop fs -cat /rankproduct/input2/rp2.txt
8 K_1,90.0
9 K_2,70.0
10 K_3,40.0
11 K_4,50.0
12
13 # hadoop fs -cat /rankproduct/input3/rp3.txt
14 K_1,4.0
15 K_2,8.0
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 63 / 74
Rank Product Algorithm
Sample Run using groupByKey(): output
1 # hadoop fs -cat /rankproduct/output/part*
2 (K_2,(1.5874010519681994,3))
3 (K_3,(4.0,2))
4 (K_1,(1.8171205928321397,3))
5 (K_4,(1.7320508075688772,2))
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 64 / 74
Rank Product Algorithm
computeRankedProductsUsingCombineByKey():
Define C data structure
1 /**
2 * RankProduct is used by combineByKey() to hold
3 * the total product values and their count.
4 */
5 static class RankProduct implements Serializable {
6 long product;
7 int count;
8
9 public RankProduct(long product, int count) {
10 this.product = product;
11 this.count = count;
12 }
13
14 public double rank() {
15 return Math.pow((double) product, 1.0/ (double) count);
16 }
17 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 65 / 74
Rank Product Algorithm
computeRankedProductsUsingCombineByKey()
1 // JavaPairRDD<String, Tuple2<Double, Integer>> = <gene, T2(rankedProduct, N>>
2 // where N is the number of elements for computing the rankedProduct
3 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProductsUsingCombineByKey(
4 JavaSparkContext context,
5 JavaPairRDD<String, Long>[] ranks) {
6 JavaPairRDD<String, Long> unionRDD = context.union(ranks);
7
8 // we need 3 function to be able to use combinebyKey()
9 Function<Long, RankProduct> createCombiner = ...
10 Function2<RankProduct, Long, RankProduct> addAndCount = ...
11 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = ...
12
13 // next find unique keys, with their associated copa scores
14 JavaPairRDD<String, RankProduct> combinedByGeneRDD =
15 unionRDD.combineByKey(createCombiner, addAndCount, mergeCombiners);
16
17 // next calculate ranked products and the number of elements
18 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = combinedByGeneRDD.mapValues(
19 new Function<
20 RankProduct, // input: RankProduct
21 Tuple2<Double, Integer> // output: (RankedProduct.count)
22 >() {
23 @Override
24 public Tuple2<Double, Integer> call(RankProduct value) {
25 double theRankedProduct = value.rank();
26 return new Tuple2<Double, Integer>(theRankedProduct, value.count);
27 }
28 });
29 return rankedProducts;
30 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 66 / 74
Rank Product Algorithm
3 functions for
computeRankedProductsUsingCombineByKey()
1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() {
2 @Override
3 public RankProduct call(Long x) {
4 return new RankProduct(x, 1);
5 }
6 };
7
8 Function2<RankProduct, Long, RankProduct> addAndCount =
9 new Function2<RankProduct, Long, RankProduct>() {
10 @Override
11 public RankProduct call(RankProduct a, Long x) {
12 a.product *= x;
13 a.count += 1;
14 return a;
15 }
16 };
17
18 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners =
19 new Function2<RankProduct, RankProduct, RankProduct>() {
20 @Override
21 public RankProduct call(RankProduct a, RankProduct b) {
22 a.product *= b.product;
23 a.count += b.count;
24 return a;
25 }
26 };
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 67 / 74
Rank Product Algorithm
REVISED computeRankedProductsUsingCombineByKey():
Define C data structure
1 static class RankProduct implements Serializable {
2 long product;
3 int count;
4
5 public RankProduct(long product, int count) {
6 this.product = product;
7 this.count = count;
8 }
9
10 public product(long value) {
11 this.product *= value;
12 this.count++;
13 }
14
15 public product(RankProduct pr) {
16 this.product *= pr.value;
17 this.count += pr.count;
18 }
19
20 public double rank() {
21 return Math.pow((double) product, 1.0/ (double) count);
22 }
23 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 68 / 74
Rank Product Algorithm
REVISED 3 functions for
computeRankedProductsUsingCombineByKey()
1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() {
2 public RankProduct call(Long x) {
3 return new RankProduct(x, 1);
4 }
5 };
6
7 Function2<RankProduct, Long, RankProduct> addAndCount =
8 new Function2<RankProduct, Long, RankProduct>() {
9 public RankProduct call(RankProduct a, Long x) {
10 a.product(x);
11 return a;
12 }
13 };
14
15 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners =
16 new Function2<RankProduct, RankProduct, RankProduct>() {
17 public RankProduct call(RankProduct a, RankProduct b) {
18 a.product(b);
19 return a;
20 }
21 };
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 69 / 74
Moral of Story
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 70 / 74
Moral of Story
Moral of Story...
Understand data requirements
Select proper platform for implementation: Spark
Partition your RDDs properly
number of cluster nodes
number of cores per node
amount of RAM
Avoid unnecessary computations (g1, g2), (g2, g1)
Use filter() often to remove non-needed RDD elements
Avoid unnecessary RDD.saveAsTextFile(path)
Run both on YARN and Spark cluster and compare performance
use groupByKey() cautiously
use combineByKey() over groupByKey()
Verify your test results against R language
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 71 / 74
References
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 72 / 74
References
References
COPA – Cancer Outlier Profile Analysis: James W. MacDonald and
Debashis Ghosh, http://bioinformatics.oxfordjournals.org/
content/22/23/2950.full.pdf
mCOPA: analysis of heterogeneous features in cancer expression data,
Chenwei Wang, Alperen Taciroglu, Stefan R Maetschke, Colleen C
Nelson, Mark A Ragan, and Melissa J Davis,
http://jclinbioinformatics.biomedcentral.com/articles/
10.1186/2043-9113-2-22
Rank Product, Mahmoud Parsian,
http://mapreduce4hackers.com/RankProduct_intro.html
Rank Product, Mahmoud Parsian, https://github.com/
mahmoudparsian/data-algorithms-book/tree/master/src/
main/java/org/dataalgorithms/bonus/rankproduct
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 73 / 74
References
Thank you!
Questions?
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 74 / 74

More Related Content

Viewers also liked

Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkQAware GmbH
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit
 
Adidas brand case study
Adidas   brand case studyAdidas   brand case study
Adidas brand case studytomjohnson15
 

Viewers also liked (18)

Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Real Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and SparkReal Time BOM Explosions with Apache Solr and Spark
Real Time BOM Explosions with Apache Solr and Spark
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Adidas brand case study
Adidas   brand case studyAdidas   brand case study
Adidas brand case study
 

Similar to Cancer Outlier Pro file Analysis using Apache Spark

Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016Mark Yashar
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark Yashar
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONGAshish Sharma
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesJeremy Yang
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 

Similar to Cancer Outlier Pro file Analysis using Apache Spark (20)

Tpa 2013
Tpa 2013Tpa 2013
Tpa 2013
 
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
 
Websci17 final
Websci17 finalWebsci17 final
Websci17 final
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case Studies
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 

Recently uploaded

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Cancer Outlier Pro file Analysis using Apache Spark

  • 1. Cancer Outlier Profile Analysis using Apache Spark Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illumina Apache Big Data Vancouver, BC May 10, 2016 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 1 / 74
  • 2. Table of Contents 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 2 / 74
  • 3. Biography Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 3 / 74
  • 4. Biography Who am I? Name: Mahmoud Parsian Education: Ph.D in Computer Science Work: Senior Architect @Illumina, Inc Lead Big Data Team @Illumina Develop scalable regression algorithms Develop DNA-Seq and RNA-Seq workflows Use Java/MapReduce/Hadoop/Spark/HBase Author: of 3 books Data Algorithms (O’Reilly: http://shop.oreilly.com/product/0636920033950.do/) JDBC Recipies (Apress: http://apress.com/) JDBC MetaData Recipies (Apress: http://apress.com/)) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 4 / 74
  • 5. Biography My book: Data Algorithms Book: http://shop.oreilly.com/product/0636920033950.do GIT: https://github.com/mahmoudparsian/data-algorithms-book Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 5 / 74
  • 6. What is COPA using Spark? Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 6 / 74
  • 7. What is COPA using Spark? Background: COPA, Spark What is COPA? Cancer Outlier Profile Analysis Is COPA important? What is an outlier? Common methods to detect outliers Outliers in cancer gene study T-statistic in outlier study What are COPA algorithms? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 7 / 74
  • 8. What is COPA using Spark? What is COPA? COPA algorithm : analyze DNA microarray data for genes markedly over-expressed ( outliers ) in a subset of cases COPA: first algorithm that lead to the discovery of ERG rearrangement in prostate cancer . ERG (ETS-related gene) is an oncogene meaning that it encodes a protein that typically is mutated in cancer. COPA identified the ETS family members ERG and ETV1 as high-ranking outliers in multiple prostate cancer profiling studies Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 8 / 74
  • 9. What is COPA using Spark? Informal COPA Algorithm Input = A cancer gene expression dataset, consisting of thousands of genes measured across hundreds of samples STEPS: (A) → (B) → (C) → (D) → (E) (A) genes are normalized and sorted via the COPA method. (B) COPA normalizes the median expression per gene to zero The MAD (median absolute deviation) to one A percentile cutoff is selected (e.g., 75th%, 90th% or 95th%) and genes are sorted by their COPA value at the selected percentile (C) The top 1% of genes is deemed to have outlier expression profiles (D) ... (E) ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 9 / 74
  • 10. What is COPA using Spark? Informal COPA Algorithm STEPS: (A) → (B) → (C) → (D) → (E) A, B, C (D) A collection of outliers from independent datasets are compiled and submitted for meta-analysis (E) Multiple datasets of a given cancer type are meta-analyzed Identify genes that are consistently called outliers across independent datasets Significance is assessed via the binomial distribution Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 10 / 74
  • 11. What is COPA using Spark? What is an Outlier? Definition: outlier is an unusual value in a dataset; one that does not fit the typical pattern of data Sources of outliers Recording of measurement errors Natural variation of the data (valid data) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 11 / 74
  • 12. What is COPA using Spark? What is an Outlier? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 12 / 74
  • 13. What is COPA using Spark? What is COPA Statistics? COPA = Cancer Outlier Profile Analysis Statistics designed to identify outliers in cancer gene expression profile Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 13 / 74
  • 14. What is COPA using Spark? Outlier Detection, How? Visual inspection of data not applicable for large complex datasets Automated methods Normal distribution-based method (NORMAL) Median Absolute Deviation (MAD) Distance Based Method COPA ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 14 / 74
  • 15. What is COPA using Spark? Normal distribution-based method f (x) = 1 σ √ 2π e−(x−µ)2 /2σ2 ∞ −∞ f (x)dx = 1 where x: sample data µ: mean of distribution σ: standard deviation σ2: variance Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 15 / 74
  • 16. What is COPA using Spark? Normal distribution in picture... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 16 / 74
  • 17. What is COPA using Spark? Normal distribution -based method Assume normal distribution for each variable Works on one variable at a time: Xk, k = 1, ..., p Algorithm Let i’th sample data for variable Xk , (i = 1, ..., n): xik Sample mean for Xk : µ = 1 n n i=1 (xik ) Sample standard deviation for Xk : Sk = 1 n n i=1 (xik − µ) 2 Calculate spatial statistics Zik for i = 1, ..., n as Zik = xik −µ Sk Label xik an outlier if |Zik | > 3 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 17 / 74
  • 18. What is COPA using Spark? Problems with Normal distribution -based method Very dependent on assumption of normality Does not work well on big data Works only for numeric-based variables µ (mean) and Sk (standard deviation) are not robust to outliers Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 18 / 74
  • 19. What is COPA using Spark? Median Absolute Deviation (MAD) Another method for dealing with robustness problem Use median as robust estimate of mean Use MAD as robust estimate of standard deviation Algorithm Calculate Dik = |Xik − median(Xk )| for i=1,...,n Calculate MAD = median(D1k , ..., Dnk ) Calculate spatial statistics Zik = xik −median(Xk ) 1.4826×MAD Label xik an outlier if |Zik | > 3.5 Note: standard deviation σ ≈ 1.4826 MAD Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 19 / 74
  • 20. What is COPA using Spark? Outliers in Cancer Studies In cancer studies: mutations can often amplify or turn off gene expression in only a minority of samples, namely produce outliers t-statistic may yield high false discovery rate (FDR) when trying to detect changes occurred in a small number of samples. COPA Is there a better method than COPA? may be? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 20 / 74
  • 21. COPA Algorithm Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 21 / 74
  • 22. COPA Algorithm COPA Algorithm gene expression values are median centered setting each gene’s median expression value to zero the median absolute deviation (MAD) is calculated scaled to 1 by dividing each gene expression value by its MAD the 75th, 90th, and 95th percentiles of the transformed expression values are tabulated for each gene genes are rank-ordered by their percentile scores, providing a prioritized list of outlier profiles. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 22 / 74
  • 23. COPA Algorithm COPA Transformation Median and MAD were used for transformation as opposed to mean and standard deviation so that outlier expression values do not unduly influence the distribution estimates, and are thus preserved post-normalization Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 23 / 74
  • 24. COPA Algorithm COPA: ONE-Sided and TWO-Sided ONE-Sided Put all Studies together Calculate COPA-Score per GENE-ID TWO-Sided Create K Study groups Rank All Groups Calculate COPA-Score per GENE-ID (per Study Group) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 24 / 74
  • 25. COPA Algorithm COPA: TWO-Sided Handles K studies, where each study is a set of samples. For each study, find the COPA scores and then the result will be the median of the COPA scores for all studies. For K studies, report GENE-IDi mean(COPA scores) geometric mean of product rank (j=1, ..., K) Rank-Converted-Scoresij for (j=1, ..., K) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 25 / 74
  • 26. COPA Algorithm Is COPA Big Data? General case: 1000’s of Studies Each Study = 100’s of Samples Each Sample = 100,000+ (K, V) pairs Example: 4000 studies, assume each study has 300 samples Total Samples: 4000 × 300 = 1, 200, 000 Analyze: 1, 200, 000 × 100, 000 = 120, 000, 000, 000 (K, V) pairs Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 26 / 74
  • 27. COPA Algorithm Input Format Each sample is a set of (K, V) pairs General Format: K=GENE-ID, V=GENE-VALUE Example-1: g127,1.20 Example-2: g4567,0.09 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
  • 28. COPA Algorithm Input Format Each sample is a set of (K, V) pairs General Format: K=GENE-ID, V=GENE-VALUE Example-1: g127,1.20 Example-2: g4567,0.09 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
  • 29. Input Data Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 28 / 74
  • 30. Input Data Input Data Format Set of k studies {S1, S2, ..., Sk} Each study has billions of (Key, Value) pairs Sample Record: <key-as-string><,><value-as-double-data-type> Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 29 / 74
  • 31. Input Data Input Data Persistence Data persists in HDFS Directory structure: /input/study-0001/file-0001-1.txt /input/study-0001/file-0001-2.txt /input/study-0001/file-0001-3.txt ... /input/study-3265/file-3265-1.txt /input/study-3265/file-3265-2.txt /input/study-3265/file-3265-3.txt ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 30 / 74
  • 32. Input Data What is a Ranking? Let S = {(K1, 40), (K2, 70), (K3, 90), (K4, 80)} Then Rank(S) = {(K1, 4), (K2, 3), (K3, 1), (K4, 2)} Since 90 > 80 > 70 > 40 Ranks are assigned as: 1, 2, 3, 4, ..., N Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 31 / 74
  • 33. Input Data What is Rank Product? Let {A1, ..., Ak} be a set of (key-value) pairs where keys are unique per dataset. Example of (key-value) pairs: (K,V) = (item, number of items sold) (K,V) = (user, number of followers for the user) (K,V) = (gene, test expression) Then the ranked product of {A1, ..., Ak} is computed based on the ranks ri for key i across all k datasets. Typically ranks are assigned based on the sorted values of datasets. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 32 / 74
  • 34. Input Data What is Rank Product? Let A1 = {(K1, 30), (K2, 60), (K3, 10), (K4, 80)}, then Rank(A1) = {(K1, 3), (K2, 2), (K3, 4), (K4, 1)} since 80 > 60 > 30 > 10 Note that 1 is the highest rank (assigned to the largest value). Let A2 = {(K1, 90), (K2, 70), (K3, 40), (K4, 50)}, Rank(A2) = {(K1, 1), (K2, 2), (K3, 4), (K4, 3)} since 90 > 70 > 50 > 40 Let A3 = {(K1, 4), (K2, 8)} Rank(A3) = {(K1, 2), (K2, 1)} since 8 > 4 The rank product of {A1, A2, A3} is expressed as: {(K1, 3 √ 3 × 1 × 2), (K2, 3 √ 2 × 2 × 1), (K3, 2 √ 4 × 4), (K4, 2 √ 1 × 3)} Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 33 / 74
  • 35. Input Data Calculation of the Rank Product Given n genes and k replicates, Let eg,i be the fold change and rg,i the rank of gene g in the i’th replicate. Compute the rank product (RP) via the geometric mean: RP(g) = k i=1 rg,i 1/k RP(g) = k k i=1 rg,i Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 34 / 74
  • 36. Input Data Formalizing Rank Product Let S = {S1, S2, ..., Sk} be a set of k studies, where k > 0 and each study represent a micro-array experiment Let Si (i = 1, 2, ..., k) be a study, which has an arbitrary number of assays identified by {Ai1, Ai2, ...} Let each assay (can be represented as a text file) be a set of arbitrary number of records in the following format: <gene_id><,><gene_value_as_double_data-type> Let gene_id be in {g1, g2, ..., gn} (we have n genes). Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 35 / 74
  • 37. Input Data Rank Product: in 2 Steps Let S = {S1, S2, ..., Sk} be a set of k studies: STEP-1: find the mean of values per study per gene you may replace the ”mean” function by your desired function finding mean involves groupByKey() or combineByKey() STEP-2: perform the ”Rank Product” per gene across all studies finding ”Rank Product” involves groupByKey() or combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 36 / 74
  • 38. Input Data Formalizing Rank Product The last step will be to find the rank product for each gene per study: S1 = {(g1, r11), (g2, r12), ...} S2 = {(g1, r21), (g2, r22), ...} ... Sk = {(g1, rk1), (g2, rk2), ...} then Ranked Product of gj = RP(gj ) = k i=1 ri,j 1/k or RP(gj ) = k k i=1 ri,j Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 37 / 74
  • 39. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  • 40. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  • 41. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  • 42. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  • 43. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  • 44. Input Data Two Spark Solutions: groupByKey() and combineByKey() Two solutions are provided using Spark-1.6.0: SparkRankProductUsingGroupByKey uses groupByKey() SparkRankProductUsingCombineByKey uses combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 39 / 74
  • 45. Input Data How does groupByKey() work Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 40 / 74
  • 46. Input Data How does reduceByKey() work Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 41 / 74
  • 47. Rank Product Algorithm Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 42 / 74
  • 48. Rank Product Algorithm Rank Product Algorithm in Spark Algorithm: High-Level Steps Step Description STEP-1 import required interfaces and classes STEP-2 handle input parameters STEP-3 create a Spark context object STEP-4 create list of studies (1, 2, ..., K) STEP-5 compute mean per gene per study STEP-6 sort by values STEP-7 assign rank STEP-8 compute rank product STEP-9 save the result in HDFS Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 43 / 74
  • 49. Rank Product Algorithm Main Driver Listing 1: performRrankProduct() 1 public static void main(String[] args) throws Exception { 2 // args[0] = output path 3 // args[1] = number of studies (K) 4 // args[2] = input path for study 1 5 // args[3] = input path for study 2 6 // ... 7 // args[K+1] = input path for study K 8 final String outputPath = args[0]; 9 final String numOfStudiesAsString = args[1]; 10 final int K = Integer.parseInt(numOfStudiesAsString); 11 List<String> inputPathMultipleStudies = new ArrayList<String>(); 12 for (int i=1; i <= K; i++) { 13 String singleStudyInputPath = args[1+i]; 14 inputPathMultipleStudies.add(singleStudyInputPath); 15 } 16 performRrankProduct(inputPathMultipleStudies, outputPath); 17 System.exit(0); 18 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 44 / 74
  • 50. Rank Product Algorithm groupByKey() vs. combineByKey() Which one should we use? combineByKey() or groupByKey()? According to their semantics, both will give you the same answer. But combineByKey()| is more efficient . In some situations, groupByKey() can even cause of out of disk problems . In general, reduceByKey(), and combineByKey() are preferred over groupByKey(). Spark shuffling is more efficient for reduceByKey()| than groupByKey() and the reason is this: in the shuffle step for reduceByKey(), data is combined so each partition outputs at most one value for each key to send over the network, while in shuffle step for groupByKey(), all the data is wastefully sent over the network and collected on the reduce workers. To understand the difference , the following figures show how the shuffle is done for reduceByKey() and groupByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 45 / 74
  • 51. Rank Product Algorithm Understanding groupByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 46 / 74
  • 52. Rank Product Algorithm Understanding reduceByKey() or combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 47 / 74
  • 53. Rank Product Algorithm Listing 2: performRrankProduct() 1 public static void performRrankProduct( 2 final List<String> inputPathMultipleStudies, 3 final String outputPath) throws Exception { 4 // create a context object, which is used 5 // as a factory for creating new RDDs 6 JavaSparkContext context = Util.createJavaSparkContext(useYARN); 7 8 // Spark 1.6.0 requires an array for creating union of many RDDs 9 int index = 0; 10 JavaPairRDD<String, Double>[] means = 11 new JavaPairRDD[inputPathMultipleStudies.size()]; 12 for (String inputPathSingleStudy : inputPathMultipleStudies) { 13 means[index] = computeMeanByGroupByKey(context, inputPathSingleStudy); 14 index++; 15 } 16 17 // next compute rank 18 ... 19 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 48 / 74
  • 54. Rank Product Algorithm Listing 3: performRrankProduct() 1 public static void performRrankProduct( 2 ... 3 // next compute rank 4 // 1. sort values based on absolute value of mean value 5 // 2. assign rank from 1 to N 6 // 3. calculate rank product for each gene 7 JavaPairRDD<String,Long>[] ranks = new JavaPairRDD[means.length]; 8 for (int i=0; i < means.length; i++) { 9 ranks[i] = assignRank(means[i]); 10 } 11 // calculate ranked products 12 // <gene, T2<rankedProduct, N>> 13 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = 14 computeRankedProducts(context, ranks); 15 16 // save the result, shuffle=true 17 rankedProducts.coalesce(1,true).saveAsTextFile(outputPath); 18 19 // close the context and we are done 20 context.close(); 21 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 49 / 74
  • 55. Rank Product Algorithm STEP-3: create a Spark context object 1 public static JavaSparkContext createJavaSparkContext(boolean useYARN) { 2 JavaSparkContext context; 3 if (useYARN) { 4 context = new JavaSparkContext("yarn-cluster", "MyAnalysis"); // YARN 5 } 6 else { 7 context = new JavaSparkContext(); // Spark cluster 8 } 9 // inject efficiency 10 SparkConf sparkConf = context.getConf(); 11 sparkConf.set("spark.kryoserializer.buffer.mb","32"); 12 sparkConf.set("spark.shuffle.file.buffer.kb","64"); 13 // set a fast serializer 14 sparkConf.set("spark.serializer", 15 "org.apache.spark.serializer.KryoSerializer"); 16 sparkConf.set("spark.kryo.registrator", 17 "org.apache.spark.serializer.KryoRegistrator"); 18 return context; 19 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 50 / 74
  • 56. Rank Product Algorithm Listing 4: STEP-4: compute mean 1 static JavaPairRDD<String, Double> computeMeanByGroupByKey( 2 JavaSparkContext context, 3 final String inputPath) throws Exception { 4 JavaPairRDD<String, Double> genes = 5 getGenesUsingTextFile(context, inputPath, 30); 6 7 // group values by gene 8 JavaPairRDD<String, Iterable<Double>> groupedByGene = genes.groupByKey(); 9 10 // calculate mean per gene 11 JavaPairRDD<String, Double> meanRDD = groupedByGene.mapValues( 12 new Function< 13 Iterable<Double>, // input 14 Double // output: mean 15 >() { 16 @Override 17 public Double call(Iterable<Double> values) { 18 double sum = 0.0; 19 int count = 0; 20 for (Double v : values) { 21 sum += v; 22 count++; 23 } 24 // calculate mean of samples 25 double mean = sum / ((double) count); 26 return mean; 27 } 28 }); 29 return meanRDD; 30 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 51 / 74
  • 57. Rank Product Algorithm Listing 5: getGenesUsingTextFile() 1 static JavaPairRDD<String, Double> getGenesUsingTextFile( 2 JavaSparkContext context, 3 final String inputPath, 4 final int numberOfPartitions) throws Exception { 5 6 // read input and create the first RDD 7 // JavaRDD<String>: where String = "gene,test_expression" 8 JavaRDD<String> records = context.textFile(inputPath, numberOfPartitions); 9 10 // for each record, we emit (K=gene, V=test_expression) 11 JavaPairRDD<String, Double> genes 12 = records.mapToPair(new PairFunction<String, String, Double>() { 13 @Override 14 public Tuple2<String, Double> call(String rec) { 15 // rec = "gene,test_expression" 16 String[] tokens = StringUtils.split(rec, ","); 17 // tokens[0] = gene 18 // tokens[1] = test_expression 19 return new Tuple2<String, Double>( 20 tokens[0], Double.parseDouble(tokens[1])); 21 } 22 }); 23 return genes; 24 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 52 / 74
  • 58. Rank Product Algorithm Listing 6: Assign Rank 1 // result is JavaPairRDD<String, Long> = (gene, rank) 2 static JavaPairRDD<String,Long> assignRank(JavaPairRDD<String,Double> rdd){ 3 // swap key and value (will be used for sorting by key); convert value to abs(value) 4 JavaPairRDD<Double,String> swappedRDD = rdd.mapToPair( 5 new PairFunction<Tuple2<String, Double>, // T: input 6 Double, // K 7 String>>(){ // V 8 public Tuple2<Double,String> call(Tuple2<String, Double> s) { 9 return new Tuple2<Double,String>(Math.abs(s._2), s._1); 10 } 11 }); 12 // we need 1 partition so that we can zip numbers into this RDD by zipWithIndex() 13 JavaPairRDD<Double,String> sorted = swappedRDD.sortByKey(false, 1); // sort means descending 14 // JavaPairRDD<T,Long> zipWithIndex() 15 // Long values will be 0, 1, 2, ...; for ranking, we need 1, 2, 3, ..., therefore, we will add 1 16 JavaPairRDD<Tuple2<Double,String>,Long> indexed = sorted.zipWithIndex(); 17 // next convert JavaPairRDD<Tuple2<Double,String>,Long> into JavaPairRDD<String,Long> 18 // JavaPairRDD<Tuple2<value,gene>,rank> into JavaPairRDD<gene,rank> 19 JavaPairRDD<String, Long> ranked = indexed.mapToPair( 20 new PairFunction<Tuple2<Tuple2<Double,String>,Long>, // T: input 21 String, // K: mapped_id 22 Long>() { // V: rank 23 public Tuple2<String, Long> call(Tuple2<Tuple2<Double,String>,Long> s) { 24 return new Tuple2<String,Long>(s._1._2, s._2 + 1); // ranks are 1, 2, ..., n 25 } 26 }); 27 return ranked; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 53 / 74
  • 59. Rank Product Algorithm Listing 7: Compute Rank Product using groupByKey() 1 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProducts( 2 JavaSparkContext context, 3 JavaPairRDD<String, Long>[] ranks) { 4 JavaPairRDD<String, Long> unionRDD = context.union(ranks); 5 6 // next find unique keys, with their associated values 7 JavaPairRDD<String, Iterable<Long>> groupedByGeneRDD = unionRDD.groupByKey(); 8 9 // next calculate ranked products and the number of elements 10 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = groupedByGeneRDD.mapValues( 11 new Function< 12 Iterable<Long>, // input: means for all studies 13 Tuple2<Double, Integer> // output: (rankedProduct, N) 14 >() { 15 @Override 16 public Tuple2<Double, Integer> call(Iterable<Long> values) { 17 int N = 0; 18 long products = 1; 19 for (Long v : values) { 20 products *= v; 21 N++; 22 } 23 double rankedProduct = Math.pow( (double) products, 1.0/((double) N)); 24 return new Tuple2<Double, Integer>(rankedProduct, N); 25 } 26 }); 27 return rankedProducts; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 54 / 74
  • 60. Rank Product Algorithm Next FOCUS on combineByKey() We do need to develop 2 functions: computeMeanByCombineByKey() computeRankedProductsUsingCombineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 55 / 74
  • 61. Rank Product Algorithm combineByKey(): how does it work? combineByKey() is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it. Like aggregate(), combineByKey() allows the user to return values that are not the same type as our input data. To understand combineByKey(), it is useful to think of how it handles each element it processes. As combineByKey() goes through the elements in a partition, each element either has a key it has not seen before or has the same key as a previous element. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 56 / 74
  • 62. Rank Product Algorithm combineByKey() by Combiner (C) 1 public <C> JavaPairRDD<K,C> combineByKey( 2 Function<V,C> createCombiner, // V -> C 3 Function2<C,V,C> mergeValue, // C+V -> C 4 Function2<C,C,C> mergeCombiners) // C+C -> C Provide 3 functions: 1. createCombiner, which turns a V into a C // creates a one-element list 2. mergeValue, to merge a V into a C // adds it to the end of a list 3. mergeCombiners, to combine two Cs into a single one. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 57 / 74
  • 63. Rank Product Algorithm computeMeanByCombineByKey(): Define C data structure 1 // 2 // AverageCount is used by combineByKey() 3 // to hold the total values and their count. 4 // 5 static class AverageCount implements Serializable { 6 double total; 7 int count; 8 9 public AverageCount(double total, int count) { 10 this.total = total; 11 this.count = count; 12 } 13 14 public double average() { 15 return total / (double) count; 16 } 17 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 58 / 74
  • 64. Rank Product Algorithm computeMeanByCombineByKey() 1 static JavaPairRDD<String, Double> computeMeanByCombineByKey( 2 JavaSparkContext context, 3 final String inputPath) throws Exception { 4 5 JavaPairRDD<String, Double> genes = getGenesUsingTextFile(context, inputPath, 30); 6 7 // we need 3 function to be able to use combineByKey() 8 Function<Double, AverageCount> createCombiner = ... 9 Function2<AverageCount, Double, AverageCount> addAndCount = ... 10 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = ... 11 12 JavaPairRDD<String, AverageCount> averageCounts = 13 genes.combineByKey(createCombiner, addAndCount, mergeCombiners); 14 15 // now compute the mean/average per gene 16 JavaPairRDD<String,Double> meanRDD = averageCounts.mapToPair( 17 new PairFunction< 18 Tuple2<String, AverageCount>, // T: input 19 String, // K 20 Double // V 21 >() { 22 @Override 23 public Tuple2<String,Double> call(Tuple2<String, AverageCount> s) { 24 return new Tuple2<String,Double>(s._1, s._2.average()); 25 } 26 }); 27 return meanRDD; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 59 / 74
  • 65. Rank Product Algorithm 3 functions for combineByKey() 1 Function<Double, AverageCount> createCombiner = new Function<Double, AverageCount>() { 2 @Override 3 public AverageCount call(Double x) { 4 return new AverageCount(x, 1); 5 } 6 }; 7 8 Function2<AverageCount, Double, AverageCount> addAndCount = 9 new Function2<AverageCount, Double, AverageCount>() { 10 @Override 11 public AverageCount call(AverageCount a, Double x) { 12 a.total += x; 13 a.count += 1; 14 return a; 15 } 16 }; 17 18 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = 19 new Function2<AverageCount, AverageCount, AverageCount>() { 20 @Override 21 public AverageCount call(AverageCount a, AverageCount b) { 22 a.total += b.total; 23 a.count += b.count; 24 return a; 25 } 26 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 60 / 74
  • 66. Rank Product Algorithm Per-key average using combineByKey() in Python 1 # C = Tuple2(sum, count) 2 sumCount = nums.combineByKey( 3 (lambda v: (v,1)), 4 (lambda C, v: (C[0] + v, C[1] + 1)), 5 (lambda C1, C2: (C1[0] + C2[0], C1[1] + C2[1])) 6 ) 7 8 mean = sumCount.map(lambda key, C: (key, C[0]/C[1])).collectAsMap() 9 print(*mean) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 61 / 74
  • 67. Rank Product Algorithm Sample Run using groupByKey(): script 1 # define input/output for Hadoop/HDFS 2 #// args[0] = output path 3 #// args[1] = number of studies (K) 4 #// args[2] = input path for study 1 5 #// args[3] = input path for study 2 6 #// ... 7 #// args[K+1] = input path for study K 8 OUTPUT=/rankproduct/output 9 NUM_OF_INPUT=3 10 INPUT1=/rankproduct/input1 11 INPUT2=/rankproduct/input2 12 INPUT3=/rankproduct/input3 13 14 # remove all files under input 15 $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT 16 # 17 # remove all files under output 18 driver=org.dataalgorithms.bonus.rankproduct.spark.SparkRankProductUsingGroupByKey 19 $SPARK_HOME/bin/spark-submit --class $driver 20 --master yarn-cluster 21 --jars $OTHER_JARS 22 --conf "spark.yarn.jar=$SPARK_JAR" 23 $APP_JAR $OUTPUT $NUM_OF_INPUT $INPUT1 $INPUT2 $INPUT3 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 62 / 74
  • 68. Rank Product Algorithm Sample Run using groupByKey(): input 1 # hadoop fs -cat /rankproduct/input1/rp1.txt 2 K_1,30.0 3 K_2,60.0 4 K_3,10.0 5 K_4,80.0 6 7 # hadoop fs -cat /rankproduct/input2/rp2.txt 8 K_1,90.0 9 K_2,70.0 10 K_3,40.0 11 K_4,50.0 12 13 # hadoop fs -cat /rankproduct/input3/rp3.txt 14 K_1,4.0 15 K_2,8.0 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 63 / 74
  • 69. Rank Product Algorithm Sample Run using groupByKey(): output 1 # hadoop fs -cat /rankproduct/output/part* 2 (K_2,(1.5874010519681994,3)) 3 (K_3,(4.0,2)) 4 (K_1,(1.8171205928321397,3)) 5 (K_4,(1.7320508075688772,2)) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 64 / 74
  • 70. Rank Product Algorithm computeRankedProductsUsingCombineByKey(): Define C data structure 1 /** 2 * RankProduct is used by combineByKey() to hold 3 * the total product values and their count. 4 */ 5 static class RankProduct implements Serializable { 6 long product; 7 int count; 8 9 public RankProduct(long product, int count) { 10 this.product = product; 11 this.count = count; 12 } 13 14 public double rank() { 15 return Math.pow((double) product, 1.0/ (double) count); 16 } 17 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 65 / 74
  • 71. Rank Product Algorithm computeRankedProductsUsingCombineByKey() 1 // JavaPairRDD<String, Tuple2<Double, Integer>> = <gene, T2(rankedProduct, N>> 2 // where N is the number of elements for computing the rankedProduct 3 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProductsUsingCombineByKey( 4 JavaSparkContext context, 5 JavaPairRDD<String, Long>[] ranks) { 6 JavaPairRDD<String, Long> unionRDD = context.union(ranks); 7 8 // we need 3 function to be able to use combinebyKey() 9 Function<Long, RankProduct> createCombiner = ... 10 Function2<RankProduct, Long, RankProduct> addAndCount = ... 11 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = ... 12 13 // next find unique keys, with their associated copa scores 14 JavaPairRDD<String, RankProduct> combinedByGeneRDD = 15 unionRDD.combineByKey(createCombiner, addAndCount, mergeCombiners); 16 17 // next calculate ranked products and the number of elements 18 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = combinedByGeneRDD.mapValues( 19 new Function< 20 RankProduct, // input: RankProduct 21 Tuple2<Double, Integer> // output: (RankedProduct.count) 22 >() { 23 @Override 24 public Tuple2<Double, Integer> call(RankProduct value) { 25 double theRankedProduct = value.rank(); 26 return new Tuple2<Double, Integer>(theRankedProduct, value.count); 27 } 28 }); 29 return rankedProducts; 30 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 66 / 74
  • 72. Rank Product Algorithm 3 functions for computeRankedProductsUsingCombineByKey() 1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() { 2 @Override 3 public RankProduct call(Long x) { 4 return new RankProduct(x, 1); 5 } 6 }; 7 8 Function2<RankProduct, Long, RankProduct> addAndCount = 9 new Function2<RankProduct, Long, RankProduct>() { 10 @Override 11 public RankProduct call(RankProduct a, Long x) { 12 a.product *= x; 13 a.count += 1; 14 return a; 15 } 16 }; 17 18 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = 19 new Function2<RankProduct, RankProduct, RankProduct>() { 20 @Override 21 public RankProduct call(RankProduct a, RankProduct b) { 22 a.product *= b.product; 23 a.count += b.count; 24 return a; 25 } 26 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 67 / 74
  • 73. Rank Product Algorithm REVISED computeRankedProductsUsingCombineByKey(): Define C data structure 1 static class RankProduct implements Serializable { 2 long product; 3 int count; 4 5 public RankProduct(long product, int count) { 6 this.product = product; 7 this.count = count; 8 } 9 10 public product(long value) { 11 this.product *= value; 12 this.count++; 13 } 14 15 public product(RankProduct pr) { 16 this.product *= pr.value; 17 this.count += pr.count; 18 } 19 20 public double rank() { 21 return Math.pow((double) product, 1.0/ (double) count); 22 } 23 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 68 / 74
  • 74. Rank Product Algorithm REVISED 3 functions for computeRankedProductsUsingCombineByKey() 1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() { 2 public RankProduct call(Long x) { 3 return new RankProduct(x, 1); 4 } 5 }; 6 7 Function2<RankProduct, Long, RankProduct> addAndCount = 8 new Function2<RankProduct, Long, RankProduct>() { 9 public RankProduct call(RankProduct a, Long x) { 10 a.product(x); 11 return a; 12 } 13 }; 14 15 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = 16 new Function2<RankProduct, RankProduct, RankProduct>() { 17 public RankProduct call(RankProduct a, RankProduct b) { 18 a.product(b); 19 return a; 20 } 21 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 69 / 74
  • 75. Moral of Story Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 70 / 74
  • 76. Moral of Story Moral of Story... Understand data requirements Select proper platform for implementation: Spark Partition your RDDs properly number of cluster nodes number of cores per node amount of RAM Avoid unnecessary computations (g1, g2), (g2, g1) Use filter() often to remove non-needed RDD elements Avoid unnecessary RDD.saveAsTextFile(path) Run both on YARN and Spark cluster and compare performance use groupByKey() cautiously use combineByKey() over groupByKey() Verify your test results against R language Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 71 / 74
  • 77. References Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 72 / 74
  • 78. References References COPA – Cancer Outlier Profile Analysis: James W. MacDonald and Debashis Ghosh, http://bioinformatics.oxfordjournals.org/ content/22/23/2950.full.pdf mCOPA: analysis of heterogeneous features in cancer expression data, Chenwei Wang, Alperen Taciroglu, Stefan R Maetschke, Colleen C Nelson, Mark A Ragan, and Melissa J Davis, http://jclinbioinformatics.biomedcentral.com/articles/ 10.1186/2043-9113-2-22 Rank Product, Mahmoud Parsian, http://mapreduce4hackers.com/RankProduct_intro.html Rank Product, Mahmoud Parsian, https://github.com/ mahmoudparsian/data-algorithms-book/tree/master/src/ main/java/org/dataalgorithms/bonus/rankproduct Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 73 / 74
  • 79. References Thank you! Questions? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 74 / 74