Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cancer Outlier Pro file Analysis using Apache Spark

Cancer Outlier Pro file Analysis (COPA) using Apache Spark

  • Login to see the comments

  • Be the first to like this

Cancer Outlier Pro file Analysis using Apache Spark

  1. 1. Cancer Outlier Profile Analysis using Apache Spark Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illumina Apache Big Data Vancouver, BC May 10, 2016 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 1 / 74
  2. 2. Table of Contents 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 2 / 74
  3. 3. Biography Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 3 / 74
  4. 4. Biography Who am I? Name: Mahmoud Parsian Education: Ph.D in Computer Science Work: Senior Architect @Illumina, Inc Lead Big Data Team @Illumina Develop scalable regression algorithms Develop DNA-Seq and RNA-Seq workflows Use Java/MapReduce/Hadoop/Spark/HBase Author: of 3 books Data Algorithms (O’Reilly: http://shop.oreilly.com/product/0636920033950.do/) JDBC Recipies (Apress: http://apress.com/) JDBC MetaData Recipies (Apress: http://apress.com/)) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 4 / 74
  5. 5. Biography My book: Data Algorithms Book: http://shop.oreilly.com/product/0636920033950.do GIT: https://github.com/mahmoudparsian/data-algorithms-book Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 5 / 74
  6. 6. What is COPA using Spark? Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 6 / 74
  7. 7. What is COPA using Spark? Background: COPA, Spark What is COPA? Cancer Outlier Profile Analysis Is COPA important? What is an outlier? Common methods to detect outliers Outliers in cancer gene study T-statistic in outlier study What are COPA algorithms? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 7 / 74
  8. 8. What is COPA using Spark? What is COPA? COPA algorithm : analyze DNA microarray data for genes markedly over-expressed ( outliers ) in a subset of cases COPA: first algorithm that lead to the discovery of ERG rearrangement in prostate cancer . ERG (ETS-related gene) is an oncogene meaning that it encodes a protein that typically is mutated in cancer. COPA identified the ETS family members ERG and ETV1 as high-ranking outliers in multiple prostate cancer profiling studies Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 8 / 74
  9. 9. What is COPA using Spark? Informal COPA Algorithm Input = A cancer gene expression dataset, consisting of thousands of genes measured across hundreds of samples STEPS: (A) → (B) → (C) → (D) → (E) (A) genes are normalized and sorted via the COPA method. (B) COPA normalizes the median expression per gene to zero The MAD (median absolute deviation) to one A percentile cutoff is selected (e.g., 75th%, 90th% or 95th%) and genes are sorted by their COPA value at the selected percentile (C) The top 1% of genes is deemed to have outlier expression profiles (D) ... (E) ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 9 / 74
  10. 10. What is COPA using Spark? Informal COPA Algorithm STEPS: (A) → (B) → (C) → (D) → (E) A, B, C (D) A collection of outliers from independent datasets are compiled and submitted for meta-analysis (E) Multiple datasets of a given cancer type are meta-analyzed Identify genes that are consistently called outliers across independent datasets Significance is assessed via the binomial distribution Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 10 / 74
  11. 11. What is COPA using Spark? What is an Outlier? Definition: outlier is an unusual value in a dataset; one that does not fit the typical pattern of data Sources of outliers Recording of measurement errors Natural variation of the data (valid data) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 11 / 74
  12. 12. What is COPA using Spark? What is an Outlier? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 12 / 74
  13. 13. What is COPA using Spark? What is COPA Statistics? COPA = Cancer Outlier Profile Analysis Statistics designed to identify outliers in cancer gene expression profile Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 13 / 74
  14. 14. What is COPA using Spark? Outlier Detection, How? Visual inspection of data not applicable for large complex datasets Automated methods Normal distribution-based method (NORMAL) Median Absolute Deviation (MAD) Distance Based Method COPA ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 14 / 74
  15. 15. What is COPA using Spark? Normal distribution-based method f (x) = 1 σ √ 2π e−(x−µ)2 /2σ2 ∞ −∞ f (x)dx = 1 where x: sample data µ: mean of distribution σ: standard deviation σ2: variance Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 15 / 74
  16. 16. What is COPA using Spark? Normal distribution in picture... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 16 / 74
  17. 17. What is COPA using Spark? Normal distribution -based method Assume normal distribution for each variable Works on one variable at a time: Xk, k = 1, ..., p Algorithm Let i’th sample data for variable Xk , (i = 1, ..., n): xik Sample mean for Xk : µ = 1 n n i=1 (xik ) Sample standard deviation for Xk : Sk = 1 n n i=1 (xik − µ) 2 Calculate spatial statistics Zik for i = 1, ..., n as Zik = xik −µ Sk Label xik an outlier if |Zik | > 3 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 17 / 74
  18. 18. What is COPA using Spark? Problems with Normal distribution -based method Very dependent on assumption of normality Does not work well on big data Works only for numeric-based variables µ (mean) and Sk (standard deviation) are not robust to outliers Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 18 / 74
  19. 19. What is COPA using Spark? Median Absolute Deviation (MAD) Another method for dealing with robustness problem Use median as robust estimate of mean Use MAD as robust estimate of standard deviation Algorithm Calculate Dik = |Xik − median(Xk )| for i=1,...,n Calculate MAD = median(D1k , ..., Dnk ) Calculate spatial statistics Zik = xik −median(Xk ) 1.4826×MAD Label xik an outlier if |Zik | > 3.5 Note: standard deviation σ ≈ 1.4826 MAD Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 19 / 74
  20. 20. What is COPA using Spark? Outliers in Cancer Studies In cancer studies: mutations can often amplify or turn off gene expression in only a minority of samples, namely produce outliers t-statistic may yield high false discovery rate (FDR) when trying to detect changes occurred in a small number of samples. COPA Is there a better method than COPA? may be? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 20 / 74
  21. 21. COPA Algorithm Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 21 / 74
  22. 22. COPA Algorithm COPA Algorithm gene expression values are median centered setting each gene’s median expression value to zero the median absolute deviation (MAD) is calculated scaled to 1 by dividing each gene expression value by its MAD the 75th, 90th, and 95th percentiles of the transformed expression values are tabulated for each gene genes are rank-ordered by their percentile scores, providing a prioritized list of outlier profiles. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 22 / 74
  23. 23. COPA Algorithm COPA Transformation Median and MAD were used for transformation as opposed to mean and standard deviation so that outlier expression values do not unduly influence the distribution estimates, and are thus preserved post-normalization Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 23 / 74
  24. 24. COPA Algorithm COPA: ONE-Sided and TWO-Sided ONE-Sided Put all Studies together Calculate COPA-Score per GENE-ID TWO-Sided Create K Study groups Rank All Groups Calculate COPA-Score per GENE-ID (per Study Group) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 24 / 74
  25. 25. COPA Algorithm COPA: TWO-Sided Handles K studies, where each study is a set of samples. For each study, find the COPA scores and then the result will be the median of the COPA scores for all studies. For K studies, report GENE-IDi mean(COPA scores) geometric mean of product rank (j=1, ..., K) Rank-Converted-Scoresij for (j=1, ..., K) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 25 / 74
  26. 26. COPA Algorithm Is COPA Big Data? General case: 1000’s of Studies Each Study = 100’s of Samples Each Sample = 100,000+ (K, V) pairs Example: 4000 studies, assume each study has 300 samples Total Samples: 4000 × 300 = 1, 200, 000 Analyze: 1, 200, 000 × 100, 000 = 120, 000, 000, 000 (K, V) pairs Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 26 / 74
  27. 27. COPA Algorithm Input Format Each sample is a set of (K, V) pairs General Format: K=GENE-ID, V=GENE-VALUE Example-1: g127,1.20 Example-2: g4567,0.09 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
  28. 28. COPA Algorithm Input Format Each sample is a set of (K, V) pairs General Format: K=GENE-ID, V=GENE-VALUE Example-1: g127,1.20 Example-2: g4567,0.09 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 27 / 74
  29. 29. Input Data Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 28 / 74
  30. 30. Input Data Input Data Format Set of k studies {S1, S2, ..., Sk} Each study has billions of (Key, Value) pairs Sample Record: <key-as-string><,><value-as-double-data-type> Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 29 / 74
  31. 31. Input Data Input Data Persistence Data persists in HDFS Directory structure: /input/study-0001/file-0001-1.txt /input/study-0001/file-0001-2.txt /input/study-0001/file-0001-3.txt ... /input/study-3265/file-3265-1.txt /input/study-3265/file-3265-2.txt /input/study-3265/file-3265-3.txt ... Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 30 / 74
  32. 32. Input Data What is a Ranking? Let S = {(K1, 40), (K2, 70), (K3, 90), (K4, 80)} Then Rank(S) = {(K1, 4), (K2, 3), (K3, 1), (K4, 2)} Since 90 > 80 > 70 > 40 Ranks are assigned as: 1, 2, 3, 4, ..., N Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 31 / 74
  33. 33. Input Data What is Rank Product? Let {A1, ..., Ak} be a set of (key-value) pairs where keys are unique per dataset. Example of (key-value) pairs: (K,V) = (item, number of items sold) (K,V) = (user, number of followers for the user) (K,V) = (gene, test expression) Then the ranked product of {A1, ..., Ak} is computed based on the ranks ri for key i across all k datasets. Typically ranks are assigned based on the sorted values of datasets. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 32 / 74
  34. 34. Input Data What is Rank Product? Let A1 = {(K1, 30), (K2, 60), (K3, 10), (K4, 80)}, then Rank(A1) = {(K1, 3), (K2, 2), (K3, 4), (K4, 1)} since 80 > 60 > 30 > 10 Note that 1 is the highest rank (assigned to the largest value). Let A2 = {(K1, 90), (K2, 70), (K3, 40), (K4, 50)}, Rank(A2) = {(K1, 1), (K2, 2), (K3, 4), (K4, 3)} since 90 > 70 > 50 > 40 Let A3 = {(K1, 4), (K2, 8)} Rank(A3) = {(K1, 2), (K2, 1)} since 8 > 4 The rank product of {A1, A2, A3} is expressed as: {(K1, 3 √ 3 × 1 × 2), (K2, 3 √ 2 × 2 × 1), (K3, 2 √ 4 × 4), (K4, 2 √ 1 × 3)} Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 33 / 74
  35. 35. Input Data Calculation of the Rank Product Given n genes and k replicates, Let eg,i be the fold change and rg,i the rank of gene g in the i’th replicate. Compute the rank product (RP) via the geometric mean: RP(g) = k i=1 rg,i 1/k RP(g) = k k i=1 rg,i Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 34 / 74
  36. 36. Input Data Formalizing Rank Product Let S = {S1, S2, ..., Sk} be a set of k studies, where k > 0 and each study represent a micro-array experiment Let Si (i = 1, 2, ..., k) be a study, which has an arbitrary number of assays identified by {Ai1, Ai2, ...} Let each assay (can be represented as a text file) be a set of arbitrary number of records in the following format: <gene_id><,><gene_value_as_double_data-type> Let gene_id be in {g1, g2, ..., gn} (we have n genes). Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 35 / 74
  37. 37. Input Data Rank Product: in 2 Steps Let S = {S1, S2, ..., Sk} be a set of k studies: STEP-1: find the mean of values per study per gene you may replace the ”mean” function by your desired function finding mean involves groupByKey() or combineByKey() STEP-2: perform the ”Rank Product” per gene across all studies finding ”Rank Product” involves groupByKey() or combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 36 / 74
  38. 38. Input Data Formalizing Rank Product The last step will be to find the rank product for each gene per study: S1 = {(g1, r11), (g2, r12), ...} S2 = {(g1, r21), (g2, r22), ...} ... Sk = {(g1, rk1), (g2, rk2), ...} then Ranked Product of gj = RP(gj ) = k i=1 ri,j 1/k or RP(gj ) = k k i=1 ri,j Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 37 / 74
  39. 39. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  40. 40. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  41. 41. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  42. 42. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  43. 43. Input Data Spark Solution for Rank Product 1 Read k input paths (path = study) 2 Find the mean per gene per study 3 Sort the genes by value per study and then assign rank values; To sort the dataset by value, we will swap the key with value and then perform the sort. 4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value) use JavaPairRDD.zipWithIndex(), which zips the RDD with its element indices (these indices will be the ranks). Spark indices will start from 0, we will add 1 5 Finally compute the Rank Product per gene for all studies: JavaPairRDD.groupByKey() (not efficient) JavaPairRDD.combineByKey() (efficient) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
  44. 44. Input Data Two Spark Solutions: groupByKey() and combineByKey() Two solutions are provided using Spark-1.6.0: SparkRankProductUsingGroupByKey uses groupByKey() SparkRankProductUsingCombineByKey uses combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 39 / 74
  45. 45. Input Data How does groupByKey() work Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 40 / 74
  46. 46. Input Data How does reduceByKey() work Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 41 / 74
  47. 47. Rank Product Algorithm Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 42 / 74
  48. 48. Rank Product Algorithm Rank Product Algorithm in Spark Algorithm: High-Level Steps Step Description STEP-1 import required interfaces and classes STEP-2 handle input parameters STEP-3 create a Spark context object STEP-4 create list of studies (1, 2, ..., K) STEP-5 compute mean per gene per study STEP-6 sort by values STEP-7 assign rank STEP-8 compute rank product STEP-9 save the result in HDFS Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 43 / 74
  49. 49. Rank Product Algorithm Main Driver Listing 1: performRrankProduct() 1 public static void main(String[] args) throws Exception { 2 // args[0] = output path 3 // args[1] = number of studies (K) 4 // args[2] = input path for study 1 5 // args[3] = input path for study 2 6 // ... 7 // args[K+1] = input path for study K 8 final String outputPath = args[0]; 9 final String numOfStudiesAsString = args[1]; 10 final int K = Integer.parseInt(numOfStudiesAsString); 11 List<String> inputPathMultipleStudies = new ArrayList<String>(); 12 for (int i=1; i <= K; i++) { 13 String singleStudyInputPath = args[1+i]; 14 inputPathMultipleStudies.add(singleStudyInputPath); 15 } 16 performRrankProduct(inputPathMultipleStudies, outputPath); 17 System.exit(0); 18 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 44 / 74
  50. 50. Rank Product Algorithm groupByKey() vs. combineByKey() Which one should we use? combineByKey() or groupByKey()? According to their semantics, both will give you the same answer. But combineByKey()| is more efficient . In some situations, groupByKey() can even cause of out of disk problems . In general, reduceByKey(), and combineByKey() are preferred over groupByKey(). Spark shuffling is more efficient for reduceByKey()| than groupByKey() and the reason is this: in the shuffle step for reduceByKey(), data is combined so each partition outputs at most one value for each key to send over the network, while in shuffle step for groupByKey(), all the data is wastefully sent over the network and collected on the reduce workers. To understand the difference , the following figures show how the shuffle is done for reduceByKey() and groupByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 45 / 74
  51. 51. Rank Product Algorithm Understanding groupByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 46 / 74
  52. 52. Rank Product Algorithm Understanding reduceByKey() or combineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 47 / 74
  53. 53. Rank Product Algorithm Listing 2: performRrankProduct() 1 public static void performRrankProduct( 2 final List<String> inputPathMultipleStudies, 3 final String outputPath) throws Exception { 4 // create a context object, which is used 5 // as a factory for creating new RDDs 6 JavaSparkContext context = Util.createJavaSparkContext(useYARN); 7 8 // Spark 1.6.0 requires an array for creating union of many RDDs 9 int index = 0; 10 JavaPairRDD<String, Double>[] means = 11 new JavaPairRDD[inputPathMultipleStudies.size()]; 12 for (String inputPathSingleStudy : inputPathMultipleStudies) { 13 means[index] = computeMeanByGroupByKey(context, inputPathSingleStudy); 14 index++; 15 } 16 17 // next compute rank 18 ... 19 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 48 / 74
  54. 54. Rank Product Algorithm Listing 3: performRrankProduct() 1 public static void performRrankProduct( 2 ... 3 // next compute rank 4 // 1. sort values based on absolute value of mean value 5 // 2. assign rank from 1 to N 6 // 3. calculate rank product for each gene 7 JavaPairRDD<String,Long>[] ranks = new JavaPairRDD[means.length]; 8 for (int i=0; i < means.length; i++) { 9 ranks[i] = assignRank(means[i]); 10 } 11 // calculate ranked products 12 // <gene, T2<rankedProduct, N>> 13 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = 14 computeRankedProducts(context, ranks); 15 16 // save the result, shuffle=true 17 rankedProducts.coalesce(1,true).saveAsTextFile(outputPath); 18 19 // close the context and we are done 20 context.close(); 21 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 49 / 74
  55. 55. Rank Product Algorithm STEP-3: create a Spark context object 1 public static JavaSparkContext createJavaSparkContext(boolean useYARN) { 2 JavaSparkContext context; 3 if (useYARN) { 4 context = new JavaSparkContext("yarn-cluster", "MyAnalysis"); // YARN 5 } 6 else { 7 context = new JavaSparkContext(); // Spark cluster 8 } 9 // inject efficiency 10 SparkConf sparkConf = context.getConf(); 11 sparkConf.set("spark.kryoserializer.buffer.mb","32"); 12 sparkConf.set("spark.shuffle.file.buffer.kb","64"); 13 // set a fast serializer 14 sparkConf.set("spark.serializer", 15 "org.apache.spark.serializer.KryoSerializer"); 16 sparkConf.set("spark.kryo.registrator", 17 "org.apache.spark.serializer.KryoRegistrator"); 18 return context; 19 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 50 / 74
  56. 56. Rank Product Algorithm Listing 4: STEP-4: compute mean 1 static JavaPairRDD<String, Double> computeMeanByGroupByKey( 2 JavaSparkContext context, 3 final String inputPath) throws Exception { 4 JavaPairRDD<String, Double> genes = 5 getGenesUsingTextFile(context, inputPath, 30); 6 7 // group values by gene 8 JavaPairRDD<String, Iterable<Double>> groupedByGene = genes.groupByKey(); 9 10 // calculate mean per gene 11 JavaPairRDD<String, Double> meanRDD = groupedByGene.mapValues( 12 new Function< 13 Iterable<Double>, // input 14 Double // output: mean 15 >() { 16 @Override 17 public Double call(Iterable<Double> values) { 18 double sum = 0.0; 19 int count = 0; 20 for (Double v : values) { 21 sum += v; 22 count++; 23 } 24 // calculate mean of samples 25 double mean = sum / ((double) count); 26 return mean; 27 } 28 }); 29 return meanRDD; 30 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 51 / 74
  57. 57. Rank Product Algorithm Listing 5: getGenesUsingTextFile() 1 static JavaPairRDD<String, Double> getGenesUsingTextFile( 2 JavaSparkContext context, 3 final String inputPath, 4 final int numberOfPartitions) throws Exception { 5 6 // read input and create the first RDD 7 // JavaRDD<String>: where String = "gene,test_expression" 8 JavaRDD<String> records = context.textFile(inputPath, numberOfPartitions); 9 10 // for each record, we emit (K=gene, V=test_expression) 11 JavaPairRDD<String, Double> genes 12 = records.mapToPair(new PairFunction<String, String, Double>() { 13 @Override 14 public Tuple2<String, Double> call(String rec) { 15 // rec = "gene,test_expression" 16 String[] tokens = StringUtils.split(rec, ","); 17 // tokens[0] = gene 18 // tokens[1] = test_expression 19 return new Tuple2<String, Double>( 20 tokens[0], Double.parseDouble(tokens[1])); 21 } 22 }); 23 return genes; 24 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 52 / 74
  58. 58. Rank Product Algorithm Listing 6: Assign Rank 1 // result is JavaPairRDD<String, Long> = (gene, rank) 2 static JavaPairRDD<String,Long> assignRank(JavaPairRDD<String,Double> rdd){ 3 // swap key and value (will be used for sorting by key); convert value to abs(value) 4 JavaPairRDD<Double,String> swappedRDD = rdd.mapToPair( 5 new PairFunction<Tuple2<String, Double>, // T: input 6 Double, // K 7 String>>(){ // V 8 public Tuple2<Double,String> call(Tuple2<String, Double> s) { 9 return new Tuple2<Double,String>(Math.abs(s._2), s._1); 10 } 11 }); 12 // we need 1 partition so that we can zip numbers into this RDD by zipWithIndex() 13 JavaPairRDD<Double,String> sorted = swappedRDD.sortByKey(false, 1); // sort means descending 14 // JavaPairRDD<T,Long> zipWithIndex() 15 // Long values will be 0, 1, 2, ...; for ranking, we need 1, 2, 3, ..., therefore, we will add 1 16 JavaPairRDD<Tuple2<Double,String>,Long> indexed = sorted.zipWithIndex(); 17 // next convert JavaPairRDD<Tuple2<Double,String>,Long> into JavaPairRDD<String,Long> 18 // JavaPairRDD<Tuple2<value,gene>,rank> into JavaPairRDD<gene,rank> 19 JavaPairRDD<String, Long> ranked = indexed.mapToPair( 20 new PairFunction<Tuple2<Tuple2<Double,String>,Long>, // T: input 21 String, // K: mapped_id 22 Long>() { // V: rank 23 public Tuple2<String, Long> call(Tuple2<Tuple2<Double,String>,Long> s) { 24 return new Tuple2<String,Long>(s._1._2, s._2 + 1); // ranks are 1, 2, ..., n 25 } 26 }); 27 return ranked; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 53 / 74
  59. 59. Rank Product Algorithm Listing 7: Compute Rank Product using groupByKey() 1 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProducts( 2 JavaSparkContext context, 3 JavaPairRDD<String, Long>[] ranks) { 4 JavaPairRDD<String, Long> unionRDD = context.union(ranks); 5 6 // next find unique keys, with their associated values 7 JavaPairRDD<String, Iterable<Long>> groupedByGeneRDD = unionRDD.groupByKey(); 8 9 // next calculate ranked products and the number of elements 10 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = groupedByGeneRDD.mapValues( 11 new Function< 12 Iterable<Long>, // input: means for all studies 13 Tuple2<Double, Integer> // output: (rankedProduct, N) 14 >() { 15 @Override 16 public Tuple2<Double, Integer> call(Iterable<Long> values) { 17 int N = 0; 18 long products = 1; 19 for (Long v : values) { 20 products *= v; 21 N++; 22 } 23 double rankedProduct = Math.pow( (double) products, 1.0/((double) N)); 24 return new Tuple2<Double, Integer>(rankedProduct, N); 25 } 26 }); 27 return rankedProducts; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 54 / 74
  60. 60. Rank Product Algorithm Next FOCUS on combineByKey() We do need to develop 2 functions: computeMeanByCombineByKey() computeRankedProductsUsingCombineByKey() Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 55 / 74
  61. 61. Rank Product Algorithm combineByKey(): how does it work? combineByKey() is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it. Like aggregate(), combineByKey() allows the user to return values that are not the same type as our input data. To understand combineByKey(), it is useful to think of how it handles each element it processes. As combineByKey() goes through the elements in a partition, each element either has a key it has not seen before or has the same key as a previous element. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 56 / 74
  62. 62. Rank Product Algorithm combineByKey() by Combiner (C) 1 public <C> JavaPairRDD<K,C> combineByKey( 2 Function<V,C> createCombiner, // V -> C 3 Function2<C,V,C> mergeValue, // C+V -> C 4 Function2<C,C,C> mergeCombiners) // C+C -> C Provide 3 functions: 1. createCombiner, which turns a V into a C // creates a one-element list 2. mergeValue, to merge a V into a C // adds it to the end of a list 3. mergeCombiners, to combine two Cs into a single one. Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 57 / 74
  63. 63. Rank Product Algorithm computeMeanByCombineByKey(): Define C data structure 1 // 2 // AverageCount is used by combineByKey() 3 // to hold the total values and their count. 4 // 5 static class AverageCount implements Serializable { 6 double total; 7 int count; 8 9 public AverageCount(double total, int count) { 10 this.total = total; 11 this.count = count; 12 } 13 14 public double average() { 15 return total / (double) count; 16 } 17 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 58 / 74
  64. 64. Rank Product Algorithm computeMeanByCombineByKey() 1 static JavaPairRDD<String, Double> computeMeanByCombineByKey( 2 JavaSparkContext context, 3 final String inputPath) throws Exception { 4 5 JavaPairRDD<String, Double> genes = getGenesUsingTextFile(context, inputPath, 30); 6 7 // we need 3 function to be able to use combineByKey() 8 Function<Double, AverageCount> createCombiner = ... 9 Function2<AverageCount, Double, AverageCount> addAndCount = ... 10 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = ... 11 12 JavaPairRDD<String, AverageCount> averageCounts = 13 genes.combineByKey(createCombiner, addAndCount, mergeCombiners); 14 15 // now compute the mean/average per gene 16 JavaPairRDD<String,Double> meanRDD = averageCounts.mapToPair( 17 new PairFunction< 18 Tuple2<String, AverageCount>, // T: input 19 String, // K 20 Double // V 21 >() { 22 @Override 23 public Tuple2<String,Double> call(Tuple2<String, AverageCount> s) { 24 return new Tuple2<String,Double>(s._1, s._2.average()); 25 } 26 }); 27 return meanRDD; 28 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 59 / 74
  65. 65. Rank Product Algorithm 3 functions for combineByKey() 1 Function<Double, AverageCount> createCombiner = new Function<Double, AverageCount>() { 2 @Override 3 public AverageCount call(Double x) { 4 return new AverageCount(x, 1); 5 } 6 }; 7 8 Function2<AverageCount, Double, AverageCount> addAndCount = 9 new Function2<AverageCount, Double, AverageCount>() { 10 @Override 11 public AverageCount call(AverageCount a, Double x) { 12 a.total += x; 13 a.count += 1; 14 return a; 15 } 16 }; 17 18 Function2<AverageCount, AverageCount, AverageCount> mergeCombiners = 19 new Function2<AverageCount, AverageCount, AverageCount>() { 20 @Override 21 public AverageCount call(AverageCount a, AverageCount b) { 22 a.total += b.total; 23 a.count += b.count; 24 return a; 25 } 26 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 60 / 74
  66. 66. Rank Product Algorithm Per-key average using combineByKey() in Python 1 # C = Tuple2(sum, count) 2 sumCount = nums.combineByKey( 3 (lambda v: (v,1)), 4 (lambda C, v: (C[0] + v, C[1] + 1)), 5 (lambda C1, C2: (C1[0] + C2[0], C1[1] + C2[1])) 6 ) 7 8 mean = sumCount.map(lambda key, C: (key, C[0]/C[1])).collectAsMap() 9 print(*mean) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 61 / 74
  67. 67. Rank Product Algorithm Sample Run using groupByKey(): script 1 # define input/output for Hadoop/HDFS 2 #// args[0] = output path 3 #// args[1] = number of studies (K) 4 #// args[2] = input path for study 1 5 #// args[3] = input path for study 2 6 #// ... 7 #// args[K+1] = input path for study K 8 OUTPUT=/rankproduct/output 9 NUM_OF_INPUT=3 10 INPUT1=/rankproduct/input1 11 INPUT2=/rankproduct/input2 12 INPUT3=/rankproduct/input3 13 14 # remove all files under input 15 $HADOOP_HOME/bin/hadoop fs -rmr $OUTPUT 16 # 17 # remove all files under output 18 driver=org.dataalgorithms.bonus.rankproduct.spark.SparkRankProductUsingGroupByKey 19 $SPARK_HOME/bin/spark-submit --class $driver 20 --master yarn-cluster 21 --jars $OTHER_JARS 22 --conf "spark.yarn.jar=$SPARK_JAR" 23 $APP_JAR $OUTPUT $NUM_OF_INPUT $INPUT1 $INPUT2 $INPUT3 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 62 / 74
  68. 68. Rank Product Algorithm Sample Run using groupByKey(): input 1 # hadoop fs -cat /rankproduct/input1/rp1.txt 2 K_1,30.0 3 K_2,60.0 4 K_3,10.0 5 K_4,80.0 6 7 # hadoop fs -cat /rankproduct/input2/rp2.txt 8 K_1,90.0 9 K_2,70.0 10 K_3,40.0 11 K_4,50.0 12 13 # hadoop fs -cat /rankproduct/input3/rp3.txt 14 K_1,4.0 15 K_2,8.0 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 63 / 74
  69. 69. Rank Product Algorithm Sample Run using groupByKey(): output 1 # hadoop fs -cat /rankproduct/output/part* 2 (K_2,(1.5874010519681994,3)) 3 (K_3,(4.0,2)) 4 (K_1,(1.8171205928321397,3)) 5 (K_4,(1.7320508075688772,2)) Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 64 / 74
  70. 70. Rank Product Algorithm computeRankedProductsUsingCombineByKey(): Define C data structure 1 /** 2 * RankProduct is used by combineByKey() to hold 3 * the total product values and their count. 4 */ 5 static class RankProduct implements Serializable { 6 long product; 7 int count; 8 9 public RankProduct(long product, int count) { 10 this.product = product; 11 this.count = count; 12 } 13 14 public double rank() { 15 return Math.pow((double) product, 1.0/ (double) count); 16 } 17 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 65 / 74
  71. 71. Rank Product Algorithm computeRankedProductsUsingCombineByKey() 1 // JavaPairRDD<String, Tuple2<Double, Integer>> = <gene, T2(rankedProduct, N>> 2 // where N is the number of elements for computing the rankedProduct 3 static JavaPairRDD<String, Tuple2<Double, Integer>> computeRankedProductsUsingCombineByKey( 4 JavaSparkContext context, 5 JavaPairRDD<String, Long>[] ranks) { 6 JavaPairRDD<String, Long> unionRDD = context.union(ranks); 7 8 // we need 3 function to be able to use combinebyKey() 9 Function<Long, RankProduct> createCombiner = ... 10 Function2<RankProduct, Long, RankProduct> addAndCount = ... 11 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = ... 12 13 // next find unique keys, with their associated copa scores 14 JavaPairRDD<String, RankProduct> combinedByGeneRDD = 15 unionRDD.combineByKey(createCombiner, addAndCount, mergeCombiners); 16 17 // next calculate ranked products and the number of elements 18 JavaPairRDD<String, Tuple2<Double, Integer>> rankedProducts = combinedByGeneRDD.mapValues( 19 new Function< 20 RankProduct, // input: RankProduct 21 Tuple2<Double, Integer> // output: (RankedProduct.count) 22 >() { 23 @Override 24 public Tuple2<Double, Integer> call(RankProduct value) { 25 double theRankedProduct = value.rank(); 26 return new Tuple2<Double, Integer>(theRankedProduct, value.count); 27 } 28 }); 29 return rankedProducts; 30 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 66 / 74
  72. 72. Rank Product Algorithm 3 functions for computeRankedProductsUsingCombineByKey() 1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() { 2 @Override 3 public RankProduct call(Long x) { 4 return new RankProduct(x, 1); 5 } 6 }; 7 8 Function2<RankProduct, Long, RankProduct> addAndCount = 9 new Function2<RankProduct, Long, RankProduct>() { 10 @Override 11 public RankProduct call(RankProduct a, Long x) { 12 a.product *= x; 13 a.count += 1; 14 return a; 15 } 16 }; 17 18 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = 19 new Function2<RankProduct, RankProduct, RankProduct>() { 20 @Override 21 public RankProduct call(RankProduct a, RankProduct b) { 22 a.product *= b.product; 23 a.count += b.count; 24 return a; 25 } 26 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 67 / 74
  73. 73. Rank Product Algorithm REVISED computeRankedProductsUsingCombineByKey(): Define C data structure 1 static class RankProduct implements Serializable { 2 long product; 3 int count; 4 5 public RankProduct(long product, int count) { 6 this.product = product; 7 this.count = count; 8 } 9 10 public product(long value) { 11 this.product *= value; 12 this.count++; 13 } 14 15 public product(RankProduct pr) { 16 this.product *= pr.value; 17 this.count += pr.count; 18 } 19 20 public double rank() { 21 return Math.pow((double) product, 1.0/ (double) count); 22 } 23 } Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 68 / 74
  74. 74. Rank Product Algorithm REVISED 3 functions for computeRankedProductsUsingCombineByKey() 1 Function<Long, RankProduct> createCombiner = new Function<Long, RankProduct>() { 2 public RankProduct call(Long x) { 3 return new RankProduct(x, 1); 4 } 5 }; 6 7 Function2<RankProduct, Long, RankProduct> addAndCount = 8 new Function2<RankProduct, Long, RankProduct>() { 9 public RankProduct call(RankProduct a, Long x) { 10 a.product(x); 11 return a; 12 } 13 }; 14 15 Function2<RankProduct, RankProduct, RankProduct> mergeCombiners = 16 new Function2<RankProduct, RankProduct, RankProduct>() { 17 public RankProduct call(RankProduct a, RankProduct b) { 18 a.product(b); 19 return a; 20 } 21 }; Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 69 / 74
  75. 75. Moral of Story Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 70 / 74
  76. 76. Moral of Story Moral of Story... Understand data requirements Select proper platform for implementation: Spark Partition your RDDs properly number of cluster nodes number of cores per node amount of RAM Avoid unnecessary computations (g1, g2), (g2, g1) Use filter() often to remove non-needed RDD elements Avoid unnecessary RDD.saveAsTextFile(path) Run both on YARN and Spark cluster and compare performance use groupByKey() cautiously use combineByKey() over groupByKey() Verify your test results against R language Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 71 / 74
  77. 77. References Outline 1 Biography 2 What is COPA using Spark? 3 COPA Algorithm 4 Input Data 5 Rank Product Algorithm 6 Moral of Story 7 References Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 72 / 74
  78. 78. References References COPA – Cancer Outlier Profile Analysis: James W. MacDonald and Debashis Ghosh, http://bioinformatics.oxfordjournals.org/ content/22/23/2950.full.pdf mCOPA: analysis of heterogeneous features in cancer expression data, Chenwei Wang, Alperen Taciroglu, Stefan R Maetschke, Colleen C Nelson, Mark A Ragan, and Melissa J Davis, http://jclinbioinformatics.biomedcentral.com/articles/ 10.1186/2043-9113-2-22 Rank Product, Mahmoud Parsian, http://mapreduce4hackers.com/RankProduct_intro.html Rank Product, Mahmoud Parsian, https://github.com/ mahmoudparsian/data-algorithms-book/tree/master/src/ main/java/org/dataalgorithms/bonus/rankproduct Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 73 / 74
  79. 79. References Thank you! Questions? Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 74 / 74

×