Your SlideShare is downloading. ×

Large scale data-parsing with Hadoop in Bioinformatics

1,969

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,969
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
45
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Large-scale data parsing and algorithm development with Hadoop / MapReduce Ntino Krampis Cloud Computing Workshop 28 October 2010 J. Craig Venter Institute
  • 2. Uniref_ID Uniref_Cluster B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 Canonical example : finding the members of Uniref100 clusters ● 30GB of data, ~ 12 million rows ● remember, this is a “small” example dataset ● your typical server at 32GB of memory + 16 cores ● approach for finding the cluster members ?
  • 3. Uniref_ID Uniref_Cluster B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 Key Value A0K3H0 → ( B1JTL4, A2VU91, A0K3H0, ... ) A0Q8P9 → ( A0Q8P9, A7JF80, B4ARM5 ) A7ZA84 → ( A7ZA84 ) Traditional approach one : hashing ● Key : Uniref_Cluster ID ● Value : array of cluster member Uniref IDs ● add new Keys or member Uniref IDs in Value if Key exists ● how big hash can you fit in a 32GB memory ?
  • 4. Uniref_ID Uniref_Cluster A0K3H0 A0K3H0 A2VU91 A0K3H0 B1JTL4 A0K3H0 Q1BRP4 A0K3H0 A0Q8P9 A0Q8P9 A7JF80 A0Q8P9 B4ARM5 A0Q8P9 A7GLP0 A7GLP0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Traditional approach two : sorting ● sort to bring all Uniref Cluster IDs together ● stream all the lines and get the cluster members ● soring algorithms, memory or disk based ? ● can probably do 100GB with disk paging (slow....)
  • 5. B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 Split the data and sort in parallel ? ● implement data distribution across compute nodes (easy) ● implement parallel processing the data fragments at nodes (easy) ● implement exhange of partial sorts / intermediate results between nodes (difficult) ● implement tracking of data fragment failures (difficult) ● let's see in detail how you'd implement all this... ● …which is the same as explaining what MapReduce/Hadoop does automatically for you.
  • 6. A bird's eye view of the Hadoop Map/Reduce framework ● data distribution across the compute nodes : HDFS , Hadoop Distributed FileSystem ● parallel processing of the data fragments at nodes, part 1 : Map script written by you (ex. parse Uniref100 cluster IDs from >FASTA) ● exchange of intermediate results between nodes : Shuffle, aggregates results sharing Key (Uniref cluster ID) on same node If not looking for Uniref clusters use random key and simply parse in parallel at Map ● parallel processing of the data fragments at nodes, part 2 : Reduce script written by you, processing of aggregated results Not required if you don't want to aggregate using specific Key ● re-scheduling of a job failure with a data fragment : Automatically
  • 7. Data distribution across compute nodes  Hadoop Distributed Filesystem (HDFS) ● data split in 64MB blocks distributed across nodes of cluster ● to you look as regular files and directories $fog-0-0-1> hadoop fs -ls , -rm , -rmr , -mkdir , -chmod etc. $fog-0-0-1> hadoop fs -put uniref100_clusters /user/kkrampis/ ● one compute task per block: granularity  tasks per cluster node based on number of blocks at the node  small data tasks prevent “long wall clock” by longest running task B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 fog-0-1-2 : 32GB + 16 cores fog-0-1-3 : 32GB + 16 cores fog-0-1-4 : 32GB + + 16 cores . . . . . . . B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0
  • 8. B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 ● Map script specifying how to parse your data ● Hadoop handles all the parallel execution details STDIN.each_line do |line| lineArray = line.split ( / t / ) uniref_id = lineArray.at ( 0 ) uniref_cluster_id = lineArray.at ( 1 ) puts " #{uniref_cluster_id} t #{uniref_id} " End ( beloved Perl: “while <STDIN> { }“ ) fog-0-1-2 : 32GB + 16 cores fog-0-1-3 : 32GB + 16 cores fog-0-1-4 : 32GB + + 16 cores . . . . . . . Map ( Key, Value ) code Map code B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 A0RAB9 A0RAB9 A7JF80 A0Q8P9 A7GLP0 A7GLP0 B4ARM5 A0Q8P9 A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 ( A0K3H0 , B1JTL4 ) ( A0Q8P9 , A0Q8P9 ) ( A0K3H0 , A2VU91 ) ( A7ZA84 , A7ZA84 ) Parallel processing the of the data fragments part 1 Map phase ( data pre-processing in parallel ) Map code
  • 9. ( A0K3H0 , B1JTL4 ) ( A0K3H0, A2VU91 ) ( A0Q8P9, A0Q8P9 ) ( A7ZA84, A7ZA84 ) ( A0Q8P9, A7JF80 ) ( A0Q8P9, B4ARM5 ) ( A0RAB9, A0RAB9 ) ( A7GLP0, A7GLP0 ) ( A0K3H0, A0K3H0 ) ( A0K3H0, Q1BRP4) ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 ) ( A0K3H0 , B1JTL4 ) ( A0Q8P9, A0Q8P9 ) ( A0K3H0, A2VU91 ) ( A7ZA84, A7ZA84 ) ( A0RAB9, A0RAB9 ) ( A0Q8P9, A7JF80 ) ( A7GLP0, A7GLP0 ) ( A0Q8P9, B4ARM5 ) ( A0K3H0, A0K3H0 ) ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 ) ( A1BI83, A1BI83 ) ( A0K3H0, Q1BRP4 ) fog-0-1-2 fog-0-1-3 fog-0-1-4 . . . . . . Processing the data fragments across nodes part 1: fog-0-1-2 fog-0-1-3 fog-0-1-4 . . . . . . Hadoop performs parallel sorting by Key this is intermendiate sorting on the data fragments at the nodes
  • 10. fog-0-1-1 : master I have A0K3H0 A0Q8P9 sent A0Q8P9 to fog-0-0-2 I have A0Q8P9 keep it sent A0K3H0 to fog-0-0-1 I have A0K3H0 Exchange of intermediate Shuffle phase results between nodes ( A0K3H0 , B1JTL4 ) ( A0K3H0, A2VU91 ) ( A0Q8P9, A0Q8P9 ) ( A7ZA84, A7ZA84 ) ( A0Q8P9, A7JF80 ) ( A0Q8P9, B4ARM5 ) ( A0RAB9, A0RAB9 ) ( A7GLP0, A7GLP0 ) ( A0K3H0, A0K3H0 ) ( A0K3H0, Q1BRP4) ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 ) fog-0-1-2 fog-0-1-3 fog-0-1-4 . . . . . . fog-0-1-2 fog-0-1-3 fog-0-1-4 . . . . . . ( A0K3H0 , B1JTL4 ) ( A0K3H0, A2VU91 ) ( A0K3H0, A0K3H0 ) ( A0K3H0, Q1BRP4) ( A7ZA84, A7ZA84 ) ( A0Q8P9, A7JF80 ) ( A0Q8P9, B4ARM5 ) ( A0Q8P9, A0Q8P9 ) ( A0RAB9, A0RAB9 ) ( A7GLP0, A7GLP0 ) ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 )
  • 11. guaranteed: Keys ordered Values are not ordered can use secondary keys if desired parallel processing as well Output: A0K3H0 B1JTL4, A2VU91, A0K3H0, Q1BRP4 A7ZA84 A7ZA84 A0Q8P9 A0Q8P9, A7JF80, B4ARM5 last_key, cluster = nil , “ ”, STDIN.each_line do |line| uniref_cluster_id, uniref_id = line.split( "t" ) if last_key && last_key != uniref_cluster_id puts "#{last_key} t #{cluster}" else last_key, cluster = uniref_cluster_id, cluster + ',' + uniref_id end end Reduce code fog-0-0-1 fog-0-0-2 fog-0-0-3 . . . . . . ( A0K3H0 , B1JTL4 ) ( A0K3H0, A2VU91 ) ( A0K3H0, A0K3H0 ) ( A0K3H0, Q1BRP4) ( A7ZA84, A7ZA84 ) ( A0Q8P9, A7JF80 ) ( A0Q8P9, B4ARM5 ) ( A0Q8P9, A0Q8P9 ) ( A0RAB9, A0RAB9 ) ( A7GLP0, A7GLP0 ) ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 ) Reduce code Processing the data fragments across nodes part 2 Reduce Phase
  • 12. CAAGGACGTGACAA TATTAATGCAATGAG TAGATCACGTTTTTA CCGGACGAACCACA CTATTTTAGTGGTCAG TGAGTTGCACTTAAG ATTAGGACCATGTAG AGTGGTGCACATGAT ACGTCAACGTCATCG TTTATCTCTCGAAACT ATTCCATAGTGAGTG TTATCGTTATTGCTAG CCATAGACGTACGTC fog-0-1-2 : 32GB + 16 cores fog-0-1-3 : 32GB + 16 cores fog-0-1-4 : 32GB + + 16 cores . . . . . . . Distributed Grep, CloudBLAST – CloudBurst and K-mer frequency counts ( Key , Value ) ( ACGT, CAAGGACGTGACAA ) ( TGCA, TATTAATGCAATGAG ) ( ACGT, TAGATCACGTTTTTA ) ( Key , Value ) ( ACGT, CAAGGACGTGACAA ) ( ACGT, TAGATCACGTTTTTA ) ( ACGT, CCATAGACGTACGTC) ( Key , Value ) ( TGCA, TATTAATGCAATGAG ) ( TGCA, TGAGTTGCACTTAAG) ( TGCA, AGTGGTGCACATGAT) ( Key , Value ) ( TGCA, TGAGTTGCACTTAAG ) ( TGCA, AGTGGTGCACATGAT ) Map Shuffle Map while <STDIN> { $value = $_ ; if $key = $_ =~ / ACGT / print “ $key t $value n”; if $key = $_ =~ / TGCA / print “ $key t $value n”; } OK, This is some Perl !
  • 13. References [1] Aaron McKenna et al. The genome analysis toolkit: A mapreduce framework for analyzing next- generation dna sequencing data. Genome Research, 20(9):1297–1303, September 2010. [2] Suzanne Matthews et al. Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics, 11(Suppl 1):S15+, 2010. [3] G. Sudha Sadasivam et al. A novel approach to multiple sequence alignment using hadoop data grids. In MDAC ’10: Proceedings of 2010 Workshop on Massive Data Analytics on the Cloud, pages 1–7, NY, USA, 2010. ACM. [4] Christopher Moretti et al. Scaling up classifiers to cloud computers. In ICDM '08. Eighth IEEE International Conference on Data Mining, pages 472-481, NY, USA, 2010. ACM. [5] Weizhong Zhao et al. Parallel k-means clustering based on mapreduce. In Martin G. Jaatun, Gansen Zhao, and Chunming Rong, editors, Cloud Computing 5931;2:2–18. Springer Berlin,2009. [6] Yang Liu et al. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. Lecture Notes in Computer Science 27: 341–355 [7] Michael C. Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363–1369, June 2009. [8] Ben Langmead et al. Searching for snps with cloud computing. Genome Biology, 10(11):R134+, November 2009.
  • 14. Further References ● showed the core framework and Hadoop streaming (using scripting languages) ● much more for experienced Java developers: - complex data structures on the Value field - combiners, custom serialization, compression - coding patterns for algorithms in MapReduce ● http://hadoop.apache.org : - Hbase / Hive : scalable, distributed database / data warehouse for large tables - Mahout: A Scalable machine learning and data mining library - Pig: data workflow language and execution framework for parallel computation.
  • 15. /home/cloud/training/hadoop_cmds.sh : hadoop fs -mkdir /user/$USER/workshop hadoop fs -put /home/cloud/training/uniref100_proteins /user/$USER/workshop/uniref100_proteins hadoop fs -ls /user/$USER/workshop hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /user/$USER/workshop/uniref100_proteins -output /user/$USER/workshop/uniref100_clusters -file /home/cloud/training/uniref100_clusters_map.rb -mapper /home/cloud/training/uniref100_clusters_map.rb -file /home/cloud/training/uniref100_clusters_reduce.rb -reducer /home/cloud/training/uniref100_clusters_reduce.rb hadoop fs -get /user/$USER/workshop/uniref100_clusters /home/cloud/users/$USER gunzip /home/cloud/users/$USER/uniref100_clusters/part-00000.gz more /home/cloud/users/$USER/uniref100_clusters/part-00000 hadoop fs -rmr /user/$USER/workshop

×