• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Large scale data-parsing with Hadoop in Bioinformatics

on

  • 2,091 views

 

Statistics

Views

Total Views
2,091
Views on SlideShare
2,083
Embed Views
8

Actions

Likes
1
Downloads
38
Comments
0

2 Embeds 8

http://www.linkedin.com 4
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large scale data-parsing with Hadoop in Bioinformatics Large scale data-parsing with Hadoop in Bioinformatics Presentation Transcript

    • Large-scale data parsing and algorithm development with Hadoop / MapReduce Ntino Krampis Cloud Computing Workshop 28 October 2010 J. Craig Venter Institute
    • Canonical example : finding the members of Uniref100 clusters Uniref_ID Uniref_Cluster B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A2VU91 A0K3H0 ● 30GB of data, ~ 12 million rows A7ZA84 A7ZA84 A0RAB9 A0RAB9 ● remember, this is a “small” example dataset A7JF80 A0Q8P9 A7GLP0 A7GLP0 ● your typical server at 32GB of memory + 16 cores B4ARM5 A0Q8P9 ● approach for finding the cluster members ? A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0
    • Traditional approach one : hashing Uniref_ID Uniref_Cluster Key Value B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 A0K3H0 → ( B1JTL4, A2VU91, A0K3H0, ... ) A2VU91 A0K3H0 A0Q8P9 → ( A0Q8P9, A7JF80, B4ARM5 ) A7ZA84 A7ZA84 A7ZA84 → ( A7ZA84 ) A0RAB9 A0RAB9 A7JF80 A0Q8P9 ● Key : Uniref_Cluster ID A7GLP0 A7GLP0 B4ARM5 A0Q8P9 ● Value : array of cluster member Uniref IDs A0K3H0 A0K3H0 A9VGI8 A9VGI8 ● add new Keys or member Uniref IDs in Value A0KAJ8 Q1BTJ3 if Key exists A1BI83 A1BI83 Q1BRP4 A0K3H0 ● how big hash can you fit in a 32GB memory ?
    • Traditional approach two : sorting Uniref_ID Uniref_Cluster A0K3H0 A0K3H0 A2VU91 A0K3H0 B1JTL4 A0K3H0 ● sort to bring all Uniref Cluster IDs together Q1BRP4 A0K3H0 ● stream all the lines and get the cluster members A0Q8P9 A0Q8P9 A7JF80 A0Q8P9 ● soring algorithms, memory or disk based ? B4ARM5 A0Q8P9 A7GLP0 A7GLP0 ● can probably do 100GB with disk paging A7ZA84 A7ZA84 (slow....) A0RAB9 A0RAB9 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83
    • B1JTL4 A0K3H0 Split the data and sort in parallel ? A0Q8P9 A0Q8P9 A2VU91 A0K3H0 A7ZA84 A7ZA84 ● implement data distribution across compute nodes (easy) A0RAB9 A0RAB9 ● implement parallel processing the data fragments at nodes (easy) A7JF80 A0Q8P9 ●implement exhange of partial sorts / intermediate results A7GLP0 A7GLP0 between nodes (difficult) B4ARM5 A0Q8P9 ● implement tracking of data fragment failures (difficult) ● let's see in detail how you'd implement all this... A0K3H0 A0K3H0 A9VGI8 A9VGI8 ●…which is the same as explaining what MapReduce/Hadoop does automatically for you. A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0
    • A bird's eye view of the Hadoop Map/Reduce framework ● data distribution across the compute nodes : HDFS , Hadoop Distributed FileSystem ● parallel processing of the data fragments at nodes, part 1 : Map script written by you (ex. parse Uniref100 cluster IDs from >FASTA) ● exchange of intermediate results between nodes : Shuffle, aggregates results sharing Key (Uniref cluster ID) on same node If not looking for Uniref clusters use random key and simply parse in parallel at Map ● parallel processing of the data fragments at nodes, part 2 : Reduce script written by you, processing of aggregated results Not required if you don't want to aggregate using specific Key ● re-scheduling of a job failure with a data fragment : Automatically
    • Data distribution across compute nodes fog-0-1-2 : 32GB + 16 cores B1JTL4 A0K3H0 A0Q8P9 A0Q8P9  Hadoop Distributed Filesystem (HDFS) A2VU91 A0K3H0 ● data split in 64MB blocks distributed across nodes of cluster A7ZA84 A7ZA84 ● to you look as regular files and directories fog-0-1-3 : 32GB + 16 cores A0RAB9 A0RAB9 $fog-0-0-1> hadoop fs -ls , -rm , -rmr , -mkdir , -chmod etc. A7JF80 A0Q8P9 $fog-0-0-1> hadoop fs -put uniref100_clusters /user/kkrampis/ A7GLP0 A7GLP0 B4ARM5 A0Q8P9 ● one compute task per block: granularity  tasks per cluster node based on number of blocks at the node fog-0-1-4 : 32GB + + 16 cores  small data tasks prevent “long wall clock” by longest running task A0K3H0 A0K3H0 A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 A1BI83 A1BI83 Q1BRP4 A0K3H0 .......
    • Parallel processing the of the data fragments part 1 Map phase ( data pre-processing in parallel ) fog-0-1-2 : 32GB + 16 cores B1JTL4 A0K3H0 A0Q8P9 A0Q8P9 ( A0K3H0 , B1JTL4 ) Map ( Key, Value ) ( A0Q8P9 , A0Q8P9 ) A2VU91 A0K3H0 code A7ZA84 A7ZA84 ( A0K3H0 , A2VU91 ) ( A7ZA84 , A7ZA84 ) fog-0-1-3 : 32GB + 16 cores A0RAB9 A0RAB9 A7JF80 A0Q8P9 Map ● Map script specifying how to parse your data A7GLP0 A7GLP0 code ● Hadoop handles all the parallel execution details B4ARM5 A0Q8P9 STDIN.each_line do |line| fog-0-1-4 : 32GB + + 16 cores lineArray = line.split ( / t / ) A0K3H0 A0K3H0 uniref_id = lineArray.at ( 0 ) Map uniref_cluster_id = lineArray.at ( 1 ) A9VGI8 A9VGI8 A0KAJ8 Q1BTJ3 code puts " #{uniref_cluster_id} t #{uniref_id} " A1BI83 A1BI83 End Q1BRP4 A0K3H0 ( beloved Perl: “while <STDIN> { }“ ) .......
    • Processing the data fragments across nodes part 1: fog-0-1-2 fog-0-1-2 ( A0K3H0 , B1JTL4 ) ( A0K3H0 , B1JTL4 ) ( A0Q8P9, A0Q8P9 ) ( A0K3H0, A2VU91 ) ( A0K3H0, A2VU91 ) ( A0Q8P9, A0Q8P9 ) ( A7ZA84, A7ZA84 ) ( A7ZA84, A7ZA84 ) fog-0-1-3 fog-0-1-3 this is Hadoop performs ( A0RAB9, A0RAB9 ) ( A0Q8P9, A7JF80 ) intermendiate parallel ( A0Q8P9, A7JF80 ) sorting on the ( A0Q8P9, B4ARM5 ) sorting by Key ( A7GLP0, A7GLP0 ) data fragments ( A0RAB9, A0RAB9 ) at the nodes ( A0Q8P9, B4ARM5 ) ( A7GLP0, A7GLP0 ) fog-0-1-4 fog-0-1-4 ( A0K3H0, A0K3H0 ) ( A0K3H0, A0K3H0 ) ( A9VGI8, A9VGI8 ) ( A0K3H0, Q1BRP4) ( Q1BTJ3, A0KAJ8 ) ( A1BI83, A1BI83 ) ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) ( A0K3H0, Q1BRP4 ) ( Q1BTJ3, A0KAJ8 ) ...... ......
    • Exchange of intermediate fog-0-1-1 : master Shuffle phase results between nodes I have A0K3H0 A0Q8P9 sent fog-0-1-2 A0Q8P9 fog-0-1-2 ( A0K3H0 , B1JTL4 ) to fog-0-0-2 ( A0K3H0 , B1JTL4 ) ( A0K3H0, A2VU91 ) ( A0K3H0, A2VU91 ) ( A0Q8P9, A0Q8P9 ) ( A0K3H0, A0K3H0 ) ( A7ZA84, A7ZA84 ) I have ( A0K3H0, Q1BRP4) A0Q8P9 ( A7ZA84, A7ZA84 ) keep it fog-0-1-3 fog-0-1-3 ( A0Q8P9, A7JF80 ) ( A0Q8P9, B4ARM5 ) ( A0Q8P9, A7JF80 ) ( A0RAB9, A0RAB9 ) ( A0Q8P9, B4ARM5 ) ( A7GLP0, A7GLP0 ) ( A0Q8P9, A0Q8P9 ) ( A0RAB9, A0RAB9 ) I have fog-0-1-4 A0K3H0 ( A7GLP0, A7GLP0 ) ( A0K3H0, A0K3H0 ) ( A0K3H0, Q1BRP4) fog-0-1-4 sent ( A1BI83, A1BI83 ) A0K3H0 ( A1BI83, A1BI83 ) ( A9VGI8, A9VGI8 ) to fog-0-0-1 ( A9VGI8, A9VGI8 ) ( Q1BTJ3, A0KAJ8 ) ( Q1BTJ3, A0KAJ8 ) ...... ......
    • Processing the data fragments across nodes part 2 Reduce Phase guaranteed: Keys ordered fog-0-0-1 Values are not ordered ( A0K3H0 , B1JTL4 ) can use secondary keys if desired ( A0K3H0, A2VU91 ) Reduce parallel processing as well ( A0K3H0, A0K3H0 ) code ( A0K3H0, Q1BRP4) last_key, cluster = nil , “ ”, ( A7ZA84, A7ZA84 ) STDIN.each_line do |line| fog-0-0-2 uniref_cluster_id, uniref_id = line.split( "t" ) ( A0Q8P9, A7JF80 ) if last_key && last_key != uniref_cluster_id ( A0Q8P9, B4ARM5 ) puts "#{last_key} t #{cluster}" ( A0Q8P9, A0Q8P9 ) else ( A0RAB9, A0RAB9 ) last_key, cluster = uniref_cluster_id, cluster + ',' + uniref_id end ( A7GLP0, A7GLP0 ) end fog-0-0-3 ( A1BI83, A1BI83 ) Output: ( A9VGI8, A9VGI8 ) A0K3H0 B1JTL4, A2VU91, A0K3H0, Q1BRP4 ( Q1BTJ3, A0KAJ8 ) Reduce A7ZA84 A7ZA84 code A0Q8P9 A0Q8P9, A7JF80, B4ARM5 ......
    • Distributed Grep, CloudBLAST – CloudBurst and K-mer frequency counts fog-0-1-2 : 32GB + 16 cores CAAGGACGTGACAA ( Key , Value ) ( Key , Value ) TATTAATGCAATGAG ( ACGT, CAAGGACGTGACAA ) ( ACGT, CAAGGACGTGACAA ) TAGATCACGTTTTTA ( TGCA, TATTAATGCAATGAG ) ( ACGT, TAGATCACGTTTTTA ) CCGGACGAACCACA ( ACGT, TAGATCACGTTTTTA ) ( ACGT, CCATAGACGTACGTC) fog-0-1-3 : 32GB + 16 cores Map Shuffle CTATTTTAGTGGTCAG ( Key , Value ) ( Key , Value ) TGAGTTGCACTTAAG ATTAGGACCATGTAG ( TGCA, TGAGTTGCACTTAAG ) ( TGCA, TATTAATGCAATGAG ) AGTGGTGCACATGAT ( TGCA, AGTGGTGCACATGAT ) ( TGCA, TGAGTTGCACTTAAG) ( TGCA, AGTGGTGCACATGAT) fog-0-1-4 : 32GB + + 16 cores while <STDIN> { ACGTCAACGTCATCG $value = $_ ; TTTATCTCTCGAAACT if $key = $_ =~ / ACGT / OK, This is some Perl ! ATTCCATAGTGAGTG Map print “ $key t $value n”; TTATCGTTATTGCTAG CCATAGACGTACGTC if $key = $_ =~ / TGCA / print “ $key t $value n”; ....... }
    • References [1] Aaron McKenna et al. The genome analysis toolkit: A mapreduce framework for analyzing next- generation dna sequencing data. Genome Research, 20(9):1297–1303, September 2010. [2] Suzanne Matthews et al. Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics, 11(Suppl 1):S15+, 2010. [3] G. Sudha Sadasivam et al. A novel approach to multiple sequence alignment using hadoop data grids. In MDAC ’10: Proceedings of 2010 Workshop on Massive Data Analytics on the Cloud, pages 1–7, NY, USA, 2010. ACM. [4] Christopher Moretti et al. Scaling up classifiers to cloud computers. In ICDM '08. Eighth IEEE International Conference on Data Mining, pages 472-481, NY, USA, 2010. ACM. [5] Weizhong Zhao et al. Parallel k-means clustering based on mapreduce. In Martin G. Jaatun, Gansen Zhao, and Chunming Rong, editors, Cloud Computing 5931;2:2–18. Springer Berlin,2009. [6] Yang Liu et al. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. Lecture Notes in Computer Science 27: 341–355 [7] Michael C. Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363–1369, June 2009. [8] Ben Langmead et al. Searching for snps with cloud computing. Genome Biology, 10(11):R134+, November 2009.
    • Further References ● showed the core framework and Hadoop streaming (using scripting languages) ● much more for experienced Java developers: - complex data structures on the Value field - combiners, custom serialization, compression - coding patterns for algorithms in MapReduce ● http://hadoop.apache.org : - Hbase / Hive : scalable, distributed database / data warehouse for large tables - Mahout: A Scalable machine learning and data mining library - Pig: data workflow language and execution framework for parallel computation.
    • /home/cloud/training/hadoop_cmds.sh : hadoop fs -mkdir /user/$USER/workshop hadoop fs -put /home/cloud/training/uniref100_proteins /user/$USER/workshop/uniref100_proteins hadoop fs -ls /user/$USER/workshop hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /user/$USER/workshop/uniref100_proteins -output /user/$USER/workshop/uniref100_clusters -file /home/cloud/training/uniref100_clusters_map.rb -mapper /home/cloud/training/uniref100_clusters_map.rb -file /home/cloud/training/uniref100_clusters_reduce.rb -reducer /home/cloud/training/uniref100_clusters_reduce.rb hadoop fs -get /user/$USER/workshop/uniref100_clusters /home/cloud/users/$USER gunzip /home/cloud/users/$USER/uniref100_clusters/part-00000.gz more /home/cloud/users/$USER/uniref100_clusters/part-00000 hadoop fs -rmr /user/$USER/workshop