Cloud burst tutorial
Upcoming SlideShare
Loading in...5
×
 

Cloud burst tutorial

on

  • 857 views

 

Statistics

Views

Total Views
857
Slideshare-icon Views on SlideShare
857
Embed Views
0

Actions

Likes
0
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cloud burst tutorial Cloud burst tutorial Document Transcript

    • 참고 http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=Sample_Results[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ lltotal 24332-rw-r--r-- 1 hadoop hadoop 1995984 Jun 15 17:19 100k.3.txt-rw-r--r-- 1 hadoop hadoop 1995984 Dec 5 2008 100k.3.txt.gold-rw-r--r-- 1 hadoop hadoop 4493593 Jun 15 17:06 100k.br-rw-r--r-- 1 hadoop hadoop 4388895 Dec 5 2008 100k.fa-rw-r--r-- 1 hadoop hadoop 1177790 Jun 15 17:06 100k.fa.map-rw-r--r-- 1 hadoop hadoop 8337 Dec 5 2008 cloudburst.err.gold-rw-r--r-- 1 hadoop hadoop 57014 Jul 9 2010 CloudBurst.jar-rw-r--r-- 1 hadoop hadoop 4067962 Jul 9 2010 ConvertFastaForCloud.jar-rw-r--r-- 1 hadoop hadoop 4067959 Jul 9 2010 PrintAlignments.jar-rw-r--r-- 1 hadoop hadoop 1452 Jul 9 2010 README.txtdrwxr-xr-x 2 hadoop hadoop 4096 Jun 15 17:19 results-rw-r--r-- 1 hadoop hadoop 579773 Jun 15 17:06 s_suis.br-rw-r--r-- 1 hadoop hadoop 2040970 Dec 5 2008 s_suis.fa-rw-r--r-- 1 hadoop hadoop 21 Jun 15 17:06 s_suis.fa.map[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ cat README.txtSample data for CloudBurst==========================CloudBurst has several parameters to control the sensitivity of thealignment algorithm. Here it finds the unambiguous best alignment for100,000 reads allowing up to 3 mismatches when mapping to the correspondingS. suis genome.== Sample input datas_suis.fa: Streptococcus suis reference genome sequence100k.fa: 100,000 36bp Illumina reads available from http://www.sanger.ac.uk/Projects/S_suis/== Format the input data$ java -jar ConvertFastaForCloud.jar s_suis.fa s_suis.br$ java -jar ConvertFastaForCloud.jar 100k.fa 100k.br
    • s_suis.br: reference genome in CloudBurst binary format100k.br: Reads in CloudBurst binary format... 생략 ...[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head s_suis.fa>Streptococcus_suisatgaaccaagaacaacttttttggcaacgatttattgaattggcaaaggtaaattttaagccatctatttatgatttttatgtcgctgatgcaaaattactcggaatcaaccagcaagttgccaatattttcttaaatcgtccatttaaaaaagatttctgggaaaaaaacttcgaagagttaatgattgccgctagttttgaaagctacggagagcctcttaccatccaatatcaattt... 생략 ...acagaggatgaacaggagattaggaatactacaaacacaagaagttcaatagttcaccaggtacagacacttgagccggctactcctcaagaaacttttaaaccggttcattctgatataaaatcccagtacacctttgctaattttgtacaaggagacaataatcactgggcaaaggctgcagctttagctgtatctgataacctaggtgagctctacaatccattattcatttttggtggtcctggtcttggaaaaactcatattttaaatgcgattggaaataaggttctagccgat[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l s_suis.fa33460 s_suis.fa[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head 100k.fa>1GCCTGTTCTTTACATGATTTTTGGTCTAGTGTATGG>2AACCGCTGTAAAGGCTTCTGCCACACCGATTTCTTG>3GAGGTGATTGTGGTATTGT.GGTAAATCGGTGATTG>4GCTTTAGCCGACCTGAACT.GACTACAAGTTGACCA>5AAAGGCTACCCGCGGTTGAACCTTACGTGACACATT[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail 100k.fa>99996AATGCCCGTAACAACGGGCTTTTATCTTGTTCTAAA>99997GTCAGATAGCGCAGGAATTTCAAAGGAATTTGGACC>99998AGTTAACTCTTCAGCTGTAAAGTTGTAGTTTTCTAA>99999
    • GCGGCATAAATTGGATAAAGAAAGAACTGAAGGACA>100000GTTACCATGTATTGTGACAGATAACCACGGTGGAGT[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -mkdir /data/cloudburst[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/s_suis.br /data/cloudburst[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/100k.br /data/cloudburst[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jarUsage: CloudBurst refpath qrypath outpath minreadlen maxreadlen k allowdifferences filteralignments #mappers #reduces#fmappers #freducers blocksize redundancy1. refpath: path in hdfs to the reference file2. qrypath: path in hdfs to the query file3. outpath: path to a directory to store the results (old results are automatically deleted)4. minreadlen: minimum length of the reads5. maxreadlen: maximum read length6. k: number of mismatches / differences to allow (higher number requires more time)7. allowdifferences: 0: mismatches only, 1: indels as well8. filteralignments: 0: all alignments, 1: only report unambiguous best alignment (results identical to RMAP)9. #mappers: number of mappers to use. suggested: #processor-cores * 1010. #reduces: number of reducers to use. suggested: #processor-cores * 211. #fmappers: number of mappers for filtration alg. suggested: #processor-cores12. #freducers: number of reducers for filtration alg. suggested: #processor-cores13. blocksize: number of qry and ref tuples to consider at a time in the reduce phase. suggested: 12814. redundancy: number of copies of low complexity seeds to use. suggested: # processor cores[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar /data/cloudburst/s_suis.br/data/cloudburst/100k.br /data/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err[hadoop@skcc-nebdap02 hadoop]$ cat cloudburst.errrefath: /data/cloudburst/s_suis.brqrypath: /data/cloudburst/100k.broutpath: /data/results-alignmentsMIN_READ_LEN: 36MAX_READ_LEN: 36K: 3SEED_LEN: 9FLANK_LEN: 30ALLOW_DIFFERENCES: 0FILTER_ALIGNMENTS: trueNUM_MAP_TASKS: 240
    • NUM_REDUCE_TASKS: 48BLOCK_SIZE: 128REDUNDANCY: 16 Removing old results12/06/15 17:11:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications shouldimplement Tool for the same.12/06/15 17:11:28 INFO mapred.FileInputFormat: Total input paths to process : 212/06/15 17:11:28 INFO mapred.JobClient: Running job: job_201206112243_001812/06/15 17:11:29 INFO mapred.JobClient: map 0% reduce 0%12/06/15 17:11:47 INFO mapred.JobClient: map 12% reduce 0%12/06/15 17:11:48 INFO mapred.JobClient: map 14% reduce 0%12/06/15 17:11:49 INFO mapred.JobClient: map 15% reduce 0%12/06/15 17:11:50 INFO mapred.JobClient: map 17% reduce 0%12/06/15 17:11:51 INFO mapred.JobClient: map 19% reduce 0%12/06/15 17:11:52 INFO mapred.JobClient: map 21% reduce 0%12/06/15 17:11:53 INFO mapred.JobClient: map 36% reduce 0%12/06/15 17:11:54 INFO mapred.JobClient: map 40% reduce 0%12/06/15 17:11:55 INFO mapred.JobClient: map 45% reduce 0%12/06/15 17:11:56 INFO mapred.JobClient: map 49% reduce 0%12/06/15 17:11:57 INFO mapred.JobClient: map 56% reduce 0%12/06/15 17:11:58 INFO mapred.JobClient: map 57% reduce 0%12/06/15 17:11:59 INFO mapred.JobClient: map 74% reduce 0%12/06/15 17:12:00 INFO mapred.JobClient: map 80% reduce 1%12/06/15 17:12:01 INFO mapred.JobClient: map 80% reduce 2%12/06/15 17:12:02 INFO mapred.JobClient: map 83% reduce 3%12/06/15 17:12:03 INFO mapred.JobClient: map 91% reduce 4%12/06/15 17:12:05 INFO mapred.JobClient: map 95% reduce 6%12/06/15 17:12:06 INFO mapred.JobClient: map 95% reduce 9%12/06/15 17:12:07 INFO mapred.JobClient: map 95% reduce 10%12/06/15 17:12:08 INFO mapred.JobClient: map 100% reduce 14%12/06/15 17:12:09 INFO mapred.JobClient: map 100% reduce 17%12/06/15 17:12:10 INFO mapred.JobClient: map 100% reduce 18%12/06/15 17:12:11 INFO mapred.JobClient: map 100% reduce 22%12/06/15 17:12:13 INFO mapred.JobClient: map 100% reduce 23%12/06/15 17:12:14 INFO mapred.JobClient: map 100% reduce 28%12/06/15 17:12:15 INFO mapred.JobClient: map 100% reduce 31%12/06/15 17:12:17 INFO mapred.JobClient: map 100% reduce 51%12/06/15 17:12:18 INFO mapred.JobClient: map 100% reduce 65%
    • 12/06/15 17:12:19 INFO mapred.JobClient: map 100% reduce 70%12/06/15 17:12:20 INFO mapred.JobClient: map 100% reduce 87%12/06/15 17:12:21 INFO mapred.JobClient: map 100% reduce 92%12/06/15 17:12:22 INFO mapred.JobClient: map 100% reduce 94%12/06/15 17:12:23 INFO mapred.JobClient: map 100% reduce 98%12/06/15 17:12:26 INFO mapred.JobClient: map 100% reduce 100%12/06/15 17:12:31 INFO mapred.JobClient: Job complete: job_201206112243_001812/06/15 17:12:32 INFO mapred.JobClient: Counters: 3112/06/15 17:12:32 INFO mapred.JobClient: Job Counters12/06/15 17:12:32 INFO mapred.JobClient: Launched reduce tasks=4812/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=298099212/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=012/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=012/06/15 17:12:32 INFO mapred.JobClient: Rack-local map tasks=15812/06/15 17:12:32 INFO mapred.JobClient: Launched map tasks=24112/06/15 17:12:32 INFO mapred.JobClient: Data-local map tasks=8312/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=110691512/06/15 17:12:32 INFO mapred.JobClient: File Input Format Counters12/06/15 17:12:32 INFO mapred.JobClient: Bytes Read=558710112/06/15 17:12:32 INFO mapred.JobClient: File Output Format Counters12/06/15 17:12:32 INFO mapred.JobClient: Bytes Written=270783612/06/15 17:12:32 INFO mapred.JobClient: FileSystemCounters12/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_READ=14051579712/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_READ=611226712/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=28816703012/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=270783612/06/15 17:12:32 INFO mapred.JobClient: Map-Reduce Framework12/06/15 17:12:32 INFO mapred.JobClient: Map output materialized bytes=14058491712/06/15 17:12:32 INFO mapred.JobClient: Map input records=10003212/06/15 17:12:32 INFO mapred.JobClient: Reduce shuffle bytes=14043627312/06/15 17:12:32 INFO mapred.JobClient: Spilled Records=555865812/06/15 17:12:32 INFO mapred.JobClient: Map output bytes=13495685112/06/15 17:12:32 INFO mapred.JobClient: Total committed heap usage (bytes)=5793631436812/06/15 17:12:32 INFO mapred.JobClient: CPU time spent (ms)=169337012/06/15 17:12:32 INFO mapred.JobClient: Map input bytes=507309212/06/15 17:12:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=2463812/06/15 17:12:32 INFO mapred.JobClient: Combine input records=012/06/15 17:12:32 INFO mapred.JobClient: Reduce input records=2774585
    • 12/06/15 17:12:32 INFO mapred.JobClient: Reduce input groups=25419612/06/15 17:12:32 INFO mapred.JobClient: Combine output records=012/06/15 17:12:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=5745998233612/06/15 17:12:32 INFO mapred.JobClient: Reduce output records=8112812/06/15 17:12:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=75487473664012/06/15 17:12:32 INFO mapred.JobClient: Map output records=2779329CloudBurst FinishedAlignment time: 65.36NUM_FMAP_TASKS: 24NUM_FREDUCE_TASKS: 24 Removing old results12/06/15 17:12:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications shouldimplement Tool for the same.12/06/15 17:12:32 INFO mapred.FileInputFormat: Total input paths to process : 4812/06/15 17:12:39 INFO mapred.JobClient: Running job: job_201206112243_001912/06/15 17:12:40 INFO mapred.JobClient: map 0% reduce 0%12/06/15 17:12:54 INFO mapred.JobClient: map 62% reduce 0%12/06/15 17:12:55 INFO mapred.JobClient: map 100% reduce 0%12/06/15 17:13:06 INFO mapred.JobClient: map 100% reduce 16%12/06/15 17:13:07 INFO mapred.JobClient: map 100% reduce 33%12/06/15 17:13:09 INFO mapred.JobClient: map 100% reduce 58%12/06/15 17:13:10 INFO mapred.JobClient: map 100% reduce 75%12/06/15 17:13:12 INFO mapred.JobClient: map 100% reduce 87%12/06/15 17:13:13 INFO mapred.JobClient: map 100% reduce 91%12/06/15 17:13:15 INFO mapred.JobClient: map 100% reduce 100%12/06/15 17:13:20 INFO mapred.JobClient: Job complete: job_201206112243_001912/06/15 17:13:20 INFO mapred.JobClient: Counters: 3112/06/15 17:13:20 INFO mapred.JobClient: Job Counters12/06/15 17:13:20 INFO mapred.JobClient: Launched reduce tasks=2412/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=20723212/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=012/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=012/06/15 17:13:20 INFO mapred.JobClient: Rack-local map tasks=512/06/15 17:13:20 INFO mapred.JobClient: Launched map tasks=4812/06/15 17:13:20 INFO mapred.JobClient: Data-local map tasks=4312/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=24565112/06/15 17:13:20 INFO mapred.JobClient: File Input Format Counters12/06/15 17:13:20 INFO mapred.JobClient: Bytes Read=2707836
    • 12/06/15 17:13:20 INFO mapred.JobClient: File Output Format Counters12/06/15 17:13:20 INFO mapred.JobClient: Bytes Written=248504212/06/15 17:13:20 INFO mapred.JobClient: FileSystemCounters12/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_READ=218833212/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_READ=271326012/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=603953212/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=248504212/06/15 17:13:20 INFO mapred.JobClient: Map-Reduce Framework12/06/15 17:13:20 INFO mapred.JobClient: Map output materialized bytes=219510012/06/15 17:13:20 INFO mapred.JobClient: Map input records=8112812/06/15 17:13:20 INFO mapred.JobClient: Reduce shuffle bytes=215379312/06/15 17:13:20 INFO mapred.JobClient: Spilled Records=16208812/06/15 17:13:20 INFO mapred.JobClient: Map output bytes=202820012/06/15 17:13:20 INFO mapred.JobClient: Total committed heap usage (bytes)=1447192166412/06/15 17:13:20 INFO mapred.JobClient: CPU time spent (ms)=9539012/06/15 17:13:20 INFO mapred.JobClient: Map input bytes=270332412/06/15 17:13:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=542412/06/15 17:13:20 INFO mapred.JobClient: Combine input records=8112812/06/15 17:13:20 INFO mapred.JobClient: Reduce input records=8104412/06/15 17:13:20 INFO mapred.JobClient: Reduce input groups=7651112/06/15 17:13:20 INFO mapred.JobClient: Combine output records=8104412/06/15 17:13:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=1316917248012/06/15 17:13:20 INFO mapred.JobClient: Reduce output records=7450212/06/15 17:13:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=19376190259212/06/15 17:13:20 INFO mapred.JobClient: Map output records=81128FilterAlignments FinishedFiltering time: 48.481Total Running time: 113.841[hadoop@skcc-nebdap02 hadoop]$[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -get /data/results/ ../CloudBurst-1.1.0/results[hadoop@skcc-nebdap02 hadoop]$ cd ../CloudBurst-1.1.0[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ java -jar PrintAlignments.jar results | sort -nk4 > 100k.3.txt[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head -n 20 100k.3.txt1 766133 766169 1 1 +1 297899 297935 2 0 -1 1325118 1325154 4 1 +1 145970 146006 7 1 -1 553513 553549 8 0 -
    • 1 1779842 1779878 9 0 -1 86299 86335 10 0 -1 1503808 1503844 11 2 +1 397758 397794 12 0 +1 241778 241814 13 0 -1 626711 626747 14 0 +1 142141 142177 15 1 +1 1401129 1401165 16 1 -1 306289 306325 17 1 +1 628571 628607 18 1 -1 815172 815208 19 0 -1 1624600 1624636 20 0 +1 13779 13815 21 0 +1 129064 129100 22 1 +1 1382938 1382974 24 2 +[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail -n 20 100k.3.txt1 1796768 1796804 99976 2 -1 1021128 1021164 99978 0 -1 1350005 1350041 99980 1 +1 799280 799316 99981 2 -1 139518 139554 99983 0 +1 57158 57194 99985 0 +1 1663030 1663066 99986 2 +1 549235 549271 99987 0 -1 1400509 1400545 99988 0 +1 880593 880629 99989 0 +1 918064 918100 99990 0 +1 937994 938030 99992 1 -1 94456 94492 99993 0 +1 1144320 1144356 99994 0 +1 1441627 1441663 99995 0 +1 1281557 1281593 99996 0 +1 1323611 1323647 99997 2 -1 800095 800131 99998 0 -1 1956458 1956494 99999 1 +1 134848 134884 100000 2 -[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l 100k.3.txt74502 100k.3.txt