참고 http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=Sample_Results



[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ ll
total 24332
-rw-r--r-- 1 hadoop hadoop 1995984 Jun 15 17:19 100k.3.txt
-rw-r--r-- 1 hadoop hadoop 1995984 Dec 5 2008 100k.3.txt.gold
-rw-r--r-- 1 hadoop hadoop 4493593 Jun 15 17:06 100k.br
-rw-r--r-- 1 hadoop hadoop 4388895 Dec 5 2008 100k.fa
-rw-r--r-- 1 hadoop hadoop 1177790 Jun 15 17:06 100k.fa.map
-rw-r--r-- 1 hadoop hadoop    8337 Dec 5 2008 cloudburst.err.gold
-rw-r--r-- 1 hadoop hadoop   57014 Jul 9 2010 CloudBurst.jar
-rw-r--r-- 1 hadoop hadoop 4067962 Jul 9 2010 ConvertFastaForCloud.jar
-rw-r--r-- 1 hadoop hadoop 4067959 Jul 9 2010 PrintAlignments.jar
-rw-r--r-- 1 hadoop hadoop    1452 Jul 9 2010 README.txt
drwxr-xr-x 2 hadoop hadoop     4096 Jun 15 17:19 results
-rw-r--r-- 1 hadoop hadoop 579773 Jun 15 17:06 s_suis.br
-rw-r--r-- 1 hadoop hadoop 2040970 Dec 5 2008 s_suis.fa
-rw-r--r-- 1 hadoop hadoop     21 Jun 15 17:06 s_suis.fa.map
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ cat README.txt
Sample data for CloudBurst
==========================


CloudBurst has several parameters to control the sensitivity of the
alignment algorithm. Here it finds the unambiguous best alignment for
100,000 reads allowing up to 3 mismatches when mapping to the corresponding
S. suis genome.




== Sample input data


s_suis.fa: Streptococcus suis reference genome sequence
100k.fa:   100,000 36bp Illumina reads available from
       http://www.sanger.ac.uk/Projects/S_suis/


== Format the input data
$ java -jar ConvertFastaForCloud.jar s_suis.fa s_suis.br
$ java -jar ConvertFastaForCloud.jar 100k.fa 100k.br
s_suis.br: reference genome in CloudBurst binary format
100k.br:   Reads in CloudBurst binary format


... 생략 ...


[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head s_suis.fa
>Streptococcus_suis
atgaaccaagaacaacttttttggcaacgatttattgaattggcaaaggtaaattttaag
ccatctatttatgatttttatgtcgctgatgcaaaattactcggaatcaaccagcaagtt
gccaatattttcttaaatcgtccatttaaaaaagatttctgggaaaaaaacttcgaagag
ttaatgattgccgctagttttgaaagctacggagagcctcttaccatccaatatcaattt
... 생략 ...
acagaggatgaacaggagattaggaatactacaaacacaagaagttcaatagttcaccag
gtacagacacttgagccggctactcctcaagaaacttttaaaccggttcattctgatata
aaatcccagtacacctttgctaattttgtacaaggagacaataatcactgggcaaaggct
gcagctttagctgtatctgataacctaggtgagctctacaatccattattcatttttggt
ggtcctggtcttggaaaaactcatattttaaatgcgattggaaataaggttctagccgat
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l s_suis.fa
33460 s_suis.fa
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head 100k.fa
>1
GCCTGTTCTTTACATGATTTTTGGTCTAGTGTATGG
>2
AACCGCTGTAAAGGCTTCTGCCACACCGATTTCTTG
>3
GAGGTGATTGTGGTATTGT.GGTAAATCGGTGATTG
>4
GCTTTAGCCGACCTGAACT.GACTACAAGTTGACCA
>5
AAAGGCTACCCGCGGTTGAACCTTACGTGACACATT
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail 100k.fa
>99996
AATGCCCGTAACAACGGGCTTTTATCTTGTTCTAAA
>99997
GTCAGATAGCGCAGGAATTTCAAAGGAATTTGGACC
>99998
AGTTAACTCTTCAGCTGTAAAGTTGTAGTTTTCTAA
>99999
GCGGCATAAATTGGATAAAGAAAGAACTGAAGGACA
>100000
GTTACCATGTATTGTGACAGATAACCACGGTGGAGT
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -mkdir /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/s_suis.br /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/100k.br /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar
Usage: CloudBurst refpath qrypath outpath minreadlen maxreadlen k allowdifferences filteralignments #mappers #reduces
#fmappers #freducers blocksize redundancy


1. refpath:         path in hdfs to the reference file
2. qrypath:         path in hdfs to the query file
3. outpath:         path to a directory to store the results (old results are automatically deleted)
4. minreadlen:        minimum length of the reads
5. maxreadlen:         maximum read length
6. k:             number of mismatches / differences to allow (higher number requires more time)
7. allowdifferences: 0: mismatches only, 1: indels as well
8. filteralignments: 0: all alignments, 1: only report unambiguous best alignment (results identical to RMAP)
9. #mappers:          number of mappers to use.             suggested: #processor-cores * 10
10. #reduces:        number of reducers to use.            suggested: #processor-cores * 2
11. #fmappers:        number of mappers for filtration alg. suggested: #processor-cores
12. #freducers:       number of reducers for filtration alg. suggested: #processor-cores
13. blocksize:       number of qry and ref tuples to consider at a time in the reduce phase. suggested: 128
14. redundancy:        number of copies of low complexity seeds to use. suggested: # processor cores
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar /data/cloudburst/s_suis.br
/data/cloudburst/100k.br /data/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err
[hadoop@skcc-nebdap02 hadoop]$ cat cloudburst.err
refath: /data/cloudburst/s_suis.br
qrypath: /data/cloudburst/100k.br
outpath: /data/results-alignments
MIN_READ_LEN: 36
MAX_READ_LEN: 36
K: 3
SEED_LEN: 9
FLANK_LEN: 30
ALLOW_DIFFERENCES: 0
FILTER_ALIGNMENTS: true
NUM_MAP_TASKS: 240
NUM_REDUCE_TASKS: 48
BLOCK_SIZE: 128
REDUNDANCY: 16
 Removing old results
12/06/15 17:11:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
12/06/15 17:11:28 INFO mapred.FileInputFormat: Total input paths to process : 2
12/06/15 17:11:28 INFO mapred.JobClient: Running job: job_201206112243_0018
12/06/15 17:11:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/15 17:11:47 INFO mapred.JobClient: map 12% reduce 0%
12/06/15 17:11:48 INFO mapred.JobClient: map 14% reduce 0%
12/06/15 17:11:49 INFO mapred.JobClient: map 15% reduce 0%
12/06/15 17:11:50 INFO mapred.JobClient: map 17% reduce 0%
12/06/15 17:11:51 INFO mapred.JobClient: map 19% reduce 0%
12/06/15 17:11:52 INFO mapred.JobClient: map 21% reduce 0%
12/06/15 17:11:53 INFO mapred.JobClient: map 36% reduce 0%
12/06/15 17:11:54 INFO mapred.JobClient: map 40% reduce 0%
12/06/15 17:11:55 INFO mapred.JobClient: map 45% reduce 0%
12/06/15 17:11:56 INFO mapred.JobClient: map 49% reduce 0%
12/06/15 17:11:57 INFO mapred.JobClient: map 56% reduce 0%
12/06/15 17:11:58 INFO mapred.JobClient: map 57% reduce 0%
12/06/15 17:11:59 INFO mapred.JobClient: map 74% reduce 0%
12/06/15 17:12:00 INFO mapred.JobClient: map 80% reduce 1%
12/06/15 17:12:01 INFO mapred.JobClient: map 80% reduce 2%
12/06/15 17:12:02 INFO mapred.JobClient: map 83% reduce 3%
12/06/15 17:12:03 INFO mapred.JobClient: map 91% reduce 4%
12/06/15 17:12:05 INFO mapred.JobClient: map 95% reduce 6%
12/06/15 17:12:06 INFO mapred.JobClient: map 95% reduce 9%
12/06/15 17:12:07 INFO mapred.JobClient: map 95% reduce 10%
12/06/15 17:12:08 INFO mapred.JobClient: map 100% reduce 14%
12/06/15 17:12:09 INFO mapred.JobClient: map 100% reduce 17%
12/06/15 17:12:10 INFO mapred.JobClient: map 100% reduce 18%
12/06/15 17:12:11 INFO mapred.JobClient: map 100% reduce 22%
12/06/15 17:12:13 INFO mapred.JobClient: map 100% reduce 23%
12/06/15 17:12:14 INFO mapred.JobClient: map 100% reduce 28%
12/06/15 17:12:15 INFO mapred.JobClient: map 100% reduce 31%
12/06/15 17:12:17 INFO mapred.JobClient: map 100% reduce 51%
12/06/15 17:12:18 INFO mapred.JobClient: map 100% reduce 65%
12/06/15 17:12:19 INFO mapred.JobClient: map 100% reduce 70%
12/06/15 17:12:20 INFO mapred.JobClient: map 100% reduce 87%
12/06/15 17:12:21 INFO mapred.JobClient: map 100% reduce 92%
12/06/15 17:12:22 INFO mapred.JobClient: map 100% reduce 94%
12/06/15 17:12:23 INFO mapred.JobClient: map 100% reduce 98%
12/06/15 17:12:26 INFO mapred.JobClient: map 100% reduce 100%
12/06/15 17:12:31 INFO mapred.JobClient: Job complete: job_201206112243_0018
12/06/15 17:12:32 INFO mapred.JobClient: Counters: 31
12/06/15 17:12:32 INFO mapred.JobClient:   Job Counters
12/06/15 17:12:32 INFO mapred.JobClient:    Launched reduce tasks=48
12/06/15 17:12:32 INFO mapred.JobClient:    SLOTS_MILLIS_MAPS=2980992
12/06/15 17:12:32 INFO mapred.JobClient:    Total time spent by all reduces waiting after reserving slots (ms)=0
12/06/15 17:12:32 INFO mapred.JobClient:    Total time spent by all maps waiting after reserving slots (ms)=0
12/06/15 17:12:32 INFO mapred.JobClient:    Rack-local map tasks=158
12/06/15 17:12:32 INFO mapred.JobClient:    Launched map tasks=241
12/06/15 17:12:32 INFO mapred.JobClient:    Data-local map tasks=83
12/06/15 17:12:32 INFO mapred.JobClient:    SLOTS_MILLIS_REDUCES=1106915
12/06/15 17:12:32 INFO mapred.JobClient:   File Input Format Counters
12/06/15 17:12:32 INFO mapred.JobClient:    Bytes Read=5587101
12/06/15 17:12:32 INFO mapred.JobClient:   File Output Format Counters
12/06/15 17:12:32 INFO mapred.JobClient:    Bytes Written=2707836
12/06/15 17:12:32 INFO mapred.JobClient:   FileSystemCounters
12/06/15 17:12:32 INFO mapred.JobClient:    FILE_BYTES_READ=140515797
12/06/15 17:12:32 INFO mapred.JobClient:    HDFS_BYTES_READ=6112267
12/06/15 17:12:32 INFO mapred.JobClient:    FILE_BYTES_WRITTEN=288167030
12/06/15 17:12:32 INFO mapred.JobClient:    HDFS_BYTES_WRITTEN=2707836
12/06/15 17:12:32 INFO mapred.JobClient:   Map-Reduce Framework
12/06/15 17:12:32 INFO mapred.JobClient:    Map output materialized bytes=140584917
12/06/15 17:12:32 INFO mapred.JobClient:    Map input records=100032
12/06/15 17:12:32 INFO mapred.JobClient:    Reduce shuffle bytes=140436273
12/06/15 17:12:32 INFO mapred.JobClient:    Spilled Records=5558658
12/06/15 17:12:32 INFO mapred.JobClient:    Map output bytes=134956851
12/06/15 17:12:32 INFO mapred.JobClient:    Total committed heap usage (bytes)=57936314368
12/06/15 17:12:32 INFO mapred.JobClient:    CPU time spent (ms)=1693370
12/06/15 17:12:32 INFO mapred.JobClient:    Map input bytes=5073092
12/06/15 17:12:32 INFO mapred.JobClient:    SPLIT_RAW_BYTES=24638
12/06/15 17:12:32 INFO mapred.JobClient:    Combine input records=0
12/06/15 17:12:32 INFO mapred.JobClient:    Reduce input records=2774585
12/06/15 17:12:32 INFO mapred.JobClient:     Reduce input groups=254196
12/06/15 17:12:32 INFO mapred.JobClient:     Combine output records=0
12/06/15 17:12:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=57459982336
12/06/15 17:12:32 INFO mapred.JobClient:     Reduce output records=81128
12/06/15 17:12:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=754874736640
12/06/15 17:12:32 INFO mapred.JobClient:     Map output records=2779329
CloudBurst Finished
Alignment time: 65.36
NUM_FMAP_TASKS: 24
NUM_FREDUCE_TASKS: 24
 Removing old results
12/06/15 17:12:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
12/06/15 17:12:32 INFO mapred.FileInputFormat: Total input paths to process : 48
12/06/15 17:12:39 INFO mapred.JobClient: Running job: job_201206112243_0019
12/06/15 17:12:40 INFO mapred.JobClient: map 0% reduce 0%
12/06/15 17:12:54 INFO mapred.JobClient: map 62% reduce 0%
12/06/15 17:12:55 INFO mapred.JobClient: map 100% reduce 0%
12/06/15 17:13:06 INFO mapred.JobClient: map 100% reduce 16%
12/06/15 17:13:07 INFO mapred.JobClient: map 100% reduce 33%
12/06/15 17:13:09 INFO mapred.JobClient: map 100% reduce 58%
12/06/15 17:13:10 INFO mapred.JobClient: map 100% reduce 75%
12/06/15 17:13:12 INFO mapred.JobClient: map 100% reduce 87%
12/06/15 17:13:13 INFO mapred.JobClient: map 100% reduce 91%
12/06/15 17:13:15 INFO mapred.JobClient: map 100% reduce 100%
12/06/15 17:13:20 INFO mapred.JobClient: Job complete: job_201206112243_0019
12/06/15 17:13:20 INFO mapred.JobClient: Counters: 31
12/06/15 17:13:20 INFO mapred.JobClient:    Job Counters
12/06/15 17:13:20 INFO mapred.JobClient:     Launched reduce tasks=24
12/06/15 17:13:20 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=207232
12/06/15 17:13:20 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/06/15 17:13:20 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/06/15 17:13:20 INFO mapred.JobClient:     Rack-local map tasks=5
12/06/15 17:13:20 INFO mapred.JobClient:     Launched map tasks=48
12/06/15 17:13:20 INFO mapred.JobClient:     Data-local map tasks=43
12/06/15 17:13:20 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=245651
12/06/15 17:13:20 INFO mapred.JobClient:    File Input Format Counters
12/06/15 17:13:20 INFO mapred.JobClient:     Bytes Read=2707836
12/06/15 17:13:20 INFO mapred.JobClient:     File Output Format Counters
12/06/15 17:13:20 INFO mapred.JobClient:      Bytes Written=2485042
12/06/15 17:13:20 INFO mapred.JobClient:     FileSystemCounters
12/06/15 17:13:20 INFO mapred.JobClient:      FILE_BYTES_READ=2188332
12/06/15 17:13:20 INFO mapred.JobClient:      HDFS_BYTES_READ=2713260
12/06/15 17:13:20 INFO mapred.JobClient:      FILE_BYTES_WRITTEN=6039532
12/06/15 17:13:20 INFO mapred.JobClient:      HDFS_BYTES_WRITTEN=2485042
12/06/15 17:13:20 INFO mapred.JobClient:     Map-Reduce Framework
12/06/15 17:13:20 INFO mapred.JobClient:      Map output materialized bytes=2195100
12/06/15 17:13:20 INFO mapred.JobClient:      Map input records=81128
12/06/15 17:13:20 INFO mapred.JobClient:      Reduce shuffle bytes=2153793
12/06/15 17:13:20 INFO mapred.JobClient:      Spilled Records=162088
12/06/15 17:13:20 INFO mapred.JobClient:      Map output bytes=2028200
12/06/15 17:13:20 INFO mapred.JobClient:      Total committed heap usage (bytes)=14471921664
12/06/15 17:13:20 INFO mapred.JobClient:      CPU time spent (ms)=95390
12/06/15 17:13:20 INFO mapred.JobClient:      Map input bytes=2703324
12/06/15 17:13:20 INFO mapred.JobClient:      SPLIT_RAW_BYTES=5424
12/06/15 17:13:20 INFO mapred.JobClient:      Combine input records=81128
12/06/15 17:13:20 INFO mapred.JobClient:      Reduce input records=81044
12/06/15 17:13:20 INFO mapred.JobClient:      Reduce input groups=76511
12/06/15 17:13:20 INFO mapred.JobClient:      Combine output records=81044
12/06/15 17:13:20 INFO mapred.JobClient:      Physical memory (bytes) snapshot=13169172480
12/06/15 17:13:20 INFO mapred.JobClient:      Reduce output records=74502
12/06/15 17:13:20 INFO mapred.JobClient:      Virtual memory (bytes) snapshot=193761902592
12/06/15 17:13:20 INFO mapred.JobClient:      Map output records=81128
FilterAlignments Finished
Filtering time: 48.481
Total Running time: 113.841
[hadoop@skcc-nebdap02 hadoop]$
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -get /data/results/ ../CloudBurst-1.1.0/results
[hadoop@skcc-nebdap02 hadoop]$ cd ../CloudBurst-1.1.0
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ java -jar PrintAlignments.jar results | sort -nk4 > 100k.3.txt
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head -n 20 100k.3.txt
1     766133 766169 1         1     +
1     297899 297935 2         0     -
1     1325118 1325154 4       1      +
1     145970 146006 7         1     -
1     553513 553549 8         0     -
1    1779842 1779878 9        0               -
1    86299    86335   10      0           -
1    1503808 1503844 11           2           +
1    397758 397794 12         0               +
1    241778 241814 13         0               -
1    626711 626747 14         0               +
1    142141 142177 15         1               +
1    1401129 1401165 16           1           -
1    306289 306325 17         1               +
1    628571 628607 18         1               -
1    815172 815208 19         0               -
1    1624600 1624636 20           0           +
1    13779    13815   21      0           +
1    129064 129100 22         1               +
1    1382938 1382974 24           2           +
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail -n 20 100k.3.txt
1    1796768 1796804 99976            2               -
1    1021128 1021164 99978            0               -
1    1350005 1350041 99980            1               +
1    799280 799316 99981              2           -
1    139518 139554 99983              0           +
1    57158    57194   99985       0               +
1    1663030 1663066 99986            2               +
1    549235 549271 99987              0           -
1    1400509 1400545 99988            0               +
1    880593 880629 99989              0           +
1    918064 918100 99990              0           +
1    937994 938030 99992              1           -
1    94456    94492   99993       0               +
1    1144320 1144356 99994            0               +
1    1441627 1441663 99995            0               +
1    1281557 1281593 99996            0               +
1    1323611 1323647 99997            2               -
1    800095 800131 99998              0           -
1    1956458 1956494 99999            1               +
1    134848 134884 100000 2                           -
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l 100k.3.txt
74502 100k.3.txt
Cloud burst tutorial

Cloud burst tutorial

  • 1.
    참고 http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=Sample_Results [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ll total 24332 -rw-r--r-- 1 hadoop hadoop 1995984 Jun 15 17:19 100k.3.txt -rw-r--r-- 1 hadoop hadoop 1995984 Dec 5 2008 100k.3.txt.gold -rw-r--r-- 1 hadoop hadoop 4493593 Jun 15 17:06 100k.br -rw-r--r-- 1 hadoop hadoop 4388895 Dec 5 2008 100k.fa -rw-r--r-- 1 hadoop hadoop 1177790 Jun 15 17:06 100k.fa.map -rw-r--r-- 1 hadoop hadoop 8337 Dec 5 2008 cloudburst.err.gold -rw-r--r-- 1 hadoop hadoop 57014 Jul 9 2010 CloudBurst.jar -rw-r--r-- 1 hadoop hadoop 4067962 Jul 9 2010 ConvertFastaForCloud.jar -rw-r--r-- 1 hadoop hadoop 4067959 Jul 9 2010 PrintAlignments.jar -rw-r--r-- 1 hadoop hadoop 1452 Jul 9 2010 README.txt drwxr-xr-x 2 hadoop hadoop 4096 Jun 15 17:19 results -rw-r--r-- 1 hadoop hadoop 579773 Jun 15 17:06 s_suis.br -rw-r--r-- 1 hadoop hadoop 2040970 Dec 5 2008 s_suis.fa -rw-r--r-- 1 hadoop hadoop 21 Jun 15 17:06 s_suis.fa.map [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ cat README.txt Sample data for CloudBurst ========================== CloudBurst has several parameters to control the sensitivity of the alignment algorithm. Here it finds the unambiguous best alignment for 100,000 reads allowing up to 3 mismatches when mapping to the corresponding S. suis genome. == Sample input data s_suis.fa: Streptococcus suis reference genome sequence 100k.fa: 100,000 36bp Illumina reads available from http://www.sanger.ac.uk/Projects/S_suis/ == Format the input data $ java -jar ConvertFastaForCloud.jar s_suis.fa s_suis.br $ java -jar ConvertFastaForCloud.jar 100k.fa 100k.br
  • 2.
    s_suis.br: reference genomein CloudBurst binary format 100k.br: Reads in CloudBurst binary format ... 생략 ... [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head s_suis.fa >Streptococcus_suis atgaaccaagaacaacttttttggcaacgatttattgaattggcaaaggtaaattttaag ccatctatttatgatttttatgtcgctgatgcaaaattactcggaatcaaccagcaagtt gccaatattttcttaaatcgtccatttaaaaaagatttctgggaaaaaaacttcgaagag ttaatgattgccgctagttttgaaagctacggagagcctcttaccatccaatatcaattt ... 생략 ... acagaggatgaacaggagattaggaatactacaaacacaagaagttcaatagttcaccag gtacagacacttgagccggctactcctcaagaaacttttaaaccggttcattctgatata aaatcccagtacacctttgctaattttgtacaaggagacaataatcactgggcaaaggct gcagctttagctgtatctgataacctaggtgagctctacaatccattattcatttttggt ggtcctggtcttggaaaaactcatattttaaatgcgattggaaataaggttctagccgat [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l s_suis.fa 33460 s_suis.fa [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head 100k.fa >1 GCCTGTTCTTTACATGATTTTTGGTCTAGTGTATGG >2 AACCGCTGTAAAGGCTTCTGCCACACCGATTTCTTG >3 GAGGTGATTGTGGTATTGT.GGTAAATCGGTGATTG >4 GCTTTAGCCGACCTGAACT.GACTACAAGTTGACCA >5 AAAGGCTACCCGCGGTTGAACCTTACGTGACACATT [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail 100k.fa >99996 AATGCCCGTAACAACGGGCTTTTATCTTGTTCTAAA >99997 GTCAGATAGCGCAGGAATTTCAAAGGAATTTGGACC >99998 AGTTAACTCTTCAGCTGTAAAGTTGTAGTTTTCTAA >99999
  • 3.
    GCGGCATAAATTGGATAAAGAAAGAACTGAAGGACA >100000 GTTACCATGTATTGTGACAGATAACCACGGTGGAGT [hadoop@skcc-nebdap02 hadoop]$ bin/hadoopdfs -mkdir /data/cloudburst [hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/s_suis.br /data/cloudburst [hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/100k.br /data/cloudburst [hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar Usage: CloudBurst refpath qrypath outpath minreadlen maxreadlen k allowdifferences filteralignments #mappers #reduces #fmappers #freducers blocksize redundancy 1. refpath: path in hdfs to the reference file 2. qrypath: path in hdfs to the query file 3. outpath: path to a directory to store the results (old results are automatically deleted) 4. minreadlen: minimum length of the reads 5. maxreadlen: maximum read length 6. k: number of mismatches / differences to allow (higher number requires more time) 7. allowdifferences: 0: mismatches only, 1: indels as well 8. filteralignments: 0: all alignments, 1: only report unambiguous best alignment (results identical to RMAP) 9. #mappers: number of mappers to use. suggested: #processor-cores * 10 10. #reduces: number of reducers to use. suggested: #processor-cores * 2 11. #fmappers: number of mappers for filtration alg. suggested: #processor-cores 12. #freducers: number of reducers for filtration alg. suggested: #processor-cores 13. blocksize: number of qry and ref tuples to consider at a time in the reduce phase. suggested: 128 14. redundancy: number of copies of low complexity seeds to use. suggested: # processor cores [hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar /data/cloudburst/s_suis.br /data/cloudburst/100k.br /data/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err [hadoop@skcc-nebdap02 hadoop]$ cat cloudburst.err refath: /data/cloudburst/s_suis.br qrypath: /data/cloudburst/100k.br outpath: /data/results-alignments MIN_READ_LEN: 36 MAX_READ_LEN: 36 K: 3 SEED_LEN: 9 FLANK_LEN: 30 ALLOW_DIFFERENCES: 0 FILTER_ALIGNMENTS: true NUM_MAP_TASKS: 240
  • 4.
    NUM_REDUCE_TASKS: 48 BLOCK_SIZE: 128 REDUNDANCY:16 Removing old results 12/06/15 17:11:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/06/15 17:11:28 INFO mapred.FileInputFormat: Total input paths to process : 2 12/06/15 17:11:28 INFO mapred.JobClient: Running job: job_201206112243_0018 12/06/15 17:11:29 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 17:11:47 INFO mapred.JobClient: map 12% reduce 0% 12/06/15 17:11:48 INFO mapred.JobClient: map 14% reduce 0% 12/06/15 17:11:49 INFO mapred.JobClient: map 15% reduce 0% 12/06/15 17:11:50 INFO mapred.JobClient: map 17% reduce 0% 12/06/15 17:11:51 INFO mapred.JobClient: map 19% reduce 0% 12/06/15 17:11:52 INFO mapred.JobClient: map 21% reduce 0% 12/06/15 17:11:53 INFO mapred.JobClient: map 36% reduce 0% 12/06/15 17:11:54 INFO mapred.JobClient: map 40% reduce 0% 12/06/15 17:11:55 INFO mapred.JobClient: map 45% reduce 0% 12/06/15 17:11:56 INFO mapred.JobClient: map 49% reduce 0% 12/06/15 17:11:57 INFO mapred.JobClient: map 56% reduce 0% 12/06/15 17:11:58 INFO mapred.JobClient: map 57% reduce 0% 12/06/15 17:11:59 INFO mapred.JobClient: map 74% reduce 0% 12/06/15 17:12:00 INFO mapred.JobClient: map 80% reduce 1% 12/06/15 17:12:01 INFO mapred.JobClient: map 80% reduce 2% 12/06/15 17:12:02 INFO mapred.JobClient: map 83% reduce 3% 12/06/15 17:12:03 INFO mapred.JobClient: map 91% reduce 4% 12/06/15 17:12:05 INFO mapred.JobClient: map 95% reduce 6% 12/06/15 17:12:06 INFO mapred.JobClient: map 95% reduce 9% 12/06/15 17:12:07 INFO mapred.JobClient: map 95% reduce 10% 12/06/15 17:12:08 INFO mapred.JobClient: map 100% reduce 14% 12/06/15 17:12:09 INFO mapred.JobClient: map 100% reduce 17% 12/06/15 17:12:10 INFO mapred.JobClient: map 100% reduce 18% 12/06/15 17:12:11 INFO mapred.JobClient: map 100% reduce 22% 12/06/15 17:12:13 INFO mapred.JobClient: map 100% reduce 23% 12/06/15 17:12:14 INFO mapred.JobClient: map 100% reduce 28% 12/06/15 17:12:15 INFO mapred.JobClient: map 100% reduce 31% 12/06/15 17:12:17 INFO mapred.JobClient: map 100% reduce 51% 12/06/15 17:12:18 INFO mapred.JobClient: map 100% reduce 65%
  • 5.
    12/06/15 17:12:19 INFOmapred.JobClient: map 100% reduce 70% 12/06/15 17:12:20 INFO mapred.JobClient: map 100% reduce 87% 12/06/15 17:12:21 INFO mapred.JobClient: map 100% reduce 92% 12/06/15 17:12:22 INFO mapred.JobClient: map 100% reduce 94% 12/06/15 17:12:23 INFO mapred.JobClient: map 100% reduce 98% 12/06/15 17:12:26 INFO mapred.JobClient: map 100% reduce 100% 12/06/15 17:12:31 INFO mapred.JobClient: Job complete: job_201206112243_0018 12/06/15 17:12:32 INFO mapred.JobClient: Counters: 31 12/06/15 17:12:32 INFO mapred.JobClient: Job Counters 12/06/15 17:12:32 INFO mapred.JobClient: Launched reduce tasks=48 12/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2980992 12/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 17:12:32 INFO mapred.JobClient: Rack-local map tasks=158 12/06/15 17:12:32 INFO mapred.JobClient: Launched map tasks=241 12/06/15 17:12:32 INFO mapred.JobClient: Data-local map tasks=83 12/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1106915 12/06/15 17:12:32 INFO mapred.JobClient: File Input Format Counters 12/06/15 17:12:32 INFO mapred.JobClient: Bytes Read=5587101 12/06/15 17:12:32 INFO mapred.JobClient: File Output Format Counters 12/06/15 17:12:32 INFO mapred.JobClient: Bytes Written=2707836 12/06/15 17:12:32 INFO mapred.JobClient: FileSystemCounters 12/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_READ=140515797 12/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_READ=6112267 12/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=288167030 12/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2707836 12/06/15 17:12:32 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 17:12:32 INFO mapred.JobClient: Map output materialized bytes=140584917 12/06/15 17:12:32 INFO mapred.JobClient: Map input records=100032 12/06/15 17:12:32 INFO mapred.JobClient: Reduce shuffle bytes=140436273 12/06/15 17:12:32 INFO mapred.JobClient: Spilled Records=5558658 12/06/15 17:12:32 INFO mapred.JobClient: Map output bytes=134956851 12/06/15 17:12:32 INFO mapred.JobClient: Total committed heap usage (bytes)=57936314368 12/06/15 17:12:32 INFO mapred.JobClient: CPU time spent (ms)=1693370 12/06/15 17:12:32 INFO mapred.JobClient: Map input bytes=5073092 12/06/15 17:12:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=24638 12/06/15 17:12:32 INFO mapred.JobClient: Combine input records=0 12/06/15 17:12:32 INFO mapred.JobClient: Reduce input records=2774585
  • 6.
    12/06/15 17:12:32 INFOmapred.JobClient: Reduce input groups=254196 12/06/15 17:12:32 INFO mapred.JobClient: Combine output records=0 12/06/15 17:12:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=57459982336 12/06/15 17:12:32 INFO mapred.JobClient: Reduce output records=81128 12/06/15 17:12:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=754874736640 12/06/15 17:12:32 INFO mapred.JobClient: Map output records=2779329 CloudBurst Finished Alignment time: 65.36 NUM_FMAP_TASKS: 24 NUM_FREDUCE_TASKS: 24 Removing old results 12/06/15 17:12:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/06/15 17:12:32 INFO mapred.FileInputFormat: Total input paths to process : 48 12/06/15 17:12:39 INFO mapred.JobClient: Running job: job_201206112243_0019 12/06/15 17:12:40 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 17:12:54 INFO mapred.JobClient: map 62% reduce 0% 12/06/15 17:12:55 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 17:13:06 INFO mapred.JobClient: map 100% reduce 16% 12/06/15 17:13:07 INFO mapred.JobClient: map 100% reduce 33% 12/06/15 17:13:09 INFO mapred.JobClient: map 100% reduce 58% 12/06/15 17:13:10 INFO mapred.JobClient: map 100% reduce 75% 12/06/15 17:13:12 INFO mapred.JobClient: map 100% reduce 87% 12/06/15 17:13:13 INFO mapred.JobClient: map 100% reduce 91% 12/06/15 17:13:15 INFO mapred.JobClient: map 100% reduce 100% 12/06/15 17:13:20 INFO mapred.JobClient: Job complete: job_201206112243_0019 12/06/15 17:13:20 INFO mapred.JobClient: Counters: 31 12/06/15 17:13:20 INFO mapred.JobClient: Job Counters 12/06/15 17:13:20 INFO mapred.JobClient: Launched reduce tasks=24 12/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=207232 12/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 17:13:20 INFO mapred.JobClient: Rack-local map tasks=5 12/06/15 17:13:20 INFO mapred.JobClient: Launched map tasks=48 12/06/15 17:13:20 INFO mapred.JobClient: Data-local map tasks=43 12/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=245651 12/06/15 17:13:20 INFO mapred.JobClient: File Input Format Counters 12/06/15 17:13:20 INFO mapred.JobClient: Bytes Read=2707836
  • 7.
    12/06/15 17:13:20 INFOmapred.JobClient: File Output Format Counters 12/06/15 17:13:20 INFO mapred.JobClient: Bytes Written=2485042 12/06/15 17:13:20 INFO mapred.JobClient: FileSystemCounters 12/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_READ=2188332 12/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_READ=2713260 12/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6039532 12/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2485042 12/06/15 17:13:20 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 17:13:20 INFO mapred.JobClient: Map output materialized bytes=2195100 12/06/15 17:13:20 INFO mapred.JobClient: Map input records=81128 12/06/15 17:13:20 INFO mapred.JobClient: Reduce shuffle bytes=2153793 12/06/15 17:13:20 INFO mapred.JobClient: Spilled Records=162088 12/06/15 17:13:20 INFO mapred.JobClient: Map output bytes=2028200 12/06/15 17:13:20 INFO mapred.JobClient: Total committed heap usage (bytes)=14471921664 12/06/15 17:13:20 INFO mapred.JobClient: CPU time spent (ms)=95390 12/06/15 17:13:20 INFO mapred.JobClient: Map input bytes=2703324 12/06/15 17:13:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=5424 12/06/15 17:13:20 INFO mapred.JobClient: Combine input records=81128 12/06/15 17:13:20 INFO mapred.JobClient: Reduce input records=81044 12/06/15 17:13:20 INFO mapred.JobClient: Reduce input groups=76511 12/06/15 17:13:20 INFO mapred.JobClient: Combine output records=81044 12/06/15 17:13:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=13169172480 12/06/15 17:13:20 INFO mapred.JobClient: Reduce output records=74502 12/06/15 17:13:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=193761902592 12/06/15 17:13:20 INFO mapred.JobClient: Map output records=81128 FilterAlignments Finished Filtering time: 48.481 Total Running time: 113.841 [hadoop@skcc-nebdap02 hadoop]$ [hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -get /data/results/ ../CloudBurst-1.1.0/results [hadoop@skcc-nebdap02 hadoop]$ cd ../CloudBurst-1.1.0 [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ java -jar PrintAlignments.jar results | sort -nk4 > 100k.3.txt [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head -n 20 100k.3.txt 1 766133 766169 1 1 + 1 297899 297935 2 0 - 1 1325118 1325154 4 1 + 1 145970 146006 7 1 - 1 553513 553549 8 0 -
  • 8.
    1 1779842 1779878 9 0 - 1 86299 86335 10 0 - 1 1503808 1503844 11 2 + 1 397758 397794 12 0 + 1 241778 241814 13 0 - 1 626711 626747 14 0 + 1 142141 142177 15 1 + 1 1401129 1401165 16 1 - 1 306289 306325 17 1 + 1 628571 628607 18 1 - 1 815172 815208 19 0 - 1 1624600 1624636 20 0 + 1 13779 13815 21 0 + 1 129064 129100 22 1 + 1 1382938 1382974 24 2 + [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail -n 20 100k.3.txt 1 1796768 1796804 99976 2 - 1 1021128 1021164 99978 0 - 1 1350005 1350041 99980 1 + 1 799280 799316 99981 2 - 1 139518 139554 99983 0 + 1 57158 57194 99985 0 + 1 1663030 1663066 99986 2 + 1 549235 549271 99987 0 - 1 1400509 1400545 99988 0 + 1 880593 880629 99989 0 + 1 918064 918100 99990 0 + 1 937994 938030 99992 1 - 1 94456 94492 99993 0 + 1 1144320 1144356 99994 0 + 1 1441627 1441663 99995 0 + 1 1281557 1281593 99996 0 + 1 1323611 1323647 99997 2 - 1 800095 800131 99998 0 - 1 1956458 1956494 99999 1 + 1 134848 134884 100000 2 - [hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l 100k.3.txt 74502 100k.3.txt