Cloud burst tutorial

참고 http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=Sample_Results

[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ ll
total 24332
-rw-r--r-- 1 hadoop hadoop 1995984 Jun 15 17:19 100k.3.txt
-rw-r--r-- 1 hadoop hadoop 1995984 Dec 5 2008 100k.3.txt.gold
-rw-r--r-- 1 hadoop hadoop 4493593 Jun 15 17:06 100k.br
-rw-r--r-- 1 hadoop hadoop 4388895 Dec 5 2008 100k.fa
-rw-r--r-- 1 hadoop hadoop 1177790 Jun 15 17:06 100k.fa.map
-rw-r--r-- 1 hadoop hadoop 8337 Dec 5 2008 cloudburst.err.gold
-rw-r--r-- 1 hadoop hadoop 57014 Jul 9 2010 CloudBurst.jar
-rw-r--r-- 1 hadoop hadoop 4067962 Jul 9 2010 ConvertFastaForCloud.jar
-rw-r--r-- 1 hadoop hadoop 4067959 Jul 9 2010 PrintAlignments.jar
-rw-r--r-- 1 hadoop hadoop 1452 Jul 9 2010 README.txt
drwxr-xr-x 2 hadoop hadoop 4096 Jun 15 17:19 results
-rw-r--r-- 1 hadoop hadoop 579773 Jun 15 17:06 s_suis.br
-rw-r--r-- 1 hadoop hadoop 2040970 Dec 5 2008 s_suis.fa
-rw-r--r-- 1 hadoop hadoop 21 Jun 15 17:06 s_suis.fa.map
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ cat README.txt
Sample data for CloudBurst
==========================

CloudBurst has several parameters to control the sensitivity of the
alignment algorithm. Here it finds the unambiguous best alignment for
100,000 reads allowing up to 3 mismatches when mapping to the corresponding
S. suis genome.

== Sample input data

s_suis.fa: Streptococcus suis reference genome sequence
100k.fa: 100,000 36bp Illumina reads available from
http://www.sanger.ac.uk/Projects/S_suis/

== Format the input data
$ java -jar ConvertFastaForCloud.jar s_suis.fa s_suis.br
$ java -jar ConvertFastaForCloud.jar 100k.fa 100k.br

s_suis.br: reference genome in CloudBurst binary format
100k.br: Reads in CloudBurst binary format

... 생략 ...

[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head s_suis.fa
>Streptococcus_suis
atgaaccaagaacaacttttttggcaacgatttattgaattggcaaaggtaaattttaag
ccatctatttatgatttttatgtcgctgatgcaaaattactcggaatcaaccagcaagtt
gccaatattttcttaaatcgtccatttaaaaaagatttctgggaaaaaaacttcgaagag
ttaatgattgccgctagttttgaaagctacggagagcctcttaccatccaatatcaattt
... 생략 ...
acagaggatgaacaggagattaggaatactacaaacacaagaagttcaatagttcaccag
gtacagacacttgagccggctactcctcaagaaacttttaaaccggttcattctgatata
aaatcccagtacacctttgctaattttgtacaaggagacaataatcactgggcaaaggct
gcagctttagctgtatctgataacctaggtgagctctacaatccattattcatttttggt
ggtcctggtcttggaaaaactcatattttaaatgcgattggaaataaggttctagccgat
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l s_suis.fa
33460 s_suis.fa
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head 100k.fa
>1
GCCTGTTCTTTACATGATTTTTGGTCTAGTGTATGG
>2
AACCGCTGTAAAGGCTTCTGCCACACCGATTTCTTG
>3
GAGGTGATTGTGGTATTGT.GGTAAATCGGTGATTG
>4
GCTTTAGCCGACCTGAACT.GACTACAAGTTGACCA
>5
AAAGGCTACCCGCGGTTGAACCTTACGTGACACATT
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail 100k.fa
>99996
AATGCCCGTAACAACGGGCTTTTATCTTGTTCTAAA
>99997
GTCAGATAGCGCAGGAATTTCAAAGGAATTTGGACC
>99998
AGTTAACTCTTCAGCTGTAAAGTTGTAGTTTTCTAA
>99999

GCGGCATAAATTGGATAAAGAAAGAACTGAAGGACA
>100000
GTTACCATGTATTGTGACAGATAACCACGGTGGAGT
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -mkdir /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/s_suis.br /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -put ../CloudBurst-1.1.0/100k.br /data/cloudburst
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar
Usage: CloudBurst refpath qrypath outpath minreadlen maxreadlen k allowdifferences filteralignments #mappers #reduces
#fmappers #freducers blocksize redundancy

1. refpath: path in hdfs to the reference file
2. qrypath: path in hdfs to the query file
3. outpath: path to a directory to store the results (old results are automatically deleted)
4. minreadlen: minimum length of the reads
5. maxreadlen: maximum read length
6. k: number of mismatches / differences to allow (higher number requires more time)
7. allowdifferences: 0: mismatches only, 1: indels as well
8. filteralignments: 0: all alignments, 1: only report unambiguous best alignment (results identical to RMAP)
9. #mappers: number of mappers to use. suggested: #processor-cores * 10
10. #reduces: number of reducers to use. suggested: #processor-cores * 2
11. #fmappers: number of mappers for filtration alg. suggested: #processor-cores
12. #freducers: number of reducers for filtration alg. suggested: #processor-cores
13. blocksize: number of qry and ref tuples to consider at a time in the reduce phase. suggested: 128
14. redundancy: number of copies of low complexity seeds to use. suggested: # processor cores
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop jar ../CloudBurst-1.1.0/CloudBurst.jar /data/cloudburst/s_suis.br
/data/cloudburst/100k.br /data/results 36 36 3 0 1 240 48 24 24 128 16 >& cloudburst.err
[hadoop@skcc-nebdap02 hadoop]$ cat cloudburst.err
refath: /data/cloudburst/s_suis.br
qrypath: /data/cloudburst/100k.br
outpath: /data/results-alignments
MIN_READ_LEN: 36
MAX_READ_LEN: 36
K: 3
SEED_LEN: 9
FLANK_LEN: 30
ALLOW_DIFFERENCES: 0
FILTER_ALIGNMENTS: true
NUM_MAP_TASKS: 240

NUM_REDUCE_TASKS: 48
BLOCK_SIZE: 128
REDUNDANCY: 16
Removing old results
12/06/15 17:11:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
12/06/15 17:11:28 INFO mapred.FileInputFormat: Total input paths to process : 2
12/06/15 17:11:28 INFO mapred.JobClient: Running job: job_201206112243_0018
12/06/15 17:11:29 INFO mapred.JobClient: map 0% reduce 0%

12/06/15 17:12:31 INFO mapred.JobClient: Job complete: job_201206112243_0018
12/06/15 17:12:32 INFO mapred.JobClient: Counters: 31
12/06/15 17:12:32 INFO mapred.JobClient: Job Counters
12/06/15 17:12:32 INFO mapred.JobClient: Launched reduce tasks=48
12/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2980992
12/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/06/15 17:12:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/06/15 17:12:32 INFO mapred.JobClient: Rack-local map tasks=158
12/06/15 17:12:32 INFO mapred.JobClient: Launched map tasks=241
12/06/15 17:12:32 INFO mapred.JobClient: Data-local map tasks=83
12/06/15 17:12:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1106915
12/06/15 17:12:32 INFO mapred.JobClient: File Input Format Counters
12/06/15 17:12:32 INFO mapred.JobClient: Bytes Read=5587101
12/06/15 17:12:32 INFO mapred.JobClient: File Output Format Counters
12/06/15 17:12:32 INFO mapred.JobClient: Bytes Written=2707836
12/06/15 17:12:32 INFO mapred.JobClient: FileSystemCounters
12/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_READ=140515797
12/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_READ=6112267
12/06/15 17:12:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=288167030
12/06/15 17:12:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2707836
12/06/15 17:12:32 INFO mapred.JobClient: Map-Reduce Framework
12/06/15 17:12:32 INFO mapred.JobClient: Map output materialized bytes=140584917
12/06/15 17:12:32 INFO mapred.JobClient: Map input records=100032
12/06/15 17:12:32 INFO mapred.JobClient: Reduce shuffle bytes=140436273
12/06/15 17:12:32 INFO mapred.JobClient: Spilled Records=5558658
12/06/15 17:12:32 INFO mapred.JobClient: Map output bytes=134956851
12/06/15 17:12:32 INFO mapred.JobClient: Total committed heap usage (bytes)=57936314368
12/06/15 17:12:32 INFO mapred.JobClient: CPU time spent (ms)=1693370
12/06/15 17:12:32 INFO mapred.JobClient: Map input bytes=5073092
12/06/15 17:12:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=24638
12/06/15 17:12:32 INFO mapred.JobClient: Combine input records=0
12/06/15 17:12:32 INFO mapred.JobClient: Reduce input records=2774585

12/06/15 17:12:32 INFO mapred.JobClient: Reduce input groups=254196
12/06/15 17:12:32 INFO mapred.JobClient: Combine output records=0
12/06/15 17:12:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=57459982336
12/06/15 17:12:32 INFO mapred.JobClient: Reduce output records=81128
12/06/15 17:12:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=754874736640
12/06/15 17:12:32 INFO mapred.JobClient: Map output records=2779329
CloudBurst Finished
Alignment time: 65.36
NUM_FMAP_TASKS: 24
NUM_FREDUCE_TASKS: 24
Removing old results
12/06/15 17:12:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
12/06/15 17:12:32 INFO mapred.FileInputFormat: Total input paths to process : 48
12/06/15 17:12:39 INFO mapred.JobClient: Running job: job_201206112243_0019
12/06/15 17:13:20 INFO mapred.JobClient: Job complete: job_201206112243_0019
12/06/15 17:13:20 INFO mapred.JobClient: Counters: 31
12/06/15 17:13:20 INFO mapred.JobClient: Job Counters
12/06/15 17:13:20 INFO mapred.JobClient: Launched reduce tasks=24
12/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=207232
12/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/06/15 17:13:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/06/15 17:13:20 INFO mapred.JobClient: Rack-local map tasks=5
12/06/15 17:13:20 INFO mapred.JobClient: Launched map tasks=48
12/06/15 17:13:20 INFO mapred.JobClient: Data-local map tasks=43
12/06/15 17:13:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=245651
12/06/15 17:13:20 INFO mapred.JobClient: File Input Format Counters
12/06/15 17:13:20 INFO mapred.JobClient: Bytes Read=2707836

12/06/15 17:13:20 INFO mapred.JobClient: File Output Format Counters
12/06/15 17:13:20 INFO mapred.JobClient: Bytes Written=2485042
12/06/15 17:13:20 INFO mapred.JobClient: FileSystemCounters
12/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_READ=2188332
12/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_READ=2713260
12/06/15 17:13:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6039532
12/06/15 17:13:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2485042
12/06/15 17:13:20 INFO mapred.JobClient: Map-Reduce Framework
12/06/15 17:13:20 INFO mapred.JobClient: Map output materialized bytes=2195100
12/06/15 17:13:20 INFO mapred.JobClient: Map input records=81128
12/06/15 17:13:20 INFO mapred.JobClient: Reduce shuffle bytes=2153793
12/06/15 17:13:20 INFO mapred.JobClient: Spilled Records=162088
12/06/15 17:13:20 INFO mapred.JobClient: Map output bytes=2028200
12/06/15 17:13:20 INFO mapred.JobClient: Total committed heap usage (bytes)=14471921664
12/06/15 17:13:20 INFO mapred.JobClient: CPU time spent (ms)=95390
12/06/15 17:13:20 INFO mapred.JobClient: Map input bytes=2703324
12/06/15 17:13:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=5424
12/06/15 17:13:20 INFO mapred.JobClient: Combine input records=81128
12/06/15 17:13:20 INFO mapred.JobClient: Reduce input records=81044
12/06/15 17:13:20 INFO mapred.JobClient: Reduce input groups=76511
12/06/15 17:13:20 INFO mapred.JobClient: Combine output records=81044
12/06/15 17:13:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=13169172480
12/06/15 17:13:20 INFO mapred.JobClient: Reduce output records=74502
12/06/15 17:13:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=193761902592
12/06/15 17:13:20 INFO mapred.JobClient: Map output records=81128
FilterAlignments Finished
Filtering time: 48.481
Total Running time: 113.841
[hadoop@skcc-nebdap02 hadoop]$
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -get /data/results/ ../CloudBurst-1.1.0/results
[hadoop@skcc-nebdap02 hadoop]$ cd ../CloudBurst-1.1.0
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ java -jar PrintAlignments.jar results | sort -nk4 > 100k.3.txt
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ head -n 20 100k.3.txt
1 766133 766169 1 1 +
1 297899 297935 2 0 -
1 1325118 1325154 4 1 +
1 145970 146006 7 1 -
1 553513 553549 8 0 -

1 1779842 1779878 9 0 -
1 86299 86335 10 0 -
1 1503808 1503844 11 2 +
1 397758 397794 12 0 +
1 241778 241814 13 0 -
1 626711 626747 14 0 +
1 142141 142177 15 1 +
1 1401129 1401165 16 1 -
1 306289 306325 17 1 +
1 628571 628607 18 1 -
1 815172 815208 19 0 -
1 1624600 1624636 20 0 +
1 13779 13815 21 0 +
1 129064 129100 22 1 +
1 1382938 1382974 24 2 +
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ tail -n 20 100k.3.txt
1 1796768 1796804 99976 2 -
1 1021128 1021164 99978 0 -
1 1350005 1350041 99980 1 +
1 799280 799316 99981 2 -
1 139518 139554 99983 0 +
1 57158 57194 99985 0 +
1 1663030 1663066 99986 2 +
1 549235 549271 99987 0 -
1 1400509 1400545 99988 0 +
1 880593 880629 99989 0 +
1 918064 918100 99990 0 +
1 937994 938030 99992 1 -
1 94456 94492 99993 0 +
1 1144320 1144356 99994 0 +
1 1441627 1441663 99995 0 +
1 1281557 1281593 99996 0 +
1 1323611 1323647 99997 2 -
1 800095 800131 99998 0 -
1 1956458 1956494 99999 1 +
1 134848 134884 100000 2 -
[hadoop@skcc-nebdap02 CloudBurst-1.1.0]$ wc -l 100k.3.txt
74502 100k.3.txt

Cloud burst tutorial

More Related Content

What's hot

Viewers also liked

Similar to Cloud burst tutorial

More from 주영 송

Recently uploaded

Cloud burst tutorial