Bioawk is a tool that extends GNU awk to facilitate working with biological file formats like FASTA, FASTQ, SAM, BED, GFF, and VCF. It directly reads gzipped files and treats spanning sequences as single records. Some key functions added in Bioawk include calculating GC content, reversing/reverse complementing sequences, and working with quality values. Bioawk allows for convenient parsing, manipulation and statistical analysis of genomic data.
2. (GNU) awk
• A complete programming language
• Operates on a per-line (row) basis
• Designed to operate upon columnar data
• By default, whitespace-delimited columns
• Outputs on a per-row basis
3. General syntax
BEGIN { print "START"; }
{ print }
END { print "END"; }'
<file 1> … <file N>
Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html.
$0;
Columns: $1, $2, …, $NF
awk '
4. Key special variables
• NF – Number of fields (columns)
• NR – Number of records (rows; all files)
• FNR – Number of records (rows; per file)
• FS – (input) field separator (default: " ")
• OFS – Output field separator (default: " ")
7. Bioawk, by Heng Li
• Behaves like GNU awk, on non-bio. data
• Install from GitHub repo. or others' Dockers
• Supports: BED, GFF, FASTA, FASTQ, SAM, VCF
List formats with: bioawk -c help
• Directly reads gzipped files (usually)
• -t short for bioawk -F't' -v OFS="t"
• Treats spanning seqs as a single record
9. Bioawk - examples - GFF*/GTFs
Find all exons less than 100 bp, which are
annotated as the main functional isoform
(i.e., APPRIS principal 1):
bioawk -c gff '$feature == "exon" &&
($end - $start) < 100 &&
$attribute ~ /appris_principal_1/'
gencode.vXX.annotation.gtf.gz
Example adapted from https://hpc.nih.gov/apps/bioawk.html
10. Bioawk - examples - FASTAs
Reverse complement:
bioawk -c fastx
'{print ">"$name;
print revcomp($seq)}'
seq.fa.gz
Example taken from the README.
11. Bioawk - examples - FASTAs
List of sequence names and lengths:
bioawk -c fastx
'{print $name,
length($seq)}'
seq.fa.gz
Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).
12. Bioawk - examples - FASTQs
%GC and mean Phred quality score:
awk -c fastx
'{ print ">"$name;
print gc($seq);
print meanqual($qual);
}' seq.fq.gz
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
13. Bioawk - examples - SAM files
Extract mapped reads:
sambamba view x.bam |
bioawk -c sam '!and($flag,4)'
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
14. Bioawk - examples - VCF files
bioawk -c vcf '{
freq[$filter]++
total++
}
END {
for(val in freq)
printf "%st%dt%dn",
val, freq[val], freq[val]*100/total
}'
From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners".
Assess pipeline—
sequence filter statistics:
• filter (e.g. LowQual)
• number of filter
occurrences
• percentage of total filters
16. Bioawk: list of added functions
• gc($seq)
• meanqual($seq)
• reverse($seq) / revcomp($seq)
• qualcount($qual, threshold)
• Number of quality values above the threshold parameter.
• trimq(qual, beg, end, param=0.05)
• Trims using Richard Mott's algorithm (used in Phred).
• Bitwise AND/OR/XOR
Editor's Notes
AWK: initials of original developers: A. Aho, B. W. Kernighan and P. Weinberger.
https://github.com/lh3/bioawk
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site
Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site
Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site
Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from /mnt/work1/users/home2/cviner/workDir/cytomod/experiments_data/linked-2017-06-11-K562_POU5F1_reprocessing_and_comparison/
Shown with bioSyntax (Vim) highlighting.
VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf
https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57
Illumina VCF filter annotations:
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff.
• LowGQ—The genotyping quality (GQ) is below a cutoff.
• LowQual—The variant quality (QUAL) is below a cutoff.
• LowVariantFreq—The variant frequency is less than the given threshold.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8.
• SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf
https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57
Illumina VCF filter annotations:
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff.
• LowGQ—The genotyping quality (GQ) is below a cutoff.
• LowQual—The variant quality (QUAL) is below a cutoff.
• LowVariantFreq—The variant frequency is less than the given threshold.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8.
• SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
"The modified Mott trimming algorithm, which is used to calculate the trimming information for the '-trim_alt' option and the phd files, uses base error probabilities calculated from the phred quality values. For each base it subtracts the base error probability from an error probability cutoff value (0.05 by default, and changed using the '-trim_cutoff' option) to form the base score. Then it finds the highest scoring segment of the sequence where the segment score is the sum of the segment base scores (the score can have non-negative values only). The algorithm requires a minimum segment length, which is set to 20 bases."
http://bozeman.mbt.washington.edu/phrap.docs/phred.html