Awk primer
and
Bioawk
Coby Viner
Lab meeting: tech – Wednesday, Dec. 2, 2020
(GNU) awk
• A complete programming language
• Operates on a per-line (row) basis
• Designed to operate upon columnar data
• By default, whitespace-delimited columns
• Outputs on a per-row basis
General syntax
BEGIN { print "START"; }
{ print }
END { print "END"; }' 
<file 1> … <file N>
Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html.
$0;
Columns: $1, $2, …, $NF
awk  '
Key special variables
• NF – Number of fields (columns)
• NR – Number of records (rows; all files)
• FNR – Number of records (rows; per file)
• FS – (input) field separator (default: " ")
• OFS – Output field separator (default: " ")
Examples
BEGIN { FS=OFS="t"; }
{ print $1,$2,$3; }
FNR > 1 { }
grep -v 'chr[mM]' <f> | awk '{print
$1,$2,$3}' | sed 's/chr//;'
awk '$1 !~ /chr[mM]/ {sub(/chr/, "");
print $1,$2,$3}' <f>
Examples: two files
awk 'FNR==NR{a[$1]=$2; next}
{print $1,$2,a[$2]; }' 
<file 1> <file 2>
Bioawk, by Heng Li
• Behaves like GNU awk, on non-bio. data
• Install from GitHub repo. or others' Dockers
• Supports: BED, GFF, FASTA, FASTQ, SAM, VCF
 List formats with: bioawk -c help
• Directly reads gzipped files (usually)
• -t short for bioawk -F't' -v OFS="t"
• Treats spanning seqs as a single record
Bioawk - generic/BED files
Parse column names:
bioawk -c header '{ print $chr }' 
<file.gz>
chr1
chr3
chrX
Bioawk - examples - GFF*/GTFs
Find all exons less than 100 bp, which are
annotated as the main functional isoform
(i.e., APPRIS principal 1):
bioawk -c gff '$feature == "exon" &&
($end - $start) < 100 &&
$attribute ~ /appris_principal_1/' 
gencode.vXX.annotation.gtf.gz
Example adapted from https://hpc.nih.gov/apps/bioawk.html
Bioawk - examples - FASTAs
Reverse complement:
bioawk -c fastx 
'{print ">"$name;
print revcomp($seq)}' 
seq.fa.gz
Example taken from the README.
Bioawk - examples - FASTAs
List of sequence names and lengths:
bioawk -c fastx 
'{print $name,
length($seq)}' 
seq.fa.gz
Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).
Bioawk - examples - FASTQs
%GC and mean Phred quality score:
awk -c fastx 
'{ print ">"$name;
print gc($seq);
print meanqual($qual);
}' seq.fq.gz
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
Bioawk - examples - SAM files
Extract mapped reads:
sambamba view x.bam | 
bioawk -c sam '!and($flag,4)'
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
Bioawk - examples - VCF files
bioawk -c vcf '{
freq[$filter]++
total++
}
END {
for(val in freq)
printf "%st%dt%dn",
val, freq[val], freq[val]*100/total
}'
From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners".
Assess pipeline—
sequence filter statistics:
• filter (e.g. LowQual)
• number of filter
occurrences
• percentage of total filters
Bioawk - examples - VCF files
VCF data: Erik Garrison's vcflib, sample.vcf.
PASS 5 55
q10 1 11
. 3 33
Bioawk: list of added functions
• gc($seq)
• meanqual($seq)
• reverse($seq) / revcomp($seq)
• qualcount($qual, threshold)
• Number of quality values above the threshold parameter.
• trimq(qual, beg, end, param=0.05)
• Trims using Richard Mott's algorithm (used in Phred).
• Bitwise AND/OR/XOR

Awk primer and Bioawk

  • 1.
    Awk primer and Bioawk Coby Viner Labmeeting: tech – Wednesday, Dec. 2, 2020
  • 2.
    (GNU) awk • Acomplete programming language • Operates on a per-line (row) basis • Designed to operate upon columnar data • By default, whitespace-delimited columns • Outputs on a per-row basis
  • 3.
    General syntax BEGIN {print "START"; } { print } END { print "END"; }' <file 1> … <file N> Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html. $0; Columns: $1, $2, …, $NF awk '
  • 4.
    Key special variables •NF – Number of fields (columns) • NR – Number of records (rows; all files) • FNR – Number of records (rows; per file) • FS – (input) field separator (default: " ") • OFS – Output field separator (default: " ")
  • 5.
    Examples BEGIN { FS=OFS="t";} { print $1,$2,$3; } FNR > 1 { } grep -v 'chr[mM]' <f> | awk '{print $1,$2,$3}' | sed 's/chr//;' awk '$1 !~ /chr[mM]/ {sub(/chr/, ""); print $1,$2,$3}' <f>
  • 6.
    Examples: two files awk'FNR==NR{a[$1]=$2; next} {print $1,$2,a[$2]; }' <file 1> <file 2>
  • 7.
    Bioawk, by HengLi • Behaves like GNU awk, on non-bio. data • Install from GitHub repo. or others' Dockers • Supports: BED, GFF, FASTA, FASTQ, SAM, VCF  List formats with: bioawk -c help • Directly reads gzipped files (usually) • -t short for bioawk -F't' -v OFS="t" • Treats spanning seqs as a single record
  • 8.
    Bioawk - generic/BEDfiles Parse column names: bioawk -c header '{ print $chr }' <file.gz> chr1 chr3 chrX
  • 9.
    Bioawk - examples- GFF*/GTFs Find all exons less than 100 bp, which are annotated as the main functional isoform (i.e., APPRIS principal 1): bioawk -c gff '$feature == "exon" && ($end - $start) < 100 && $attribute ~ /appris_principal_1/' gencode.vXX.annotation.gtf.gz Example adapted from https://hpc.nih.gov/apps/bioawk.html
  • 10.
    Bioawk - examples- FASTAs Reverse complement: bioawk -c fastx '{print ">"$name; print revcomp($seq)}' seq.fa.gz Example taken from the README.
  • 11.
    Bioawk - examples- FASTAs List of sequence names and lengths: bioawk -c fastx '{print $name, length($seq)}' seq.fa.gz Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).
  • 12.
    Bioawk - examples- FASTQs %GC and mean Phred quality score: awk -c fastx '{ print ">"$name; print gc($seq); print meanqual($qual); }' seq.fq.gz Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
  • 13.
    Bioawk - examples- SAM files Extract mapped reads: sambamba view x.bam | bioawk -c sam '!and($flag,4)' Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
  • 14.
    Bioawk - examples- VCF files bioawk -c vcf '{ freq[$filter]++ total++ } END { for(val in freq) printf "%st%dt%dn", val, freq[val], freq[val]*100/total }' From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners". Assess pipeline— sequence filter statistics: • filter (e.g. LowQual) • number of filter occurrences • percentage of total filters
  • 15.
    Bioawk - examples- VCF files VCF data: Erik Garrison's vcflib, sample.vcf. PASS 5 55 q10 1 11 . 3 33
  • 16.
    Bioawk: list ofadded functions • gc($seq) • meanqual($seq) • reverse($seq) / revcomp($seq) • qualcount($qual, threshold) • Number of quality values above the threshold parameter. • trimq(qual, beg, end, param=0.05) • Trims using Richard Mott's algorithm (used in Phred). • Bitwise AND/OR/XOR

Editor's Notes

  • #3 AWK: initials of original developers: A. Aho, B. W. Kernighan and P. Weinberger.
  • #8 https://github.com/lh3/bioawk
  • #9 Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  • #11 Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  • #12 Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  • #13 Example files selected randomly, from /mnt/work1/users/home2/cviner/workDir/cytomod/experiments_data/linked-2017-06-11-K562_POU5F1_reprocessing_and_comparison/ Shown with bioSyntax (Vim) highlighting.
  • #15 VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
  • #16 VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
  • #17 "The modified Mott trimming algorithm, which is used to calculate the trimming information for the '-trim_alt' option and the phd files, uses base error probabilities calculated from the phred quality values. For each base it subtracts the base error probability from an error probability cutoff value (0.05 by default, and changed using the '-trim_cutoff' option) to form the base score. Then it finds the highest scoring segment of the sequence where the segment score is the sum of the segment base scores (the score can have non-negative values only). The algorithm requires a minimum segment length, which is set to 20 bases." http://bozeman.mbt.washington.edu/phrap.docs/phred.html