Awk primer and Bioawk

•Download as PPTX, PDF•

0 likes•360 views

Bioawk is a tool that extends GNU awk to facilitate working with biological file formats like FASTA, FASTQ, SAM, BED, GFF, and VCF. It directly reads gzipped files and treats spanning sequences as single records. Some key functions added in Bioawk include calculating GC content, reversing/reverse complementing sequences, and working with quality values. Bioawk allows for convenient parsing, manipulation and statistical analysis of genomic data.

Technology

Awk primer
and
Bioawk
Coby Viner
Lab meeting: tech – Wednesday, Dec. 2, 2020

(GNU) awk
• A complete programming language
• Operates on a per-line (row) basis
• Designed to operate upon columnar data
• By default, whitespace-delimited columns
• Outputs on a per-row basis

General syntax
BEGIN { print "START"; }
{ print }
END { print "END"; }'
<file 1> … <file N>
Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html.
$0;
Columns: $1, $2, …, $NF
awk '

Key special variables
• NF – Number of fields (columns)
• NR – Number of records (rows; all files)
• FNR – Number of records (rows; per file)
• FS – (input) field separator (default: " ")
• OFS – Output field separator (default: " ")

Examples
BEGIN { FS=OFS="t"; }
{ print $1,$2,$3; }
FNR > 1 { }
grep -v 'chr[mM]' <f> | awk '{print
$1,$2,$3}' | sed 's/chr//;'
awk '$1 !~ /chr[mM]/ {sub(/chr/, "");
print $1,$2,$3}' <f>

Examples: two files
awk 'FNR==NR{a[$1]=$2; next}
{print $1,$2,a[$2]; }'
<file 1> <file 2>

Bioawk, by Heng Li
• Behaves like GNU awk, on non-bio. data
• Install from GitHub repo. or others' Dockers
• Supports: BED, GFF, FASTA, FASTQ, SAM, VCF
 List formats with: bioawk -c help
• Directly reads gzipped files (usually)
• -t short for bioawk -F't' -v OFS="t"
• Treats spanning seqs as a single record

Bioawk - generic/BED files
Parse column names:
bioawk -c header '{ print $chr }'
<file.gz>
chr1
chr3
chrX

Bioawk - examples - GFF*/GTFs
Find all exons less than 100 bp, which are
annotated as the main functional isoform
(i.e., APPRIS principal 1):
bioawk -c gff '$feature == "exon" &&
($end - $start) < 100 &&
$attribute ~ /appris_principal_1/'
gencode.vXX.annotation.gtf.gz
Example adapted from https://hpc.nih.gov/apps/bioawk.html

Bioawk - examples - FASTAs
Reverse complement:
bioawk -c fastx
'{print ">"$name;
print revcomp($seq)}'
seq.fa.gz
Example taken from the README.

Bioawk - examples - FASTAs
List of sequence names and lengths:
bioawk -c fastx
'{print $name,
length($seq)}'
seq.fa.gz
Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).

Bioawk - examples - FASTQs
%GC and mean Phred quality score:
awk -c fastx
'{ print ">"$name;
print gc($seq);
print meanqual($qual);
}' seq.fq.gz
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).

Bioawk - examples - SAM files
Extract mapped reads:
sambamba view x.bam |
bioawk -c sam '!and($flag,4)'
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).

Bioawk - examples - VCF files
bioawk -c vcf '{
freq[$filter]++
total++
}
END {
for(val in freq)
printf "%st%dt%dn",
val, freq[val], freq[val]*100/total
}'
From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners".
Assess pipeline—
sequence filter statistics:
• filter (e.g. LowQual)
• number of filter
occurrences
• percentage of total filters

Bioawk - examples - VCF files
VCF data: Erik Garrison's vcflib, sample.vcf.
PASS 5 55
q10 1 11
. 3 33

Bioawk: list of added functions
• gc($seq)
• meanqual($seq)
• reverse($seq) / revcomp($seq)
• qualcount($qual, threshold)
• Number of quality values above the threshold parameter.
• trimq(qual, beg, end, param=0.05)
• Trims using Richard Mott's algorithm (used in Phred).
• Bitwise AND/OR/XOR

What's hot

김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019min woog kim

[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint NAVER D2

FreeRTOS Course - Semaphore/Mutex ManagementAmr Ali (ISTQB CTAL Full, CSM, ITIL Foundation)

Profiling your Applications using the Linux Perf ToolsemBO_Conference

java memory management & gcexsuns

Linux Profiling at NetflixBrendan Gregg

Static partitioning virtualization on RISC-VRISC-V International

리눅스 커널 디버거 KGDB/KDBManjong Han

Linux BPF SuperpowersBrendan Gregg

Review of QNXRobert-Emmanuel Mayssat

Java App On Digital Ocean: Deploying With Gitlab CI/CDSeun Matt

Ss systemdのwslディストロを作る kernelvm探検隊online part 3Takaya Saeki

[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다KWON JUNHYEOK

Linux for embedded_systemsVandana Salve

Advanced Namespaces and cgroupsKernel TLV

UniRx - Reactive Extensions for Unity(EN)Yoshifumi Kawai

Microservices - it's déjà vu all over againArnon Rotem-Gal-Oz

Hacking QNXricardomcm

ProxySQL on KubernetesRené Cannaò

Asynchronous Programming in .NETPierre-Luc Maheu

What's hot (20)

김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019

[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint

FreeRTOS Course - Semaphore/Mutex Management

Profiling your Applications using the Linux Perf Tools

java memory management & gc

Linux Profiling at Netflix

Static partitioning virtualization on RISC-V

리눅스 커널 디버거 KGDB/KDB

Linux BPF Superpowers

Review of QNX

Java App On Digital Ocean: Deploying With Gitlab CI/CD

Ss systemdのwslディストロを作る kernelvm探検隊online part 3

[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다

Linux for embedded_systems

Advanced Namespaces and cgroups

UniRx - Reactive Extensions for Unity(EN)

Microservices - it's déjà vu all over again

Hacking QNX

ProxySQL on Kubernetes

Asynchronous Programming in .NET

Similar to Awk primer and Bioawk

awk_intro.pptPrasadReddy710753

Awk programming Dr.M.Karthika parthasarathy

Shell ScriptsDr.Ravi

Unix - Class7 - awkNihar Ranjan Paital

BioMake BOSC 2004Chris Mungall

Airlover 20030324 1Dr.Ravi

ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaaewout2

101 3.2 process text streams using filtersAcácio Oliveira

Unix TutorialSanjay Saluth

Workshop NGS data analysis - 2Maté Ongenaert

2005_Structures and functions of MakefileNakCheon Jung

3.2 process text streams using filtersAcácio Oliveira

JIP Pipeline System Introductionthasso23

DevChatt 2010 - *nix Cmd Line Kung Foobrian_dailey

101 3.4 use streams, pipes and redirectsAcácio Oliveira

101 3.2 process text streams using filtersAcácio Oliveira

Course 102: Lecture 8: Composite Commands Ahmed El-Arabawy

One-Liners to Rule Them Allegypt

Terraform in deployment pipelineAnton Babenko

Similar to Awk primer and Bioawk (20)

awk_intro.ppt

Awk programming

Shell Scripts

Unix - Class7 - awk

BioMake BOSC 2004

Airlover 20030324 1

ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa

101 3.2 process text streams using filters

Unix Tutorial

Workshop NGS data analysis - 2

2005_Structures and functions of Makefile

3.2 process text streams using filters

JIP Pipeline System Introduction

DevChatt 2010 - *nix Cmd Line Kung Foo

101 3.4 use streams, pipes and redirects

101 3.2 process text streams using filters

Course 102: Lecture 8: Composite Commands

One-Liners to Rule Them All

Terraform in deployment pipeline

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance

Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles

How world-class product teams are winning in the AI era by CEO and Founder, P...Product School

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer

The Future of Platform EngineeringJemma Hussein Allen

Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School

Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»QADay

Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School

Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance

"Impact of front-end architecture on development cost", Viktor TurskyiFwdays

Speed Wins: From Kafka to APIs in Minutesconfluent

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Key Trends Shaping the Future of Infrastructure.pdf

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

How world-class product teams are winning in the AI era by CEO and Founder, P...

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

The Future of Platform Engineering

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Neuro-symbolic is not enough, we need neuro-*semantic*

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Designing Great Products: The Power of Design and Leadership by Chief Designe...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

"Impact of front-end architecture on development cost", Viktor Turskyi

Speed Wins: From Kafka to APIs in Minutes

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Awk primer and Bioawk

1. Awk primer and Bioawk Coby Viner Lab meeting: tech – Wednesday, Dec. 2, 2020

2. (GNU) awk • A complete programming language • Operates on a per-line (row) basis • Designed to operate upon columnar data • By default, whitespace-delimited columns • Outputs on a per-row basis

3. General syntax BEGIN { print "START"; } { print } END { print "END"; }' <file 1> … <file N> Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html. $0; Columns: $1, $2, …, $NF awk '

4. Key special variables • NF – Number of fields (columns) • NR – Number of records (rows; all files) • FNR – Number of records (rows; per file) • FS – (input) field separator (default: " ") • OFS – Output field separator (default: " ")

5. Examples BEGIN { FS=OFS="t"; } { print $1,$2,$3; } FNR > 1 { } grep -v 'chr[mM]' <f> | awk '{print $1,$2,$3}' | sed 's/chr//;' awk '$1 !~ /chr[mM]/ {sub(/chr/, ""); print $1,$2,$3}' <f>

6. Examples: two files awk 'FNR==NR{a[$1]=$2; next} {print $1,$2,a[$2]; }' <file 1> <file 2>

7. Bioawk, by Heng Li • Behaves like GNU awk, on non-bio. data • Install from GitHub repo. or others' Dockers • Supports: BED, GFF, FASTA, FASTQ, SAM, VCF  List formats with: bioawk -c help • Directly reads gzipped files (usually) • -t short for bioawk -F't' -v OFS="t" • Treats spanning seqs as a single record

8. Bioawk - generic/BED files Parse column names: bioawk -c header '{ print $chr }' <file.gz> chr1 chr3 chrX

9. Bioawk - examples - GFF*/GTFs Find all exons less than 100 bp, which are annotated as the main functional isoform (i.e., APPRIS principal 1): bioawk -c gff '$feature == "exon" && ($end - $start) < 100 && $attribute ~ /appris_principal_1/' gencode.vXX.annotation.gtf.gz Example adapted from https://hpc.nih.gov/apps/bioawk.html

10. Bioawk - examples - FASTAs Reverse complement: bioawk -c fastx '{print ">"$name; print revcomp($seq)}' seq.fa.gz Example taken from the README.

11. Bioawk - examples - FASTAs List of sequence names and lengths: bioawk -c fastx '{print $name, length($seq)}' seq.fa.gz Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).

12. Bioawk - examples - FASTQs %GC and mean Phred quality score: awk -c fastx '{ print ">"$name; print gc($seq); print meanqual($qual); }' seq.fq.gz Adapted from Istvan Albert's Bioawk tutorial (on GitHub).

13. Bioawk - examples - SAM files Extract mapped reads: sambamba view x.bam | bioawk -c sam '!and($flag,4)' Adapted from Istvan Albert's Bioawk tutorial (on GitHub).

14. Bioawk - examples - VCF files bioawk -c vcf '{ freq[$filter]++ total++ } END { for(val in freq) printf "%st%dt%dn", val, freq[val], freq[val]*100/total }' From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners". Assess pipeline— sequence filter statistics: • filter (e.g. LowQual) • number of filter occurrences • percentage of total filters

15. Bioawk - examples - VCF files VCF data: Erik Garrison's vcflib, sample.vcf. PASS 5 55 q10 1 11 . 3 33

16. Bioawk: list of added functions • gc($seq) • meanqual($seq) • reverse($seq) / revcomp($seq) • qualcount($qual, threshold) • Number of quality values above the threshold parameter. • trimq(qual, beg, end, param=0.05) • Trims using Richard Mott's algorithm (used in Phred). • Bitwise AND/OR/XOR

Editor's Notes

AWK: initials of original developers: A. Aho, B. W. Kernighan and P. Weinberger.
https://github.com/lh3/bioawk
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
Example files selected randomly, from /mnt/work1/users/home2/cviner/workDir/cytomod/experiments_data/linked-2017-06-11-K562_POU5F1_reprocessing_and_comparison/ Shown with bioSyntax (Vim) highlighting.
VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
"The modified Mott trimming algorithm, which is used to calculate the trimming information for the '-trim_alt' option and the phd files, uses base error probabilities calculated from the phred quality values. For each base it subtracts the base error probability from an error probability cutoff value (0.05 by default, and changed using the '-trim_cutoff' option) to form the base score. Then it finds the highest scoring segment of the sequence where the segment score is the sum of the segment base scores (the score can have non-negative values only). The algorithm requires a minimum segment length, which is set to 20 bases." http://bozeman.mbt.washington.edu/phrap.docs/phred.html

Awk primer and Bioawk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Awk primer and Bioawk

Similar to Awk primer and Bioawk (20)

More from Hoffman Lab

More from Hoffman Lab (20)

Recently uploaded

Recently uploaded (20)

Awk primer and Bioawk

Editor's Notes