SlideShare a Scribd company logo
Awk primer
and
Bioawk
Coby Viner
Lab meeting: tech – Wednesday, Dec. 2, 2020
(GNU) awk
• A complete programming language
• Operates on a per-line (row) basis
• Designed to operate upon columnar data
• By default, whitespace-delimited columns
• Outputs on a per-row basis
General syntax
BEGIN { print "START"; }
{ print }
END { print "END"; }' 
<file 1> … <file N>
Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html.
$0;
Columns: $1, $2, …, $NF
awk  '
Key special variables
• NF – Number of fields (columns)
• NR – Number of records (rows; all files)
• FNR – Number of records (rows; per file)
• FS – (input) field separator (default: " ")
• OFS – Output field separator (default: " ")
Examples
BEGIN { FS=OFS="t"; }
{ print $1,$2,$3; }
FNR > 1 { }
grep -v 'chr[mM]' <f> | awk '{print
$1,$2,$3}' | sed 's/chr//;'
awk '$1 !~ /chr[mM]/ {sub(/chr/, "");
print $1,$2,$3}' <f>
Examples: two files
awk 'FNR==NR{a[$1]=$2; next}
{print $1,$2,a[$2]; }' 
<file 1> <file 2>
Bioawk, by Heng Li
• Behaves like GNU awk, on non-bio. data
• Install from GitHub repo. or others' Dockers
• Supports: BED, GFF, FASTA, FASTQ, SAM, VCF
 List formats with: bioawk -c help
• Directly reads gzipped files (usually)
• -t short for bioawk -F't' -v OFS="t"
• Treats spanning seqs as a single record
Bioawk - generic/BED files
Parse column names:
bioawk -c header '{ print $chr }' 
<file.gz>
chr1
chr3
chrX
Bioawk - examples - GFF*/GTFs
Find all exons less than 100 bp, which are
annotated as the main functional isoform
(i.e., APPRIS principal 1):
bioawk -c gff '$feature == "exon" &&
($end - $start) < 100 &&
$attribute ~ /appris_principal_1/' 
gencode.vXX.annotation.gtf.gz
Example adapted from https://hpc.nih.gov/apps/bioawk.html
Bioawk - examples - FASTAs
Reverse complement:
bioawk -c fastx 
'{print ">"$name;
print revcomp($seq)}' 
seq.fa.gz
Example taken from the README.
Bioawk - examples - FASTAs
List of sequence names and lengths:
bioawk -c fastx 
'{print $name,
length($seq)}' 
seq.fa.gz
Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).
Bioawk - examples - FASTQs
%GC and mean Phred quality score:
awk -c fastx 
'{ print ">"$name;
print gc($seq);
print meanqual($qual);
}' seq.fq.gz
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
Bioawk - examples - SAM files
Extract mapped reads:
sambamba view x.bam | 
bioawk -c sam '!and($flag,4)'
Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
Bioawk - examples - VCF files
bioawk -c vcf '{
freq[$filter]++
total++
}
END {
for(val in freq)
printf "%st%dt%dn",
val, freq[val], freq[val]*100/total
}'
From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners".
Assess pipeline—
sequence filter statistics:
• filter (e.g. LowQual)
• number of filter
occurrences
• percentage of total filters
Bioawk - examples - VCF files
VCF data: Erik Garrison's vcflib, sample.vcf.
PASS 5 55
q10 1 11
. 3 33
Bioawk: list of added functions
• gc($seq)
• meanqual($seq)
• reverse($seq) / revcomp($seq)
• qualcount($qual, threshold)
• Number of quality values above the threshold parameter.
• trimq(qual, beg, end, param=0.05)
• Trims using Richard Mott's algorithm (used in Phred).
• Bitwise AND/OR/XOR

More Related Content

What's hot

김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019
김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019
김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019min woog kim
 
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint [D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint NAVER D2
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gcexsuns
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
리눅스 커널 디버거 KGDB/KDB
리눅스 커널 디버거 KGDB/KDB리눅스 커널 디버거 KGDB/KDB
리눅스 커널 디버거 KGDB/KDBManjong Han
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Java App On Digital Ocean: Deploying With Gitlab CI/CD
Java App On Digital Ocean: Deploying With Gitlab CI/CDJava App On Digital Ocean: Deploying With Gitlab CI/CD
Java App On Digital Ocean: Deploying With Gitlab CI/CDSeun Matt
 
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Takaya Saeki
 
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다KWON JUNHYEOK
 
Linux for embedded_systems
Linux for embedded_systemsLinux for embedded_systems
Linux for embedded_systemsVandana Salve
 
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroupsKernel TLV
 
UniRx - Reactive Extensions for Unity(EN)
UniRx - Reactive Extensions for Unity(EN)UniRx - Reactive Extensions for Unity(EN)
UniRx - Reactive Extensions for Unity(EN)Yoshifumi Kawai
 
Microservices - it's déjà vu all over again
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over againArnon Rotem-Gal-Oz
 
ProxySQL on Kubernetes
ProxySQL on KubernetesProxySQL on Kubernetes
ProxySQL on KubernetesRené Cannaò
 
Asynchronous Programming in .NET
Asynchronous Programming in .NETAsynchronous Programming in .NET
Asynchronous Programming in .NETPierre-Luc Maheu
 

What's hot (20)

김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019
김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019
김민욱, (달빛조각사) 엘릭서를 이용한 mmorpg 서버 개발, NDC2019
 
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint [D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
[D2] java 애플리케이션 트러블 슈팅 사례 & pinpoint
 
FreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex ManagementFreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex Management
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
리눅스 커널 디버거 KGDB/KDB
리눅스 커널 디버거 KGDB/KDB리눅스 커널 디버거 KGDB/KDB
리눅스 커널 디버거 KGDB/KDB
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Review of QNX
Review of QNXReview of QNX
Review of QNX
 
Java App On Digital Ocean: Deploying With Gitlab CI/CD
Java App On Digital Ocean: Deploying With Gitlab CI/CDJava App On Digital Ocean: Deploying With Gitlab CI/CD
Java App On Digital Ocean: Deploying With Gitlab CI/CD
 
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3Ss systemdのwslディストロを作る kernelvm探検隊online part 3
Ss systemdのwslディストロを作る kernelvm探検隊online part 3
 
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
[Devil's camp 2019] 혹시 Elixir 아십니까? 정.말.갓.언.어.입.니.다
 
Linux for embedded_systems
Linux for embedded_systemsLinux for embedded_systems
Linux for embedded_systems
 
Advanced Namespaces and cgroups
Advanced Namespaces and cgroupsAdvanced Namespaces and cgroups
Advanced Namespaces and cgroups
 
UniRx - Reactive Extensions for Unity(EN)
UniRx - Reactive Extensions for Unity(EN)UniRx - Reactive Extensions for Unity(EN)
UniRx - Reactive Extensions for Unity(EN)
 
Microservices - it's déjà vu all over again
Microservices  - it's déjà vu all over againMicroservices  - it's déjà vu all over again
Microservices - it's déjà vu all over again
 
Hacking QNX
Hacking QNXHacking QNX
Hacking QNX
 
ProxySQL on Kubernetes
ProxySQL on KubernetesProxySQL on Kubernetes
ProxySQL on Kubernetes
 
Asynchronous Programming in .NET
Asynchronous Programming in .NETAsynchronous Programming in .NET
Asynchronous Programming in .NET
 

Similar to Awk primer and Bioawk

Shell Scripts
Shell ScriptsShell Scripts
Shell ScriptsDr.Ravi
 
Airlover 20030324 1
Airlover 20030324 1Airlover 20030324 1
Airlover 20030324 1Dr.Ravi
 
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaaShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaaewout2
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
2005_Structures and functions of Makefile
2005_Structures and functions of Makefile2005_Structures and functions of Makefile
2005_Structures and functions of MakefileNakCheon Jung
 
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filtersAcácio Oliveira
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introductionthasso23
 
DevChatt 2010 - *nix Cmd Line Kung Foo
DevChatt 2010 - *nix Cmd Line Kung FooDevChatt 2010 - *nix Cmd Line Kung Foo
DevChatt 2010 - *nix Cmd Line Kung Foobrian_dailey
 
101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirects101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirectsAcácio Oliveira
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands Ahmed El-Arabawy
 
One-Liners to Rule Them All
One-Liners to Rule Them AllOne-Liners to Rule Them All
One-Liners to Rule Them Allegypt
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipelineAnton Babenko
 

Similar to Awk primer and Bioawk (20)

awk_intro.ppt
awk_intro.pptawk_intro.ppt
awk_intro.ppt
 
Awk programming
Awk programming Awk programming
Awk programming
 
Shell Scripts
Shell ScriptsShell Scripts
Shell Scripts
 
Unix - Class7 - awk
Unix - Class7 - awkUnix - Class7 - awk
Unix - Class7 - awk
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Airlover 20030324 1
Airlover 20030324 1Airlover 20030324 1
Airlover 20030324 1
 
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaaShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa
ShellAdvanced aaäaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Unix Tutorial
Unix TutorialUnix Tutorial
Unix Tutorial
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
2005_Structures and functions of Makefile
2005_Structures and functions of Makefile2005_Structures and functions of Makefile
2005_Structures and functions of Makefile
 
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filters
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introduction
 
DevChatt 2010 - *nix Cmd Line Kung Foo
DevChatt 2010 - *nix Cmd Line Kung FooDevChatt 2010 - *nix Cmd Line Kung Foo
DevChatt 2010 - *nix Cmd Line Kung Foo
 
101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirects101 3.4 use streams, pipes and redirects
101 3.4 use streams, pipes and redirects
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands Course 102: Lecture 8: Composite Commands
Course 102: Lecture 8: Composite Commands
 
One-Liners to Rule Them All
One-Liners to Rule Them AllOne-Liners to Rule Them All
One-Liners to Rule Them All
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipeline
 

More from Hoffman Lab

GNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkGNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkHoffman Lab
 
Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetHoffman Lab
 
WashU Epigenome Browser
WashU Epigenome BrowserWashU Epigenome Browser
WashU Epigenome BrowserHoffman Lab
 
Wireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelWireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelHoffman Lab
 
Plotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornPlotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornHoffman Lab
 
Go Get Data (GGD)
Go Get Data (GGD)Go Get Data (GGD)
Go Get Data (GGD)Hoffman Lab
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorHoffman Lab
 
R markdown and Rmdformats
R markdown and RmdformatsR markdown and Rmdformats
R markdown and RmdformatsHoffman Lab
 
File searching tools
File searching toolsFile searching tools
File searching toolsHoffman Lab
 
Better BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroBetter BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroHoffman Lab
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and ShellsHoffman Lab
 
BioRender & Glossary/Acronym
BioRender & Glossary/AcronymBioRender & Glossary/Acronym
BioRender & Glossary/AcronymHoffman Lab
 
BioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyBioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyHoffman Lab
 
Get Good With Git
Get Good With GitGet Good With Git
Get Good With GitHoffman Lab
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserHoffman Lab
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...Hoffman Lab
 
dreamRs: interactive ggplot2
dreamRs: interactive ggplot2dreamRs: interactive ggplot2
dreamRs: interactive ggplot2Hoffman Lab
 
Basic Cryptography & Security
Basic Cryptography & SecurityBasic Cryptography & Security
Basic Cryptography & SecurityHoffman Lab
 

More from Hoffman Lab (20)

GNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkGNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talk
 
TCRpower
TCRpowerTCRpower
TCRpower
 
Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with gget
 
WashU Epigenome Browser
WashU Epigenome BrowserWashU Epigenome Browser
WashU Epigenome Browser
 
Wireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelWireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network Tunnel
 
Plotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornPlotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seaborn
 
Go Get Data (GGD)
Go Get Data (GGD)Go Get Data (GGD)
Go Get Data (GGD)
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
R markdown and Rmdformats
R markdown and RmdformatsR markdown and Rmdformats
R markdown and Rmdformats
 
File searching tools
File searching toolsFile searching tools
File searching tools
 
Better BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroBetter BibTeX (BBT) for Zotero
Better BibTeX (BBT) for Zotero
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and Shells
 
BioRender & Glossary/Acronym
BioRender & Glossary/AcronymBioRender & Glossary/Acronym
BioRender & Glossary/Acronym
 
Linters in R
Linters in RLinters in R
Linters in R
 
BioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyBioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biology
 
Get Good With Git
Get Good With GitGet Good With Git
Get Good With Git
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...
 
dreamRs: interactive ggplot2
dreamRs: interactive ggplot2dreamRs: interactive ggplot2
dreamRs: interactive ggplot2
 
Basic Cryptography & Security
Basic Cryptography & SecurityBasic Cryptography & Security
Basic Cryptography & Security
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform EngineeringJemma Hussein Allen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»QADay
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

Awk primer and Bioawk

  • 1. Awk primer and Bioawk Coby Viner Lab meeting: tech – Wednesday, Dec. 2, 2020
  • 2. (GNU) awk • A complete programming language • Operates on a per-line (row) basis • Designed to operate upon columnar data • By default, whitespace-delimited columns • Outputs on a per-row basis
  • 3. General syntax BEGIN { print "START"; } { print } END { print "END"; }' <file 1> … <file N> Adapted from Bruce Barnett’s “Intro. To AWK”: https://www.grymoire.com/Unix/Awk.html. $0; Columns: $1, $2, …, $NF awk '
  • 4. Key special variables • NF – Number of fields (columns) • NR – Number of records (rows; all files) • FNR – Number of records (rows; per file) • FS – (input) field separator (default: " ") • OFS – Output field separator (default: " ")
  • 5. Examples BEGIN { FS=OFS="t"; } { print $1,$2,$3; } FNR > 1 { } grep -v 'chr[mM]' <f> | awk '{print $1,$2,$3}' | sed 's/chr//;' awk '$1 !~ /chr[mM]/ {sub(/chr/, ""); print $1,$2,$3}' <f>
  • 6. Examples: two files awk 'FNR==NR{a[$1]=$2; next} {print $1,$2,a[$2]; }' <file 1> <file 2>
  • 7. Bioawk, by Heng Li • Behaves like GNU awk, on non-bio. data • Install from GitHub repo. or others' Dockers • Supports: BED, GFF, FASTA, FASTQ, SAM, VCF  List formats with: bioawk -c help • Directly reads gzipped files (usually) • -t short for bioawk -F't' -v OFS="t" • Treats spanning seqs as a single record
  • 8. Bioawk - generic/BED files Parse column names: bioawk -c header '{ print $chr }' <file.gz> chr1 chr3 chrX
  • 9. Bioawk - examples - GFF*/GTFs Find all exons less than 100 bp, which are annotated as the main functional isoform (i.e., APPRIS principal 1): bioawk -c gff '$feature == "exon" && ($end - $start) < 100 && $attribute ~ /appris_principal_1/' gencode.vXX.annotation.gtf.gz Example adapted from https://hpc.nih.gov/apps/bioawk.html
  • 10. Bioawk - examples - FASTAs Reverse complement: bioawk -c fastx '{print ">"$name; print revcomp($seq)}' seq.fa.gz Example taken from the README.
  • 11. Bioawk - examples - FASTAs List of sequence names and lengths: bioawk -c fastx '{print $name, length($seq)}' seq.fa.gz Adapted from a DNA.today blog post, by Jean-Yves Sgro (January 25, 2020).
  • 12. Bioawk - examples - FASTQs %GC and mean Phred quality score: awk -c fastx '{ print ">"$name; print gc($seq); print meanqual($qual); }' seq.fq.gz Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
  • 13. Bioawk - examples - SAM files Extract mapped reads: sambamba view x.bam | bioawk -c sam '!and($flag,4)' Adapted from Istvan Albert's Bioawk tutorial (on GitHub).
  • 14. Bioawk - examples - VCF files bioawk -c vcf '{ freq[$filter]++ total++ } END { for(val in freq) printf "%st%dt%dn", val, freq[val], freq[val]*100/total }' From sahilseth's flowr Bioawk tips, itself adapted from Stephen Turner's "Bioinformatics one-liners". Assess pipeline— sequence filter statistics: • filter (e.g. LowQual) • number of filter occurrences • percentage of total filters
  • 15. Bioawk - examples - VCF files VCF data: Erik Garrison's vcflib, sample.vcf. PASS 5 55 q10 1 11 . 3 33
  • 16. Bioawk: list of added functions • gc($seq) • meanqual($seq) • reverse($seq) / revcomp($seq) • qualcount($qual, threshold) • Number of quality values above the threshold parameter. • trimq(qual, beg, end, param=0.05) • Trims using Richard Mott's algorithm (used in Phred). • Bitwise AND/OR/XOR

Editor's Notes

  1. AWK: initials of original developers: A. Aho, B. W. Kernighan and P. Weinberger.
  2. https://github.com/lh3/bioawk
  3. Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  4. Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  5. Example files selected randomly, from ~/new_proj/experiments/2019-07-15-ChromaClique-initial_viz_work/templating_initial_attempt/Flt1_GABPA_site Shown with bioSyntax (Vim) highlighting.
  6. Example files selected randomly, from /mnt/work1/users/home2/cviner/workDir/cytomod/experiments_data/linked-2017-06-11-K562_POU5F1_reprocessing_and_comparison/ Shown with bioSyntax (Vim) highlighting.
  7. VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
  8. VCF used: https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf https://gist.github.com/sahilseth/587edf0aed095be49121fd7f05904e57 Illumina VCF filter annotations: If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK.
  9. "The modified Mott trimming algorithm, which is used to calculate the trimming information for the '-trim_alt' option and the phd files, uses base error probabilities calculated from the phred quality values. For each base it subtracts the base error probability from an error probability cutoff value (0.05 by default, and changed using the '-trim_cutoff' option) to form the base score. Then it finds the highest scoring segment of the sequence where the segment score is the sum of the segment base scores (the score can have non-negative values only). The algorithm requires a minimum segment length, which is set to 20 bases." http://bozeman.mbt.washington.edu/phrap.docs/phred.html