Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
Module 2 Sequence similarity.
Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
It includes the information related to a bioinformatics tool BLAST (Basic Local Alignment Search Tool), BLAST is in-silico hybridisation to find regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. This presentation too contains the input - output format, Blast process and its types .
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Module 2 Sequence similarity.
Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
It includes the information related to a bioinformatics tool BLAST (Basic Local Alignment Search Tool), BLAST is in-silico hybridisation to find regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. This presentation too contains the input - output format, Blast process and its types .
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
[2017-05-29] DNASmartTagger : Development of DNA sequence tagging tools based on machine learning using public sequence annotation data, NIG International Symposium 2017.
Key recovery attacks against commercial white-box cryptography implementation...CODE BLUE
White-box cryptography aims to protect cryptographic primitives and keys in software implementations even when the adversary has a full control to the execution environment and an access to the implementation of the cryptographic algorithm. It combines mathematical transformation with obfuscation techniques so it’s not just obfuscation on a data and a code level but actually algorithmic obfuscation.
In the white-box implementation, cryptographic keys are mathematically transformed so that never revealed in a plain form, even during execution of cryptographic algorithms. With such security in the place, it becomes extremely difficult for attackers to locate, modify, and extract the cryptographic keys. Although all current academic white-box implementations have been practically broken by various attacks including table-decomposition, power analysis attack, and fault injection attacks, There are no published reports of successful attacks against commercial white-box implementations to date. When I have assessed Commercial white box implementations to check if they were vulnerable to previous attacks, I found out that previous attacks failed to retrieve a secret key protected with the commercial white-box implementation. Consequently, I modified side channel attacks to be available in academic literature and succeeded in retrieving a secret key protected with the commercial white-box cryptography implementation. This is the first report that succeeded to recover secret key protected with commercial white-box implementation to the best of my knowledge in this industry. In this talk, I would like to share how to recover the key protected with commercial white-box implementation and present security guides on applying white-box cryptography to services more securely.
The WesTest 3000 is a Bench Top Test Set that is adaptable and lower cost for those production, factory or engineering unit tests that only need a limited about of horsepower.
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow
Mikael Kubista, Department of Biotechnology, CAS and TATAA Biocenter
We present a highly specific, sensitive and cost-effective system to quantify miRNA expression based on novel chemistry called Two-tailed RT-qPCR. It takes advantage of target-specific primers for reverse transcription composed of two hemiprobes complementary to two different parts of the targeted miRNA, connected by a hairpin structure. The introduction of a second probe ensures high sensitivity and enables discrimination of highly homologous miRNAs irrespectively of the position of the mismatched nucleotide. Two-tailed RT-qPCR has a dynamic range of 7 logs and a sensitivity sufficient to detect less than ten target miRNA molecules. The reverse transcription step can be multiplexed and it allows for rapid testing with a total analysis time of less than 2.5 hours.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
NGS data formats and analyses
1. NGS data formats and analyses
Richard Orton
Viral Genomics and Bioinformatics
Centre for Virus Research
University of Glasgow
1
2. NGS – HTS – 1st – 2nd – 3rd Gen
• Next Generation Sequencing is now the
Current Generation Sequencing
• NGS = High-Throughput Sequencing (HTS)
• 1st Generation: The automated Sanger
sequencing method
• 2nd Generation: NGS (Illumina, Roche 454,
Ion Torrent etc)
• 3rd Generation: PacBio & Oxford Nanopore:
The Next Next Generation Sequencing.
Single molecule sequencing.
2
3. FASTA Format
• Text based format for representing nucleotide or protein sequences.
• Typical file extensions
– .fasta, .fas, .fa, .fna, .fsa
• Two components of a FASTA sequence
– Header line – begins with a `>` character – followed by a sequence name and an optional description
– Sequence line(s) – the sequence data either all on a single line or spread across multiple lines
• Typically, sequence data is spread across multiple lines of length 80, 70, or 60 bases
>gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
Name Description
Sequence lines
Header line `>`
3
4. FASTA Format
• Can have multiple sequences in a single FASTA file
• The ‘>’ character signifies the end of one sequence and the beginning of the next
one
• Sequences do not need to be the same length
>gi|123456780|gb|KM012345.1| Epizone test sequence 1
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAAT
>gi|123456781|gb|KM012346.1| Epizone test sequence 2
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAAC
>gi|123456782|gb|KM012347.1| Epizone test sequence 3
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAAC
>gi|123456783|gb|KM012348.1| Epizone test sequence 4
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGG
>gi|123456784|gb|KM012349.1| Epizone test sequence 5
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGAAAAAAAAAAAAAAAAAAAAGCAGGTCTGTCCG
4
5. Sequence Characters – IUPAC Codes
• The nucleic acid notation currently in use was first formalized by the
International Union of Pure and Applied Chemistry (IUPAC) in 1970.
Symbol Description Bases Represented Num
A Adenine A 1
C Cytosine C 1
G Guanine G 1
T Thymine T 1
U Uracil U 1
W Weak A T 2
S Strong C G 2
M aMino A C 2
K Keto G T 2
R puRine A G 2
Y pYrimidine C T 2
B not A (B comes after A) C G T 3
D not C (D comes after C) A G T 4
H not G (H comes after G) A C T 4
V not T (V comes after T & U) A C G 4
N Any Nucleotide A C G T 4
- Gap 0
5
6. FASTA Names
• GenBank, at the National Center for Biotechnology Information (NCBI), is the NIH genetic sequence
database, an annotated collection of all publicly available DNA sequences.
– Part of the International Nucleotide Sequence Database Collaboration, which also comprises the DNA
DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL).
• GI (genInfo identifier) number
– A unique integer number which identifies a particular sequence (DNA or prot)
– As of September 2016, the integer sequence identifiers known as "GIs" will no longer be included in the
GenBank
• Accession Number
– A unique alphanumeric unique identifier given to a sequence (DNA or prot)
– Accession.Version
>gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
GI number
Description
Accession Number
Version Number
6
7. FASTQ Format
• Typically you will get your NGS reads in FASTQ format
– Oxford Nanopore – FAST5 format
– PacBio – HDF5 format
• FASTQ is a text-based format for storing both a nucleotide sequence and its corresponding quality
score.
• Four lines per sequence (read)
– @ - The UNIQUE sequence name
– The nucleotide sequence {A, C, G, T, N} – on 1 line
– + - The quality line break – sometimes see +SequenceName
– The quality scores {ASCII characters} – on 1 line
• The quality line is always the same length as the sequence line
– The 1st quality score corresponds to the quality of the 1st sequence
– The 14th quality score corresponds to the quality of the 14th sequence
• Multiple FASTQ sequence per file – millions
• Sequences can be different length
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ Quality scores
Sequence bases
Sequence name
Quality line break
7
8. FASTQ Names
• Specific to Illumina Sequences
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - FASTQ Sequence Start
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Instrument Name
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Run ID
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Flowcell ID
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Lane Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Tile Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - X-coordinate of cluster
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Y-coordinate of cluster
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - read Number (Paired 1/2)
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Filtered
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Control Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Index Sequence
• Filtered by Illumina Real Time Analysis (RTA) software during the run
– Y = is filtered (did not pass)
– N = is not filtered (did pass)
• Control number
– 0 if sequence NOT identified as control (PhiX)
– > 0 is a control sequence
• However, can be just: @SRR1553467.1 1 8
9. Illumina Flowcell
Surface of flow cell coated with a dense lawn of oligos
8 Lanes/Channels
2 Columns Per Lane
50 tiles per column, 100 tiles per lane (GAII)
Each column contains 50 tiles
Each tile is imaged 4 times per cycle – 1 image
per base {ACGT}
For 70bp run: 28,000 images per lane, 224,000
images per flowcell
10. FASTQ Quality Scores
• Each nucleotide in a sequence has an associated quality score
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
ACGTGTACACAGTCAGATGATAGCAGATAGGAAAT
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@
• FASTQ quality scores are ASCII characters
– ASCII - American Standard Code for Information Interchange
– letters {A-Z, a-z}, numbers {0-9}, special characters {@ £ # $ % ^ & * ( ) ,
. ? “ / | ~ ` { } }
– Each character has a numeric whole number associated with it
• When people speak of quality scores, it is typically in the region
of 0 to 40
– For example, trimming the ends of reads below Q25
• To translate from ASCII characters to Phred scale quality scores
– YOU SUBTRACT 33 (-33)
• But … what does a quality score actually mean???
10
Q-Symbol Q-ASCII Q-Score
I 73 40
H 72 39
G 71 38
F 70 37
E 69 36
D 68 35
C 67 34
B 66 33
A 65 32
@ 64 31
? 63 30
> 62 29
= 61 28
< 60 27
; 59 26
: 58 25
9 57 24
8 56 23
7 55 22
6 54 21
5 53 20
4 52 19
3 51 18
2 50 17
1 49 16
0 48 15
/ 47 14
. 46 13
- 45 12
, 44 11
+ 43 10
* 42 9
) 41 8
( 40 7
' 39 6
& 38 5
% 37 4
$ 36 3
# 35 2
" 34 1
! 33 0
11. Q = Probability of Error
Q-Symbol Q-ASCII Q-Score P-Error
I 73 40 0.00010
H 72 39 0.00013
G 71 38 0.00016
F 70 37 0.00020
E 69 36 0.00025
D 68 35 0.00032
C 67 34 0.00040
B 66 33 0.00050
A 65 32 0.00063
@ 64 31 0.00079
? 63 30 0.00100
> 62 29 0.00126
= 61 28 0.00158
< 60 27 0.00200
; 59 26 0.00251
: 58 25 0.00316
9 57 24 0.00398
8 56 23 0.00501
7 55 22 0.00631
6 54 21 0.00794
5 53 20 0.01000
4 52 19 0.01259
3 51 18 0.01585
2 50 17 0.01995
1 49 16 0.02512
0 48 15 0.03162
/ 47 14 0.03981
. 46 13 0.05012
- 45 12 0.06310
, 44 11 0.07943
+ 43 10 0.10000
* 42 9 0.12589
) 41 8 0.15849
( 40 7 0.19953
' 39 6 0.25119
& 38 5 0.31623
% 37 4 0.39811
$ 36 3 0.50119
# 35 2 0.63096
" 34 1 0.79433
! 33 0 1.00000
Q40 is not the maximum quality score
• A Phred quality score is a measure of the quality of the
identification of the nucleotide generated by automated DNA
sequencing.
• The quality score is associated to a probability of error – the
probability of an incorrect base call
• The calculation takes into account the ambiguity of the signal
for the respective base as well as the quality of neighboring
bases and the quality of the entire read.
• P = 10(-Q/10)
• Q40 = 1 in 10,000 error rate
• Q = - 10 x log10(P)
• If you have high coverage (~30,000) you expect errors even if
everything was the best quality i.e you would expect 3 errors on
average.
12. Read Quality Scores
• Quality scores tend to decrease along the read
• The read represents the sequence of a Cluster
– Cluster is comprised of ~1,000 DNA molecules
• As the sequencing progresses more and more of the DNA
molecules in the cluster get out of sync
– Phasing: falling behind: missing and incorporation cycle,
incomplete removal of the 3’ terminators/fluorophores
– Pre-phasing: jumping ahead: incorporation of multiple bases in a
cycle due to NTs without effective 3’ blocking
• Proportion of sequences in each cluster which are affected
by Phasing & Pre-Phasing increases with cycle number –
hampering correct base identification.
Q-Score P-error
40 0.0001
39 0.000125893
38 0.000158489
37 0.000199526
36 0.000251189
35 0.000316228
34 0.000398107
33 0.000501187
32 0.000630957
31 0.000794328
30 0.001
Trim low quality
ends off reads
Delete poor quality
reads overall
FASTQC
FASTX
ConDeTri
13. Illumina Quality Scores
• Quality scores tend to decrease along the read
• The read represents the sequence of a Cluster
– Cluster is comprised of ~1,000 DNA molecules
• As the sequencing progresses more and more of the DNA
molecules in the cluster get out of sync
– Phasing: falling behind: missing and incorporation cycle,
incomplete removal of the 3’ terminators/fluorophores
– Pre-phasing: jumping ahead: incorporation of multiple bases in a
cycle due to NTs without effective 3’ blocking
• Proportion of sequences in each cluster which are affected
by Phasing & Pre-Phasing increases with cycle number –
hampering correct base identification.
– A/C and G/ T fluorophore spectra overlap
– T accumulation – in certain chemistries – due to a lower removal
rate of T fluorophores
15. Short Read Alignment
• Find which bit of the genome the short read has come from
– Given a reference genome and a short read, find the part of the
reference genome that is identical (or most similar) to the read
• It is basically a String matching problem
– BLAST – use hash tables
– MUMmer – use suffix trees
– Bowtie – use Burrows-Wheeler Transform
• There are now many alignment tools for short reads:
– BWA, Bowtie, MAQ, Novoalign, Stampy … … … … …
– Lots of options such as maximum no of mismatches between read/ref,
gap penalties etc
• Aligning millions of reads can be time consuming process.
• So you want to save the alignments for each read to a file:
– So the process doesn’t need to be repeated
– So you can view the results
– In a generic format that other tools can utilise (variant calling etc)
16. SAM Header
• The first section of the SAM file is called the header.
• All header lines start with an @
• HD – The Header Line – 1st line
– SO – Sorting order of alignments (unknown (default), unsorted, queryname and coordinate)
• SQ - Reference sequence dictionary.
– SN - Reference sequence name.
– LN - Reference sequence length.
• PG – Program
– ID – The program ID
– PN – The program name
– VN – Program version number
– CL – The Command actually used to create the SAM file
• RG – Read Group
– "a set of reads that were all the product of a single sequencing run on one lane"
16
@HDVN:1.0 SO:coordinate
@SQ SN:KM034562.G3686.1 LN:18957
@PG ID:bowtie2PN:bowtie2 VN:2.2.4 CL:"/usr/bin/../lib/bowtie2/bin/bowtie2-align-s --wrapper basic-
0 --local -x ../ref/ebov -S ebov1.sam -1 ebov1_1.fastq -2 ebov1_2.fastq"
17. SAM File Format
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUALITY
Read1 10 FMDVgenome 3537 57 70M CCAGTACGTA >AAA=>?AA>
• SAM (Sequence Alignment/Map format) data files are outputted from aligners that read FASTQ files and
assign sequences/reads to a position with respect to a known reference genome.
– Readable Text format – tab delimited
– Each line contains alignment information for a read to the reference
• Each line contains:
– QNAME: Read Name
– FLAG: Info on if the read is mapped, part of a pair, strand etc
– RNAME: Reference Sequence Name that the read aligns to
– POS: Leftmost position of where this alignment maps to the reference
– MAPQ: Mapping quality of read to reference (phred scale P that mapping is wrong)
– CIGAR: Compact Idiosyncratic Gapped Alignment Report: 50M, 30M1I29M
– RNEXT: Paired Mate Read Name
– PNEXT: Paired Mate Position
– TLEN: Template length/Insert Size (difference in outer co-ordinates of paired reads)
– SEQ: The actual read DNA sequence
– QUAL: ASCII Phred quality scores (+33)
– TAGS: Optional data – Lots of options e.g. MD=String for mismatches
18. SAM FLAGS
• Example Flag = 99
• 99 < 128, 256, 512, 1024, 2048 … so all these flags don’t apply
• 99 > 64 … {99 – 64 = 35} … This read is the first in a pair
• 35 > 32 … {35 – 32 = 3} … The paired mate read is on the reverse strand
• 3 < 16 … {3} … This read is not on the reverse strand
• 3 < 8 … {3} … The paired mate read is not unmapped
• 3 > 2 … {3 – 2 = 1} … The read is mapped in proper pair
• 1 = 1 … {1 – 1 = 0} … The read is paired 18
# Flag Description
1 1 Read paired
2 2 Read mapped in proper pair
3 4 Read unmapped
4 8 Mate unmapped
5 16 Read reverse strand
6 32 Mate reverse strand
7 64 First in pair
8 128 Second in pair
9 256 Not primary alignment
10 512 Read fails platform/vendor quality checks
11 1024 Read is PCR or optical duplicate
12 2048 Supplementary alignment
8 > (4+2+1) = 7
16 > (8+4+2+1) = 15
32 > (16+8+4+2+1) = 31
etc
19. 19
• H : Hard Clipping: the clipped nucleotides have been removed from the read
• S : Soft Clipping: the clipped nucleotides are still present in the read
• M : Match: the nucleotides align to the reference – BUT the nucleotide can be an alignment match or mismatch.
• X : Mismatch: the nucleotide aligns to the reference but is a mismatch
• = : Match: the nucleotide aligns to the reference and is a match to it
• I : Insertion: the nucleotide(s) is present in the read but not in the reference
• D : Deletion; the nucleotide(s) is present in the reference but not in the read
• N : Skipped region in reference (for mRNA-to-genome alignment, an N operation represents an intron)
• P: Padding (silent deletion from the padded reference sequence) – not a true deletion
CIGAR String
• 8M2I4M1D3M
• 8Matches
• 2 insertions
• 4 Matches
• 1 Deletion
• 3 Matches
20. BAM File
• SAM (Sequence Alignment/Map) files are typically converted to
Binary Alignment/Map (BAM) format
• They are binary
– Can NOT be opened like a text file
– Compressed = Smaller (great for server storage)
– Sorted and indexed = Daster
• Once you have a file in a BAM format you can delete your aligned
read files
– You can recover the FASTQ reads from the BAM format
– Delete the aligned reads – trimmed/filtered reads
– Do not delete the RAW reads – zip them and save them
20
21. Variant Call Format
• Once reads are aligned to a reference, you may want to call
mutations and indels to the reference
• Many tools output this data in the VCF (Variant Call Format)
• Text file – tab delimited
21
22. Variant Call Format
22
CHROM: Chromosome Name (i.e. viral genome name)
POS: Position the viral genome
ID: Name of the known variant from DB
REF: The reference allele (i.e. the reference base)
ALT: The alternate allele (i.e. the variant/mutation observed)
QUAL: The quality of the variant on a Phred scale
FILTER: Did the variant pass the tools filters
INFO: Information
DP = Depth
AF = Alternate Allele Frequency
SB = Strand Bias P-value
DP4 = Extra Depth Information
(forward ref; reverse ref; forward non-ref; reverse non-ref)