SlideShare a Scribd company logo
Bioinformatics:
Guide to bio-computing and the Internet
Copyright© Kerstin Wagner
Introduction: What is bioinformatics?
Can be defined as the body of tools, algorithms needed to handle large
and complex biological information.
Bioinformatics is a new scientific discipline created from the interaction
of biology and computer.
The NCBI defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline”
Genomics era: High-throughput DNA sequencing
The first high-throughput genomics
technology was automated DNA sequencing
in the early 1990.
In September 1999, Celera Genomics
completed the sequencing of the
Drosophila genome.
In 1995, Venter and Hamilton used whole-
genome shotgun sequencing strategy to
sequence the genomes of Mycoplasma and
Haemophilus .
The 3-billion-bp human genome sequence
was generated in a competition between
the publicly funded Human Genome
Project and Celera
Top image: confocal detection
by the MegaBACE sequencer
of fluorescently labeled DNA
High-throughput DNA sequencing
That was then. How about
now?
Next Generation Sequencing
(2010) vol11:31
Genomics: Completed genomes as of 2010
Currently the genome of the organisms are sequenced:
This generates large amounts of information to be handled by individual
computers.
1598 bacterial/85 archaeal/294 eukaryotic genomes
The trend of data growth
0
1
2
3
4
5
6
7
8
1980 1985 1990 1995 2000
Years
Nucleotides(billion)
21st century is a century of biotechnology:
Microarray: Global expression analysis: RNA levels of every
gene in the genome analyzed in parallel. (OUT!)
Replaced by RNA-seq
Proteomics:Global protein analysis generates by large mass
spectra libraries.
Metabolomics:Global metabolite analysis: 25,000 secondary
metabolites characterized
Genomics: New sequence information is being
produced at increasing rates. (The
contents of GenBank double every year)
Metagenomics
- “Who is there and what are they doing?”
- Cultivation-independent approaches to study the big impact of microbes
How to handle the large amount of information?
Drew Sheneman, New Jersey--The Newark Star Ledger
Answer: bioinformatics and Internet
Bioinformatics history
IBM 7090 computer
In1960s: the birth of bioinformatics
Margaret Oakley Dayhoff created:
The first protein database
The first program for sequence assembly
There is a need for computers and algorithms that allow:
Access, processing, storing, sharing, retrieving, visualizing, annotating…
Why do we need the Internet?
“omics” projects and the information associated with involve a huge amount
of data that is stored on computers all over the world.
Because it is impossible to maintain up-to-date copies of all relevant
databases within the lab. Access to the data is via the internet.
You are
here
Database
storage
The Commercial Market
Current bioinformatics market is worth 300 million / year
(Half software)
Prediction: $2 billion / year in 5-6 years
~50 Bioinformatics companies:
Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode
Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions….
Scope of this lab
The lab will touch on the following computational tasks:
Similarity search
Sequence comparison: Alignment, multiple alignment, retrieval
Sequences analysis: Signal peptide, transmembrane domain,…
Protein folding: secondary structure from sequence
Sequence evolution: phylogenetic trees
Make you familiar with bioinformatics resources available on the
web to do these tasks.
You have just
cloned a gene
Evolutionary
relationship?
-Phylogenetic
tree
-Accession #?
-Annotation?
Is it already in
databases?
-Sub-localization
-Soluble?
-3D fold
Protein
characteristics?
-% identity?
-Family member?
Is there similar
sequences?
-Alignments?
-Domains?
Is there conserved
regions?
Other
information?
-Expression profile?
-Mutants?
A critical failure of current bioinformatics is the lack of a single software
package that can perform all of these functions.
Applying algorithms to analyze genomics data
DNA (nucleotide sequences) databases
They are big databases and searching either one should produce
similar results because they exchange information routinely.
-GenBank (NCBI): http://www.ncbi.nlm.nih.gov
-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Specialized databases:Tissues, species…
-ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST
~at TIGR http://tigr.org/tdb/tgi
- ...many more!
They are big databases too:
-Swiss-Prot (very high level of annotation)
http://au.expasy.org/
-PIR (protein identification resource) the world's most
comprehensive catalog of information on proteins
http://www.pir.uniprot.org/
Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
-GenPept (translation of coding regions in GenBank)
-pdb (sequences derived from the 3D structure
Brookhaven PDB) http://www.rcsb.org/pdb/
Protein (amino acid) databases
Database homology searching
Use algorithms to efficiently provide mathematical basis of searches
that can be translated to statistical significance.
Assumes that sequence, structure, and function are inter-related.
All similarity searching methods rely on the concepts of alignment
and distance between sequences.
A similarity score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Calculating alignment scores
Scoring system: Uses scoring matrices that allow biologists to quantify the
quality of sequence alignments.
The raw score S is calculated by summing the scores for each aligned
position and the scores for gaps. Gap creation/extension scores are
inherent to the scoring system in use (BLAST, FASTA…)
The score for an identity or a mismatch is given by the specified substitution
matrix (e.g., BLOSUM62).
Devising a scoring system
How the matrices were created:
Very similar sequences were aligned.
From these alignments, the frequency of substitution between
each pair of amino acids was calculated and then PAM1 was built.
After normalizing to log-odds format, the full series of PAM matrices
can be calculated by multiplying the PAM1 matrix by itself.
Some popular scoring matrices are:
PAM (Percent Accepted Mutation): for evolutionary studies.
For example in PAM1, 1 accepted point mutation per 100 amino
acids is required.
BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding
common motifs. For example in BLOSUM62, the alignment is
created using sequences sharing no more than 62% identity.
Devising a scoring system
Importance:
Scoring matrices appear in all analysis
involving sequence comparison.
The choice of matrix can strongly influence
the outcome of the analysis.
Understanding theories underlying a given
scoring matrix can aid in making proper
choice:
-Some matrices reflect similarity: good for
database searching
-Some reflect distance: good for phylogenies
 Log-odds matrices, a normalisation method for matrix values:
S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to
be related. pi and pj are their frequencies of occurrence in the set of sequences.
Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:
Global alignment: not sensitive
Local alignment: faster
QKESGPSSSYC
VQQESGLVRTTC
ESG
ESG
The most widely used local similarity algorithms are:
Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)
Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?
Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.
BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST
-tuple methods provide optimal alignments
These methods are faster and excellent in comparing sequences.
BLAST and FASTA programs are based on -tuple algorithms:
1-Using query sequence, derive a list of
words of length w (e.g., 3)
2-Keep high-scoring words using a
scoring matrix(e.g. BLOSUM 62)
3-High-scoring words are compared
with database sequences
4-Sequences with many matches to
high-scoring words are used for final
alignments
The dilemma: DNA or protein?
Is the comparison of two nucleotide sequences accurate?
By translating into amino acid sequence, are we losing information?
The genetic code is degenerate (Two or more codons can represent
the same amino acid)
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Search by similarity
Using nucleotide seq. Using amino acid seq.
Tools to search databases
Comparing DNA sequences give more random matches:
Reasons for translating
A good alignment with end-gaps A very poor alignment
Almost 50% identity!
Conservation of protein in evolution (DNA similarity decays faster!)
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
Conclusion:
FASTA: Compares a DNA query to DNA database, or a protein query
to protein database
FASTX: Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database
BLAST and FASTA variants
BLASTN: Compares a DNA query to DNA database.
BLASTP: Compares a protein query to protein database.
BLASTX: Compares the 6-frame translations of DNA query to protein
database.
TBLASTN: Compares a protein query to the 6-frame translations of a DNA
database.
TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)
PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results
E value: is the expectation value or probability to find by chance hits similar to
your sequence. The lower the E, the more significant the score.
Database searching tips
Use latest database version.
Use BLAST first, then a finer tool (FASTA,…)
Search both strands when using FASTA.
Translate sequences where relevant
Search 6-frame translation of DNA database
E < 0.05 is statistically significant, usually biologically
interesting.
If the query has repeated segments, delete them and
repeat search
Most widely used sites for sequence analysis
Sites for alignment of 2 sequences:
T-COFFEE (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi): more
accurate than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html;
http://align.genome.jp)
bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)
Sites for DNA to protein translation:
These algorithms can translate DNA sequences in any of the 3 forward or three
reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)
Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html)
Transeq (http://www.ebi.ac.uk/emboss/transeq)
http://www.mbio.ncsu.edu/bioedit/bioedit.html
BioEdit — a sequence editing software package
Oligo Design and Analysis Tools
http://www.idtdna.com/scitools/scitools.aspx

More Related Content

Similar to Bioinformatics_1_ChenS.pptx

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
Shruthi Choudary
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
ExternalEvents
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
kigaruantony
 
bioinformatic.pptx
bioinformatic.pptxbioinformatic.pptx
bioinformatic.pptx
RitikaChoudhary57
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
MuhammadAbbaskhan9
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
Prof. Wim Van Criekinge
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
Atai Rabby
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Anne Deslattes Mays
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Data Driven Innovation
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
Piyush Mishra
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
nedalalazzwy
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
Prabin Shakya
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
Pinky Vincent
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a Service
Justin Johnson
 
GENOME DATA ANALYSIS
GENOME DATA ANALYSISGENOME DATA ANALYSIS
GENOME DATA ANALYSIS
AmeldaAkoijam
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
shashi bijapure
 

Similar to Bioinformatics_1_ChenS.pptx (20)

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
bioinformatic.pptx
bioinformatic.pptxbioinformatic.pptx
bioinformatic.pptx
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
 
Bioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformaticsBioinformatica 29-09-2011-t1-bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
Enabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a ServiceEnabling Large Scale Sequencing Studies through Science as a Service
Enabling Large Scale Sequencing Studies through Science as a Service
 
GENOME DATA ANALYSIS
GENOME DATA ANALYSISGENOME DATA ANALYSIS
GENOME DATA ANALYSIS
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
 

Recently uploaded

CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdfCHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
rishi2789
 
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotesPromoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
PsychoTech Services
 
Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
BrissaOrtiz3
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
rishi2789
 
vonoprazan A novel drug for GERD presentation
vonoprazan A novel drug for GERD presentationvonoprazan A novel drug for GERD presentation
vonoprazan A novel drug for GERD presentation
Dr.pavithra Anandan
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
NephroTube - Dr.Gawad
 
All info about Diabetes and how to control it.
 All info about Diabetes and how to control it. All info about Diabetes and how to control it.
All info about Diabetes and how to control it.
Gokuldas Hospital
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Ayurveda ForAll
 
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
Donc Test
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
shivalingatalekar1
 
Chapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptxChapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptx
Earlene McNair
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
Donc Test
 
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptxREGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
LaniyaNasrink
 
Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)
Josep Vidal-Alaball
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
Jim Jacob Roy
 
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loopLOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
debosmitaasanyal1
 
OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1
KafrELShiekh University
 
share - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptxshare - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptx
Tina Purnat
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
Dr. Jyothirmai Paindla
 

Recently uploaded (20)

CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdfCHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
CHEMOTHERAPY_RDP_CHAPTER 4_ANTI VIRAL DRUGS.pdf
 
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotesPromoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotes
 
Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
 
vonoprazan A novel drug for GERD presentation
vonoprazan A novel drug for GERD presentationvonoprazan A novel drug for GERD presentation
vonoprazan A novel drug for GERD presentation
 
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
 
All info about Diabetes and how to control it.
 All info about Diabetes and how to control it. All info about Diabetes and how to control it.
All info about Diabetes and how to control it.
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
 
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
TEST BANK For Community and Public Health Nursing: Evidence for Practice, 3rd...
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
 
Chapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptxChapter 11 Nutrition and Chronic Diseases.pptx
Chapter 11 Nutrition and Chronic Diseases.pptx
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
 
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptxREGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
 
Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
 
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loopLOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
LOOPS in orthodontics t loop bull loop vertical loop mushroom loop stop loop
 
OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1OCT Training Course for clinical practice Part 1
OCT Training Course for clinical practice Part 1
 
share - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptxshare - Lions, tigers, AI and health misinformation, oh my!.pptx
share - Lions, tigers, AI and health misinformation, oh my!.pptx
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
 

Bioinformatics_1_ChenS.pptx

  • 1. Bioinformatics: Guide to bio-computing and the Internet Copyright© Kerstin Wagner
  • 2. Introduction: What is bioinformatics? Can be defined as the body of tools, algorithms needed to handle large and complex biological information. Bioinformatics is a new scientific discipline created from the interaction of biology and computer. The NCBI defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline”
  • 3. Genomics era: High-throughput DNA sequencing The first high-throughput genomics technology was automated DNA sequencing in the early 1990. In September 1999, Celera Genomics completed the sequencing of the Drosophila genome. In 1995, Venter and Hamilton used whole- genome shotgun sequencing strategy to sequence the genomes of Mycoplasma and Haemophilus . The 3-billion-bp human genome sequence was generated in a competition between the publicly funded Human Genome Project and Celera
  • 4. Top image: confocal detection by the MegaBACE sequencer of fluorescently labeled DNA High-throughput DNA sequencing That was then. How about now?
  • 6.
  • 7. Genomics: Completed genomes as of 2010 Currently the genome of the organisms are sequenced: This generates large amounts of information to be handled by individual computers. 1598 bacterial/85 archaeal/294 eukaryotic genomes
  • 8. The trend of data growth 0 1 2 3 4 5 6 7 8 1980 1985 1990 1995 2000 Years Nucleotides(billion) 21st century is a century of biotechnology: Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel. (OUT!) Replaced by RNA-seq Proteomics:Global protein analysis generates by large mass spectra libraries. Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized Genomics: New sequence information is being produced at increasing rates. (The contents of GenBank double every year)
  • 9. Metagenomics - “Who is there and what are they doing?” - Cultivation-independent approaches to study the big impact of microbes
  • 10. How to handle the large amount of information? Drew Sheneman, New Jersey--The Newark Star Ledger Answer: bioinformatics and Internet
  • 11. Bioinformatics history IBM 7090 computer In1960s: the birth of bioinformatics Margaret Oakley Dayhoff created: The first protein database The first program for sequence assembly There is a need for computers and algorithms that allow: Access, processing, storing, sharing, retrieving, visualizing, annotating…
  • 12. Why do we need the Internet? “omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world. Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet.
  • 14. The Commercial Market Current bioinformatics market is worth 300 million / year (Half software) Prediction: $2 billion / year in 5-6 years ~50 Bioinformatics companies: Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic, GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools, Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist, eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic, GeneFormatics, Molecular Simulations, Bioinformatics Solutions….
  • 15. Scope of this lab The lab will touch on the following computational tasks: Similarity search Sequence comparison: Alignment, multiple alignment, retrieval Sequences analysis: Signal peptide, transmembrane domain,… Protein folding: secondary structure from sequence Sequence evolution: phylogenetic trees Make you familiar with bioinformatics resources available on the web to do these tasks.
  • 16. You have just cloned a gene Evolutionary relationship? -Phylogenetic tree -Accession #? -Annotation? Is it already in databases? -Sub-localization -Soluble? -3D fold Protein characteristics? -% identity? -Family member? Is there similar sequences? -Alignments? -Domains? Is there conserved regions? Other information? -Expression profile? -Mutants? A critical failure of current bioinformatics is the lack of a single software package that can perform all of these functions. Applying algorithms to analyze genomics data
  • 17. DNA (nucleotide sequences) databases They are big databases and searching either one should produce similar results because they exchange information routinely. -GenBank (NCBI): http://www.ncbi.nlm.nih.gov -DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp -TIGR: http://tigr.org/tdb/tgi -Yeast: http://yeastgenome.org -Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi Specialized databases:Tissues, species… -ESTs (Expressed Sequence Tags) ~at NCBI http://www.ncbi.nlm.nih.gov/dbEST ~at TIGR http://tigr.org/tdb/tgi - ...many more!
  • 18. They are big databases too: -Swiss-Prot (very high level of annotation) http://au.expasy.org/ -PIR (protein identification resource) the world's most comprehensive catalog of information on proteins http://www.pir.uniprot.org/ Translated databases: -TREMBL (translated EMBL): includes entries that have not been annotated yet into Swiss-Prot. http://www.ebi.ac.uk/trembl/access.html -GenPept (translation of coding regions in GenBank) -pdb (sequences derived from the 3D structure Brookhaven PDB) http://www.rcsb.org/pdb/ Protein (amino acid) databases
  • 19. Database homology searching Use algorithms to efficiently provide mathematical basis of searches that can be translated to statistical significance. Assumes that sequence, structure, and function are inter-related. All similarity searching methods rely on the concepts of alignment and distance between sequences. A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.
  • 20. Calculating alignment scores Scoring system: Uses scoring matrices that allow biologists to quantify the quality of sequence alignments. The raw score S is calculated by summing the scores for each aligned position and the scores for gaps. Gap creation/extension scores are inherent to the scoring system in use (BLAST, FASTA…) The score for an identity or a mismatch is given by the specified substitution matrix (e.g., BLOSUM62).
  • 21. Devising a scoring system How the matrices were created: Very similar sequences were aligned. From these alignments, the frequency of substitution between each pair of amino acids was calculated and then PAM1 was built. After normalizing to log-odds format, the full series of PAM matrices can be calculated by multiplying the PAM1 matrix by itself. Some popular scoring matrices are: PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required. BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
  • 22. Devising a scoring system Importance: Scoring matrices appear in all analysis involving sequence comparison. The choice of matrix can strongly influence the outcome of the analysis. Understanding theories underlying a given scoring matrix can aid in making proper choice: -Some matrices reflect similarity: good for database searching -Some reflect distance: good for phylogenies  Log-odds matrices, a normalisation method for matrix values: S is the probability that two residues, i and j, are aligned by evolutionary descent and by chance. qij are the frequencies that i and j are observed to align in sequences known to be related. pi and pj are their frequencies of occurrence in the set of sequences.
  • 23. Database search methods: Sequence Alignment Two broad classes of sequence alignments exist: Global alignment: not sensitive Local alignment: faster QKESGPSSSYC VQQESGLVRTTC ESG ESG The most widely used local similarity algorithms are: Smith-Waterman (http://www.ebi.ac.uk/MPsrch/) Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov) Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/; http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
  • 24. Which algorithm to use for database similarity search? Speed: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a LOT OF COMPUTER POWER) Sensitivity/statistics: FASTA is more sensitive, misses less homologues Smith-Waterman is even more sensitive. BLAST calculates probabilities FASTA more accurate for DNA-DNA search then BLAST
  • 25. -tuple methods provide optimal alignments These methods are faster and excellent in comparing sequences. BLAST and FASTA programs are based on -tuple algorithms: 1-Using query sequence, derive a list of words of length w (e.g., 3) 2-Keep high-scoring words using a scoring matrix(e.g. BLOSUM 62) 3-High-scoring words are compared with database sequences 4-Sequences with many matches to high-scoring words are used for final alignments
  • 26. The dilemma: DNA or protein? Is the comparison of two nucleotide sequences accurate? By translating into amino acid sequence, are we losing information? The genetic code is degenerate (Two or more codons can represent the same amino acid) Very different DNA sequences may code for similar protein sequences We certainly do not want to miss those cases! Search by similarity Using nucleotide seq. Using amino acid seq. Tools to search databases
  • 27. Comparing DNA sequences give more random matches: Reasons for translating A good alignment with end-gaps A very poor alignment Almost 50% identity! Conservation of protein in evolution (DNA similarity decays faster!) It is almost always better to compare coding sequences in their amino acid form, especially if they are very divergent. Very highly similar nucleotide sequences may give better results. Conclusion:
  • 28. FASTA: Compares a DNA query to DNA database, or a protein query to protein database FASTX: Compares a translated DNA query to a protein database TFASTA: Compares a protein query to a translated DNA database BLAST and FASTA variants BLASTN: Compares a DNA query to DNA database. BLASTP: Compares a protein query to protein database. BLASTX: Compares the 6-frame translations of DNA query to protein database. TBLASTN: Compares a protein query to the 6-frame translations of a DNA database. TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame translations of a DNA database (each sequence is comparable to BLASTP searches!) PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching
  • 29. A practical example of sequence alignment http://www.ncbi.nlm.nih.gov BLAST results
  • 30. Detailed BLAST results E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.
  • 31. Database searching tips Use latest database version. Use BLAST first, then a finer tool (FASTA,…) Search both strands when using FASTA. Translate sequences where relevant Search 6-frame translation of DNA database E < 0.05 is statistically significant, usually biologically interesting. If the query has repeated segments, delete them and repeat search
  • 32. Most widely used sites for sequence analysis Sites for alignment of 2 sequences: T-COFFEE (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi): more accurate than ClustalW for sequences with less than 30% identity. ClustalW (http://www.ch.embnet.org/software/ClustalW.html; http://align.genome.jp) bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi) LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html) MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html) Sites for DNA to protein translation: These algorithms can translate DNA sequences in any of the 3 forward or three reverse sense frames. Translate (http://au.expasy.org/tools/dna.html) Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html) Transeq (http://www.ebi.ac.uk/emboss/transeq)
  • 34. Oligo Design and Analysis Tools http://www.idtdna.com/scitools/scitools.aspx