SlideShare a Scribd company logo
1 of 49
Download to read offline
An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics
Leighton Pritchard and Peter Cock
On Conļ¬dence
ā€œIgnorance more frequently begets conļ¬dence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .ā€
- Charles Darwin
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Zeroeth Golden Rule of Bioinformatics
ā€¢ No-one knows everything about everything - talk to people!
ā€¢ local bioinformaticians, mailing lists, forums, Twitter, etc.
ā€¢ Keep learning - there are lots of resources
ā€¢ There is no free lunch - no method works best on all data
ā€¢ The worst errors are silent - share worries, problems, etc.
ā€¢ Share expertise (see ļ¬rst item)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Subgroups
ā€¢ You are in group A, B, C or D - this decides your dataset:
expnA.tab, expnB.tab, expnC.tab, expnD.tab
ā€¢ You will use R at the command-line to analyse your data
The biological question
ā€¢ Your dataset expn?.tab describes (log) expression data for
two genes: gene1 and gene2
ā€¢ Expression measured at eleven time points (including control)
ā€¢ Q: Are gene1 and gene2 genes coregulated?
ā€¢ How do we answer this question?
Reformulating the biological question
ā€¢ Q: Are gene1 and gene2 genes coregulated?
ā€¢ A: We cannot determine this from expression data alone
Reformulating the biological question
ā€¢ Q: Are gene1 and gene2 genes coregulated?
ā€¢ A: We cannot determine this from expression data alone
ā€¢ Reformulate the question:
ā€¢ NewQ: Is there evidence that gene1 and gene2 expression
proļ¬les are correlated?
(is expression gene1 āˆ gene2)
ā€¢ How do we answer this new question?
Starting the analysis
ā€¢ Change directory to where Exercise 1 data is located, and
start R.
1 $ cd ../../ data/ ex1_expression /
2 $ R
Load and inspect data in R
1 > data = read.table("expnA.tab", sep="t", header=TRUE)
2 > head(data)
3 gene1 gene2
4 1 10 8.04
5 2 8 6.95
6 3 13 7.58
7 4 9 8.81
8 5 11 8.33
9 6 14 9.96
Load and inspect data in R
1 > mean(data$gene1)
2 [1] 9
3 > mean(data$gene2)
4 [1] 7.500909
5 > sd(data$gene1)
6 [1] 3.316625
7 > sd(data$gene2)
8 [1] 2.031568
9 > cor(data)
10 gene1 gene2
11 gene1 1.0000000 0.8164205
12 gene2 0.8164205 1.0000000
Results
measure expnA expnB expnC expnD
mean(gene1) 9
mean(gene2) 7.5
sd(gene1) 3.3
sd(gene2) 2.0
cor(data) 0.816
Results
measure expnA expnB expnC expnD
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816
Results
measure expnA expnB expnC expnD
mean(gene1) 9 9 9 9
mean(gene2) 7.5 7.5 7.5 7.5
sd(gene1) 3.3 3.3 3.3 3.3
sd(gene2) 2.0 2.0 2.0 2.0
cor(data) 0.816 0.816 0.816 0.816
ā€¢ r = 0.816(P < 0.005) in every experiment
ā€¢ Can we conclude that gene1 and gene2 are coexpressed in
each experiment?
Plot the data in R
1 > plot(data)
Always plot the data
Which gene pairs are coexpressed?
Always plot the data
Is the matrix of (Pearson) correlation values potentially misleading?
1 > data = anscombe
2 > cor(data)[1:4 ,5:8]
3 y1 y2 y3 y4
4 x1 0.8164205 0.8162365 0.8162867 -0.3140467
5 x2 0.8164205 0.8162365 0.8162867 -0.3140467
6 x3 0.8164205 0.8162365 0.8162867 -0.3140467
7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
Sometimes real correlation doesnā€™t
mean anything
First Golden Rule of Bioinformatics
ā€¢ Always inspect the raw data (trends, outliers, clustering)
ā€¢ What is the question? Can the data answer it?
ā€¢ Communicate with data collectors! (donā€™t be afraid of
pedantry)
ā€¢ Who? When? How?
ā€¢ You need to understand the experiment to analyse it (easier if
you helped design it).
ā€¢ Be wary of block eļ¬€ects (experimenter, time, batch, etc.)
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Exercise 2
ā€¢ You are in group A, B, C or D - this decides your database
dbA, dbB, dbC, dbD
ā€¢ You will use BLAST at the command-line to analyse your data
ā€¢ You will use script at the command-line to record your work
Exercise 2
ā€¢ Start recording your actions by entering script at the
command line
1 $ script
2 Script started , output file is typescript
Exercise 2
ā€¢ Change directory to the ex2 blast directory
ā€¢ Run BLAST with the appropriate database
ā€¢ Exit script
1 $ cd ../ ex2_blast
2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA
3 $ exit
4 exit
5 Script done , output file is typescript
Exercise 2
ā€¢ You can view the typescript ļ¬le with cat
1 $ cat typescript
2 Script started on Fri May 9 10:45:12 2014
3 lpritc@lpmacpro :$ cd ../ ex2_blast
4 [...]
Exercise 2
Query= query protein sequence
Length=400
Score
Sequences producing significant alignments: (Bits)
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
Exercise 2
ā€¢ What is a reasonable E-value threshold to call a ā€™matchā€™?
ā€¢ 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value
Exercise 2
ā€¢ What is a reasonable E-value threshold to call a ā€™matchā€™?
ā€¢ 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
ā€¢ Five orders of magnitude diļ¬€erence in E-value, depending on
database choice - Why?
Exercise 2
ā€¢ E-values depend on database size
ā€¢ Bit score and alignment do not depend on database size
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
Bit score 34.3 34.3 34.3 34.3
Sequences 100,001 501 1 5,001
Letters 48,650,486 210,866 486 2,066,510
Exercise 2
ā€¢ E-values diļ¬€er, but the query matches a choline
transporter-like protein quite well. . .
ā€¢ After all, a biological match is a biological match. . .
Exercise 2
ā€¢ E-values diļ¬€er, but the query matches a choline
transporter-like protein quite well. . .
ā€¢ Doesnā€™t it?
ā€¢ After all, a biological match is a biological match. . .
ā€¢ Isnā€™t it?
Exercise 2
Query= query protein sequence
Length=400
Score E
Sequences producing significant alignments: (Bits) Value
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
Exercise 2
ā€¢ Sequence accessions (PITG ?????T0) are correct in the
databases
Exercise 2
ā€¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā€¢ Sequence functional descriptions are randomly shuļ¬„ed:
lengths do not match in BLAST output
Exercise 2
ā€¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā€¢ Sequence functional descriptions are randomly shuļ¬„ed:
lengths do not match in BLAST output
ā€¢ dbA contains only three diļ¬€erent sequences: two are repeated
50,000 times
Exercise 2
ā€¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā€¢ Sequence functional descriptions are randomly shuļ¬„ed:
lengths do not match in BLAST output
ā€¢ dbA contains only three diļ¬€erent sequences: two are repeated
50,000 times
ā€¢ query.fasta is random sequence, not a real protein
ā€¢ Shuļ¬„ed from all P. infestans proteins
ā€¢ No nr or PFam matches
Second Golden Rule of Bioinformatics
ā€¢ Do not trust the software: it is not an authority
ā€¢ Software does not distinguish meaningful from meaningless
data
ā€¢ Software has bugs
ā€¢ Algorithms have assumptions, conditions, and applicable
domains
ā€¢ Some problems are inherently hard, or even insoluble
ā€¢ You must understand the analysis/algorithm
ā€¢ Always sanity test
ā€¢ Test output for robustness to parameter (including data)
choice
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
Exercise 3
ā€¢ Rule: If there is a vowel on one side of the card, there must
be an even number on the other side.
ā€¢ Which cards must be turned over to determine if this rule (if
a card shows a vowel on one face, the opposite face is even)
holds true?
Exercise 3
This is the Wason Selection Task
ā€¢ If you chose E and 4
Exercise 3
This is the Wason Selection Task
ā€¢ If you chose E and 4
ā€¢ You are in the typical majority group
ā€¢ You are not correct
ā€¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
Exercise 3
This is the Wason Selection Task
ā€¢ If you chose E and 4
ā€¢ You are in the typical majority group
ā€¢ You are not correct
ā€¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
ā€¢ If you chose E and 7
Exercise 3
This is the Wason Selection Task
ā€¢ If you chose E and 4
ā€¢ You are in the typical majority group
ā€¢ You are not correct
ā€¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
ā€¢ If you chose E and 7
ā€¢ Congratulations!
ā€¢ Your choice was capable of falsifying the rule.
Exercise 3
Rule: If there is a vowel on one side of the card, there must be an
even number on the other side.
Card Outcome Rule
E
Even Can be true even if rule false
Odd violated
K
Even na
Odd na
4
Vowel Can be true even if rule false
Consonant na
7
Vowel violated
Consonant na
Exercise 3
ā€¢ This is equivalent to functional classiļ¬cation, e.g:
ā€¢ Rule: If there is a CRN/RxLR/T3SS domain, the protein must
be an eļ¬€ector.
Exercise 3
ā€¢ Conļ¬rmation Bias (Wason Selection Task)
ā€¢ An uninformative experiment is performed
ā€¢ http://en.wikipedia.org/wiki/Wason_selection_task
ā€¢ Aļ¬ƒrming the Consequent (a related formal fallacy)
1. If P, then Q
2. Q
3. Therefore, P
ā€¢ Experimental results are misinterpreted
ā€¢ http:
//en.wikipedia.org/wiki/Affirming_the_consequent
Third Golden Rule of Bioinformatics
ā€¢ Everyone has expectations of their data/experiment
ā€¢ Beware cognitive errors, such as conļ¬rmation bias!
ā€¢ System 1 vs. System 2 ā‰ˆ intuition vs. reason
ā€¢ Think statistically!
ā€¢ Large datasets can be counterintuitive and appear to conļ¬rm a
large number of contradictory hypotheses
ā€¢ Always account for multiple tests.
ā€¢ Avoid ā€œdata dredgingā€: intensive computation is not an
adequate substitute for expertise
ā€¢ Use test-driven development of analyses and code
ā€¢ Use examples that pass and fail
Table of Contents
Rule 0
Rule 1
Rule 2
Rule 3
Conclusions
In Conclusion
ā€¢ Always communicate!
ā€¢ worst errors are silent
ā€¢ Donā€™t trust the data
ā€¢ formatting/validation/category errors - check!
ā€¢ suitability for scientiļ¬c question
ā€¢ Donā€™t trust the software
ā€¢ software is not an authority
ā€¢ always benchmark, always validate
ā€¢ Donā€™t trust yourself
ā€¢ beware cognitive errors
ā€¢ think statistically
ā€¢ biological ā€œstoriesā€ can be constructed from nonsense

More Related Content

Similar to Golden Rules of Bioinformatics

Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatMarwa Zalat
Ā 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
Ā 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
Ā 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
Ā 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
Ā 
A deep learning approach for twitter spam detection lijie zhou
A deep learning approach for twitter spam detection lijie zhouA deep learning approach for twitter spam detection lijie zhou
A deep learning approach for twitter spam detection lijie zhouAnne(Lijie) Zhou
Ā 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfssuser4c50a9
Ā 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
Ā 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
Ā 
Introduction to Machine Learning & Classification
Introduction to Machine Learning & ClassificationIntroduction to Machine Learning & Classification
Introduction to Machine Learning & ClassificationChristopher Sharkey
Ā 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
Ā 
Database Searching
Database SearchingDatabase Searching
Database SearchingMeghaj Mallick
Ā 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
Ā 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data MiningRashmi Bhat
Ā 
Privacy preserving queries on encrypted data
Privacy preserving queries on encrypted dataPrivacy preserving queries on encrypted data
Privacy preserving queries on encrypted datarohit_ainapure
Ā 
Is ignorance bliss
Is ignorance blissIs ignorance bliss
Is ignorance blissStephen Senn
Ā 

Similar to Golden Rules of Bioinformatics (20)

Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
Ā 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
Ā 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
Ā 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
Ā 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
Ā 
A deep learning approach for twitter spam detection lijie zhou
A deep learning approach for twitter spam detection lijie zhouA deep learning approach for twitter spam detection lijie zhou
A deep learning approach for twitter spam detection lijie zhou
Ā 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdf
Ā 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Ā 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
Ā 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Ā 
Introduction to Machine Learning & Classification
Introduction to Machine Learning & ClassificationIntroduction to Machine Learning & Classification
Introduction to Machine Learning & Classification
Ā 
Classification of indoor actions through deep neural networks
Classification of indoor actions through deep neural networksClassification of indoor actions through deep neural networks
Classification of indoor actions through deep neural networks
Ā 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Ā 
Database Searching
Database SearchingDatabase Searching
Database Searching
Ā 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
Ā 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
Ā 
Hapi
HapiHapi
Hapi
Ā 
Privacy preserving queries on encrypted data
Privacy preserving queries on encrypted dataPrivacy preserving queries on encrypted data
Privacy preserving queries on encrypted data
Ā 
Normalisation revision
Normalisation revisionNormalisation revision
Normalisation revision
Ā 
Is ignorance bliss
Is ignorance blissIs ignorance bliss
Is ignorance bliss
Ā 

More from Leighton Pritchard

RDVW Hands-on session: Python
RDVW Hands-on session: PythonRDVW Hands-on session: Python
RDVW Hands-on session: PythonLeighton Pritchard
Ā 
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLeighton Pritchard
Ā 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesLeighton Pritchard
Ā 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Leighton Pritchard
Ā 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogensLeighton Pritchard
Ā 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
Ā 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopLeighton Pritchard
Ā 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeLeighton Pritchard
Ā 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataLeighton Pritchard
Ā 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportLeighton Pritchard
Ā 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Leighton Pritchard
Ā 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
Ā 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
Ā 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...Leighton Pritchard
Ā 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsLeighton Pritchard
Ā 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsLeighton Pritchard
Ā 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Leighton Pritchard
Ā 

More from Leighton Pritchard (20)

In a Different Class?
In a Different Class?In a Different Class?
In a Different Class?
Ā 
RDVW Hands-on session: Python
RDVW Hands-on session: PythonRDVW Hands-on session: Python
RDVW Hands-on session: Python
Ā 
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic Bacteria
Ā 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
Ā 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Ā 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
Ā 
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Whole genome taxonomic classication for prokaryotic plant pathogensWhole genome taxonomic classication for prokaryotic plant pathogens
Whole genome taxonomic classi cation for prokaryotic plant pathogens
Ā 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
Ā 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Ā 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
Ā 
Sequencing and Beyond?
Sequencing and Beyond?Sequencing and Beyond?
Sequencing and Beyond?
Ā 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Ā 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad Report
Ā 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)
Ā 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
Ā 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Ā 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
Ā 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnostics
Ā 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for Effectors
Ā 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2
Ā 

Recently uploaded

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.aasikanpl
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
Ā 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
Ā 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
Ā 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
Ā 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
Ā 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
Ā 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
Ā 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
Ā 
ā€ā€VIRUS - 123455555555555555555555555555555555555555
ā€ā€VIRUS -  123455555555555555555555555555555555555555ā€ā€VIRUS -  123455555555555555555555555555555555555555
ā€ā€VIRUS - 123455555555555555555555555555555555555555kikilily0909
Ā 
Manassas R - Parkside Middle School šŸŒŽšŸ«
Manassas R - Parkside Middle School šŸŒŽšŸ«Manassas R - Parkside Middle School šŸŒŽšŸ«
Manassas R - Parkside Middle School šŸŒŽšŸ«qfactory1
Ā 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
Ā 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
Ā 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
Ā 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
Ā 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
Ā 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
Ā 

Recently uploaded (20)

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Ā 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”9953322196šŸ” šŸ’ÆEscort.
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
Ā 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Ā 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Ā 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
Ā 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Ā 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Ā 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Ā 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
Ā 
ā€ā€VIRUS - 123455555555555555555555555555555555555555
ā€ā€VIRUS -  123455555555555555555555555555555555555555ā€ā€VIRUS -  123455555555555555555555555555555555555555
ā€ā€VIRUS - 123455555555555555555555555555555555555555
Ā 
Manassas R - Parkside Middle School šŸŒŽšŸ«
Manassas R - Parkside Middle School šŸŒŽšŸ«Manassas R - Parkside Middle School šŸŒŽšŸ«
Manassas R - Parkside Middle School šŸŒŽšŸ«
Ā 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
Ā 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
Ā 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Ā 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
Ā 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
Ā 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
Ā 

Golden Rules of Bioinformatics

  • 1. An Introduction to Bioinformatics Tools Part 1: Golden Rules of Bioinformatics Leighton Pritchard and Peter Cock
  • 2. On Conļ¬dence ā€œIgnorance more frequently begets conļ¬dence than does knowledge: it is those who know little, not those who know much, who so positively assert. . .ā€ - Charles Darwin
  • 3. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 4. Zeroeth Golden Rule of Bioinformatics ā€¢ No-one knows everything about everything - talk to people! ā€¢ local bioinformaticians, mailing lists, forums, Twitter, etc. ā€¢ Keep learning - there are lots of resources ā€¢ There is no free lunch - no method works best on all data ā€¢ The worst errors are silent - share worries, problems, etc. ā€¢ Share expertise (see ļ¬rst item)
  • 5. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 6. Subgroups ā€¢ You are in group A, B, C or D - this decides your dataset: expnA.tab, expnB.tab, expnC.tab, expnD.tab ā€¢ You will use R at the command-line to analyse your data
  • 7. The biological question ā€¢ Your dataset expn?.tab describes (log) expression data for two genes: gene1 and gene2 ā€¢ Expression measured at eleven time points (including control) ā€¢ Q: Are gene1 and gene2 genes coregulated? ā€¢ How do we answer this question?
  • 8. Reformulating the biological question ā€¢ Q: Are gene1 and gene2 genes coregulated? ā€¢ A: We cannot determine this from expression data alone
  • 9. Reformulating the biological question ā€¢ Q: Are gene1 and gene2 genes coregulated? ā€¢ A: We cannot determine this from expression data alone ā€¢ Reformulate the question: ā€¢ NewQ: Is there evidence that gene1 and gene2 expression proļ¬les are correlated? (is expression gene1 āˆ gene2) ā€¢ How do we answer this new question?
  • 10. Starting the analysis ā€¢ Change directory to where Exercise 1 data is located, and start R. 1 $ cd ../../ data/ ex1_expression / 2 $ R
  • 11. Load and inspect data in R 1 > data = read.table("expnA.tab", sep="t", header=TRUE) 2 > head(data) 3 gene1 gene2 4 1 10 8.04 5 2 8 6.95 6 3 13 7.58 7 4 9 8.81 8 5 11 8.33 9 6 14 9.96
  • 12. Load and inspect data in R 1 > mean(data$gene1) 2 [1] 9 3 > mean(data$gene2) 4 [1] 7.500909 5 > sd(data$gene1) 6 [1] 3.316625 7 > sd(data$gene2) 8 [1] 2.031568 9 > cor(data) 10 gene1 gene2 11 gene1 1.0000000 0.8164205 12 gene2 0.8164205 1.0000000
  • 13. Results measure expnA expnB expnC expnD mean(gene1) 9 mean(gene2) 7.5 sd(gene1) 3.3 sd(gene2) 2.0 cor(data) 0.816
  • 14. Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816
  • 15. Results measure expnA expnB expnC expnD mean(gene1) 9 9 9 9 mean(gene2) 7.5 7.5 7.5 7.5 sd(gene1) 3.3 3.3 3.3 3.3 sd(gene2) 2.0 2.0 2.0 2.0 cor(data) 0.816 0.816 0.816 0.816 ā€¢ r = 0.816(P < 0.005) in every experiment ā€¢ Can we conclude that gene1 and gene2 are coexpressed in each experiment?
  • 16. Plot the data in R 1 > plot(data)
  • 17. Always plot the data Which gene pairs are coexpressed?
  • 18. Always plot the data Is the matrix of (Pearson) correlation values potentially misleading? 1 > data = anscombe 2 > cor(data)[1:4 ,5:8] 3 y1 y2 y3 y4 4 x1 0.8164205 0.8162365 0.8162867 -0.3140467 5 x2 0.8164205 0.8162365 0.8162867 -0.3140467 6 x3 0.8164205 0.8162365 0.8162867 -0.3140467 7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
  • 19. Sometimes real correlation doesnā€™t mean anything
  • 20. First Golden Rule of Bioinformatics ā€¢ Always inspect the raw data (trends, outliers, clustering) ā€¢ What is the question? Can the data answer it? ā€¢ Communicate with data collectors! (donā€™t be afraid of pedantry) ā€¢ Who? When? How? ā€¢ You need to understand the experiment to analyse it (easier if you helped design it). ā€¢ Be wary of block eļ¬€ects (experimenter, time, batch, etc.)
  • 21. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 22. Exercise 2 ā€¢ You are in group A, B, C or D - this decides your database dbA, dbB, dbC, dbD ā€¢ You will use BLAST at the command-line to analyse your data ā€¢ You will use script at the command-line to record your work
  • 23. Exercise 2 ā€¢ Start recording your actions by entering script at the command line 1 $ script 2 Script started , output file is typescript
  • 24. Exercise 2 ā€¢ Change directory to the ex2 blast directory ā€¢ Run BLAST with the appropriate database ā€¢ Exit script 1 $ cd ../ ex2_blast 2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA 3 $ exit 4 exit 5 Script done , output file is typescript
  • 25. Exercise 2 ā€¢ You can view the typescript ļ¬le with cat 1 $ cat typescript 2 Script started on Fri May 9 10:45:12 2014 3 lpritc@lpmacpro :$ cd ../ ex2_blast 4 [...]
  • 26. Exercise 2 Query= query protein sequence Length=400 Score Sequences producing significant alignments: (Bits) PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • 27. Exercise 2 ā€¢ What is a reasonable E-value threshold to call a ā€™matchā€™? ā€¢ 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value
  • 28. Exercise 2 ā€¢ What is a reasonable E-value threshold to call a ā€™matchā€™? ā€¢ 1e-05, 0.001, 0.1, 10? dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 ā€¢ Five orders of magnitude diļ¬€erence in E-value, depending on database choice - Why?
  • 29. Exercise 2 ā€¢ E-values depend on database size ā€¢ Bit score and alignment do not depend on database size dbA dbB dbC dbD E-value 0.45 0.002 4e-06 0.019 Bit score 34.3 34.3 34.3 34.3 Sequences 100,001 501 1 5,001 Letters 48,650,486 210,866 486 2,066,510
  • 30. Exercise 2 ā€¢ E-values diļ¬€er, but the query matches a choline transporter-like protein quite well. . . ā€¢ After all, a biological match is a biological match. . .
  • 31. Exercise 2 ā€¢ E-values diļ¬€er, but the query matches a choline transporter-like protein quite well. . . ā€¢ Doesnā€™t it? ā€¢ After all, a biological match is a biological match. . . ā€¢ Isnā€™t it?
  • 32. Exercise 2 Query= query protein sequence Length=400 Score E Sequences producing significant alignments: (Bits) Value PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06 > PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like protein (441 aa) Length=486 Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust. Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%) Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165 E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++ Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95 Query 166 IKTKSNSSE 174 T SN S+ Sbjct 96 CHTSSNISQ 104
  • 33. Exercise 2 ā€¢ Sequence accessions (PITG ?????T0) are correct in the databases
  • 34. Exercise 2 ā€¢ Sequence accessions (PITG ?????T0) are correct in the databases ā€¢ Sequence functional descriptions are randomly shuļ¬„ed: lengths do not match in BLAST output
  • 35. Exercise 2 ā€¢ Sequence accessions (PITG ?????T0) are correct in the databases ā€¢ Sequence functional descriptions are randomly shuļ¬„ed: lengths do not match in BLAST output ā€¢ dbA contains only three diļ¬€erent sequences: two are repeated 50,000 times
  • 36. Exercise 2 ā€¢ Sequence accessions (PITG ?????T0) are correct in the databases ā€¢ Sequence functional descriptions are randomly shuļ¬„ed: lengths do not match in BLAST output ā€¢ dbA contains only three diļ¬€erent sequences: two are repeated 50,000 times ā€¢ query.fasta is random sequence, not a real protein ā€¢ Shuļ¬„ed from all P. infestans proteins ā€¢ No nr or PFam matches
  • 37. Second Golden Rule of Bioinformatics ā€¢ Do not trust the software: it is not an authority ā€¢ Software does not distinguish meaningful from meaningless data ā€¢ Software has bugs ā€¢ Algorithms have assumptions, conditions, and applicable domains ā€¢ Some problems are inherently hard, or even insoluble ā€¢ You must understand the analysis/algorithm ā€¢ Always sanity test ā€¢ Test output for robustness to parameter (including data) choice
  • 38. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 39. Exercise 3 ā€¢ Rule: If there is a vowel on one side of the card, there must be an even number on the other side. ā€¢ Which cards must be turned over to determine if this rule (if a card shows a vowel on one face, the opposite face is even) holds true?
  • 40. Exercise 3 This is the Wason Selection Task ā€¢ If you chose E and 4
  • 41. Exercise 3 This is the Wason Selection Task ā€¢ If you chose E and 4 ā€¢ You are in the typical majority group ā€¢ You are not correct ā€¢ You have been a victim of conļ¬rmation bias (System 1 thinking)
  • 42. Exercise 3 This is the Wason Selection Task ā€¢ If you chose E and 4 ā€¢ You are in the typical majority group ā€¢ You are not correct ā€¢ You have been a victim of conļ¬rmation bias (System 1 thinking) ā€¢ If you chose E and 7
  • 43. Exercise 3 This is the Wason Selection Task ā€¢ If you chose E and 4 ā€¢ You are in the typical majority group ā€¢ You are not correct ā€¢ You have been a victim of conļ¬rmation bias (System 1 thinking) ā€¢ If you chose E and 7 ā€¢ Congratulations! ā€¢ Your choice was capable of falsifying the rule.
  • 44. Exercise 3 Rule: If there is a vowel on one side of the card, there must be an even number on the other side. Card Outcome Rule E Even Can be true even if rule false Odd violated K Even na Odd na 4 Vowel Can be true even if rule false Consonant na 7 Vowel violated Consonant na
  • 45. Exercise 3 ā€¢ This is equivalent to functional classiļ¬cation, e.g: ā€¢ Rule: If there is a CRN/RxLR/T3SS domain, the protein must be an eļ¬€ector.
  • 46. Exercise 3 ā€¢ Conļ¬rmation Bias (Wason Selection Task) ā€¢ An uninformative experiment is performed ā€¢ http://en.wikipedia.org/wiki/Wason_selection_task ā€¢ Aļ¬ƒrming the Consequent (a related formal fallacy) 1. If P, then Q 2. Q 3. Therefore, P ā€¢ Experimental results are misinterpreted ā€¢ http: //en.wikipedia.org/wiki/Affirming_the_consequent
  • 47. Third Golden Rule of Bioinformatics ā€¢ Everyone has expectations of their data/experiment ā€¢ Beware cognitive errors, such as conļ¬rmation bias! ā€¢ System 1 vs. System 2 ā‰ˆ intuition vs. reason ā€¢ Think statistically! ā€¢ Large datasets can be counterintuitive and appear to conļ¬rm a large number of contradictory hypotheses ā€¢ Always account for multiple tests. ā€¢ Avoid ā€œdata dredgingā€: intensive computation is not an adequate substitute for expertise ā€¢ Use test-driven development of analyses and code ā€¢ Use examples that pass and fail
  • 48. Table of Contents Rule 0 Rule 1 Rule 2 Rule 3 Conclusions
  • 49. In Conclusion ā€¢ Always communicate! ā€¢ worst errors are silent ā€¢ Donā€™t trust the data ā€¢ formatting/validation/category errors - check! ā€¢ suitability for scientiļ¬c question ā€¢ Donā€™t trust the software ā€¢ software is not an authority ā€¢ always benchmark, always validate ā€¢ Donā€™t trust yourself ā€¢ beware cognitive errors ā€¢ think statistically ā€¢ biological ā€œstoriesā€ can be constructed from nonsense