Golden Rules of Bioinformatics.
Presented as part of a full-day introductory bioinformatics course - the example data and source for the slides can be found at https://github.com/widdowquinn/Teaching-Intro-to-Bioinf
1. An Introduction to Bioinformatics
Tools
Part 1: Golden Rules of Bioinformatics
Leighton Pritchard and Peter Cock
2. On Conļ¬dence
āIgnorance more frequently begets conļ¬dence than does
knowledge: it is those who know little, not those who know much,
who so positively assert. . .ā
- Charles Darwin
4. Zeroeth Golden Rule of Bioinformatics
ā¢ No-one knows everything about everything - talk to people!
ā¢ local bioinformaticians, mailing lists, forums, Twitter, etc.
ā¢ Keep learning - there are lots of resources
ā¢ There is no free lunch - no method works best on all data
ā¢ The worst errors are silent - share worries, problems, etc.
ā¢ Share expertise (see ļ¬rst item)
6. Subgroups
ā¢ You are in group A, B, C or D - this decides your dataset:
expnA.tab, expnB.tab, expnC.tab, expnD.tab
ā¢ You will use R at the command-line to analyse your data
7. The biological question
ā¢ Your dataset expn?.tab describes (log) expression data for
two genes: gene1 and gene2
ā¢ Expression measured at eleven time points (including control)
ā¢ Q: Are gene1 and gene2 genes coregulated?
ā¢ How do we answer this question?
8. Reformulating the biological question
ā¢ Q: Are gene1 and gene2 genes coregulated?
ā¢ A: We cannot determine this from expression data alone
9. Reformulating the biological question
ā¢ Q: Are gene1 and gene2 genes coregulated?
ā¢ A: We cannot determine this from expression data alone
ā¢ Reformulate the question:
ā¢ NewQ: Is there evidence that gene1 and gene2 expression
proļ¬les are correlated?
(is expression gene1 ā gene2)
ā¢ How do we answer this new question?
10. Starting the analysis
ā¢ Change directory to where Exercise 1 data is located, and
start R.
1 $ cd ../../ data/ ex1_expression /
2 $ R
11. Load and inspect data in R
1 > data = read.table("expnA.tab", sep="t", header=TRUE)
2 > head(data)
3 gene1 gene2
4 1 10 8.04
5 2 8 6.95
6 3 13 7.58
7 4 9 8.81
8 5 11 8.33
9 6 14 9.96
20. First Golden Rule of Bioinformatics
ā¢ Always inspect the raw data (trends, outliers, clustering)
ā¢ What is the question? Can the data answer it?
ā¢ Communicate with data collectors! (donāt be afraid of
pedantry)
ā¢ Who? When? How?
ā¢ You need to understand the experiment to analyse it (easier if
you helped design it).
ā¢ Be wary of block eļ¬ects (experimenter, time, batch, etc.)
22. Exercise 2
ā¢ You are in group A, B, C or D - this decides your database
dbA, dbB, dbC, dbD
ā¢ You will use BLAST at the command-line to analyse your data
ā¢ You will use script at the command-line to record your work
23. Exercise 2
ā¢ Start recording your actions by entering script at the
command line
1 $ script
2 Script started , output file is typescript
24. Exercise 2
ā¢ Change directory to the ex2 blast directory
ā¢ Run BLAST with the appropriate database
ā¢ Exit script
1 $ cd ../ ex2_blast
2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA
3 $ exit
4 exit
5 Script done , output file is typescript
25. Exercise 2
ā¢ You can view the typescript ļ¬le with cat
1 $ cat typescript
2 Script started on Fri May 9 10:45:12 2014
3 lpritc@lpmacpro :$ cd ../ ex2_blast
4 [...]
26. Exercise 2
Query= query protein sequence
Length=400
Score
Sequences producing significant alignments: (Bits)
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
27. Exercise 2
ā¢ What is a reasonable E-value threshold to call a āmatchā?
ā¢ 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value
28. Exercise 2
ā¢ What is a reasonable E-value threshold to call a āmatchā?
ā¢ 1e-05, 0.001, 0.1, 10?
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
ā¢ Five orders of magnitude diļ¬erence in E-value, depending on
database choice - Why?
29. Exercise 2
ā¢ E-values depend on database size
ā¢ Bit score and alignment do not depend on database size
dbA dbB dbC dbD
E-value 0.45 0.002 4e-06 0.019
Bit score 34.3 34.3 34.3 34.3
Sequences 100,001 501 1 5,001
Letters 48,650,486 210,866 486 2,066,510
30. Exercise 2
ā¢ E-values diļ¬er, but the query matches a choline
transporter-like protein quite well. . .
ā¢ After all, a biological match is a biological match. . .
31. Exercise 2
ā¢ E-values diļ¬er, but the query matches a choline
transporter-like protein quite well. . .
ā¢ Doesnāt it?
ā¢ After all, a biological match is a biological match. . .
ā¢ Isnāt it?
32. Exercise 2
Query= query protein sequence
Length=400
Score E
Sequences producing significant alignments: (Bits) Value
PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06
> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like
protein (441 aa)
Length=486
Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust.
Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)
Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165
E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++
Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95
Query 166 IKTKSNSSE 174
T SN S+
Sbjct 96 CHTSSNISQ 104
34. Exercise 2
ā¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā¢ Sequence functional descriptions are randomly shuļ¬ed:
lengths do not match in BLAST output
35. Exercise 2
ā¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā¢ Sequence functional descriptions are randomly shuļ¬ed:
lengths do not match in BLAST output
ā¢ dbA contains only three diļ¬erent sequences: two are repeated
50,000 times
36. Exercise 2
ā¢ Sequence accessions (PITG ?????T0) are correct in the
databases
ā¢ Sequence functional descriptions are randomly shuļ¬ed:
lengths do not match in BLAST output
ā¢ dbA contains only three diļ¬erent sequences: two are repeated
50,000 times
ā¢ query.fasta is random sequence, not a real protein
ā¢ Shuļ¬ed from all P. infestans proteins
ā¢ No nr or PFam matches
37. Second Golden Rule of Bioinformatics
ā¢ Do not trust the software: it is not an authority
ā¢ Software does not distinguish meaningful from meaningless
data
ā¢ Software has bugs
ā¢ Algorithms have assumptions, conditions, and applicable
domains
ā¢ Some problems are inherently hard, or even insoluble
ā¢ You must understand the analysis/algorithm
ā¢ Always sanity test
ā¢ Test output for robustness to parameter (including data)
choice
39. Exercise 3
ā¢ Rule: If there is a vowel on one side of the card, there must
be an even number on the other side.
ā¢ Which cards must be turned over to determine if this rule (if
a card shows a vowel on one face, the opposite face is even)
holds true?
41. Exercise 3
This is the Wason Selection Task
ā¢ If you chose E and 4
ā¢ You are in the typical majority group
ā¢ You are not correct
ā¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
42. Exercise 3
This is the Wason Selection Task
ā¢ If you chose E and 4
ā¢ You are in the typical majority group
ā¢ You are not correct
ā¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
ā¢ If you chose E and 7
43. Exercise 3
This is the Wason Selection Task
ā¢ If you chose E and 4
ā¢ You are in the typical majority group
ā¢ You are not correct
ā¢ You have been a victim of conļ¬rmation bias (System 1
thinking)
ā¢ If you chose E and 7
ā¢ Congratulations!
ā¢ Your choice was capable of falsifying the rule.
44. Exercise 3
Rule: If there is a vowel on one side of the card, there must be an
even number on the other side.
Card Outcome Rule
E
Even Can be true even if rule false
Odd violated
K
Even na
Odd na
4
Vowel Can be true even if rule false
Consonant na
7
Vowel violated
Consonant na
45. Exercise 3
ā¢ This is equivalent to functional classiļ¬cation, e.g:
ā¢ Rule: If there is a CRN/RxLR/T3SS domain, the protein must
be an eļ¬ector.
46. Exercise 3
ā¢ Conļ¬rmation Bias (Wason Selection Task)
ā¢ An uninformative experiment is performed
ā¢ http://en.wikipedia.org/wiki/Wason_selection_task
ā¢ Aļ¬rming the Consequent (a related formal fallacy)
1. If P, then Q
2. Q
3. Therefore, P
ā¢ Experimental results are misinterpreted
ā¢ http:
//en.wikipedia.org/wiki/Affirming_the_consequent
47. Third Golden Rule of Bioinformatics
ā¢ Everyone has expectations of their data/experiment
ā¢ Beware cognitive errors, such as conļ¬rmation bias!
ā¢ System 1 vs. System 2 ā intuition vs. reason
ā¢ Think statistically!
ā¢ Large datasets can be counterintuitive and appear to conļ¬rm a
large number of contradictory hypotheses
ā¢ Always account for multiple tests.
ā¢ Avoid ādata dredgingā: intensive computation is not an
adequate substitute for expertise
ā¢ Use test-driven development of analyses and code
ā¢ Use examples that pass and fail