Keynote talk given at Fairdom User meeting http://fair-dom.org/communities/users/barcelona-2016-first-user-meeting/ .
I begin by summarising how we apply molecular approaches to understand social behaviour in ants. Subsequently, I give an overview of the data-handling challenges the genomic bioinformatics community faces. Finally, I give an overview of some of the tools and approaches my lab have developed to help us get things done better, faster, more reliably and more reproducibly.
28. Allozyme screen Social form associated to Gp-9 locus
Frequency of
the most
common allele
Locus!
0.3!
0.4!
0.5!
0.6!
0.7!
0.8!
0.9!
1.0!
Single queen!Multiple queen!
Est-6!Est-4!
G
3pdh-1!C
a-4!Pgm
-4!Ddh-1!Pro-5!
Pgm
-3!Acoh-5!acoh-1!A
cy-1!Pgm
-1!Aat-2!Gp-9!
Ken Ross and colleagues
Laurent Keller and colleagues
29. Single queen form Multiple queen form
Ken Ross and colleagues
Laurent Keller and colleagues
Social form completely associated to Gp-9 locus
30. bbbbBB BB Bb bb
Ken Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% )(< 5% )
31. bbBB BB Bb
x
Gp-9 bb females rare
Ken Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% )(< 5% )
32. BB BB Bb
Ken Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% )(< 5% )
33. BB BB Bb
x
Ken Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
Social form completely associated to Gp-9 locus
(>15% )(< 5% )
34. BB BB Bb
x x
Ken Ross and colleagues
Laurent Keller and colleagues
Social form completely associated to Gp-9 locus
Single queen form Multiple queen form
(>15% )(< 5% )
35. BB BB Bb
x x x
Ken Ross and colleagues
Laurent Keller and colleagues
Single queen form Multiple queen form
(>15% )(< 5% )
Social form completely associated to Gp-9 locus
36. • Is this gene the single überregulator?
Social form completely associated to Gp-9 locus
37. • Is this gene the single überregulator?
maybe 1/14th of the genome?
•Only 14 allozyme markers
Locus!
0.3!
0.4!
0.5!
0.6!
0.7!
0.8!
0.9!
1.0!
Single queen!Multiple queen!
Est-6!Est-4!
G
3pdh-1!C
a-4!Pgm
-4!Ddh-1!Pro-5!
Pgm
-3!Acoh-5!acoh-1!A
cy-1!Pgm
-1!Aat-2!Gp-9!
Social form completely associated to Gp-9 locus
41. Identify polymorphism
individual x locus
genotype table
RAD genotyping: sequencing the same
0.01% of the genome in many individuals
A B C D E F
L1 A C A A C C
L2 G G T - T G
L3 - A G A - G
L4 C - - G G C
L5 T T C T C -
L6 G A A - - G
2419loci
38 B & 38 b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+
Amount of variance explained per principal component
%VarianceExplained
051015202530
12.7%
6.1% 5.4% 4.8% 4.7% 3.9% 3.5% 3.2% 3.1% 2.9% 2.8% 2.6% 2.4% 2.3% 2.2% 2.0% 1.9% 1.7% 1.6%
30.2%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+
Amount of variance explained per principal component
%VarianceExplained
051015202530
PCA: Principal Component Analysis
45. • Is this gene the single überregulator?
maybe 1/14th of the genome?
•Only 14 allozyme markers
Social form completely associated to Gp-9 locus
BB BB Bb
Single queen form Multiple queen form
(>15% )(< 5% )
x xx
✖
✔
Locus!
0.3!
0.4!
0.5!
0.6!
0.7!
0.8!
0.9!
1.0!
Single queen!Multiple queen!
Est-6!Est-4!
G
3pdh-1!C
a-4!Pgm
-4!Ddh-1!Pro-5!
Pgm
-3!Acoh-5!acoh-1!A
cy-1!Pgm
-1!Aat-2!Gp-9!
46. Sex chromosomes
X Y
Gp-9 B
Gp-9 b
SB Sb
?
1.Why non-recombining?
“Social chromosomes”
= supergene
2.Are SB and Sb differentiated?
3.What are the differences?
47.
48. SBSB
SBSb
Single queen form
Multiple queen form
SBSB SB Sb
Single queen colony Multiple queen colony
SBSB SB Sb
Single queen colony Multiple queen colony
Summary: Fire ants have two colony types
Summary: this is determined by a pair of
social chromosomes
49.
50. Research themes
• Biomedical approaches
• International population genomics surveys
• Monitoring via sequencing
• Major social transitions
» social chromosomes
» convergence
» eusociality, queen number, parasitism...
• 100-fold intra-specific variation in lifespan
• Strengths of selection
• Candidate genes/pathway
Pollinator health
Genome evolution Social evolution
Modern bioinformatics tools & approaches
(some at https://wurmlab.github.io )
53. “Can you BLAST this for me?”
BLAST
But:
•convoluted interface
•challenging on custom data
Antgenomes.org SequenceServer
BLAST made easy
is the most commonly used tool: >100,000 citations
54. http://www.sequenceserver.com/
If no config file:Asks interactive setup questions.
If needed: Downloads BLAST binaries
If needed: Formats FASTA into BLAST database.
1. Installing
gem install sequenceserver
### Launched SequenceServer at: http://0.0.0.0:4567
2. Launch
sequenceserver
Demo
Anurag Priyam - @yeban
58. Timewasters
• Client vs server-side code.
• Workflows stalling (data
download, cluster queues…)
• Fragmented efforts - having to
learn additional languages for
specific tools
+ project-
specific
needs
Bionode
BrunoVieira @bmpvieira
59. Philosophy for flexibility
Modules should:
•(also) work in the web browser (when possible)
•(also) work in the command-line
•support streaming input/output
gittergitter join chatjoin chat
http://bionode.io
BrunoVieira @bmpvieira
60. Difficulty writing scalable, reproducible and
complex bioinformatic pipelines.
Solution: Node.js everywhereStreams
var ncbi = require('bionode-ncbi')
var tool = require('tool-stream')
var through = require('through2')
var fork1 = through.obj()
var fork2 = through.obj()
ncbi
.search('sra', 'Solenopsis invicta')
.pipe(fork1)
.pipe(dat.reads)
fork1
.pipe(tool.extractProperty('expxml.Biosample.id'))
.pipe(ncbi.search('biosample'))
.pipe(dat.samples)
fork1
.pipe(tool.extractProperty('uid'))
.pipe(ncbi.link('sra', 'pubmed'))
Node/Bionode for complex pipelines
@bmpvieira
63. BrunoVieira @bmpvieira
Philosophy for flexibility
Modules should:
•(also) work in the web browser (when possible)
•(also) work in the command-line
•support streaming input/output
Modules:
•decentralised management.
•small - just do one thing well.
•few strict rules, but some strong
recommendations (style, interfaces etc).
gittergitter join chatjoin chat
71. Geoffrey Chang: Crystallographer
• Beckman FoundationYoung Investigator
Award
• Presidential Early Career Award
Journal of Molecular Biology (2003) Chang. Structure
of MsbA from Vibrio cholera: a multidrug resistance ABC
transporter homolog in a closed conformation.
PNAS (2004) Ma & Chang. Structure of the multidrug
resistance efflux transporter EmrE from Escherichia coli.
Science (2005) Reyes & Chang. Structure of the ABC
transporter MsbA in complex with ADP vanadate and
lipopolysaccharide.
Science (2005) Pornillos et al. X-ray structure of the
EmrE multidrug transporter in complex with a substrate.
Science (2001) Chang & Roth. Structure of MsbA from
E. coli: a homolog of the multidrug resistance ATP binding
cassette (ABC) transporters.
Science (2001) Chang & Roth.
72. earch Institute in
next year, in a cer-
Chang received a
Award
rs, the
young
ated a
apers
ctures
ded in
into a
Swiss
per in
bt on a
group
cience
gated,
scover
ispro-
mns of
density
m had
ucture.
d used
energy from adenosine triphosphate to trans-
port molecules across cell membranes. These
so-called ABC transporters perform many
determination was at the root o
cess: “He has an incredible d
ethic. He really pushed the fie
of getting things to
no one else had be
Chang’s data are go
but the faulty so
everything off.
Ironically, anoth
doc in Rees’s lab, K
exposed the mistake
tember issue of Na
now at the Swiss F
ofTechnology in Zu
the structure of anA
calledSav1866from
aureus. The structur
cally—and unexpe
ent from that of
pulling up Sav186
MsbA from S. typh
computer screen, L
realized in minutes
structurewasinvert
the “hand” of a mol
Flipping fiasco. The structures of MsbA (purple) and Sav1866 (green) overlap
little (left) until MsbA is inverted (right).
California.The next year, in a cer-
e White House, Chang received a
l Early Career Award
ts and Engineers, the
ghest honor for young
. His lab generated a
high-profile papers
e molecular structures
proteins embedded in
nes.
e dream turned into a
In September, Swiss
published a paper in
cast serious doubt on a
cture Chang’s group
ed in a 2001 Science
en he investigated,
horrified to discover
madedata-analysispro-
ipped two columns of
ng the electron-density
which his team had
final protein structure.
ly, his group had used
m to analyze data for
port molecules across cell membranes. These
so-called ABC transporters perform many
cess: “He has an
ethic. He really p
of get
no on
Chan
but t
every
Iro
doc in
expos
temb
now a
ofTec
the str
called
aureu
cally—
ent f
pullin
MsbA
comp
realiz
struct
the “h
a cha
Flipping fiasco. The structures of MsbA (purple) and Sav1866 (green) overlap
little (left) until MsbA is inverted (right).
Sav1866 Dawson & Locher (2006) Nature
Science(2001)Chang&Roth.Science (2001) Chang & Roth.
Comparison with 3D structure of ortholog
Science (2001) Chang & Roth.
73. http://wurmlab.github.io
LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES
1878 1880 1882
LETTERS
edited by Etta Kavanagh
Retraction
WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OF
MsbA from E. coli:A homolog of the multidrug resistanceATP bind-
ing cassette (ABC) transporters” and both of our Reports “Structure of
the ABC transporter MsbA in complex with ADP•vanadate and
lipopolysaccharide”and“X-raystructureoftheEmrEmultidrugtrans-
porter in complex with a substrate” (1–3).
The recently reported structure of Sav1866 (4) indicated that our
MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-
ture and the topology. Thus, our biological interpretations based on
these inverted models for MsbA are invalid.
Anin-housedatareductionprogramintroducedachangeinsignfor
anomalous differences.This program, which was not part of a conven-
tional data processing package, converted the anomalous pairs (I+ and
I-) to (F- and F+), thereby introducing a sign change. As the diffrac-
tion data collected for each set of MsbA crystals and for the EmrE
crystals were processed with the same program, the structures reported
in (1–3, 5, 6) had the wrong hand.
The error in the topology of the original MsbA structure was a con-
sequence of the low resolution of the data as well as breaks in the elec-
tron density for the connecting loop regions. Unfortunately, the use of
the multicopy refinement procedure still allowed us to obtain reason-
able refinement values for the wrong structures.
The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for
MsbA and 1S7B and 2F2M for EmrE have been moved to the archive
of obsolete PDB entries. The MsbA and EmrE structures will be
recalculated from the original data using the proper sign for the anom-
alous differences, and the new Ca coordinates and structure factors
will be deposited.
We very sincerely regret the confusion that these papers have
caused and, in particular, subsequent research efforts that were unpro-
ductive as a result of our original findings.
GEOFFREY CHANG, CHRISTOPHER B. ROTH,
CHRISTOPHER L. REYES, OWEN PORNILLOS,
YEN-JU CHEN, ANDY P. CHEN
Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
References
1. G. Chang, C. B. Roth, Science 293, 1793 (2001).
2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).
3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).
4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).
5. G. Chang, J. Mol. Biol. 330, 419 (2003).
6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).
MsbA from E. coli:A homolog of the multidrug resistanceATP bind-
ing cassette (ABC) transporters” and both of our Reports “Structure of
the ABC transporter MsbA in complex with ADP•vanadate and
lipopolysaccharide”and“X-raystructureoftheEmrEmultidrugtrans-
porter in complex with a substrate” (1–3).
The recently reported structure of Sav1866 (4) indicated that our
MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-
ture and the topology. Thus, our biological interpretations based on
these inverted models for MsbA are invalid.
Anin-housedatareductionprogramintroducedachangeinsignfor
anomalous differences.This program, which was not part of a conven-
tional data processing package, converted the anomalous pairs (I+ and
I-) to (F- and F+), thereby introducing a sign change. As the diffrac-
tion data collected for each set of MsbA crystals and for the EmrE
crystals were processed with the same program, the structures reported
in (1–3, 5, 6) had the wrong hand.
The error in the topology of the original MsbA structure was a con-
sequence of the low resolution of the data as well as breaks in the elec-
1860
Untilrecently,GeoffreyChang’scareerwason
a trajectory most young scientists only dream
about. In 1999, at the age of 28, the protein
crystallographer landed a faculty position at
the prestigious Scripps Research Institute in
San Diego, California.The next year, in a cer-
emony at the White House, Chang received a
Presidential Early Career Award
for Scientists and Engineers, the
country’s highest honor for young
researchers. His lab generated a
stream of high-profile papers
detailing the molecular structures
of important proteins embedded in
cell membranes.
Then the dream turned into a
nightmare. In September, Swiss
researchers published a paper in
Nature that cast serious doubt on a
protein structure Chang’s group
had described in a 2001 Science
paper. When he investigated,
Chang was horrified to discover
thatahomemadedata-analysispro-
2001 Science paper, which described the struc-
tureofaproteincalledMsbA,isolatedfromthe
bacterium Escherichia coli. MsbA belongs to a
huge and ancient family of molecules that use
energy from adenosine triphosphate to trans-
port molecules across cell membranes. These
so-called ABC transporters perform many
Sciences and
EmrE, a differ
Crystalliz
five membra
was an incred
postdoc advis
nia Institute o
proteins are a
because they
ously diffic
needed for x-
determination
cess: “He has
ethic. He real
of
no
Ch
bu
ev
do
ex
tem
no
of
the
cal
au
ca
en
pu
A Scientist’s Nightmare: Software
Problem Leads to Five Retractions
SCIENTIFIC PUBLISHING
74. !
Geoffrey Chang
• Beckman FoundationYoung Investigator
Award
• Presidential Early Career Award
Science (2001) Chang & Roth. Structure of MsbA from
E. coli: a homolog of the multidrug resistance ATP binding
cassette (ABC) transporters.
Journal of Molecular Biology (2003) Chang. Structure
of MsbA from Vibrio cholera: a multidrug resistance ABC
transporter homolog in a closed conformation.
PNAS (2004) Ma & Chang. Structure of the multidrug
resistance efflux transporter EmrE from Escherichia coli.
Science (2005) Reyes & Chang. Structure of the ABC
transporter MsbA in complex with ADP vanadate and
lipopolysaccharide.
Science (2005) Pornillos et al. X-ray structure of the
EmrE multidrug transporter in complex with a substrate.
1860
Untilrecently,GeoffreyChang’scareerwason
a trajectory most young scientists only dream
about. In 1999, at the age of 28, the protein
crystallographer landed a faculty position at
the prestigious Scripps Research Institute in
San Diego, California.The next year, in a cer-
emony at the White House, Chang received a
Presidential Early Career Award
for Scientists and Engineers, the
country’s highest honor for young
researchers. His lab generated a
stream of high-profile papers
detailing the molecular structures
of important proteins embedded in
cell membranes.
Then the dream turned into a
nightmare. In September, Swiss
researchers published a paper in
Nature that cast serious doubt on a
protein structure Chang’s group
had described in a 2001 Science
paper. When he investigated,
Chang was horrified to discover
thatahomemadedata-analysispro-
2001 Science paper, which described the struc-
tureofaproteincalledMsbA,isolatedfromthe
bacterium Escherichia coli. MsbA belongs to a
huge and ancient family of molecules that use
energy from adenosine triphosphate to trans-
port molecules across cell membranes. These
so-called ABC transporters perform many
Sciences and
EmrE, a differ
Crystalliz
five membra
was an incred
postdoc advis
nia Institute o
proteins are a
because they
ously diffic
needed for x-
determination
cess: “He has
ethic. He real
of
no
Ch
bu
ev
do
ex
tem
no
of
the
cal
au
ca
en
pu
A Scientist’s Nightmare: Software
Problem Leads to Five Retractions
SCIENTIFIC PUBLISHING
77. http://wurmlab.github.io
• Understanding/visualising/analysing/massaging big data is hard.
• Biology/life is complex.
• Biologists lack computational training.
• Field is young.
• Analysis tools (generally) suck:
• badly written
• badly tested
• hard to install
• output quality… often questionable.
• Data sizes keep growing!
• Data formats keep changing :(
Genome bioinformatics is hardBiology is harder than (many) other
data sciences
80. http://wurmlab.github.io
Community Page
Best Practices for Scientific Computing
Greg Wilson1
*, D. A. Aruliah2
, C. Titus Brown3
, Neil P. Chue Hong4
, Matt Davis5
, Richard T. Guy6¤
,
Steven H. D. Haddock7
, Kathryn D. Huff8
, Ian M. Mitchell9
, Mark D. Plumbley10
, Ben Waugh11
,
Ethan P. White12
, Paul Wilson13
1 Mozilla Foundation, Toronto, Ontario, Canada, 2 University of Ontario Institute of Technology, Oshawa, Ontario, Canada, 3 Michigan State University, East Lansing,
Michigan, United States of America, 4 Software Sustainability Institute, Edinburgh, United Kingdom, 5 Space Telescope Science Institute, Baltimore, Maryland, United
States of America, 6 University of Toronto, Toronto, Ontario, Canada, 7 Monterey Bay Aquarium Research Institute, Moss Landing, California, United States of America,
8 University of California Berkeley, Berkeley, California, United States of America, 9 University of British Columbia, Vancouver, British Columbia, Canada, 10 Queen Mary
University of London, London, United Kingdom, 11 University College London, London, United Kingdom, 12 Utah State University, Logan, Utah, United States of America,
13 University of Wisconsin, Madison, Wisconsin, United States of America
Introduction
Scientists spend an increasing amount of time building and
using software. However, most scientists are never taught how to
do this efficiently. As a result, many are unaware of tools and
practices that would allow them to write more reliable and
maintainable code with less effort. We describe a set of best
practices for scientific software development that have solid
foundations in research and experience, and that improve
scientists’ productivity and the reliability of their software.
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively on
computational problems, to traditional laboratory and field
scientists, more and more of the daily operation of science revolves
around developing new algorithms, managing and analyzing the
large amounts of data that are generated in single research
projects, combining disparate datasets to assess synthetic problems,
and other computational tasks.
Scientists typically develop their own software for these purposes
because doing so requires substantial domain-specific knowledge.
As a result, recent studies have found that scientists typically spend
30% or more of their time developing software [1,2]. However,
90% or more of them are primarily self-taught [1,2], and therefore
lack exposure to basic software development practices such as
writing maintainable code, using version control and issue
error from another group’s code was not discovered until after
publication [6]. As with bench experiments, not everything must be
done to the most exacting standards; however, scientists need to be
aware of best practices both to improve their own approaches and
for reviewing computational work by others.
This paper describes a set of practices that are easy to adopt and
have proven effective in many research settings. Our recommenda-
tions are based on several decades of collective experience both
building scientific software and teaching computing to scientists
[17,18], reports from many other groups [19–25], guidelines for
commercial and open source software development [26,27], and on
empirical studies of scientific computing [28–31] and software
development in general (summarized in [32]). None of these practices
will guarantee efficient, error-free software development, but used in
concert they will reduce the number of errors in scientific software,
make it easier to reuse, and save the authors of the software time and
effort that can used for focusing on the underlying scientific questions.
Our practices are summarized in Box 1; labels in the main text
such as ‘‘(1a)’’ refer to items in that summary. For reasons of space,
we do not discuss the equally important (but independent) issues of
reproducible research, publication and citation of code and data,
and open science. We do believe, however, that all of these will be
much easier to implement if scientists have the skills we describe.
Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2
*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction
Most bioinformatics coursework focus-
es on algorithms, with perhaps some
components devoted to learning pro-
gramming skills and learning how to
use existing bioinformatics software. Un-
fortunately, for students who are prepar-
ing for a research career, this type of
curriculum fails to address many of the
day-to-day organizational challenges as-
sociated with performing computational
experiments. In practice, the principles
behind organizing and documenting
computational experiments are often
learned on the fly, and this learning is
strongly influenced by personal predilec-
tions as well as by chance interactions
with collaborators or colleagues.
The purpose of this article is to describe
one good strategy for carrying out com-
putational experiments. I will not describe
profound issues such as how to formulate
hypotheses, design experiments, or draw
conclusions. Rather, I will focus on
relatively mundane issues such as organiz-
ing files and directories and documenting
understanding your work or who may be
evaluating your research skills. Most com-
monly, however, that ‘‘someone’’ is you. A
few months from now, you may not
remember what you were up to when you
created a particular set of files, or you may
not remember what conclusions you drew.
You will either have to then spend time
reconstructing your previous experiments
or lose whatever insights you gained from
those experiments.
This leads to the second principle,
which is actually more like a version of
Murphy’s Law: Everything you do, you
will probably have to do over again.
Inevitably, you will discover some flaw in
your initial preparation of the data being
analyzed, or you will get access to new
data, or you will decide that your param-
eterization of a particular model was not
broad enough. This means that the
experiment you did last week, or even
the set of experiments you’ve been work-
ing on over the past month, will probably
need to be redone. If you have organized
and documented your work clearly, then
repeating the experiment with the new
under a common root directory. The
exception to this rule is source code or
scripts that are used in multiple projects
Each such program might have a projec
directory of its own.
Within a given project, I use a top-leve
organization that is logical, with chrono
logical organization at the next level, and
logical organization below that. A sample
project, called msms, is shown in Figure 1
At the root of most of my projects, I have a
data directory for storing fixed data sets, a
results directory for tracking computa
tional experiments peformed on that data
a doc directory with one subdirectory per
manuscript, and directories such as src
for source code and bin for compiled
binaries or scripts.
Within the data and results directo
ries, it is often tempting to apply a similar
logical organization. For example, you
may have two or three data sets agains
which you plan to benchmark your
algorithms, so you could create one
directory for each of them under data
In my experience, this approach is risky
because the logical structure of your fina
83. http://wurmlab.github.io
Write code for humans (not computers!)
• For
• yourself
• colleagues / collaborators
• reviewers
• other random people who may reuse/improve your code
• Respect conventions (e.g., a style guide)
84. te Damian ConwayUse whitespace/indentation!
e Damian Conway
Same information
85. Line length
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a
reasonably sized font. If you find yourself running out of room, this is a good indication that you
should encapsulate some of the work in a separate function.
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, se
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt',
header = TRUE,
sep = 't',
col.names = c('colony', 'individual', 'headwidth', 'mass')
)
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE,
sep='t', col.names = c('colony', 'individual', 'headwidth', 'mass'))
R style guide extract
http://r-pkgs.had.co.nz/style.html
86. R style guide extract
http://r-pkgs.had.co.nz/style.html
88. http://wurmlab.github.io
Write code for humans (not computers!)
• For
• yourself
• colleagues / collaborators
• reviewers
• other random people who may want to reuse your code
• Respect conventions (e.g., a style guide)
• Don't optimise (generally…)
95. Automatically check consistency with
style guide
install.packages("lint") # once
library(lint) # everytime
lint("file_to_check.R")
96. http://wurmlab.github.io
Create code tests that are easy to run
• Unit tests == checking edge cases to see if the function works
# do your stuff
# e.g. define speed() function
library(testthat)
expect_that(speed(km = 0, minutes = 60), equals(0))
expect_that(speed(km = 60, minutes = 60), equals(1))
expect_that(speed(km = -4, minutes = 60), throws_error())
expect_that(nrow(significant_SNPs), 42)
expect_that(my_model, is_a("lm"))
• Integration tests == "full analysis" but on small data with
known results
• e.g. on fakeVCF genotype file of 2 loci (one true positive,
one true negative)
• Add sanity checks. E.g. the following should fail rather than
return something incorrect.
speed(km= "twenty", minutes=20)
speed(km = -4, minutes = 60)
99. http://wurmlab.github.io
Use tools that reduce risks
• Ensure computers are set up for productivity. E.g.,:
• use GNU parallel on an 80-core machine when more
appropriate than submitting to queue
• If you need to make a "pipeline", use software designed for this.
E.g.:
• Snakemake
• Nextflow
• (etc)
• too many examples to discuss here
100. knitr/rmarkdown/
jupyter
Analysis & report in one.
analysis.Rmd
A minimal R Markdown example
I know the value of pi is 3.1416, and 2 times pi is 6.2832. To c
library(knitr); knit( minimal.Rmd )
A paragraph here. A code chunk below:
1+1
## [1] 2
.4-.7+.3 # what? it is not zero!
## [1] 5.551e-17
Graphics work too
library(ggplot2)
qplot(speed, dist, data = cars) + geom_smooth()
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
0
40
80
120
5 10 15 20
speed
dist
Figure 1: A scatterplot of cars
101.
102. How to get users to adopt good
practices?
• Carrot (dual-benefit):
• Use their motivation to have an easier life.
"their motivation is the database"
"they see it, they understand it" -Thomasz? on SEEK
• Piggyback off that so they do things better ("by stealth" -Carol)
• Stick:
• When you're reviewing publications/grants
• Politics:
• Encourage funders / journals to require good practices.
103.
104. Summary
• Ants are cool
• Biology is hard
• We need to handle data better
105. y.wurm@qmul.ac.uk
@yannick__
https://wurmlab.github.io
@ Queen Mary U London
Rodrigo Pracana
Anurag Priyam @yeban
Eckart Stolle
Bruno Vieira @bmpvieira
R Nichols & sbcsEvolve
R Christie & T King / ITSR Apocrita
Laurent Keller lab @ Lausanne
J Wang, D Shoemaker,O Riba-Grognuz,
M Nipitwattanaphon
Ioannis Xenarios @ SIB
DeWayne Shoemaker @ USDA
Thanks!