Good practices (and challenges) for reproducibility

Good practices (and challenges) for reproducibility
“Give your samples a decent life”
Javier Quilez

Outline
● Make groups of 3 (ideally 2 wet-lab + 1 dry-lab)
● I will present sequentially several scenarios/challenges
● You will have some minutes to think how you will tackle them
● I will propose approaches that worked for me
2

The life of your sample
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
4

What is your sample?
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
5

Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
This is NOT enough
6

● Initial processing of the data
● Quality control
● Downstream analysis
● Reproducibility
● Data sharing and publication
Is all the information needed available?
7

● What information (aka. metadata) will describe your experiment?
● How will you collect metadata?
● Who will have access to metadata?
● Will metadata be future-proof?
Think
8

Collect systematically the metadata of the experiments
● Do it before processing the data
● Short and easy to complete
● Instantly accessible by authorized members of the team
● Easy to parse for humans and computers
9

Experiments will happen over time
Time
Exp. 1
Untreated
ctrl.txt
Treated
t60.txt
Exp. 2
Treated
T60.txt
11

Which is your sample (and other issues)?
Untreated
ctrl.txt
Treated
t60.txt
Treated
T60.txt
? ?
What “*60.txt” file does correspond to each trated
experiment?
What “*60” and “ctrl” means may not be so obvious
and implies human interpretation whatsoever
Are both treated samples to be used with the same
untreated sample?
The variable use of lower/upper case complicates
computer searches
12

● How will you name your samples?
● Will it be really unique?
● Will it provide any information about the sample and/or group similar samples?
● Is it future-proof (i.e. consider more samples will come)?
● What will you label with the sample name (i.e. tubes, files)?
Think
13

● Simplest way: auto-incremental identifier (ID) (i.e. sample001, sample002, …)
● More complex options (sample ID based on metadata)
● Whichever you choose…
○ Unique
○ Computer-friendly (fixed length and pattern, all upper or lower case)
○ Anticipate the number of samples that can be reached
● Trace your sample with its ID through its life: from the tube to the files
Establish a system: each sample a unique identifier
14

The sample ID links metadata and data
15

Where are data and results?
Experiment Data
File(s)
Results
+File(s)
17

Looking for Waldo is fun, Looking for files is NOT
18

● How will you organize your raw data?
● How will you organize your processed data?
● How will you organize your analysis results data?
● Will human and computer searches be easy?
Think
19

Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
20

Experiment
(wet-lab domain)
Raw
Data
1
Analysis
results
3
21
2
Processed
data

(1) Raw data - 1 directory per instrument run
● Files as spit from the instrument
● Do not store modified, subsetted or merged files
● Quality control of raw files
22

(2) Processed data - 1 directory per sample
● Several subdirectories
○ Steps of the analysis pipeline
○ Logs of the programs used
○ File integrity verifications
● Subdirectories accommodate variations in the analysis pipelines
○ sample1/step1/program_a/sample1.txt
○ sample1/step1/program_b/sample1.txt
23

(3) Analysis results - projects and analysis directories
24
project_a

Data analysis hardly ever is a one-time task
Experiment
(wet-lab domain)
Data
(digital domain)
File(s)
Results
(digital domain)
+File(s)
26

Can you process seamlessly multiple samples?
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
27

● Imagine you write code to process/analyze 1 sample:
○ How will it handle 100 samples?
○ Will 100 samples be processed in a reasonable time?
○ Will you have to manually configure sample-specific parameters?
○ Will you be able to run specific parts of your code?
Think
28

Computer clusters to the rescue
29
Login
node
Computing
nodes

The naive first approach
30
Step
1
Configure
+
Execute
Step
2
Sample 1

And yet another sample…
31
Step
1
Configure
+
execute
Step
2
Sample 2

How long can you go like this?
32
Step
1
Configure
+
execute
Step
2
Sample 3

Scalability - 1 sample as easy as 100s
34
Step
1
Configure
+
execute
Step
2
Step
1
Step
2
Sample 1
Sample 2
Sample 3
Step
1
Step
2

Parallelization - run all samples at the same time
35
Step
1
Configure
+
execute
Step
2
Step
1
Step
2
Sample 1
Sample 2
Sample 3
Step
1
Step
2

Parallelization - speed up individual steps
36
Step
1
Configure
+
execute
Step
2
Step
1
Step
2
Sample 1
Sample 2
Sample 3
Step
1
Step
2
Step
1
Step
2
Step
1
Step
2
3 hours
3 hours
3 hours / 3 = 1 hour

Automatic configuration - no per-sample tuning
37
Step
1
Configure
+
execute
Step
2
Step
1
Step
2
Sample 1
Sample 2
Sample 3
Step
1
Step
2
Human
Mouse
Yeast

Automatic configuration - no per-sample tuning
38
Step
1
Configure
+
execute
Step
2
Step
1
Step
2
Sample 1
Sample 2
Sample 3
Step
1
Step
2
Metadata
Human
Mouse
Yeast

Modularity - execute it all or partially
39
OK
Configure
+
execute
Step
2
OK
Step
2
Sample 1
Sample 2
Sample 3
OK
Step
2

Modularity - execute it all or partially
40
Configure
+
execute
Step
2
Step
2
Sample 1
Sample 2
Sample 3
Step
2

Data go through many procedures to generate results
Time
ResultsData
Results
Results
Results
Results
Results
Data
...
42

Can you or anybody else reproduce your results?
Results
Results
Results
Results
?
?
Little understanding, irreproducibility, identification of errors is harder
43

● How will you document your procedures?
● How will you store your code?
● How others will have access to your documentation?
Think
44

● Write in README files how and when software and accessory files are obtained
(e.g. genome reference sequence, annotation)
● Allocate a directory for any task (even as simple as sharing files)
● Code core analysis pipeline to log the output of the programs and verify file
integrity
● Document procedures using Markdown, Jupyter Notebooks, RStudio or alike
● Specify non-default variable values
Document, document and document
45

Take home message
Which is your sample?
Can you processes
seamlessly multiple samples?
Can you or anybody else
reproduce your results?
47

Take home message
Can you processes
Collect systematically the metadata of the
experiments
48

Take home message
Can you processes
experiments
49

Take home message
Can you processes
experiments
Structured and hierarchical organization of the data
50

Take home message
Can you processes
experiments
Scalability, parallelization, automatic configuration and
modularity of the code
51

Take home message
Can you processes
experiments
Scalability, parallelization, automatic configuration and
modularity of the code
Document, document and document!
52

In case you forget the take home message…
The human factor is the greatest hurdle for reproducibility
Limit or control human intervention by automating every step of
the data analysis as much as possible
It’s not you, it’s the lab culture
53

Your involvement in the data analysis is a choice
The data analysis itself is not
55

Your involvement in the data analysis is a choice
The data analysis itself is not
56
Your
autonomy
Dependenceon
bioinformaticians
Your involvement in the data analysis

Thanks!
javier.quilez@crg.eu
Twitter: @jaquol
https://www.biorxiv.org/content/early/2017/08/29/136358
https://github.com/4DGenome/parallel_sequencing_lives
58

Good practices (and challenges) for reproducibility

Recommended

Recommended

More Related Content

Similar to Good practices (and challenges) for reproducibility

Similar to Good practices (and challenges) for reproducibility (20)

Recently uploaded

Recently uploaded (20)

Good practices (and challenges) for reproducibility