SlideShare a Scribd company logo
1 of 1
Download to read offline
Managing the analysis of high-throughput data
It’s not so much about the tools, it’s the attitude
Javier Quilez1,2, Enrique Vidal1,2, François Le Dily1,2, François Serra1,2,3, Yasmina Cuartero1,2,3, Ralph Stadhouders1,2, Thomas Graf1,2, Marc A. Marti-Renom1,2,3,4, Miguel Beato1,2 and Guillaume Filion1,2
1Gene Regulation, Stem Cells and Cancer Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain
2Universitat Pompeu Fabra (UPF), Barcelona, Spain
3CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain
4ICREA, Pg. Lluis Companys 23, 08010 Barcelona, Spain
• High-throughput sequencing (HTS) experiments are pervasive in the life sciences; from small research groups to large-scale projects, HTS data accumulates at a rapid pace
• The human factor is the greatest hurdle to (i) analyse HTS data efficiently, (ii) reach the FAIR (Findable, Accessible, Interoperable and Reusable) Principles
• To overcome these limitations we propose that: (i) crucial questions need to be addressed at an early stage of the project; (ii) scientific groups must develop habits and tools for sharing data
and analyses; and (iii) data-producing teams focus on Documentation, Automation, Traceability and Autonomy
• Interested but don’t have time/energy to keep reading? Check out our parable “Parallel sequencing lives, or what makes large sequencing project successful”
What, when, how and who will have access to the sample metadata?
Collect systematically the metadata of the experiments
• Sequencing reads are not all the information derived from a HTS experiment

• Metadata provide information about HTS experiments, which are required for analysing and sharing the data
and for reproducing the results

• Very often, however, metadata are scattered, inaccurate, insufficient or even missing (especially for older
samples)

• Collect the metadata systematically and before the processing of the data starts (Fig. 1a and Box 1)
Can samples be identified unambiguously?
Establish a system: each sample a unique identifier (ID)
• Samples are often called with names that are easy to remember for the person who performed the
experiment

• This generates sample swaps as similar identifiers can refer to different experiments; also, unsystematic
naming prevents accessing samples programmatically, which may lead to errors and undermines the
capability to automate the analysis

• Establish a scheme to uniquely identify samples and the associated (meta)data (Box 2)
Where are data and results?
Structured and hierarchical organisation of the data
• Data and results derived from HTS experiments are typically stored in an untidy manner

• Organise data in a structured and hierarchical manner reflecting the way data are generated and analysed: (1)
raw data, (2) processed data and (3) analysis results (Fig. 1b)
Can multiple samples be seamlessly processed?
Scalability, parallelisation, automatic configuration and modularity of the code
• Data analysis rarely is a one-time task: (i) samples are sequenced at different time points (Fig. 1b) so core
analysis pipelines have to be executed for every new sequencing batch; (ii) samples need to be re-processed
when analysis pipelines are modified substantially; and (iii) downstream analyses are often repeated with
different datasets or variables

• Automate the data processing as much as possible (Box 3 and Fig. 2a)
Does anybody have the information to reproduce the results?
Documentation, documentation and documentation
• Results with no documentation leads to little understanding of the analysis, irreproducibility and makes harder
the identification of errors

• Document all the parts involved in the analysis (from the raw data to the results) (Box 4)
Can anybody make use of the data generated?
Empower experimenters to perform basic analysis via web applications
• Analysis workflows generate many files, which may not be accessible for users (too big to open or too difficult
to manipulate) (Box 5)

• Implement interactive web applications to display the processed data and to perform specific analyses in a
user-friendly manner (Fig. 2b)

• Building interfaces for standard analyses frees bioinformaticians to focus on the most technical parts of the
project, while allowing all the members to contribute to the analyses (Box 5)

• The features of such web applications must be discussed with their potential users, because implementing
them requires effort and time
References
1. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18

2. GigaScience. 2017,gix100. doi.org/10.1093/gigascience/gix100

3. https://daringfireball.net/projects/markdown/

4. http://jupyter.org/

5. https://www.rstudio.com/

6. https://shiny.rstudio.com/
Funding
We received funding from the European Research Council under the European Union's Seventh Framework
Programme (FP7/2007-2013)/ERC Synergy grant agreement 609989 (4DGenome). The content of this poster
reflects only the author’s views and the Union is not liable for any use that may be made of the information
contained therein. We acknowledge support of the Spanish Ministry of Economy and Competitiveness,
‘Centro de Excelencia Severo Ochoa 2013-2017’ and Plan Nacional (SAF2016-75006-P), as well as support
of the CERCA Programme / Generalitat de Catalunya. Ralph Stadhouders was supported by an EMBO
Long-term Fellowship (ALTF 1201-2014) and a Marie Curie Individual Fellowship (H2020-MSCA-IF-2014).
@jaquol
@4DGenome
javier.quilez@crg.eu
Box 1. Features of a good metadata collection system
Easy to parse for
humans & computers
Responsible for maintenance
& metadata validation
Future-aware
flexible
Agreed & understood
by people using it
Box 2. Unique IDs: connecting tubes, metadata & data
Sample ID
Biological
Technical Logistics
Application User
Experiment
Cell type
Treatment
Target protein
Facility
Run date
Read length
Species
SE/PE
@HWI-D00733:72:C8E09ANXX:5:1101:1211:2429 1:N:0:ACAGTG
CTACCACCAAACTTAGAACGGTCATTATGTTACTCTAAGATAATAGAATA
+
AABB=FDGGGGGCGGEC1CCGEC/C1=<CFFGEFF1=CFG1>F>1FG1<1
@HWI-D00733:72:C8E09ANXX:5:1101:1284:2358 1:N:0:ACAGTG
AGGATATATTTGTTAAAAATACAACAAAAACCCCTAGTATTTGTGAGCAA
+
ABBB0EFFGGFGFGEFGGGCFGGGGGGGGGGGGG<FCCFF1<BCB11=EF
@HWI-D00733:72:C8E09ANXX:5:1101:1413:2386 1:N:0:ACAGTG
GGCTCCTCTCGGTTCTTCCGAGCCAGCTCGTCATATTGGGCCCGGATGTC
+
BCCBBEGDFGFGCBGGEGGFBCB/B0:DDF>FGGE1@CG@DFAEGGBE:=
@HWI-D00733:72:C8E09ANXX:5:1101:1319:2485 1:N:0:ACAGTG
GCTTAGTCTTATTGCTCAGGAGACCGGAGGCCTGGGTTGCTACAGTGCAG
+
A3<AA1EE@1;C1>>>>C=1;EF=G/<E/>BCFGG0FDGB1BFG1EEFF1
@HWI-D00733:72:C8E09ANXX:5:1101:1565:2381 1:N:0:ACAGTG
GGCCAACCACAAGACGATAAAGGGAAACAGGGCGTGGGGATTTCCAGTTT
Data
(Sequencing reads, FASTQ)
Metadata
Computer-fiendly
fixed length & pattern
all lower or upper case
anticipate max. # of samples
#1: simple auto-incremental
(sample001, sample002, …)
#2: hash function applied to
metadata (b1913e6c1_51720e9cf)
Examples
Fig. 2. Automating data analysis and visualisation
(a) Scalability is achieved by having a submission script (‘*.submit.sh’) that generates as many
pipeline scripts as samples listed in the configuration file (‘*.config’), so that a pipeline is executed
simultaneously for multiple samples with a single command (gray rectangle). The configuration file
also contains the hard-coded parameters shared by all samples (e.g. number of processors or
genome assembly version). Parallelisation is obtained by (i) submitting each sample pipeline script as
an independent job in the computing cluster, if there is one, where it will be queued (orange) and
eventually executed (green), and (ii) adapting the pipeline code in ‘*seq.sh’ to be suitable for running
in multiple processors. Each pipeline script is automatically configured by retrieving the pipeline
variable values (e.g. species, read length) from the metadata SQL database; in addition, selected
metadata generated by the pipeline (e.g. running time, number of aligned reads) are recorded into the
database. For further flexibility, the pipeline code is grouped into modules that can be executed all
sequentially or individually by specifying it in the configuration file. (b) We take advantage of our
structured and hierarchical data organisation as well as the available metadata to deploy a web
application to visualise processed data using Shiny6.
app.R
*.config
samples &
parameters
[full]
*submit.sh
*seq.sh
pipeline code
[module 1]
[module 2]
[module 3]
SQL
database
*.sh
pipeline script
sample A2
*.sh
pipeline script
sample A1
*.sh
pipeline script
sample N
…
> *submit.sh *.config
sample001
sample0022
sample003
…
Processed
data
sample004
sample005
sample006
…
Shiny
server
a
b
Fig. 1. Framework for the management of HTS data
(a) Metadata collection. In our projects, metadata are collected via an online Google Form and stored
both online (Google Sheet) and in a local SQL database. We design forms to be short and easy to
complete, and Google Sheets provide instant access to the metadata by authorised users. The SQL
database works both as a backup and as the source for retrieving metadata programmatically. (b)
The stages of HTS data. In general, experiments are sequenced in different multi-sample runs
separated in time. HTS data are usually analysed in two steps. First, raw data are processed sample-
wise with standard but tunable core analysis pipelines which generate a variety of files. Second,
processed data from one or more samples are combined to perform downstream analyses.
runs/
|-2017-10-09/
|--sample001_read1.fastq.gz
|--fastqc/
|---sample001_read1_fastqc.txt
>PROJECT
>APPLICATION
>SAMPLE_ID
>SAMPLE_NAME
…
Google Form Google Spreadsheet
SQL
database
ONLINE CLUSTER
sample001
sample002
sample003
…
sample001
sample002
sample003
…
Core analysis
pipeline
Downstream
analysis
Raw
data
1
Analysis 1
Analysis
results
3
Sequencing
run A
Processed
data
2
sample004
sample005
sample006
…
sample004
sample005
sample006
…
Sequencing
run B
a
b
Timestamp SAMPLE_ID CELL_TYPE TREATMENT TREATMENT_TIME
08/10/15 14:13 sample001 T47D Untreated 0
08/10/15 14:35 sample002 T47D Progesterone 60
08/10/15 14:38 sample003 T47D Untreated 0
2/22/16 12:35:00 sample004 B-cell Untreated 0
sample001/
|-alignments/
|--hg19/
|--hg38/
|-profiles/
|-logs/
|--program1.out
projects/
|-project1/
|--2017-10-09_diff_expression/
|---data/
|---figures/
|---tables/
|---scripts/
Box 3. Analysis code’s wish-list
Scalability
1 sample or 100s
Parallelisation
run all samples simultaneously
speed up individual tasks
Automatic configuration
no need to set variables for each sample
Pipeline modularity
execute it all or individually
REPRODUCIBILITY
Every task is a directory
Analysis pipelines:
• Log monitors the
progress of the
pipeline
• Keep the logs of the
programs used
• Check the integrity
of important files
(e.g. raw reads)
Use Markdown3,
Jupyter Notebook4,
RStudio5 or alike to
document
procedures
Specify the non-
default variable
values used
Version control
Code repositories
(e.g. GitHub)
Virtual
machines
(e.g. Docker)
Box 4. The multiple pieces of reproducibility
I run the pipeline on your 10 samples
Can you send me the interaction
matrix of chr2 for all of them? Excel
crashes and, yet, I’d have to do it
many times…
I wish I could
focus on the
more technical
aspects…
I wish I could be
more autonomous
with the data
analysis…
Box 5. Web applications: a win-win situation

More Related Content

What's hot

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsEnabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsLuis Marco Ruiz
 
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...IRJET Journal
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuKAUSHAL SAHU
 
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Syed Ahmad Chan Bukhari, PhD
 
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...Syed Ahmad Chan Bukhari, PhD
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0TELKOMNIKA JOURNAL
 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
 
AI in Bioinformatics
AI in BioinformaticsAI in Bioinformatics
AI in BioinformaticsAli Kishk
 
2012-ICGC-Heidelberg-Whitty-DCC 2
2012-ICGC-Heidelberg-Whitty-DCC 22012-ICGC-Heidelberg-Whitty-DCC 2
2012-ICGC-Heidelberg-Whitty-DCC 2Brett Whitty
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Recent trends in bioinformatics
Recent trends in bioinformaticsRecent trends in bioinformatics
Recent trends in bioinformaticsZeeshan Hanjra
 

What's hot (20)

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse EnvironmentsEnabling Clinical Data Reuse with openEHR Data Warehouse Environments
Enabling Clinical Data Reuse with openEHR Data Warehouse Environments
 
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
IRJET - Prediction of Risk Factor of the Patient with Hepatocellular Carcinom...
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
 
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
 
Ijmet 10 01_029
Ijmet 10 01_029Ijmet 10 01_029
Ijmet 10 01_029
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
AI in Bioinformatics
AI in BioinformaticsAI in Bioinformatics
AI in Bioinformatics
 
2012-ICGC-Heidelberg-Whitty-DCC 2
2012-ICGC-Heidelberg-Whitty-DCC 22012-ICGC-Heidelberg-Whitty-DCC 2
2012-ICGC-Heidelberg-Whitty-DCC 2
 
Research Methodology - Target Discovery
Research Methodology - Target DiscoveryResearch Methodology - Target Discovery
Research Methodology - Target Discovery
 
An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification An Ensemble of Filters and Wrappers for Microarray Data Classification
An Ensemble of Filters and Wrappers for Microarray Data Classification
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Recent trends in bioinformatics
Recent trends in bioinformaticsRecent trends in bioinformatics
Recent trends in bioinformatics
 

Similar to Managing the analysis of high-throughput data

CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsTom Plasterer
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster LEARN Project
 
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...acijjournal
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFelipe Gutierrez
 
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...AIST
 
Elaboration and enhanced usage of data analysis tool DAMIS+
Elaboration and enhanced usage of data analysis tool DAMIS+Elaboration and enhanced usage of data analysis tool DAMIS+
Elaboration and enhanced usage of data analysis tool DAMIS+Saulius Maskeliunas
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Who needs fast data? - Journal for Clinical Studies
Who needs fast data? - Journal for Clinical Studies Who needs fast data? - Journal for Clinical Studies
Who needs fast data? - Journal for Clinical Studies KCR
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesMartin Szomszor
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing dataWorld Agroforestry (ICRAF)
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchUniversity Medicine Greifswald
 
IRJET - Healthcare Data Storage using Blockchain
IRJET - Healthcare Data Storage using BlockchainIRJET - Healthcare Data Storage using Blockchain
IRJET - Healthcare Data Storage using BlockchainIRJET Journal
 
The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...Massimiliano Masi
 

Similar to Managing the analysis of high-throughput data (20)

CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge Graphs
 
Processes 06-00053
Processes 06-00053Processes 06-00053
Processes 06-00053
 
Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...
Adaptive Real Time Data Mining Methodology for Wireless Body Area Network Bas...
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
 
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...Nataly Zhukova - Conceptual Model for Routine Measurements Analyses  in Seman...
Nataly Zhukova - Conceptual Model for Routine Measurements Analyses in Seman...
 
Elaboration and enhanced usage of data analysis tool DAMIS+
Elaboration and enhanced usage of data analysis tool DAMIS+Elaboration and enhanced usage of data analysis tool DAMIS+
Elaboration and enhanced usage of data analysis tool DAMIS+
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
The BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative researchThe BlueBRIDGE approach to collaborative research
The BlueBRIDGE approach to collaborative research
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Who needs fast data? - Journal for Clinical Studies
Who needs fast data? - Journal for Clinical Studies Who needs fast data? - Journal for Clinical Studies
Who needs fast data? - Journal for Clinical Studies
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
 
Irida bccdc dec10_2015
Irida bccdc dec10_2015Irida bccdc dec10_2015
Irida bccdc dec10_2015
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing data
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical research
 
IRJET - Healthcare Data Storage using Blockchain
IRJET - Healthcare Data Storage using BlockchainIRJET - Healthcare Data Storage using Blockchain
IRJET - Healthcare Data Storage using Blockchain
 
The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...
 

Recently uploaded

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Managing the analysis of high-throughput data

  • 1. Managing the analysis of high-throughput data It’s not so much about the tools, it’s the attitude Javier Quilez1,2, Enrique Vidal1,2, François Le Dily1,2, François Serra1,2,3, Yasmina Cuartero1,2,3, Ralph Stadhouders1,2, Thomas Graf1,2, Marc A. Marti-Renom1,2,3,4, Miguel Beato1,2 and Guillaume Filion1,2 1Gene Regulation, Stem Cells and Cancer Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain 2Universitat Pompeu Fabra (UPF), Barcelona, Spain 3CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain 4ICREA, Pg. Lluis Companys 23, 08010 Barcelona, Spain • High-throughput sequencing (HTS) experiments are pervasive in the life sciences; from small research groups to large-scale projects, HTS data accumulates at a rapid pace • The human factor is the greatest hurdle to (i) analyse HTS data efficiently, (ii) reach the FAIR (Findable, Accessible, Interoperable and Reusable) Principles • To overcome these limitations we propose that: (i) crucial questions need to be addressed at an early stage of the project; (ii) scientific groups must develop habits and tools for sharing data and analyses; and (iii) data-producing teams focus on Documentation, Automation, Traceability and Autonomy • Interested but don’t have time/energy to keep reading? Check out our parable “Parallel sequencing lives, or what makes large sequencing project successful” What, when, how and who will have access to the sample metadata? Collect systematically the metadata of the experiments • Sequencing reads are not all the information derived from a HTS experiment • Metadata provide information about HTS experiments, which are required for analysing and sharing the data and for reproducing the results • Very often, however, metadata are scattered, inaccurate, insufficient or even missing (especially for older samples) • Collect the metadata systematically and before the processing of the data starts (Fig. 1a and Box 1) Can samples be identified unambiguously? Establish a system: each sample a unique identifier (ID) • Samples are often called with names that are easy to remember for the person who performed the experiment • This generates sample swaps as similar identifiers can refer to different experiments; also, unsystematic naming prevents accessing samples programmatically, which may lead to errors and undermines the capability to automate the analysis • Establish a scheme to uniquely identify samples and the associated (meta)data (Box 2) Where are data and results? Structured and hierarchical organisation of the data • Data and results derived from HTS experiments are typically stored in an untidy manner • Organise data in a structured and hierarchical manner reflecting the way data are generated and analysed: (1) raw data, (2) processed data and (3) analysis results (Fig. 1b) Can multiple samples be seamlessly processed? Scalability, parallelisation, automatic configuration and modularity of the code • Data analysis rarely is a one-time task: (i) samples are sequenced at different time points (Fig. 1b) so core analysis pipelines have to be executed for every new sequencing batch; (ii) samples need to be re-processed when analysis pipelines are modified substantially; and (iii) downstream analyses are often repeated with different datasets or variables • Automate the data processing as much as possible (Box 3 and Fig. 2a) Does anybody have the information to reproduce the results? Documentation, documentation and documentation • Results with no documentation leads to little understanding of the analysis, irreproducibility and makes harder the identification of errors • Document all the parts involved in the analysis (from the raw data to the results) (Box 4) Can anybody make use of the data generated? Empower experimenters to perform basic analysis via web applications • Analysis workflows generate many files, which may not be accessible for users (too big to open or too difficult to manipulate) (Box 5) • Implement interactive web applications to display the processed data and to perform specific analyses in a user-friendly manner (Fig. 2b) • Building interfaces for standard analyses frees bioinformaticians to focus on the most technical parts of the project, while allowing all the members to contribute to the analyses (Box 5) • The features of such web applications must be discussed with their potential users, because implementing them requires effort and time References 1. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18 2. GigaScience. 2017,gix100. doi.org/10.1093/gigascience/gix100 3. https://daringfireball.net/projects/markdown/ 4. http://jupyter.org/ 5. https://www.rstudio.com/ 6. https://shiny.rstudio.com/ Funding We received funding from the European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013)/ERC Synergy grant agreement 609989 (4DGenome). The content of this poster reflects only the author’s views and the Union is not liable for any use that may be made of the information contained therein. We acknowledge support of the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013-2017’ and Plan Nacional (SAF2016-75006-P), as well as support of the CERCA Programme / Generalitat de Catalunya. Ralph Stadhouders was supported by an EMBO Long-term Fellowship (ALTF 1201-2014) and a Marie Curie Individual Fellowship (H2020-MSCA-IF-2014). @jaquol @4DGenome javier.quilez@crg.eu Box 1. Features of a good metadata collection system Easy to parse for humans & computers Responsible for maintenance & metadata validation Future-aware flexible Agreed & understood by people using it Box 2. Unique IDs: connecting tubes, metadata & data Sample ID Biological Technical Logistics Application User Experiment Cell type Treatment Target protein Facility Run date Read length Species SE/PE @HWI-D00733:72:C8E09ANXX:5:1101:1211:2429 1:N:0:ACAGTG CTACCACCAAACTTAGAACGGTCATTATGTTACTCTAAGATAATAGAATA + AABB=FDGGGGGCGGEC1CCGEC/C1=<CFFGEFF1=CFG1>F>1FG1<1 @HWI-D00733:72:C8E09ANXX:5:1101:1284:2358 1:N:0:ACAGTG AGGATATATTTGTTAAAAATACAACAAAAACCCCTAGTATTTGTGAGCAA + ABBB0EFFGGFGFGEFGGGCFGGGGGGGGGGGGG<FCCFF1<BCB11=EF @HWI-D00733:72:C8E09ANXX:5:1101:1413:2386 1:N:0:ACAGTG GGCTCCTCTCGGTTCTTCCGAGCCAGCTCGTCATATTGGGCCCGGATGTC + BCCBBEGDFGFGCBGGEGGFBCB/B0:DDF>FGGE1@CG@DFAEGGBE:= @HWI-D00733:72:C8E09ANXX:5:1101:1319:2485 1:N:0:ACAGTG GCTTAGTCTTATTGCTCAGGAGACCGGAGGCCTGGGTTGCTACAGTGCAG + A3<AA1EE@1;C1>>>>C=1;EF=G/<E/>BCFGG0FDGB1BFG1EEFF1 @HWI-D00733:72:C8E09ANXX:5:1101:1565:2381 1:N:0:ACAGTG GGCCAACCACAAGACGATAAAGGGAAACAGGGCGTGGGGATTTCCAGTTT Data (Sequencing reads, FASTQ) Metadata Computer-fiendly fixed length & pattern all lower or upper case anticipate max. # of samples #1: simple auto-incremental (sample001, sample002, …) #2: hash function applied to metadata (b1913e6c1_51720e9cf) Examples Fig. 2. Automating data analysis and visualisation (a) Scalability is achieved by having a submission script (‘*.submit.sh’) that generates as many pipeline scripts as samples listed in the configuration file (‘*.config’), so that a pipeline is executed simultaneously for multiple samples with a single command (gray rectangle). The configuration file also contains the hard-coded parameters shared by all samples (e.g. number of processors or genome assembly version). Parallelisation is obtained by (i) submitting each sample pipeline script as an independent job in the computing cluster, if there is one, where it will be queued (orange) and eventually executed (green), and (ii) adapting the pipeline code in ‘*seq.sh’ to be suitable for running in multiple processors. Each pipeline script is automatically configured by retrieving the pipeline variable values (e.g. species, read length) from the metadata SQL database; in addition, selected metadata generated by the pipeline (e.g. running time, number of aligned reads) are recorded into the database. For further flexibility, the pipeline code is grouped into modules that can be executed all sequentially or individually by specifying it in the configuration file. (b) We take advantage of our structured and hierarchical data organisation as well as the available metadata to deploy a web application to visualise processed data using Shiny6. app.R *.config samples & parameters [full] *submit.sh *seq.sh pipeline code [module 1] [module 2] [module 3] SQL database *.sh pipeline script sample A2 *.sh pipeline script sample A1 *.sh pipeline script sample N … > *submit.sh *.config sample001 sample0022 sample003 … Processed data sample004 sample005 sample006 … Shiny server a b Fig. 1. Framework for the management of HTS data (a) Metadata collection. In our projects, metadata are collected via an online Google Form and stored both online (Google Sheet) and in a local SQL database. We design forms to be short and easy to complete, and Google Sheets provide instant access to the metadata by authorised users. The SQL database works both as a backup and as the source for retrieving metadata programmatically. (b) The stages of HTS data. In general, experiments are sequenced in different multi-sample runs separated in time. HTS data are usually analysed in two steps. First, raw data are processed sample- wise with standard but tunable core analysis pipelines which generate a variety of files. Second, processed data from one or more samples are combined to perform downstream analyses. runs/ |-2017-10-09/ |--sample001_read1.fastq.gz |--fastqc/ |---sample001_read1_fastqc.txt >PROJECT >APPLICATION >SAMPLE_ID >SAMPLE_NAME … Google Form Google Spreadsheet SQL database ONLINE CLUSTER sample001 sample002 sample003 … sample001 sample002 sample003 … Core analysis pipeline Downstream analysis Raw data 1 Analysis 1 Analysis results 3 Sequencing run A Processed data 2 sample004 sample005 sample006 … sample004 sample005 sample006 … Sequencing run B a b Timestamp SAMPLE_ID CELL_TYPE TREATMENT TREATMENT_TIME 08/10/15 14:13 sample001 T47D Untreated 0 08/10/15 14:35 sample002 T47D Progesterone 60 08/10/15 14:38 sample003 T47D Untreated 0 2/22/16 12:35:00 sample004 B-cell Untreated 0 sample001/ |-alignments/ |--hg19/ |--hg38/ |-profiles/ |-logs/ |--program1.out projects/ |-project1/ |--2017-10-09_diff_expression/ |---data/ |---figures/ |---tables/ |---scripts/ Box 3. Analysis code’s wish-list Scalability 1 sample or 100s Parallelisation run all samples simultaneously speed up individual tasks Automatic configuration no need to set variables for each sample Pipeline modularity execute it all or individually REPRODUCIBILITY Every task is a directory Analysis pipelines: • Log monitors the progress of the pipeline • Keep the logs of the programs used • Check the integrity of important files (e.g. raw reads) Use Markdown3, Jupyter Notebook4, RStudio5 or alike to document procedures Specify the non- default variable values used Version control Code repositories (e.g. GitHub) Virtual machines (e.g. Docker) Box 4. The multiple pieces of reproducibility I run the pipeline on your 10 samples Can you send me the interaction matrix of chr2 for all of them? Excel crashes and, yet, I’d have to do it many times… I wish I could focus on the more technical aspects… I wish I could be more autonomous with the data analysis… Box 5. Web applications: a win-win situation