Cassava Genome Hub
Updates on cassava genomics big data management and analysis.
Anestis Gkanogiannis
CIRAD
June 24, 2016
Introduction
Big Data
The Cassava Genome Hub
Usecase
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase
Introduction
Big Data
The Cassava Genome Hub
Usecase
Who am I?
Born and raised in the Greek island of Evia.
Physicist, 1998 - 2003, BSc, UOC, Crete, Greece
Informatician
2003 -2005, MSc in Information Retrieval, AUEB, Athens
2005 - 2011, PhD in Machine Learning, AUEB, Athens
Introduction
Big Data
The Cassava Genome Hub
Usecase
Who am I?
Born and raised in the Greek island of Evia.
Physicist, 1998 - 2003, BSc, UOC, Crete, Greece
Informatician
2003 -2005, MSc in Information Retrieval, AUEB, Athens
2005 - 2011, PhD in Machine Learning, AUEB, Athens
2011 - 2013, Text Analysis, UNB, Fredericton, Canada
2013 - 2015, Bacterial Genomics, Genoscope, Paris,
France
2015 - 2016, Plant Genomics, CIRAD, Montpellier, France
2016 - ??
Introduction
Big Data
The Cassava Genome Hub
Usecase
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase
Introduction
Big Data
The Cassava Genome Hub
Usecase
Definition
Everyone is talking about it. Any combination of Very hot subject in Omics.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Data
Volume
Tens of TB of raw sequence data.
Hundreds of GB of processed and analyzed data.
Velocity
New and improved assemblies and annotation.
New sequencing technologies and lower cost.
Variety
Genomic sequences, RNASeq,RADSeq, etc.
Annotation
Variants
Metabolomic
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Data
Public/Private Type Technology Description Publication Samples
Public Genomic WGS Assembly and annotation V6 Prochnik et al, 2012
Public Genomic WGS Genetic Variants Bredeson et al, 2016 61
Private Genomic RADSeq Genetic Variants in progress 1100
Private Genomic WGS Genetic Variants in progress 34
Public Transcriptomic RNASeq Response to Xanthomonas Munoz-Bodnar et al, 2014 12(2*6)
Public Transcriptomic RNASeq Response to Xanthomonas Cohn et al, 2014 18(3*6)
Private Transcriptomic RNASeq Response to White Fly in progress 16(2*8)
Table: Resources of data available
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
JBrowse
A fast, embeddable
Genome Browser built
completely with
JavaScript and HTML5.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
JBrowse
A fast, embeddable
Genome Browser built
completely with
JavaScript and HTML5.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
JBrowse
A fast, embeddable
Genome Browser built
completely with
JavaScript and HTML5.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
SNiPlay
SNiPlay3: a web-based
application for
exploration and large
scale analyses of genomic
variations.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
SNiPlay
SNiPlay3: a web-based
application for
exploration and large
scale analyses of genomic
variations.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
SNiPlay
SNiPlay3: a web-based
application for
exploration and large
scale analyses of genomic
variations.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
SNiPlay
SNiPlay3: a web-based
application for
exploration and large
scale analyses of genomic
variations.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
GIGWA
A web-based tool that
provides an easy and
intuitive way to explore
large amounts of
genotyping data by
filtering it.
Data storage relies on
MongoDB, which offers
good scalability
properties.
Can handle multiple
databases and may be
deployed in either single-
or multi-user mode, while
it provides a wide range
of popular export
formats.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
GIGWA
A web-based tool that
provides an easy and
intuitive way to explore
large amounts of
genotyping data by
filtering it.
Data storage relies on
MongoDB, which offers
good scalability
properties.
Can handle multiple
databases and may be
deployed in either single-
or multi-user mode, while
it provides a wide range
of popular export
formats.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
DiffExDB
Explore differential
expression analyses.
Visualize heatmap of
RPKM expression values.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
DiffExDB
Explore differential
expression analyses.
Visualize heatmap of
RPKM expression values.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
DiffExDB
Explore differential
expression analyses.
Visualize heatmap of
RPKM expression values.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
CMap
A browser-based tool for
the visual comparison of
various maps (sequence,
genetic, etc.).
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
CMap
A browser-based tool for
the visual comparison of
various maps (sequence,
genetic, etc.).
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
BLAST
blastn : Search
nucleotide databases
using a nucleotide query.
blastp : Search protein
databases using a protein
query.
blastx, tblastn, tblastx
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
BLAST
blastn : Search
nucleotide databases
using a nucleotide query.
blastp : Search protein
databases using a protein
query.
blastx, tblastn, tblastx
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Advanced Search
Search for genomic
features, genomic
locations, enzymatic
codes, gene ontology
terms, etc.
Output as nucleotide or
translated aminoacid
sequences.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Advanced Search
Search for genomic
features, genomic
locations, enzymatic
codes, gene ontology
terms, etc.
Output as nucleotide or
translated aminoacid
sequences.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Pathway Tools
Creates a new
Pathway/Genome
Database (PGDB)
containing the predicted
metabolic pathways.
Supports query,
visualization, and analysis
of PGDBs.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Galaxy
A scientific workflow,
data integration and data
analysis platform that
aims to make
computational biology
accessible to research
scientists that do not
have computer
programming experience.
Provides means to build
multi-step computational
analyses. It provides a
graphical user interface
for specifying what data
to operate on, what steps
to take, and what order
to do them in.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Architecture
Tools
Galaxy
A scientific workflow,
data integration and data
analysis platform that
aims to make
computational biology
accessible to research
scientists that do not
have computer
programming experience.
Provides means to build
multi-step computational
analyses. It provides a
graphical user interface
for specifying what data
to operate on, what steps
to take, and what order
to do them in.
Introduction
Big Data
The Cassava Genome Hub
Usecase
Table of Contents
1 Introduction
2 Big Data
3 The Cassava Genome Hub
Architecture
Technologies
Data
Tools
JBrowse
SNiPlay
GIGWA
DiffExDB
Genetic Map
Querying Tools
Galaxy
4 Usecase