SlideShare a Scribd company logo
“Design of a community
microarray database”
Andrea Splendiani, ca 2004-2007
Design of a community microarray database
• About (concept)

• Introduction by
examples

• Design

• Information modeling

• Annotation process

• Implementation

• Data access
About
• Development of a microarray database for the Genopolis consortium
(Milan, Italy), within the University of Milano-Bicocca.

• The Genopolis Consortium acts as a service provider (Affymetrix
GeneChip)

• Supports a scientific community studying the behavior of immune
cells in host response interaction at the gene expression level

• Supports several research networks

• Integrated to ArrayExpress (EBI)
About:: Desiderata & peculiarities
• Data storage

• Data query/analysis

• “Integration” with other databases
• Support for an heterogeneous
community of users
• Limited to Affymetrix GeneChip
expression data

• Users tend to have an homogenous
scientific focus

• Different roles of users: service
provider, ‘customers’,...

• Neither public nor private data
(depending on agreements and
publication status)
About:: community database concept
This reflects in:

• Information modeling

• Annotation process

• Implementation

• Data access
Introduction by example (user describes experiment)
Introduction by example (checking experiment annotation)
Introduction by example (data input by service provider)
Introduction by example (Administrator/Service p.)
Introduction by example (Administrator/Service p.)
Introduction by example (Supervisor manages CVs)
Design:: information modeling
Gene
expression
values
Genes
Experiment
conditions
(stimuli)
Gene Expression data
structure.
The importance to characterize experiment condition (specially in
public repositories) is well understood, with results such as
MIAME, MAGE, MGED-Ontology and ArrayExpress)
Annotation of genes concerns both the characterization of the
measurement technology, and of genes ‘properties’ (as Gene
Ontology codes). The latter is not strictly part of a microarray
database domain.
Gene expression data can be thought as a “matrix” representing
a relation between the dimension of “stimuli” or experimental
conditions and the dimension of genes.
Genopolis Microarray data model is related to MAGE, with two main
differences: Array description is ‘not relevant’ (standard technology, can be
‘imported’ from provider), Experiment description is simplified.
(The relation between stimuli and samples is also re-designed).
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Experiment description
Measurements
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Replicates
No stimuli
-> controls
Design:: information modeling (data)
• Which data should be stored in the database ? 

• The principle is to store in the basic information needed by any
‘interpretation technology’ (like raw scanned images) and actual
expression values that can be used ‘live’ (like Signal, evidence code...).
Some other useful intermediate data is stored as well.
Design:: annotation process
• Annotation process by database users

• Users with different views of the experiment can input different
types of information (experiment description, measurement, array
features...)

• In the description of terms, users make use of controlled
vocabularies generated by the community within this process
(ontologies)

• Checking of the coherence of the database content (data and
annotation) are both automatic and carried by supervisors: ‘draft’
and ‘certified’ information.

• Annotation process at large

• Information, once public, can be sent to a public repository (via
MAGE-ML).
Design:: implementation
• Web application (php/mysql)

• Object based. Objects represents entities of the domain, and are containers
of objects representing fields. (Display/Set/Store/Check methods)

• Two key concepts:

• Approximate relations among objects as a tree (stimuli are leafs). Use tree
traversal for: completeness/correctness checking, computation
(replicates), administration (more later...)

• Use two distinct databases: for draft and for complete information. This
can be used to improve efficiency (indexing, deployment on cluster).
BaseObject
DBObject
DAOedDBObject
TreeDAOedObject
objects that just know
about the system (ex. MailManager)
objects that know of underlying databases,
can make queries (ex. DBQuery)
Objects that can handle a
web representation (ex. Protocol)
objects that are organized in a tree,
allow iteration over the tree
Specific Objects
Objects that represents entities,
with specific properties.
Design:: implemenation (object types)
Design:: implementation (annotation process)
• Two databases

• TDB (temporary)

• SDB (‘standard’)

• read only

• can be duplicated on
nodes of a cluster
TDB SDB
Design:: implementation (annotation process)
• Terms for controlled
vocabularies come
from SDB

• New terms proposed
are stored in SDB
TDB SDB
Users Description
+ data (files)
Design:: implementation (annotation process)
• Supervisor accepts
new terms proposed
by users
TDB SDB
Users Description
+ data (files)
Supervisor
Design:: implementation (annotation process)
• Systems checks for:

• completeness of data
(required fields)

• common errors

• accepted terms

• Generates and send
reports to responsibles
TDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors
Design:: implementation (annotation process)
• Systems publish data 

• off-line operation

• possible performance
optimization

• data files are parsed
in this phaseTDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors
system
“publish” data,
file->sql tables
Batch!
Design:: implementation (annotation process)
• Un-publishing for
revisions
TDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors system
“un-publish” data
Design:: data access
• Who can access

• Users belong to groups with a role. Experiments (data +
description) belong to groups. Depending on their role in groups
users can edit, query, view... experiments’ information. 

• How to access data

• Several interfaces. Some related to data inspection (related to the
structure), some oriented to data analysis.

• It is always possible to export a subset of data as a table (for
analysis tools...)

• MAGE-ML

• Examples shown: 

• Tree view

• Interactive “context based” browsing
Design:: data access (tree view)
Design:: data access (interactive browsing)
• Gene expression data as a matrix (Genes x Sample).

• For each sub-matrix “data” is the connection between a selected subset of
samples and genes

• The idea is to provide a way to navigate between sub-matrices, based on
genes’ annotation, samples’ features or data. 

• Follows example...

• Extensions to this interface include pluggable search/view modules and gene
lists sharing among groups.
The End

More Related Content

What's hot

Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
nitttin
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
mayurik19
 

What's hot (19)

The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
data mining
data miningdata mining
data mining
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
A classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningA classification of methods for frequent pattern mining
A classification of methods for frequent pattern mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Database
DatabaseDatabase
Database
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 

Viewers also liked

Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networks
ngulbahce
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
Jhoirene Clemente
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 

Viewers also liked (11)

RT-PCR
RT-PCRRT-PCR
RT-PCR
 
Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networks
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysis
 
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network Medicine
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Dr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 MedicineDr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 Medicine
 

Similar to The Genopolis Microarray database

Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
sscdotopen
 
9a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc29a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc2
Mukund Trivedi
 

Similar to The Genopolis Microarray database (20)

Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Architectural Styles and Case Studies, Software architecture ,unit–2
Architectural Styles and Case Studies, Software architecture ,unit–2Architectural Styles and Case Studies, Software architecture ,unit–2
Architectural Styles and Case Studies, Software architecture ,unit–2
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
DIACHRON Project Overview
DIACHRON Project OverviewDIACHRON Project Overview
DIACHRON Project Overview
 
Database management system
Database management systemDatabase management system
Database management system
 
Chapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdfChapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdf
 
artrec.pptx
artrec.pptxartrec.pptx
artrec.pptx
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
(ATS4-DEV02) Accelrys Query Service: Technology and Tools
(ATS4-DEV02) Accelrys Query Service: Technology and Tools(ATS4-DEV02) Accelrys Query Service: Technology and Tools
(ATS4-DEV02) Accelrys Query Service: Technology and Tools
 
9a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc29a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc2
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
DataHub
DataHubDataHub
DataHub
 

Recently uploaded

The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
Sérgio Sacani
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
sreddyrahul
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
Sérgio Sacani
 
Mitosis...............................pptx
Mitosis...............................pptxMitosis...............................pptx
Mitosis...............................pptx
Cherry
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Sérgio Sacani
 

Recently uploaded (20)

The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
Triploidy ...............................pptx
Triploidy ...............................pptxTriploidy ...............................pptx
Triploidy ...............................pptx
 
Mitosis...............................pptx
Mitosis...............................pptxMitosis...............................pptx
Mitosis...............................pptx
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Phytogeography........................pptx
Phytogeography........................pptxPhytogeography........................pptx
Phytogeography........................pptx
 

The Genopolis Microarray database

  • 1. “Design of a community microarray database” Andrea Splendiani, ca 2004-2007
  • 2. Design of a community microarray database • About (concept) • Introduction by examples • Design • Information modeling • Annotation process • Implementation • Data access
  • 3. About • Development of a microarray database for the Genopolis consortium (Milan, Italy), within the University of Milano-Bicocca. • The Genopolis Consortium acts as a service provider (Affymetrix GeneChip) • Supports a scientific community studying the behavior of immune cells in host response interaction at the gene expression level • Supports several research networks • Integrated to ArrayExpress (EBI)
  • 4. About:: Desiderata & peculiarities • Data storage • Data query/analysis • “Integration” with other databases • Support for an heterogeneous community of users • Limited to Affymetrix GeneChip expression data • Users tend to have an homogenous scientific focus • Different roles of users: service provider, ‘customers’,... • Neither public nor private data (depending on agreements and publication status)
  • 5. About:: community database concept This reflects in: • Information modeling • Annotation process • Implementation • Data access
  • 6. Introduction by example (user describes experiment)
  • 7. Introduction by example (checking experiment annotation)
  • 8. Introduction by example (data input by service provider)
  • 9. Introduction by example (Administrator/Service p.)
  • 10. Introduction by example (Administrator/Service p.)
  • 11. Introduction by example (Supervisor manages CVs)
  • 12. Design:: information modeling Gene expression values Genes Experiment conditions (stimuli) Gene Expression data structure. The importance to characterize experiment condition (specially in public repositories) is well understood, with results such as MIAME, MAGE, MGED-Ontology and ArrayExpress) Annotation of genes concerns both the characterization of the measurement technology, and of genes ‘properties’ (as Gene Ontology codes). The latter is not strictly part of a microarray database domain. Gene expression data can be thought as a “matrix” representing a relation between the dimension of “stimuli” or experimental conditions and the dimension of genes. Genopolis Microarray data model is related to MAGE, with two main differences: Array description is ‘not relevant’ (standard technology, can be ‘imported’ from provider), Experiment description is simplified. (The relation between stimuli and samples is also re-designed).
  • 13. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures
  • 14. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ (object centric) Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures Experiment description Measurements
  • 15. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ (object centric) Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures Replicates No stimuli -> controls
  • 16. Design:: information modeling (data) • Which data should be stored in the database ? • The principle is to store in the basic information needed by any ‘interpretation technology’ (like raw scanned images) and actual expression values that can be used ‘live’ (like Signal, evidence code...). Some other useful intermediate data is stored as well.
  • 17. Design:: annotation process • Annotation process by database users • Users with different views of the experiment can input different types of information (experiment description, measurement, array features...) • In the description of terms, users make use of controlled vocabularies generated by the community within this process (ontologies) • Checking of the coherence of the database content (data and annotation) are both automatic and carried by supervisors: ‘draft’ and ‘certified’ information. • Annotation process at large • Information, once public, can be sent to a public repository (via MAGE-ML).
  • 18. Design:: implementation • Web application (php/mysql) • Object based. Objects represents entities of the domain, and are containers of objects representing fields. (Display/Set/Store/Check methods) • Two key concepts: • Approximate relations among objects as a tree (stimuli are leafs). Use tree traversal for: completeness/correctness checking, computation (replicates), administration (more later...) • Use two distinct databases: for draft and for complete information. This can be used to improve efficiency (indexing, deployment on cluster).
  • 19. BaseObject DBObject DAOedDBObject TreeDAOedObject objects that just know about the system (ex. MailManager) objects that know of underlying databases, can make queries (ex. DBQuery) Objects that can handle a web representation (ex. Protocol) objects that are organized in a tree, allow iteration over the tree Specific Objects Objects that represents entities, with specific properties. Design:: implemenation (object types)
  • 20. Design:: implementation (annotation process) • Two databases • TDB (temporary) • SDB (‘standard’) • read only • can be duplicated on nodes of a cluster TDB SDB
  • 21. Design:: implementation (annotation process) • Terms for controlled vocabularies come from SDB • New terms proposed are stored in SDB TDB SDB Users Description + data (files)
  • 22. Design:: implementation (annotation process) • Supervisor accepts new terms proposed by users TDB SDB Users Description + data (files) Supervisor
  • 23. Design:: implementation (annotation process) • Systems checks for: • completeness of data (required fields) • common errors • accepted terms • Generates and send reports to responsibles TDB SDB Users Description + data (files) Supervisor system check: -completeness -errors
  • 24. Design:: implementation (annotation process) • Systems publish data • off-line operation • possible performance optimization • data files are parsed in this phaseTDB SDB Users Description + data (files) Supervisor system check: -completeness -errors system “publish” data, file->sql tables Batch!
  • 25. Design:: implementation (annotation process) • Un-publishing for revisions TDB SDB Users Description + data (files) Supervisor system check: -completeness -errors system “un-publish” data
  • 26. Design:: data access • Who can access • Users belong to groups with a role. Experiments (data + description) belong to groups. Depending on their role in groups users can edit, query, view... experiments’ information. • How to access data • Several interfaces. Some related to data inspection (related to the structure), some oriented to data analysis. • It is always possible to export a subset of data as a table (for analysis tools...) • MAGE-ML • Examples shown: • Tree view • Interactive “context based” browsing
  • 27. Design:: data access (tree view)
  • 28. Design:: data access (interactive browsing) • Gene expression data as a matrix (Genes x Sample). • For each sub-matrix “data” is the connection between a selected subset of samples and genes • The idea is to provide a way to navigate between sub-matrices, based on genes’ annotation, samples’ features or data. • Follows example... • Extensions to this interface include pluggable search/view modules and gene lists sharing among groups.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.