Data management for large
collaborative projects: challenges
and solutions.
Arek Kasprzyk
Head of Data Management
Center f...
 Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-ha...
 2000 – 1 Genome ---- Human Genome Project
 2008 – 1000 Genomes ----- 1000 Genome Project
 2008 – 25, 000 Genomes -----...
 libraries of life sciences information, collected from
scientific experiments, published literature, high-
throughput ex...
www.biomart.org
Egyptian Hieroglyphs
Phoenician Alphabet
Biological abstractions
Query abstractions
Dataset
Filter
Attribute
Examples
human genes
located on chromosome 1, expressed in lungs
name, chromosome, description
rat genes
up-regulated in b...
Graphical User Interface
Dataset
Filter
Attribute
SOAP/REST
<Query>
<Dataset name="hsapiens_gene_ensembl" >
<Filter name="chromosome_name" value="1"/>
<Attribute name="ense...
 What percentage of patients with primary breast cancer who
relapsed within 5 years of surgery?
 What is the average sur...
?
BioMart schema
– “reversed star”
Filter
Attribute
Dataset
BioMart Architecture
Mart Configurator
Add new data sources
Federate data sources
Edit metadata
Convert relational databases into
mart schemas ...
BioMart
“out of the box”
website
Multiple Graphical User Interfaces
Multiple Aplication Programing Interfaces
What is BioMart?
BioMart
BioMart
URL SOAP
RESTJAVA
What is BioMart?
SPARQL
BioMart
Bioclipse Taverna
Galaxy
Cytoscape
BioConductor
WebLab
What is BioMart?
BioMart Central Portal: An Open Database Network for Biological Community
Guberman et al Database Vol. 2011, doi:10.1093/d...
BioMart Community
BioMart Community
BioMart Central Portal
BioMart Central Portal
 The G-protein coupled receptor domain (GPCR) has the InterPro ID of
IPR000276. Find the human protein-coding genes in En...
 Find all IKMC resources for genes encoding transcription
factors on chromosome 1 between 180-190 Mbp
 Find all IKMC res...
 Find all gene fusion mutations involving the FUS gene with a
primary site of bone, and display mutation and sample
infor...
 Find genes commonly deregulated in pancreatic cancer precursor lesions,
pancreatic intraepithelial neoplasia (PanIN) sam...
 Scales better
 No central funding
 No admin overhead
 Very green footprint
 Maintained by experts
“Virtual Bioinform...
ICGC Data Portal
International Cancer Genome Consortium Data Portal: A One Stop Shop for
Cancer Genomic Data Zhang et al D...
International Cancer Genome Consortium
Goals
 Catalogue genomic abnormalities in tumors in 50 different cancer types
and/...
ICGC members
Data models
 Genes
 Samples
 Simple mutations
 Copy number mutations
 Structural rearrangements
 Gene expression
 D...
Architecture
“Parallel” Query Engine
Gene Report
Ensembl
KEGG
ICGC
Pancreatic Expression
Database
Quick Search
Ensembl
KEGG
ICGC
Ensembl
 Retrieve clinical staging data for colorectal cancer
patients with non-synonymous simple mutations in
genes that are inv...
“Digital Medicine” University Health
Network (UHN)
“Digital Medicine” Pilot
“Digital Medicine” Architecture
KEGG ReactomeCOSMIC Ensembl
BioMart Central Portal
HICT
COSMIC
Ensembl
HICT
KEGG
Reactome
...
 Collaboration between UHN, OICR and Pfizer on
collorectal cancer
 Sequencing
 Stem cells
 Clinical data
Pop-cure proj...
PopCure data management
architecture
KEGG ReactomeCOSMIC Ensembl
BioMart Central Portal
COSMIC
Ensembl
HICT
KEGG
Reactome
...
 Collaboration between BioMart and Pfizer on their
internal data management infrastructure
 BioMart technology provided ...
Institut National de la Recherche
Agronomique (INRA)
San Raffaele Scientific Institute (SRSI)
 SRSI is one of the principal research Institutes in Italy
as per volume and pro...
Center for Translational Genomics and
Bioinformatics
NGS e Malattie
Supportare a livello nazionale ed internazionale l’uti...
 “Consumer driven market”
 Plethora of web tools
 The usual problems
 Incompatibility of the websites
 Incompatibilit...
BioMart v 0.9
Data analysis and visualization
framework
Enrichment Tool
All ensembl species
Plethora of identifiers
Homolo...
 All data and analytics under one roof
 All publically available disease data
 Cancer
 Mendelian
 Complex
 Analysis
...
 BioMart
 Services: Single access point to biomedical data
 Software: Support for collaborative efforts
 Data federati...
 Host institutions
 European Bioinformatics Institute (EBI)
 Ontario Institute for Cancer Research (OICR)
 San Raffale...
Center for Translational
Genomics and
Bioinformatics
SRSI
Milan Italy
Director: Professor Giorgio Casari
Co-Director: Dr E...
www.biomart.org
Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions
Upcoming SlideShare
Loading in...5
×

Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

233

Published on

Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project (www.biomart.org) was initiated to address these challenges.

BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart offers different types of access tailored to different groups of users. For biologists, BioMart offers a number of interactive and customisable web-based graphical user interfaces. For bioinformaticians, BioMart provides data access through a range of application programing interfaces. For service providers, BioMart offers a highly customizable system that can be installed locally and tailored to support different types of data management needs.

In this talk I will share my experiences in managing data for large international collaborations involving academic and industry partners. I will also outline the current status of BioMart’s software and services, and describe its new features – such as tools for analysing next generation sequencing data.

Published in: Healthcare, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
233
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions

  1. 1. Data management for large collaborative projects: challenges and solutions. Arek Kasprzyk Head of Data Management Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute, March 27, 2014
  2. 2.  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big Data
  3. 3.  2000 – 1 Genome ---- Human Genome Project  2008 – 1000 Genomes ----- 1000 Genome Project  2008 – 25, 000 Genomes ----- ICGC  2012 – 100,000 ----- UK Genomes Big Data?
  4. 4.  libraries of life sciences information, collected from scientific experiments, published literature, high- throughput experiment technology, and computational analyses They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics, Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures. Biomedical databases
  5. 5. www.biomart.org
  6. 6. Egyptian Hieroglyphs
  7. 7. Phoenician Alphabet
  8. 8. Biological abstractions
  9. 9. Query abstractions Dataset Filter Attribute
  10. 10. Examples human genes located on chromosome 1, expressed in lungs name, chromosome, description rat genes up-regulated in brain and associated with a QTL for a neurological disorder Upstream sequences Rihanna songs released before 2012 UK top 10
  11. 11. Graphical User Interface Dataset Filter Attribute
  12. 12. SOAP/REST <Query> <Dataset name="hsapiens_gene_ensembl" > <Filter name="chromosome_name" value="1"/> <Attribute name="ensembl_gene_id"/> <Attribute name="ensembl_transcript_id"/> <Attribute name="biotype"/> </Dataset> </Query>
  13. 13.  What percentage of patients with primary breast cancer who relapsed within 5 years of surgery?  What is the average survival of patients with Chronic Myeloid Leukaemia (CML) and both with and without splenomegaly at diagnosis?  Find the age and gender of patients who have been diagnosed with Hodgkin's disease, where the initial diagnosis occurred between the ages 50 and 70 inclusive  What is the percentage of patients diagnosed with primary breast cancer in the age range 30 to 70 who were surgically treated and had post operative haematoma/seroma? Examples
  14. 14. ?
  15. 15. BioMart schema – “reversed star” Filter Attribute Dataset
  16. 16. BioMart Architecture
  17. 17. Mart Configurator Add new data sources Federate data sources Edit metadata Convert relational databases into mart schemas ‘virtual marts’ Website with a click of a button
  18. 18. BioMart “out of the box” website Multiple Graphical User Interfaces Multiple Aplication Programing Interfaces
  19. 19. What is BioMart? BioMart
  20. 20. BioMart URL SOAP RESTJAVA What is BioMart? SPARQL
  21. 21. BioMart Bioclipse Taverna Galaxy Cytoscape BioConductor WebLab What is BioMart?
  22. 22. BioMart Central Portal: An Open Database Network for Biological Community Guberman et al Database Vol. 2011, doi:10.1093/database/bar041 BioMart Central Portal
  23. 23. BioMart Community
  24. 24. BioMart Community
  25. 25. BioMart Central Portal
  26. 26. BioMart Central Portal
  27. 27.  The G-protein coupled receptor domain (GPCR) has the InterPro ID of IPR000276. Find the human protein-coding genes in Ensembl that code for this domain, and investigate whether any of them are detectable with the Affy HuGene 1_0 st v1 array  esv263 is the DGVa accession number of a structural variation from Redon et al. (20). What genomic region does this copy number variation span?  Find the genes from Escherichia coli strain K12 that are found within the region ‘360473–365601’ and discover whether there are any orthologs in the related strains E. coli O157:H7 EC4115 and E. coli DH10B  The three-gene APL1 locus encodes essential components of the mosquito immune defense against malaria parasites. Find the variations within the APL1A, APL1B and APL1C genes as well as the strain name, strain genotype, allele and biotype Ensembl
  28. 28.  Find all IKMC resources for genes encoding transcription factors on chromosome 1 between 180-190 Mbp  Find all IKMC resources for genes expressed in heart  Find all IKMC mice available from the EMMA Repository with information on the vector used to make the mutation  Show me all the distributed EMMA lines have passed Southern blot quality control at a distribution center  s there any existing phenotype data for other mouse knockouts of the same gene for mouse lines produced from EUCOMM ES resources International Knockout Mouse Consortium (IKMC)
  29. 29.  Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information  Find variation information for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information  Check the transcriptomic alteration status of the genes gained in lung cancer  Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information COSMIC
  30. 30.  Find genes commonly deregulated in pancreatic cancer precursor lesions, pancreatic intraepithelial neoplasia (PanIN) samples and display gene information, comparison and direction of regulation  Find genes differentially expressed in the serum of pancreatic cancer patients when compared to the serum of patients with benign pancreatic diseases (chronic pancreatitis and pancreatic pseudocyst). Find associated pathways via query integration with Reactome. Display gene and protein information, experimental details and pathway information  Find DNA copy number high-level amplifications in PDAC samples that also contain genes differentially expressed in PDAC versus chronic pancreatitis (CP) and display copy number information, gene information and differential expression experimental details  Find miRNAs differentially expressed in PDAC versus CP whose expression has been confirmed by RT–PCR techniques and display miRNA attributes and study information Pancreatic Expression Database
  31. 31.  Scales better  No central funding  No admin overhead  Very green footprint  Maintained by experts “Virtual Bioinformatics Institute”
  32. 32. ICGC Data Portal International Cancer Genome Consortium Data Portal: A One Stop Shop for Cancer Genomic Data Zhang et al Database Vol. 2011, doi:10.1093/database/bar038
  33. 33. International Cancer Genome Consortium Goals  Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe  Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors  Make the data available to the entire research community as rapidly as possible and with minimal restrictions to accelerate research into the causes and control of cancer 50 different tumor types and/or subtypes 500 samples per tumor 25,000 Human Genome Projects!
  34. 34. ICGC members
  35. 35. Data models  Genes  Samples  Simple mutations  Copy number mutations  Structural rearrangements  Gene expression  DNA methylation  miRNA  Exon junction
  36. 36. Architecture
  37. 37. “Parallel” Query Engine
  38. 38. Gene Report Ensembl KEGG ICGC Pancreatic Expression Database
  39. 39. Quick Search Ensembl KEGG ICGC Ensembl
  40. 40.  Retrieve clinical staging data for colorectal cancer patients with non-synonymous simple mutations in genes that are involved in WNT signaling pathway  Search for genes affected by copy number loss and also detected as deletion from structural rearrangement analysis  In pancreatic cancer data set, retrieve all RNA-seq expression data for genes that are affected by copy number gains ICGC Query Examples
  41. 41. “Digital Medicine” University Health Network (UHN)
  42. 42. “Digital Medicine” Pilot
  43. 43. “Digital Medicine” Architecture KEGG ReactomeCOSMIC Ensembl BioMart Central Portal HICT COSMIC Ensembl HICT KEGG Reactome TCGA - Ovarian ICGC ICGC - Pancreas TCGA Cancer Portal (ICGC) Clinical Trials
  44. 44.  Collaboration between UHN, OICR and Pfizer on collorectal cancer  Sequencing  Stem cells  Clinical data Pop-cure project
  45. 45. PopCure data management architecture KEGG ReactomeCOSMIC Ensembl BioMart Central Portal COSMIC Ensembl HICT KEGG Reactome TCGA - Colorectal ICGC ICGC - Colorectal TCGA Cancer Portal (ICGC) Pfizer internal dataUHN internal data
  46. 46.  Collaboration between BioMart and Pfizer on their internal data management infrastructure  BioMart technology provided a single access point to  BioMart community portal (Ensembl, Reactome etc)  ICGC Portal  Internal Pfizer resources The “La Jolla” Project
  47. 47. Institut National de la Recherche Agronomique (INRA)
  48. 48. San Raffaele Scientific Institute (SRSI)  SRSI is one of the principal research Institutes in Italy as per volume and profile of scientific output  Italy’s leading center for translational medicine
  49. 49. Center for Translational Genomics and Bioinformatics NGS e Malattie Supportare a livello nazionale ed internazionale l’utilizzo di tecnologie genomiche e di metodologie bioinformatiche per migliorare la nostra conoscenza delle malattie umane, permettendo di migliorarne prevenzione, diagnosi e cura. Interdisciplinarità Offrire all’Istituto e all’Ospedale, a collaboratori e clienti una piattaforma integrata e interdisciplinare che spazi dalla biologia molecolare alla medicina, dalla statistica all’informatica, dalla matematica all’etica Divulgazione Contribuire con pubblicazioni di livello internazionale alle scienze genomiche e bioinformatiche Servizio Offrire un supporto professionale e qualificato nell’erogazione di servizi in ambito genomico e bioinformatico, che garantisca puntualità e precisione nell’erogazione dei risultati pur mantenendo la naturale curiosità scientifica e attitudine collaborativa caratterizzante la ricerca
  50. 50.  “Consumer driven market”  Plethora of web tools  The usual problems  Incompatibility of the websites  Incompatibility of technologies  Etc The reality of translational bioinformatics
  51. 51. BioMart v 0.9 Data analysis and visualization framework Enrichment Tool All ensembl species Plethora of identifiers Homology BED files (CNVs, DMRs) Full programmatic access
  52. 52.  All data and analytics under one roof  All publically available disease data  Cancer  Mendelian  Complex  Analysis  Enrichment  Prioritizer  “Disease report” for your experimental data BioMart Disease Portal
  53. 53.  BioMart  Services: Single access point to biomedical data  Software: Support for collaborative efforts  Data federation  Scalability  Data agnostic Summary
  54. 54.  Host institutions  European Bioinformatics Institute (EBI)  Ontario Institute for Cancer Research (OICR)  San Raffale Scientific Institute (SRSI)  BioMart community  28 organizations  50 database projects  BioMart developers Acknowledgments
  55. 55. Center for Translational Genomics and Bioinformatics SRSI Milan Italy Director: Professor Giorgio Casari Co-Director: Dr Elia Stupka
  56. 56. www.biomart.org
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×