Your SlideShare is downloading. ×
Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Eagle Bioinformatics Symposium: 5. Arek Kasprzyk: Data Management for Large Collaborative Projects: Challenges and Solutions


Published on

Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models …

Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project ( was initiated to address these challenges.

BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart offers different types of access tailored to different groups of users. For biologists, BioMart offers a number of interactive and customisable web-based graphical user interfaces. For bioinformaticians, BioMart provides data access through a range of application programing interfaces. For service providers, BioMart offers a highly customizable system that can be installed locally and tailored to support different types of data management needs.

In this talk I will share my experiences in managing data for large international collaborations involving academic and industry partners. I will also outline the current status of BioMart’s software and services, and describe its new features – such as tools for analysing next generation sequencing data.

Published in: Healthcare, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data management for large collaborative projects: challenges and solutions. Arek Kasprzyk Head of Data Management Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute, March 27, 2014
  • 2.  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big Data
  • 3.  2000 – 1 Genome ---- Human Genome Project  2008 – 1000 Genomes ----- 1000 Genome Project  2008 – 25, 000 Genomes ----- ICGC  2012 – 100,000 ----- UK Genomes Big Data?
  • 4.  libraries of life sciences information, collected from scientific experiments, published literature, high- throughput experiment technology, and computational analyses They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics, Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures. Biomedical databases
  • 5.
  • 6. Egyptian Hieroglyphs
  • 7. Phoenician Alphabet
  • 8. Biological abstractions
  • 9. Query abstractions Dataset Filter Attribute
  • 10. Examples human genes located on chromosome 1, expressed in lungs name, chromosome, description rat genes up-regulated in brain and associated with a QTL for a neurological disorder Upstream sequences Rihanna songs released before 2012 UK top 10
  • 11. Graphical User Interface Dataset Filter Attribute
  • 12. SOAP/REST <Query> <Dataset name="hsapiens_gene_ensembl" > <Filter name="chromosome_name" value="1"/> <Attribute name="ensembl_gene_id"/> <Attribute name="ensembl_transcript_id"/> <Attribute name="biotype"/> </Dataset> </Query>
  • 13.  What percentage of patients with primary breast cancer who relapsed within 5 years of surgery?  What is the average survival of patients with Chronic Myeloid Leukaemia (CML) and both with and without splenomegaly at diagnosis?  Find the age and gender of patients who have been diagnosed with Hodgkin's disease, where the initial diagnosis occurred between the ages 50 and 70 inclusive  What is the percentage of patients diagnosed with primary breast cancer in the age range 30 to 70 who were surgically treated and had post operative haematoma/seroma? Examples
  • 14. ?
  • 15. BioMart schema – “reversed star” Filter Attribute Dataset
  • 16. BioMart Architecture
  • 17. Mart Configurator Add new data sources Federate data sources Edit metadata Convert relational databases into mart schemas ‘virtual marts’ Website with a click of a button
  • 18. BioMart “out of the box” website Multiple Graphical User Interfaces Multiple Aplication Programing Interfaces
  • 19. What is BioMart? BioMart
  • 20. BioMart URL SOAP RESTJAVA What is BioMart? SPARQL
  • 21. BioMart Bioclipse Taverna Galaxy Cytoscape BioConductor WebLab What is BioMart?
  • 22. BioMart Central Portal: An Open Database Network for Biological Community Guberman et al Database Vol. 2011, doi:10.1093/database/bar041 BioMart Central Portal
  • 23. BioMart Community
  • 24. BioMart Community
  • 25. BioMart Central Portal
  • 26. BioMart Central Portal
  • 27.  The G-protein coupled receptor domain (GPCR) has the InterPro ID of IPR000276. Find the human protein-coding genes in Ensembl that code for this domain, and investigate whether any of them are detectable with the Affy HuGene 1_0 st v1 array  esv263 is the DGVa accession number of a structural variation from Redon et al. (20). What genomic region does this copy number variation span?  Find the genes from Escherichia coli strain K12 that are found within the region ‘360473–365601’ and discover whether there are any orthologs in the related strains E. coli O157:H7 EC4115 and E. coli DH10B  The three-gene APL1 locus encodes essential components of the mosquito immune defense against malaria parasites. Find the variations within the APL1A, APL1B and APL1C genes as well as the strain name, strain genotype, allele and biotype Ensembl
  • 28.  Find all IKMC resources for genes encoding transcription factors on chromosome 1 between 180-190 Mbp  Find all IKMC resources for genes expressed in heart  Find all IKMC mice available from the EMMA Repository with information on the vector used to make the mutation  Show me all the distributed EMMA lines have passed Southern blot quality control at a distribution center  s there any existing phenotype data for other mouse knockouts of the same gene for mouse lines produced from EUCOMM ES resources International Knockout Mouse Consortium (IKMC)
  • 29.  Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information  Find variation information for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information  Check the transcriptomic alteration status of the genes gained in lung cancer  Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information COSMIC
  • 30.  Find genes commonly deregulated in pancreatic cancer precursor lesions, pancreatic intraepithelial neoplasia (PanIN) samples and display gene information, comparison and direction of regulation  Find genes differentially expressed in the serum of pancreatic cancer patients when compared to the serum of patients with benign pancreatic diseases (chronic pancreatitis and pancreatic pseudocyst). Find associated pathways via query integration with Reactome. Display gene and protein information, experimental details and pathway information  Find DNA copy number high-level amplifications in PDAC samples that also contain genes differentially expressed in PDAC versus chronic pancreatitis (CP) and display copy number information, gene information and differential expression experimental details  Find miRNAs differentially expressed in PDAC versus CP whose expression has been confirmed by RT–PCR techniques and display miRNA attributes and study information Pancreatic Expression Database
  • 31.  Scales better  No central funding  No admin overhead  Very green footprint  Maintained by experts “Virtual Bioinformatics Institute”
  • 32. ICGC Data Portal International Cancer Genome Consortium Data Portal: A One Stop Shop for Cancer Genomic Data Zhang et al Database Vol. 2011, doi:10.1093/database/bar038
  • 33. International Cancer Genome Consortium Goals  Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe  Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors  Make the data available to the entire research community as rapidly as possible and with minimal restrictions to accelerate research into the causes and control of cancer 50 different tumor types and/or subtypes 500 samples per tumor 25,000 Human Genome Projects!
  • 34. ICGC members
  • 35. Data models  Genes  Samples  Simple mutations  Copy number mutations  Structural rearrangements  Gene expression  DNA methylation  miRNA  Exon junction
  • 36. Architecture
  • 37. “Parallel” Query Engine
  • 38. Gene Report Ensembl KEGG ICGC Pancreatic Expression Database
  • 39. Quick Search Ensembl KEGG ICGC Ensembl
  • 40.  Retrieve clinical staging data for colorectal cancer patients with non-synonymous simple mutations in genes that are involved in WNT signaling pathway  Search for genes affected by copy number loss and also detected as deletion from structural rearrangement analysis  In pancreatic cancer data set, retrieve all RNA-seq expression data for genes that are affected by copy number gains ICGC Query Examples
  • 41. “Digital Medicine” University Health Network (UHN)
  • 42. “Digital Medicine” Pilot
  • 43. “Digital Medicine” Architecture KEGG ReactomeCOSMIC Ensembl BioMart Central Portal HICT COSMIC Ensembl HICT KEGG Reactome TCGA - Ovarian ICGC ICGC - Pancreas TCGA Cancer Portal (ICGC) Clinical Trials
  • 44.  Collaboration between UHN, OICR and Pfizer on collorectal cancer  Sequencing  Stem cells  Clinical data Pop-cure project
  • 45. PopCure data management architecture KEGG ReactomeCOSMIC Ensembl BioMart Central Portal COSMIC Ensembl HICT KEGG Reactome TCGA - Colorectal ICGC ICGC - Colorectal TCGA Cancer Portal (ICGC) Pfizer internal dataUHN internal data
  • 46.  Collaboration between BioMart and Pfizer on their internal data management infrastructure  BioMart technology provided a single access point to  BioMart community portal (Ensembl, Reactome etc)  ICGC Portal  Internal Pfizer resources The “La Jolla” Project
  • 47. Institut National de la Recherche Agronomique (INRA)
  • 48. San Raffaele Scientific Institute (SRSI)  SRSI is one of the principal research Institutes in Italy as per volume and profile of scientific output  Italy’s leading center for translational medicine
  • 49. Center for Translational Genomics and Bioinformatics NGS e Malattie Supportare a livello nazionale ed internazionale l’utilizzo di tecnologie genomiche e di metodologie bioinformatiche per migliorare la nostra conoscenza delle malattie umane, permettendo di migliorarne prevenzione, diagnosi e cura. Interdisciplinarità Offrire all’Istituto e all’Ospedale, a collaboratori e clienti una piattaforma integrata e interdisciplinare che spazi dalla biologia molecolare alla medicina, dalla statistica all’informatica, dalla matematica all’etica Divulgazione Contribuire con pubblicazioni di livello internazionale alle scienze genomiche e bioinformatiche Servizio Offrire un supporto professionale e qualificato nell’erogazione di servizi in ambito genomico e bioinformatico, che garantisca puntualità e precisione nell’erogazione dei risultati pur mantenendo la naturale curiosità scientifica e attitudine collaborativa caratterizzante la ricerca
  • 50.  “Consumer driven market”  Plethora of web tools  The usual problems  Incompatibility of the websites  Incompatibility of technologies  Etc The reality of translational bioinformatics
  • 51. BioMart v 0.9 Data analysis and visualization framework Enrichment Tool All ensembl species Plethora of identifiers Homology BED files (CNVs, DMRs) Full programmatic access
  • 52.  All data and analytics under one roof  All publically available disease data  Cancer  Mendelian  Complex  Analysis  Enrichment  Prioritizer  “Disease report” for your experimental data BioMart Disease Portal
  • 53.  BioMart  Services: Single access point to biomedical data  Software: Support for collaborative efforts  Data federation  Scalability  Data agnostic Summary
  • 54.  Host institutions  European Bioinformatics Institute (EBI)  Ontario Institute for Cancer Research (OICR)  San Raffale Scientific Institute (SRSI)  BioMart community  28 organizations  50 database projects  BioMart developers Acknowledgments
  • 55. Center for Translational Genomics and Bioinformatics SRSI Milan Italy Director: Professor Giorgio Casari Co-Director: Dr Elia Stupka
  • 56.