BioMart 0.8 offers new tools, more
interfaces, and increased flexibility
through plugins


             Junjun Zhang
       BOSC 2011, Vienna, Austria
             July 15, 2011
BioMart: an open source federated
data management system
•  Widely used by public/private biological databases

•  Quickly bring in-house data accessible online

•  User friendly and flexible querying interfaces: web
   GUI and programmatic access API (REST, Perl,
   biomaRt etc)

•  Automated data conversion tool

•  Effortlessly federate in-house datasets with existing
   public BioMart datasets

                www.biomart.org                            2

                            	
  
BioMart 0.8 new features
 •  Integrated Java application makes it possible to build a
    BioMart data source, configure querying and presentation
    interfaces, and deploy a BioMart server from a single tool
    (MartConfigurator)

 •  Support more RDBMS (MS SQL Server, DB2, in addition to
    MySQL, PostgreSQL, and Oracle)

 •  Create ‘virtual mart’ from 3NF normalized source database
    without materialization

 •  New diverse Web GUIs and APIs provide added flexibility and
    ease of use

 •  Link indexing and parallel querying optimizations

 •  Support several security features (HTTPS, OpenID and oAuth
    protocols) for managing sensitive data

 •  Extendable plugin framework for analysis and visualization    3
Basic BioMart Concepts – the
Power of Simplicity
Building	
  or	
  querying	
  a	
  BioMart	
  data	
  source	
  only	
  requires	
  
understanding	
  of	
  a	
  few	
  basic	
  concepts:	
  
•  DataSource	
  
•  DataMart	
  
•  DataSet	
  
•  A;ribute	
  	
  
•  Filter	
  
•  AccessPoint	
  (new)	
  
•  Analysis	
  (new)	
  
•  Parameter	
  (new)	
  
	
  

BioMart	
  hides	
  complexity	
  of	
  underlie	
  database	
  schema	
  and	
  
federaCon	
  mechanism.	
  
                                                                                       4
BioMart dataset is organized in a reverse
star schema




                                            5
3NF normalized database can be converted to
reversed star schema




                                                   Source	
  schema	
  




                                   Reverse	
  star	
  schema	
  
                                                                          6
BioMart system components

                                          Client-­‐side	
  
                                            	
  Plugin	
  
                                                	
  	
  	
  




            Query	
  Engine	
  /	
  Plugin	
  




                                                               7
MartConfigurator – an integrated tool
for setting up, configuring and
managing a BioMart server




                                        8
BioMart 0.8 provides several data querying GUIs
                    MartForm




                                                  9
BioMart 0.8 provides several data querying GUIs

                    MartWizard




                                                  10
BioMart 0.8 provides several data querying GUIs

                    MartExplorer




                                                  11
Programmatic access API query syntax at the click
of a button




                                                    12
Special GUI - MartReport
Ensembl
KEGG
Reactome


Mutation frequencies from
cancer projects with data
distributed around the globe



COSMIC




Pancreatic Expression Database
(PED)
Breast Cancer Campaign Tissue Bank
(BCCTB)                              13
Special GUI - MartAnalysis
                 Mostly affected pathways




                                            14
Special GUI – MartAnalysis
      Genomic sequence retrieval tool




                                        Sequence retrieval
                                        tool is implemented
                                        as server-side
                                        analysis plugin




                                                         15
New query type - Analysis
Query against ‘affected_pathways’ analysis:
<Query>
       <Analysis name="affected_pathways" dataset="gene_oicrPanc">
                <Parameter name="biotype" value="protein_coding"/>
                <Parameter name="file_type" value=”png"/>
                <Parameter name="img_height" value="8000"/>
                <Parameter name="img_width" value="12000"/>
       </Analysis>
</Query>

Query against ‘gene_sequence’ sequence retrieval tool:
<Query>
       <Analysis name="gene_sequence">
                <Parameter name="seq_type" value="gene_flank"/>
                <Parameter name="upstream_flank" value="500"/>
       </Analysis>
</Query>


                                                                     16
Several large collaborative projects are
using BioMart for data management


•  BioMart Central Portal (http://central.biomart.org)

•  International Cancer Genome Consortium (http://dcc.icgc.org)

•  POPCURE (collaboration with Pfizer, controlled access)




                                                                  17
BioMart Central Portal    (central.biomart.org)




                         First-­‐of-­‐its	
  kind,	
  community-­‐driven	
  effort	
  
                         to	
  provide	
  unified	
  access	
  to	
  dozens	
  of	
  
                         biological	
  databases	
  spanning	
  genomics,	
  
                         proteomics,	
  model	
  organisms,	
  cancer	
  
                         data,	
  and	
  more	
  

                                                                                        18
BioMart Portal provides access to a collection
of data sources




                                       “Master/Slave” like




                                                             19
International Cancer Genome Consortium Data Portal
        CANADA                                              EU / UNITED
        Pancreatic cancer                                   KINGDOM
        (Ductal adenocarcinoma)                             Breast cancer
        Prostate cancer                                     (ER positive, HER2 negative)
        (Adenocarcinoma)
                                                                                            GERMANY
        UNITED STATES                                        UNITED                        Malignant lymphoma
        Bladder cancer                                       KINGDOM                       (Germinal center B-cell
                                                                                           derived lymphomas)
        Blood cancer                                        Bone cancer                    Pediatric brain tumors
        (Acute myeloid leukemia)                            (Osteosarcoma/                 (Medulloblastoma and
        Brain cancer                                        chondrosarcoma/                Pediatric pilocytic
        (Glioblastoma multiforme/                           rare subtypes)                  astrocytoma)                 CHINA
        lower grade glioma)                                 Breast cancer
        Breast cancer                                       (Triple negative/lobular/
                                                                                           Prostate cancer               Gastric cancer
                                                                                                                         (Intestinal- and di use-type)
                                                                                                                                                         JAPAN
                                                                                           (Early onset)
        (Ductal & lobular)                                  other)                                                                                       Liver cancer
        Cervical cancer                                     Chronic Myeloid Disorders                                                                    (Hepatocellular carcinoma)
        (Squamous)                                          (Myelodysplastic syndromes,                                                                  (Virus-associated)
        Colon cancer                                        myeloproliferative neoplasms
        (Adenocarcinoma)                                     and other chronic myeloid
        Endometrial cancer                                  malignancies)
        (Uterine corpus endometrial                         Esophageal cancer
         carcinoma)                                         Prostate cancer
        Gastric cancer
        (Adenocarcinoma)
        Head and neck cancer                                 EU / FRANCE
        (Squamous cell carcinoma/                           Renal cancer
        Thyroid carcinoma)                                  (Renal cell carcinoma)
        Renal cancer                                        (Focus on but not limited
        (Renal clear cell carcinoma/                         to clear cell subtype)
        Renal papillary carcinoma)
        Liver cancer                                                                       ITALY                                                         AUSTRALIA
        (Hepatocellular carcinoma)
        Lung cancer
                                                             FRANCE                        Rare pancreatic tumors
                                                                                           (Enteropancreatic endocrine   INDIA                           Ovarian cancer
                                                            Breast cancer                                                                                (Serous cystadenocarcinoma)
        (Adenocarcinoma/                                                                   tumors and rare pancreatic    Oral cancer
                                                            (Subtype de ned by an                                                                        Pancreatic cancer
        squamous cell carcinoma)                                                           exocrine tumors)              (Gingivobuccal)
                                                            ampli cation of the                                                                          (Ductal adenocarcinoma)
        Ovarian cancer                                                                                                                                   Prostate cancer
        (Serous cystadenocarcinoma)    MEXICO               HER2 gene)
                                                            Liver cancer
        Prostate cancer
        (Adenocarcinoma)
                                       Multiple sub-types   (Hepatocellular carcinoma)     SPAIN
        Rectal cancer                                       (Secondary to alcohol          Chronic lymphocytic
        (Adenocarcinoma)                                     and adiposity)                leukemia
        Skin cancer                                         Prostate cancer                (CLL with mutated and
        (Cutaneous melanoma)                                (Adenocarcinoma)               unmutated IgVH)




   GOALS: To obtain a comprehensive description of genomic, transcriptomic, and
   epigenomic changes in 50 different tumor types and/or subtypes, which are of clinical
   and societal importance across the globe. 500 tumor and matched control samples will
   be analyzed per tumor type. At present, 12 countries joined ICGC. Data will be
   generated by institutions all over the world.

   To make the data available rapidly and with minimal restrictions, to accelerate
   research of the causes and control of cancer.
                                                                                                                                                                                       20
ICGC Data Portal Architecture




          “Peer-to-Peer” like




                                21
(dcc.icgc.org)




                 22
Future Directions

•  Creation of BioMart Central Registry to improve
   coordination between BioMart servers. It will be a
   permanent resource where BioMart data providers can
   register their data models, data sources and services.

•  Enhancing data transformation module for building
   BioMart databases from non-RDBMS data sources (e.g.
   flat data files, XML data files etc) with high scalability
   and flexibility.

•  Enhancing the plugin system to allow various forms of
   data analysis and visualization. Third parties are
   encouraged to develop plugins to extend the capabilities
   of the system.
                                                                23
The BioMart team
    Joachim	
  Baran	
  
    Anthony	
  Cros	
  
    Jonathan	
  Guberman	
        For	
  support:	
  users@biomart.org	
  
    Jack	
  Hsu	
  
    Yong	
  Liang	
  
    Elena	
  Rivkin	
  
    Bre;	
  Whi;y	
  
    Marie	
  Wong-­‐Erasmus	
  
    Long	
  Yao	
  
    Syed	
  Haider	
  
    Junjun	
  Zhang	
  
    Arek	
  Kasprzyk	
  
                                                                         24

B07-GenomeContent-Biomart

  • 1.
    BioMart 0.8 offersnew tools, more interfaces, and increased flexibility through plugins Junjun Zhang BOSC 2011, Vienna, Austria July 15, 2011
  • 2.
    BioMart: an opensource federated data management system •  Widely used by public/private biological databases •  Quickly bring in-house data accessible online •  User friendly and flexible querying interfaces: web GUI and programmatic access API (REST, Perl, biomaRt etc) •  Automated data conversion tool •  Effortlessly federate in-house datasets with existing public BioMart datasets www.biomart.org 2  
  • 3.
    BioMart 0.8 newfeatures •  Integrated Java application makes it possible to build a BioMart data source, configure querying and presentation interfaces, and deploy a BioMart server from a single tool (MartConfigurator) •  Support more RDBMS (MS SQL Server, DB2, in addition to MySQL, PostgreSQL, and Oracle) •  Create ‘virtual mart’ from 3NF normalized source database without materialization •  New diverse Web GUIs and APIs provide added flexibility and ease of use •  Link indexing and parallel querying optimizations •  Support several security features (HTTPS, OpenID and oAuth protocols) for managing sensitive data •  Extendable plugin framework for analysis and visualization 3
  • 4.
    Basic BioMart Concepts– the Power of Simplicity Building  or  querying  a  BioMart  data  source  only  requires   understanding  of  a  few  basic  concepts:   •  DataSource   •  DataMart   •  DataSet   •  A;ribute     •  Filter   •  AccessPoint  (new)   •  Analysis  (new)   •  Parameter  (new)     BioMart  hides  complexity  of  underlie  database  schema  and   federaCon  mechanism.   4
  • 5.
    BioMart dataset isorganized in a reverse star schema 5
  • 6.
    3NF normalized databasecan be converted to reversed star schema Source  schema   Reverse  star  schema   6
  • 7.
    BioMart system components Client-­‐side    Plugin         Query  Engine  /  Plugin   7
  • 8.
    MartConfigurator – anintegrated tool for setting up, configuring and managing a BioMart server 8
  • 9.
    BioMart 0.8 providesseveral data querying GUIs MartForm 9
  • 10.
    BioMart 0.8 providesseveral data querying GUIs MartWizard 10
  • 11.
    BioMart 0.8 providesseveral data querying GUIs MartExplorer 11
  • 12.
    Programmatic access APIquery syntax at the click of a button 12
  • 13.
    Special GUI -MartReport Ensembl KEGG Reactome Mutation frequencies from cancer projects with data distributed around the globe COSMIC Pancreatic Expression Database (PED) Breast Cancer Campaign Tissue Bank (BCCTB) 13
  • 14.
    Special GUI -MartAnalysis Mostly affected pathways 14
  • 15.
    Special GUI –MartAnalysis Genomic sequence retrieval tool Sequence retrieval tool is implemented as server-side analysis plugin 15
  • 16.
    New query type- Analysis Query against ‘affected_pathways’ analysis: <Query> <Analysis name="affected_pathways" dataset="gene_oicrPanc"> <Parameter name="biotype" value="protein_coding"/> <Parameter name="file_type" value=”png"/> <Parameter name="img_height" value="8000"/> <Parameter name="img_width" value="12000"/> </Analysis> </Query> Query against ‘gene_sequence’ sequence retrieval tool: <Query> <Analysis name="gene_sequence"> <Parameter name="seq_type" value="gene_flank"/> <Parameter name="upstream_flank" value="500"/> </Analysis> </Query> 16
  • 17.
    Several large collaborativeprojects are using BioMart for data management •  BioMart Central Portal (http://central.biomart.org) •  International Cancer Genome Consortium (http://dcc.icgc.org) •  POPCURE (collaboration with Pfizer, controlled access) 17
  • 18.
    BioMart Central Portal (central.biomart.org) First-­‐of-­‐its  kind,  community-­‐driven  effort   to  provide  unified  access  to  dozens  of   biological  databases  spanning  genomics,   proteomics,  model  organisms,  cancer   data,  and  more   18
  • 19.
    BioMart Portal providesaccess to a collection of data sources “Master/Slave” like 19
  • 20.
    International Cancer GenomeConsortium Data Portal CANADA EU / UNITED Pancreatic cancer KINGDOM (Ductal adenocarcinoma) Breast cancer Prostate cancer (ER positive, HER2 negative) (Adenocarcinoma) GERMANY UNITED STATES UNITED Malignant lymphoma Bladder cancer KINGDOM (Germinal center B-cell derived lymphomas) Blood cancer Bone cancer Pediatric brain tumors (Acute myeloid leukemia) (Osteosarcoma/ (Medulloblastoma and Brain cancer chondrosarcoma/ Pediatric pilocytic (Glioblastoma multiforme/ rare subtypes) astrocytoma) CHINA lower grade glioma) Breast cancer Breast cancer (Triple negative/lobular/ Prostate cancer Gastric cancer (Intestinal- and di use-type) JAPAN (Early onset) (Ductal & lobular) other) Liver cancer Cervical cancer Chronic Myeloid Disorders (Hepatocellular carcinoma) (Squamous) (Myelodysplastic syndromes, (Virus-associated) Colon cancer myeloproliferative neoplasms (Adenocarcinoma) and other chronic myeloid Endometrial cancer malignancies) (Uterine corpus endometrial Esophageal cancer carcinoma) Prostate cancer Gastric cancer (Adenocarcinoma) Head and neck cancer EU / FRANCE (Squamous cell carcinoma/ Renal cancer Thyroid carcinoma) (Renal cell carcinoma) Renal cancer (Focus on but not limited (Renal clear cell carcinoma/ to clear cell subtype) Renal papillary carcinoma) Liver cancer ITALY AUSTRALIA (Hepatocellular carcinoma) Lung cancer FRANCE Rare pancreatic tumors (Enteropancreatic endocrine INDIA Ovarian cancer Breast cancer (Serous cystadenocarcinoma) (Adenocarcinoma/ tumors and rare pancreatic Oral cancer (Subtype de ned by an Pancreatic cancer squamous cell carcinoma) exocrine tumors) (Gingivobuccal) ampli cation of the (Ductal adenocarcinoma) Ovarian cancer Prostate cancer (Serous cystadenocarcinoma) MEXICO HER2 gene) Liver cancer Prostate cancer (Adenocarcinoma) Multiple sub-types (Hepatocellular carcinoma) SPAIN Rectal cancer (Secondary to alcohol Chronic lymphocytic (Adenocarcinoma) and adiposity) leukemia Skin cancer Prostate cancer (CLL with mutated and (Cutaneous melanoma) (Adenocarcinoma) unmutated IgVH) GOALS: To obtain a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes, which are of clinical and societal importance across the globe. 500 tumor and matched control samples will be analyzed per tumor type. At present, 12 countries joined ICGC. Data will be generated by institutions all over the world. To make the data available rapidly and with minimal restrictions, to accelerate research of the causes and control of cancer. 20
  • 21.
    ICGC Data PortalArchitecture “Peer-to-Peer” like 21
  • 22.
  • 23.
    Future Directions •  Creationof BioMart Central Registry to improve coordination between BioMart servers. It will be a permanent resource where BioMart data providers can register their data models, data sources and services. •  Enhancing data transformation module for building BioMart databases from non-RDBMS data sources (e.g. flat data files, XML data files etc) with high scalability and flexibility. •  Enhancing the plugin system to allow various forms of data analysis and visualization. Third parties are encouraged to develop plugins to extend the capabilities of the system. 23
  • 24.
    The BioMart team Joachim  Baran   Anthony  Cros   Jonathan  Guberman   For  support:  users@biomart.org   Jack  Hsu   Yong  Liang   Elena  Rivkin   Bre;  Whi;y   Marie  Wong-­‐Erasmus   Long  Yao   Syed  Haider   Junjun  Zhang   Arek  Kasprzyk   24