The document is a presentation slide deck for the International Cancer Genome Consortium (ICGC) Data Coordinating Center (DCC) given on November 14th 2013. It provides an overview of the ICGC, including its goals to catalog genomic abnormalities in 50 different cancer types using comprehensive genome, transcriptome, methylome, and clinical data analysis. It describes the activities of the ICGC DCC, which provides tools and infrastructure for data uploading, tracking, quality control, and distribution. The DCC aims to make ICGC data accessible and useful to researchers through search and analysis capabilities on its data portal.
1. The International Cancer Genome
Consortium (ICGC) Data Coordinating
Center (DCC)
November 14th 2013
B.F. Francis Ouellette
â˘
â˘
francis@oicr.on.ca
Senior Scientists & Associate Director,
Informatics and Biocomputing, Ontario Institute for
Cancer Research, Toronto, ON
Associate Professor, Department of Cell and Systems Biology,
University of Toronto, Toronto, ON.
3. You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
3
4. Slides are on slideshare.net
⢠http://www.slideshare.net/bffo/ebi-oncogenomics-nov2013ouellettever03
http://goo.gl/HP613K
4
10. The revolution in cancer
research can summed up
in a single sentence:
cancer is in essence,
a genetic disease.
- Bert Vogelstein
10
11. Cancer
A Disease of the Genome
Challenge in Treating Cancer:
ď Every tumor is different
ď Every cancer patient is different
11
12. Large-Scale Studies of Cancer Genomes
ď Johns Hopkins
> 18,000 genes analyzed for mutations
11 breast and 11 colon tumors
L.D. Wood et al, Science, Oct. 2007
ď Wellcome Trust Sanger Institute
518 genes analyzed for mutations
210 tumors of various types
C. Greenman et al, Nature, Mar. 2007
ď TCGA (NIH)
Multiple technologies
brain (glioblastoma multiforme), lung (squamous
carcinoma), and ovarian (serous cystadenocarcinoma).
F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007
12
13. Lessons learned
ďź Heterogeneity within and across tumor types
ďź High rate of abnormalities (driver vs
passenger)
ďź Sample quality matters
13
15. International Cancer Genome Consortium
⢠Collect ~500 tumour/normal pairs from each of 50 different major
cancer types;
⢠Comprehensive genome analysis of each T/N pair:
â
â
â
â
Genome
Transcriptome
Methylome
Clinical data
⢠Make the data available to the research community & public.
Identify
genome
changes
âŚGATTATTCCAGGTATâŚ
âŚGATTATTGCAGGTATâŚ
GCAGGTATâŚ
âŚGATTATT
15
16. Rationale for the ICGC
⢠The scope is huge, such that no country can do it all.
⢠Coordinated cancer genome initiatives will reduce
duplication of effort for common and easy to acquire
tumor samples and and ensure complete studies for many
less frequent forms of cancer.
⢠Standardization and uniform quality measures across
studies will enable the merging of datasets, increasing
power to detect additional targets.
⢠The spectrum of many cancers varies across the
world for many tumor types, because of environmental,
genetic and other causes.
⢠The ICGC will accelerate the dissemination of genomic
and analytical methods across participating sites, and
the user community
16
17. International Cancer Genome Consortium
(ICGC)
Goals
⢠Catalogue genomic abnormalities in tumors in 50
different cancer types and/or subtypes of clinical and
societal importance across the globe
â˘
Generate complementary catalogues of transcriptomic
and epigenomic datasets from the same tumors
â˘
Make the data available to research community rapidly
with minimal restrictions to accelerate research into the
causes and control of cancer
50 tumor types and/or subtypes
500 tumors + 500 controls per subtype
50,000 Human Genome Projects!
Nature (2010) 464:993
17
18. Analysis Data Types
â˘
â˘
â˘
â˘
â˘
â˘
â˘
â˘
Simple Somatic Mutations
Copy Number Alterations
Structural Somatic Mutations
Gene Expression (micro-arrays and RNASeq)
miRNA Expression (RNASeq)
Epigenomics (Arrays and Methylation)
Splicing Variation
Protein Expression
18
20. OICRâs mission
To build innovative research
programs that will have an impact
on the prevention, early detection,
diagnosis and treatment of
cancer.
20
21. OICR Informatics & Biocomputing Senior Staff
Lincoln Stein
Director, I&B
Sr. PI
Vincent Ferretti
Assoc. Director,
Bioinf. Software Dev
Sr. PI
Francis Ouellette
Assoc. Director, I&B
Paul Boutros
Jr. PI
Lakshmi
Muthuswamy
Jr. PI
David Sutton
Director, IT
Paul Shoichet
BrianBoutros
Jr. PI
Sr. PI
May 2013
Tatiana Lomasko
Program Manager
Jared Simpson
OICR Fellow
May 2013
21
23. ICGC Map â November 2013
67 projects launched
23
24. ICGC Committees & Working Groups
http://icgc.org/icgc/committees-and-working-groups
24
25. ICGC Project Teams @ OICR
⢠ICGC Secretariat
â Executive Chair: Thomas Hudson
â Senior Project Manager: Jennifer Jennings
â Administrative Coordinator: Jaypee Banlawi
⢠(with the support of the Web Development team)
⢠ICGC Data Coordination Center (DCC)
â DCC Leader: Lincoln Stein
â DCC Co-Leader: Francis Ouellette
â DCC Software Development Team Leader: Vincent
Ferretti (+6 FTE)
â DCC Data Curation: Hardeep Nahal (+1 FTE)
25
26. DCC Activities
DCC activities are split between two groups:
⢠Software Development
â DCC portal
â Submission tool
⢠Curation (and Content Management)
â
â
â
â
Data level management
Submitter âhandlingâ
Coordination with secratariat
User support
http://dcc.icgc.org/team
26
26
27. ICGC Data Coordination Centre
A âcomprehensive management systemâ providing:
â˘
â˘
â˘
â˘
â˘
â˘
â˘
Secure mechanism for uploading data
Track uploads and perform integrity checks
Regular progress reporting (data audit)
Quality checks (coverage, correctness, etc.)
Enable distribution of raw data to public repositories
Provide essential metadata to public repositories
Integrate with other public repositories via standard data
formats, ontologies, etc.
27
27
28. ICGC Data Coordination Centre (2)
Provides the following support to experimental
biologists, computational biologists, and other
researchers:
â˘
â˘
â˘
â˘
Download of complete dataset, or subsets
Restrict protected data to authorized users (controlled access)
Search data by gene or specimen, or lists thereof
Interactive system for identifying specimens of interest, finding what
data sets are available for those specimens, selecting data slices
across those specimens (e.g., counts of the number of somatic
mutations observed a region within the UTR of a gene of interest), and
running basic analytic tests on those data slices
28
28
29. ICGC Data Types
⢠Clinical Data
â Hosted by DCC via data portal
â Was 100% open access, but currently 9 data elements have been flagged by DACO
as controlled access and are under review by IDAC
⢠Experimental Analysis Data
â Hosted by DCC via data portal
â Somatic is open access, germline is controlled
⢠âRawâ Sequencing Data (+ array data, etc.)
â Hosted at other public repositories
â Primary repository for ICGC sequence data is EBI EGA
â TCGA raw data hosted at CGhub
29
30. Hardeep Nahal
ICGC datasets to date
ICGC Data Portal Cumulative Donor Count for Member Projects
10,000
Release 14
Release 11
Release 13
9000
Release 12
8000
Release 10
Release 9
7000
6000
Number
of
Donors
5000
Release 8
4000
Release 7
3000
2000
1000
Dec-11
Jan-2012
Feb
March
April
May
June
July
Aug
Sept
Oct
Nov
Dec
Jan-2013
Feb
March
April
May
June
July
Aug
Sept-2013
30
31. ICGC dataset version 14
September 2013
Hardeep Nahal
⢠Cancer types: 41
⢠Donors: 8,532 (18,056 specimens)
⢠Simple somatic mutations: 1,995,134
⢠Copy number mutations: 18,526,593
⢠Structural rearrangements: 18,614
⢠Genes affected* by simple somatic mutations: 22,074
⢠Genes affected* by non-synonymous coding mutations: 19,150 Genes
affected* by copy number mutations: 20,341
⢠Genes affected* by structural rearrangements: 1,884
â˘
*out 22,259 protein coding genes annotated in Ensembl Human release 69
⢠Open tier and controlled data currently available
32. Key DCC Activities for 2013
⢠Improved data & metadata curation at EGA; better
linking of data held at DCC to ICGC data in other
repositories (currently not perfect)
⢠Improved data quality/integrity checking through
new submission/validation system; review of
submission file specifications
⢠Integration of new data submission system and
portal infrastructure with project and user
information managed at ICGC.org
32
34. Where do you find that information?
⢠We actually make it hard to find, but we are
working on that! (this is an example of where ICGC
would like to do what TCGA does!)
⢠http://cancergenome.nih.gov/publications/publicatio
nguidelines
34
35. Where do you find that information?
For ICGC data:
⢠Need to find the policy!
⢠http://icgc.org/icgc/goals-structure-policiesguidelines/e3-publication-policy
⢠Find text:
⢠Published > no embargo
⢠< 100 tumors > 2 years
⢠> 100 tumors > 1 year
⢠Find date: in README on FTP file
⢠(exception in README)
⢠This is bad, we know it, and we are fixing it!
⢠In doubt? Contact us! info@icgc.org
35
36. Time limits for publication moratoriums:
All data shall become free of a publication
moratorium when either:
1) the data is published by the ICGC member project
2) one year after a specified quantity of data (e.g.
genome dataset from 100 tumours per project)
has been released via the ICGC database or
other public databases.
3) In all cases data shall be free of a publication
moratorium two years after its initial release.
36
39. Raw Data Availability at EGA by Project and Data Type
⢠https://www.ebi.ac.uk/ega/organisations/EGAO00000000024
39
40. Cooperation with EBI EGA Repository for
Controlled Access Raw Data
⢠Concerted efforts with EGA staff to support
coordinated data submissions to both ICGC DCC
& EGA
⢠Infrastructure to grant controlled data access
automatically on approval of ICGC DACO web
application forms
40
40
41. What the users see?
⢠Important to have a data portal that represents the
richness of the data that we generate, but to also
make sure biologists and clinicians can actually
use the data & make discoveries!
⢠Important to have a scalable technology that will
support 50,000 human genomes, and thousands of
concurrent users (we donât have that many yet)
41
42. Uniform Annotations
⢠Annotating Simple Somatic Mutations (SSM) and Simple
Germline Variations (SGV)
⢠DCC is currently implementing the snpEff software
⌠Recommended by the ICGC Bioinformatics Analysis
Working Group
⌠Returns Sequence Ontology's controlled vocabulary
regarding mutation-induced changes
(www.sequenceontology.org)
⢠ICGC members will not be required to annotate
SSM and SGV for the ICGC data releases
42
50. Highlights of the new portal: dcc.icgc.org
⢠Faceted searches capabilities for variants, genes and
donors
â Interactive data exploration fast and easy
⢠Mutation aggregation & counts across donors and cancers
â # of pancreatic cancers donors with mutation KRAS G12D
â˘
â˘
â˘
â˘
â˘
Standardized gene consequence across all projects
Genome browser
Data doewnload
Protein domains
Links to repositories
50
72. ICGC Data Categories
ICGC Open Access Datasets
ICGC Controlled Access Datasets
ď Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
ď Donor
Gender
Age range
ď RNA expression (normalized)
ď DNA methylation
ď Genotype frequencies
ď Somatic mutations (SNV,
CNV and Structural
Rearrangement)
ďDetailed Phenotype and Outcome Data
Patient demography
Risk factors
Examination
Surgery/Drugs/Radiation
Sample/Slide
Specific histological features
Protocol
Analyte/Aliquot
ďGene Expression (probe-level data)
ďRaw genotype calls (germline)
ďGene-sample identifier links
ďGenome sequence files
Most of the data in the portal is publically available without restriction. However,
access to some data, like the germline mutations, requires authorization by the Data
Access Compliance Office (DACO)
72
75. ICGC Controlled
Access Datasets
⢠Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
⢠Gene Expression (probe-level data)
⢠Raw genotype calls
⢠Gene-sample identifier links
⢠Genome sequence files
ICGC OA
Datasets
⢠Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
⢠Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
⢠Gene Expression (normalized)
⢠DNA methylation
â˘Computed Copy Number and
Loss of Heterozygosity
⢠Newly discovered somatic variants
http://goo.gl/w4mrV
75
76. Identify
yourself
Fill out detail form which
includes:
⢠Contact and Project
Information
â˘Information Technology
details and procedures
for keeping data secure
â˘Data Access Agreement
Module 1: Cancer Genomic Databases
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
bioinformatics.ca
83. DACO approved projects:
59 groups - 75% academic
(~400 people)
Module 1: Cancer Genomic Databases
bioinformatics.ca
84. DACO/DCC User Data Access Process
â˘
Users approved through DACO are now automatically granted access to
ICGC controlled access datasets available through the ICGC Data Portal
and the EBIâs EGA repository
user
accounts
activated
application
approved
by DACO
DACO Web
Application
DCC Data
Portal
DCC User
Registry
EBI EGA
84
85. Future Work for the DCC
â˘
Work with projects to improve in a number of areas:
â clinical data content,
â Increasing frequency of data release
â˘
Better metadata collection from the EGA
â Working with EGA to better match metadata requirements for ICGC member
submissions; will enable reliable linking by Sample ID, Donor ID, etc. between data
portal and EGA. Will allow direct link to DACO approved users
â Projects will be required to provide this required metadata at submission time,
existing EGA datasets will be updated.
â˘
Improve access to projectsâ analysis methods
â Suggested publishing analysis SOPs in Standards in Genomic Sciences at most
recent ICGC workshop; havenât seen any interest in doing this from member projects.
â DCC to host centralized web page(s) for each projectâs analysis methods; use
permalink in submission files.
â˘
â˘
â˘
Better documentation ⌠always need more!
Better transparency of processes
Better links to publications
85
85
86. Future Work for the DCC
⢠New releases:
â Release 15: finished before Christmas
⢠All data submission sent in again, plus new data
⢠(no methylation data)
â Release 16: incremental submission + Methylation data,
released before May
â Release 17: adopt incremental for all data types, and
increase frequency of releases.
86
86
87. New Project: ICGC PANCANCER analysis
⢠2,000 Whole genome sequencing
â
â
â
â
â
â
6 cloud infrastructures across the world
Appropriate policy and tool availability
Agreed upon shared pipelines, and others
Shared datasets
Petabytes of files, 10,000âs cores
Mutation analysis, as well as CNV, Structural, others
when feasible (RNA and methylome).
87
88. Challenges and Opertunity
⢠Targetted sequencing for Patient
Selection
⢠Consent
⢠Combinations
⢠Corrected features and #features >>
#samples
⢠Noisy and incomplete data
⢠Speed and cost
We are also hiring!
Adapted from Paul Rejto, Pfizer
88
89. FGEDâs mission:
To be a positive agent of
change in the effective
sharing and reproducibility
of functional genomic data
fged.org
89
90. Acknowledgments
http://oicr.on.ca
ICGC Project leaders
at the OICR:
Ouellette Lab
⢠FGED
Michelle Brazas
Emilie Chautard
Nina Palikuca
Matthew Ziembicki
Alvis Brazma
Roger Bumgarner
Cesare Furlanello
Michael Miller
Francis Ouellette
John Quackenbush â
Dana-Farber
Michael Reich
Gabriella Rustici
Chris Stoeckert
Ronald Taylor
Steve Trutane
Jennifer Weller
Brian Wilhelm
Neil Winegarden
â˘
Tom Hudson
â˘
John McPherson
â˘
Lincoln Stein
â˘
Paul Boutros
â˘
Lakshmi Mutsawarma
â˘
Vincent Ferretti
â˘
Francis Ouellette
â˘
Jennifer Jennings
DCC Software
Developer
Vincent Ferretti
Brian OâConnor
Junjun Zhang
Anthony Cros
Jonathan Guberman
Bob Tiernay
Shane Wilson
Long Yao
Daniel Chang
Jerry Lam
Stuart Watt
⌠and all the patients and their
families that that are putting their
hopes into our work!
http://icgc.org
Web Dev
Miyuki Fukuma
Kamen Wu
Joseph Yamada
Salman Badr
Pipeline Development
& Evaluation
Morgan Taschuk
Rob Denroche
Peter Ruzanov
Zhibin Lu
DCC Data Coordinator
Hardeep Nahal
90