Nov 2014 ouellette_windsor_icgc_final

A project status for the International Cancer Genome
Consortium (ICGC).
November 21th 2014
B.F. Francis Ouellette francis@oicr.on.ca
• Senior Scientists & Associate Director,
Informatics and Biocomputing, Ontario Institute for
Cancer Research, Toronto, ON
• Associate Professor, Department of Cell and Systems Biology,
University of Toronto, Toronto, ON.
@bf fo on

2
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

3
But first, a little about me …
… an unfinished story!

from the National Centre for Biotechnology Information
15
(from the National Centre for Biotechnology Information)

16

17
PANIC

33
International Cancer Genome Consortium: icgc.org

34
http://www.csb.utoronto.ca/

37
http://bioinformatics.ca/workshops/2014

38
E-mail: course_info@bioinformatics.ca
Web: http://bioinformatics.ca

39
Cancer
A Disease of the Genome
Challenge in Treating Cancer:
 Every tumor is different
 Every cancer patient is different

40
Large-Scale Studies of Cancer Genomes
 Johns Hopkins
> 18,000 genes analyzed for mutations
11 breast and 11 colon tumors
L.D. Wood et al, Science, Oct. 2007
 Wellcome Trust Sanger Institute
518 genes analyzed for mutations
210 tumors of various types
C. Greenman et al, Nature, Mar. 2007
 TCGA (NIH)
Multiple technologies
brain (glioblastoma multiforme), lung (squamous carcinoma),
and ovarian (serous cystadenocarcinoma).
F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

41
Lessons learned
 Heterogeneity within and across tumor types
 High rate of abnormalities (driver vs
passenger)
 Sample quality matters
 Consent and controlled data access is
complicated

42
International Cancer Genome Consortium
• Collect ~500 tumour/normal pairs from each of 50 different major
cancer types;
• Comprehensive genome analysis of each T/N pair:
– Genome
– Transcriptome
– Methylome
– Clinical data
• Make the data available to the research community & public.
Identify
genome
changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…

43
Rationale for the ICGC
• The scope is huge, such that no country can do it all.
• Coordinated cancer genome initiatives will reduce
duplication of effort for common and easy to acquire
tumor samples and and ensure complete studies for many
less frequent forms of cancer.
• Standardization and uniform quality measures across
studies will enable the merging of datasets, increasing
power to detect additional targets.
• The spectrum of many cancers varies across the
world for many tumor types, because of environmental,
genetic and other causes.
• The ICGC will accelerate the dissemination of genomic
and analytical methods across participating sites, and
the user community

44
International Cancer Genome Consortium
(ICGC)
Goals
• Catalogue genomic abnormalities in tumors in 50
different cancer types and/or subtypes of clinical and
societal importance across the globe
• Generate complementary catalogues of transcriptomic
and epigenomic datasets from the same tumors
• Make the data available to research community rapidly
with minimal restrictions to accelerate research into the
causes and control of cancer
50 tumor types and/or subtypes
500 tumors + 500 controls per subtype
50,000 Human Genome Projects!
Nature (2010) 464:993

45
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN

46
Primary Goal: coordinate efforts to
reach goals (50 tumours)

http://docs.icgc.org/dcc-data-element-specifications
47

Primary Goal: be comprehensive
48
http://goo.gl/BE7KH1

49
Analysis Data Types
• Germline variants (SNPs)
• Simple Somatic Mutations (SSM)
• Copy Number Alterations (CNA)
• Structural Variants (SV)
• Gene Expression (micro-arrays and RNASeq)
• miRNA Expression (RNASeq)
• Epigenomics (Arrays and Methylation)
• Splicing Variation (RNASeq)
• Protein Expression (Arrays)

50
Primary Goal: generate highest quality
http://goo.gl/FXCvi9

52
Primary Goal: available to all

53
Primary Goal: available to all

54
ICGC Controlled
Access Datasets
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC OA
Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
http://goo.gl/w4mrV

55
Secondary Goal: coordinate
work to benefit productivity
http://goo.gl/K5mHC3

56
https://icgc.org/icgc/committees-and-working-groups

57
Secondary Goal: disseminate knowledge
http://goo.gl/ObcZXy

58
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN

59
Policy
ICGC membership implies compliance with Core
Bioethical Elements for samples used in ICGC
Cancer Projects:
http://goo.gl/TFrCmK
http://goo.gl/nYx6YG

60
POLICY:
The members of the International Cancer Genomics
Consortium (ICGC) are committed to the principle of
rapid data release to the scientific community.
http://goo.gl/TFrCmK

61
Publication Policy
• The individual research groups in
the ICGC are free to publish the
results of their own efforts in
independent publications at any
time (subject, of course, to any
policies of any collaborations in
which they may be participating).

62
Moratorium:
http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy

64
Where do you find that information?
• We actually make it hard to find, but we are
working on that! (this is an example of where ICGC
would like to do what TCGA does!)
• http://cancergenome.nih.gov/publications/publicatio
nguidelines

65
Where do you find that information?
For ICGC data:
• Need to find the policy!
• http://icgc.org/icgc/goals-structure-policies-guidelines/
e3-publication-policy
• Find text:
• Find date: in README on FTP file
• This is bad, we know it, and we are fixing it!
• In doubt, contact us: info@icgc.org

66
Policy on Intellectual Property
• All ICGC members agree not to make claims to
possible IP derived from primary data (including
somatic mutations) and to not pursue IP
protections that would prevent or block access to
or use of any element of ICGC data or conclusions
drawn directly from those data.
http://goo.gl/TCMXCl

67
ICGC Map – May 2014
72 projects launched

68
DCC Activities
DCC activities are split between two groups:
• Software Development
– DCC portal
– Submission tool
• Biocuration (which also includes Content
Management)
– Data level management
– Submitter “handling”
– Coordination with secretariat
– User support
http://dcc.icgc.org/team
68

69
Data
Validation
VaVlaidliadtaiotinon
(dictionary)
Validation
(across fields)
Validation
(across fields)
Validation
(across fields)
indexing
Happy
Users
http://goo.gl/1EcyR

70
http://docs.icgc.org/methods

http://docs.icgc.org/dcc-data-element-specifications
71

72
ICGC Biocuration
• Helping submitters get their data to ICGC
• Progress reporting (data audit)
• Quality checks (coverage, correctness, etc.)
• Helping users get to the data
• Validate and check (and recheck) metadata on public
repositories
• Test and integrate with other public repositories via
standard data formats, ontologies.
• Documentation, documentation, and more documentation
• Training
72

73
ICGC datasets to date
ICGC Data Portal Cumulative Donor Count for Member Projects
14,000
12,000
10,000
8000
6000
4000
2000
0
Number
of
Donors
Release 7
Release 8
Release 10
Release 9
Release 11
Release 12
Release 14
Release 13
Release 15
Release 16
Release 17

ICGC dataset version 17
Sept 11th 2014
•Cancer types: 50
•Body sites: 18
•Donors: 12,232
•Specimens: 24, 661
•Simple somatic mutations: 9,871,477
•Mutated genes: 57,526

75
Clinical Data Completeness
Overall Donor Clinical Data Completeness
Donor Tumour stage at diagnosis supplemental
Donor relapse type
Donor relapse interval
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor vital status
Donor region of residence
Disease status last followup
Donor interval of last followup
Donor diagnosis ICG10
Donor
Fields
Donor survival time
Donor age at last followup
Donor age at diagnosis
Donor sex
Donor ID
Average Percentage Completeness

76
Overall Donor Clinical Data Completeness
Donor Tumour stage at diagnosis supplemental
Donor relapse type
Donor relapse interval
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor vital status
Donor region of residence
Disease status last followup
Donor interval of last followup
Donor diagnosis ICG10
Donor
Fields
Donor survival time
Donor age at last followup
Donor age at diagnosis
Donor sex
Donor ID

77
Overall Specimen Clinical Data Completeness
Specimen Biobank ID
Specimen donor treatment type other
Specimen Biobank
Percentage cellularity
Tumour Stage Supplemental
Tumour Grade Supplemental
Level of cellularity
Tumour Grade
Specimen type other
Tumour Stage
Tumour Grading System
Tumour Stage System
Digital Image of Stained Section
Specimen available
Tumour Histological Type
Specimen storage
Specimen processing
Specimen Interval
Specimen donor treatment type
Specimen processing other
Tumour confirmed
Specimen storage other
Specimen type
Specimen ID
Donor ID
Specimen
Fields
0 20 40 60 80
10 30 50 70 90 100

78
Overall Specimen Clinical Data Completeness
Specimen Biobank ID
Specimen donor treatment type other
Specimen Biobank
Percentage cellularity
Tumour Stage Supplemental
Tumour Grade Supplemental
Level of cellularity
Tumour Grade
Specimen type other
Tumour Stage
Tumour Grading System
Tumour Stage System
Digital Image of Stained Section
Specimen available
Tumour Histological Type
Specimen storage
Specimen processing
Specimen Interval
Specimen donor treatment type
Specimen processing other
Tumour confirmed
Specimen storage other
Specimen type
Specimen ID
Donor ID
Specimen
Fields
0 20 40 60 80
10 30 50 70 90 100

79
DACO
ICGC
cgHUB
EGA
TCGA
BAM
Open
Open
BA
M
Germ
Line
+ EGA id
BA
M
BA
M
ERA

ICGC
BAM/FASTQ
TCGA
BAM/FASTQ
ICGC
Open
Data
(includes
TCGA
Open Data)
COSMIC
Open
Data

81
Raw Data Availability at EGA by Project and Data Type
• https://www.ebi.ac.uk/ega/organisations/EGAO00000000024

85
Select “Bladder Cancer – China”

86
Select “Pancreatic cancer – Canada”

87
… But where is the data?

92
Highlights of the new portal: dcc.icgc.org
• Faceted searches capabilities for variants, genes and
donors
– Interactive data exploration fast and easy
• Mutation aggregation & counts across donors and cancers
– # of pancreatic cancers donors with mutation KRAS G12D
• Standardized gene consequence across all projects
• Genome browser
• Data doewnload
• Protein domains
• Links to repositories

94
• Summary
• Cancer type distribution
• Other links (Cosmic, Entrez, etc)
• Mutation profile in protein
• Domains
• Genomic Context
• Mutation profile
• Most common mutations

95
http://dcc.icgc.org/genes/ENSG00000133703

99
Donor
• Donor ID
• Primary site
• Cancer Project
• Gender
• Tumor Stage
• Vital Status
• Disease Status
• Release type
• Age at diagnosis
• Available data types
• Analysis types

101
Mutations
• Consequences
• Type
• Platform
• Verification status

106
Can do bulk download of the data …

107
BIG
DATA
Validation
ValidRaAtiWon
DATA
Meta
DATA
Interpreted
data
✔
✔
✔
✔
✔

108
DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
M
BA
M

109
ICGC Data Categories
ICGC Open Access Datasets ICGC Controlled Access Datasets
 Cancer Pathology
 Donor
Gender
Age range
RNA expression (normalized)
DNA methylation
 Genotype frequencies
 Somatic mutations (SNV,
CNV and Structural
Rearrangement)
Detailed Phenotype and Outcome Data
Patient demography
Risk factors
Examination
Surgery/Drugs/Radiation
Sample/Slide
Protocol
Analyte/Aliquot
Gene Expression (probe-level data)
Raw genotype calls (germline)
Gene-sample identifier links
Genome sequence files
Most of the data in the portal is publically available without restriction. However,
access to some data, like the germline mutations, requires authorization by the Data
Access Compliance Office (DACO)

112
ICGC Controlled
Access Datasets
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC OA
Datasets
• Cancer Pathology
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
http://goo.gl/w4mrV

Identify
yourself
Fill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf

‹#›
DACO approved projects:
> 160 groups - 75% academic
(> 870 people)

121
121
Nature 409:452
Bioinformatics Citizenship: What it means,
and what does it cost?

122
Important messages:
• The ICGC portal is evolving and getting better all
the time
• Lots of data provided by the ICGC
• Important to be good citizens of the scientific world
• The idea behind all of this is to provide tools to
help cure cancer
• Need to respect policies and guidelines
• There is help out there, and user feedback is
*always* welcome.

123
Acknowledgments
DCC Software
Developer
Vincent Ferretti
Daniel Chang
Anthony Cros
Jerry Lam
Brian O'Connor
Bob Tiernay
Stuart Watt
Shane Wilson
Junjun Zhang
ICGC Project leaders
at the OICR:
Tom Hudson
John McPherson
Lincoln Stein
Jared Simpson
Paul Boutros
Vincent Ferretti
Francis Ouellette
Jennifer Jennings
http://oicr.on.ca http://icgc.org
Ouellette Lab
Michelle Brazas
Emilie Chautard
Nina Palikuca
Zhibin Lu
Web Dev
Joseph Yamada
Angela Chao
Daniel Gross
Kamen Wu
Kim Cullion
Miyuki Fukuma
Wen Xu
Pipeline Development
& Evaluation
Morgan Taschuk
Michael Laszloffy
Peter Ruzanov
ICGC DCC Biocuration
Hardeep Nahal
Marc Perry
Research IT/Systems
David Sutton,
Bob Gibson
Sam Maclennan
David Magda
Rob Naccarato
Brian Ott
Gino Yearwood
EGA
Justin Paschall
Jeff Almeida-King
Ilkka Lappalainen
Jordi Rambla De Argila
Marc Sitges Puy
… and all the patients and their
families that that are putting their
hopes into our work!

124
Informatics and Biocomputing at the OICR

125
http://oicr.on.ca/careers

127
http://icgc.org
http://dcc.icgc.org
http://docs.icgc.org
info@icgc.org
@bffo
Video tutorial: https://vimeo.com/75522669

Nov 2014 ouellette_windsor_icgc_final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nov 2014 ouellette_windsor_icgc_final

Similar to Nov 2014 ouellette_windsor_icgc_final (20)

Recently uploaded

Recently uploaded (20)

Nov 2014 ouellette_windsor_icgc_final