Cancer uk 2015_module1_ouellette_ver02

Provided to you by the
Canadian Bioinformatics
Workshop series
www.bioinformatics.ca
NCRI Cancer Conference:
Cancer data and its analysis
practical workshop
November 1, 2015

bioinformatics.ca
NCRI Workshop 2015
NCRI Workshop 2015 – Module 1
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and
respect the rights and licenses associated
with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

bioinformatics.ca
NCRI Workshop 2015
Slides are on slideshare.net
• http://www.slideshare.net/bffo/cancer-uk-2015module1ouellettever02

Module 1
Cancer genomic databases
B.F. Francis Ouellette

bioinformatics.ca
NCRI Workshop 2015
@bffo
francis@oicr.on.caE-mail

bioinformatics.ca
NCRI Workshop 2015
Schedule for Module 1:
Cancer Genomic Databases
• Introduction to the Canadian Bioinformatics
Workshop series.
• The Databases:
– The Cancer Genome Atlas (TCGA)
– The International Cancer Genome Consortium (ICGC)
• Data Access: human genomes and security and
privacy issues:
Open Data vs. Controlled Access data
• Another Database:
– The Catalogue of Somatic Mutations in Cancer (COSMIC)

bioinformatics.ca
NCRI Workshop 2015
http://bioinformatics.ca/

bioinformatics.ca
NCRI Workshop 2015
http://bioinformatics.ca/workshops/2015/bioinformatics-cancer-genomics-2015

bioinformatics.ca
NCRI Workshop 2015
Workshops planned for 2016:
http://bioinformatics.ca/workshops
1. Bioinformatics for Cancer Genomics
2. High-throughput Biology: From Sequence to Networks (2017 - CSHL)
3. Introduction to R
4. Exploratory Analysis of Biological Data using R
5. Informatics for RNA-sequence Analysis
6. Informatics on High Throughput Sequencing Data
7. Pathway and Network Analysis of -omics Data
8. Informatics and Statistics for Metabolomics
9. Analysis of Metagenomic Data
10. How to Work in the Cloud: Computing on Human Genome Data
11. Epigenomic Data Analysis
12. Big Data in Precision Genomics

bioinformatics.ca
NCRI Workshop 2015
http://bioinformatics.ca/workshops/2015

bioinformatics.ca
NCRI Workshop 2015
E-mail: course_info@bioinformatics.ca
Web: http://bioinformatics.ca
Workshop announcement mailing list:
http://bioinformatics.ca/mailman/listinfo/announce

bioinformatics.ca
NCRI Workshop 2015
Soap-Box time!
• Open Access, Open Data and Open Source are essential for good
Science.
• Openness is a responsibility, an obligation, and something that comes
with the privilege of doing publicly funded work.
Open Access
Open Source
Open Data
Opencourseware

bioinformatics.ca
NCRI Workshop 2015

bioinformatics.ca
NCRI Workshop 2015
Cancer therapy is like
beating the dog with
a stick to get rid of
his fleas.
- Anna Deavere Smith,
Let me down easy

bioinformatics.ca
NCRI Workshop 2015
http://goo.gl/Yhbsj

bioinformatics.ca
NCRI Workshop 2015
The revolution in cancer
research can summed up
in a single sentence:
cancer is in essence,
a genetic disease.
- Bert Vogelstein

bioinformatics.ca
NCRI Workshop 2015
Cancer: a Disease of the Genome
Challenge in Treating Cancer:
 Every tumour is different
 Every cancer patient is different

bioinformatics.ca
NCRI Workshop 2015
https://en.wikipedia.org/wiki/List_of_databases_for_oncogenomic_research

bioinformatics.ca
NCRI Workshop 2015
Papers (PMID)
– TCGA: 24071849 21720365 23000897
22960745 22810696 24476821
– ICGC: 20393554
– COSMIC: 25355519
– Data Access: 22807659
http://www.ncbi.nlm.nih.gov/pubmed/[PMID]

NCRI Workshop 2015 – Module 1 bioinformatics.ca
TCGA
The Cancer Genome Atlas is a
comprehensive and coordinated
effort to accelerate our
understanding of the molecular
basis of cancer through the
application of genome analysis
technologies, including large-
scale genome sequencing.

bioinformatics.ca
NCRI Workshop 2015
About the TCGA
• National Cancer Institute (NCI)
• National Human Genome Research Institute
(NHGRI)
• Phased Structure:
– Three-year pilot in 2006 with an investment of $50 million
from each
– TCGA will collect and characterize more than 20 additional
tumour types

bioinformatics.ca
NCRI Workshop 2015
Where to start with the TCGA?
Wiki: https://wiki.nci.nih.gov/display/TCGA/About+TCGA

bioinformatics.ca
NCRI Workshop 2015
Division of Labour
• Biospecimen Core Resource (BCR)
– centre where samples are carefully catalogued, processed, qualitychecked
and stored along with participant clinical information
• Genome Sequencing Centre (GSC)
– uses high-throughput methods to identify changes to DNA sequences that are
associated with specific cancer types
• Genome Characterization Centre (GCC)
– uses high-throughput technologies to analyze genomic changes involved in cancer
• Genome Data Analysis Centre (GDAC)
– provides novel informatics tools to the research community
– provides analysis results using TCGA data.
• Data Coordinating Centre (DCC)
– Central provider of TCGA data.
– Standardizes data formats and validates submitted data.

bioinformatics.ca
NCRI Workshop 2015
TCGA Data
• Sequence reads from newer sequencing
technologies are available at the Cancer Genome
Hub: https://cghub.ucsc.edu/
• Higher level sequence data (variation calls and
abundance measures) are available at the TCGA
Portal: http://cancergenome.nih.gov/
• Also integrated with ICGC data (more on this later)

TCGA data flow
http://goo.gl/b5nojx

bioinformatics.ca
NCRI Workshop 2015
Data Coordinating Centre
• Play a central role
– Receiving data from BCR, GSC and GCC sites
– Providing access to users
– Performing analysis of data
• Responsibilities:
– Protecting participant privacy and confidentiality
– Developing data standards and controlled vocabularies
– Establishing informatics pipelines for data flow
– Developing new analytical and visualization technologies
to facilitate data analysis, for all audiences

bioinformatics.ca
NCRI Workshop 2015
TCGA DCC Data Portal
• Provides a platform to search, download and
analyze TCGA data sets
• Two data access tiers: Open and Controlled
• Analytic tools include: Cancer Molecular Analysis
and Cancer Genome Workbench (NCBIB),
Integrative Genomics Viewer (Broad) and
CancerGenomics Analysis (MSKCC).

bioinformatics.ca
NCRI Workshop 2015
TCGA Data Browser
https://tcga-data.nci.nih.gov/tcga/
Query TCGA
data online
using the
TCGA Data
Browser

bioinformatics.ca
NCRI Workshop 2015
The International Cancer Genome Consortium (ICGC)
• http://www.icgc.org/
• “ICGC was launched
to coordinate large-
scale cancer genome
studies in tumours
from 50 different
cancer types and/or
subtypes that are of
clinical and societal
importance across
the globe”

ICGC Map – February 2015
85 projects launched

bioinformatics.ca
NCRI Workshop 2015
ICGC datasets to date:
https://dcc.icgc.org/projects/history

Select “Pancreatic cancer – Canada”

… But where is the data?

http://dcc.icgc.org/

bioinformatics.ca
NCRI Workshop 2015
DACO
ICGC
dbGaP
EGA
TCGA
BA
M
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
M
BA
M

bioinformatics.ca
NCRI Workshop 2015
ICGC
BAM/FASTQ
TCGA
BAM/FASTQ
ICGC
Open
Data
(includes
TCGA
Open Data)

bioinformatics.ca
NCRI Workshop 2015
ICGC
TCGA

ICG
C
TCGA
Differences between ICGC & TCGA
• Different tumour types
• Different geographic rules
• Many countries vs one jurisdiction
• Different definitions of what is controlled
• Different data access rules

• Detailed Phenotype and Outcome data
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
• Germ line variants
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Somatic variants from Exome or WGS
ICGC Open
Access Datasets
http://goo.gl/w4mrV

• Primary sequence data
(BAM and FASTQ files)
• SNP6 array level 1 and level 2 data
• Exon array level 1 and level 2 data
• Somatic variants from whole
genome sequencing
• Certain information in MAFs
• A full list of controlled-access
data types can be found at:
http://goo.gl/K1h7zu
TCGA Controlled
Access Datasets
• De-identified clinical and
demographic data
• Gene expression data
• Copy number alterations in regions
of the genome
• Epigenetic data
• Summaries of data compiled across
individuals
• Anonymized single amplicon DNA
sequence data
• Somatic variants from scrubbed
exome sequencing
TCGA Open
Access Datasets
http://goo.gl/A1rMRB

bioinformatics.ca
NCRI Workshop 2015
TCGA/ICGC users agreed:
• … to keep all computer systems on which controlled
access data reside, or which provide access to such
data, up to date with respect to software and
security patches.
• … to protect Controlled Access Data against
disclosure to unauthorized individuals.
• … to monitor and control which individuals have
access to Controlled Access Data.

bioinformatics.ca
NCRI Workshop 2015
TCGA/ICGC users agreed:
• … to destroy all copies of controlled access data
after controlled access privileges expires.
• ... to only use secure transfer protocols:
e.g. https and sftp
• … to encrypt Controlled Access data in transfers
and storage

What does it mean for this file?
simple_somatic_mutation.aggregated.vcf.gz
https://dcc.icgc.org/repository/icgc/release_19/Summary

Identify
yourself
Fill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf

http://icgc.org/daco

• Name
• Institution
• Title of Project
• Collaborators
• Research Summary
• Lay Summary
• Ethics
• IT Security
• Cloud Storage
• Agreement
• Appendices

http://goo.gl/2UVLDJ

DACO approved projects

bioinformatics.ca
NCRI Workshop 2015
DACO/DCC User Data Access Process
• Users approved through DACO are now automatically granted access to
ICGC controlled access datasets available through the ICGC Data Portal and
the EBI’s EGA repository
DACO Web
Application
DCC User
Registry
DCC Data
Portal
EBI EGA
application
approved
by DACO
user
accounts
activated

Catalogue of Somatic Mutations in Cancer
(COSMIC) • http://cancer.sanger.ac.uk/cancerg
enome/projects/cosmic/
• COSMIC is designed
to store and display
somatic mutation
information and
related details and
contains information
relating to human
cancers.

ICGC
BAM/FASTQ
TCGA
BAM/FASTQ
ICGC
Open
Data
(includes
TCGA
Open Data)
COSMIC
Open
Data

bioinformatics.ca
NCRI Workshop 2015
COSMIC
• Somatic Mutations Only
• Diverse sources
– Literature (Arrays, Next-Gen, PCR...)
– TCGA
– ICGC
• Diverse ways to look at data
– Gene
– Variation
– Tumour type
– Cell line
– Experiment

bioinformatics.ca
NCRI Workshop 2015
FAQ

bioinformatics.ca
NCRI Workshop 2015
Looking up your favorite gene
1 2 3

bioinformatics.ca
NCRI Workshop 2015
In closing
• Remember all these sites have great amounts of
documentation
• The field is changing quickly, and so are the portals.
• New features are planned as we speak, and so you
need to use the sites, and keep coming back.
• Don’t be afraid to explore
• Interested in learning more after today? Consider
one of the bioinformatics.ca workshops!

Acknowledgements:
the CBW gang
Michelle Brazas
Michael
Stromberg
Marc
Fiume
Michael
Brudno

Cancer uk 2015_module1_ouellette_ver02

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cancer uk 2015_module1_ouellette_ver02

Similar to Cancer uk 2015_module1_ouellette_ver02 (20)

Recently uploaded

Recently uploaded (20)

Cancer uk 2015_module1_ouellette_ver02