1. Joaquín Dopazo
Clinical Bioinformatics Area,
Fundación Progreso y Salud,
Functional Genomics Node, (INB-ELIXIR-es),
Bioinformatics in Rare Diseases (BiER-CIBERER),
Sevilla, Spain.
Taller Genómica y cáncer
Introducción
http://www.clinbioinfosspa.es
http://www. babelomics.org
@xdopazo, @ClinicalBioinfo
XXV Jornadas Nacionales de Innovación y Salud en Andalucía.
SEIS, Torremolinos, 14 Junio 2018
2. Progress in science depends on new techniques, new
discoveries and new ideas, probably in that order1.
Sydney Brenner, Nobel prize in Physiology or Medicine in 2002
1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC139404/
Introducción
La revolución
tecnológica de la
secuenciación ha
cambiado
completamente las
reglas del juego en
biomedicina
3. ¿Que es lo que secuenciamos?
Nuestro DNA
• Cada célula tiene unos 2m de DNA hechos
de 3000 millones de “letras” (genoma)
• Toda la información del mundo cabe en una
cucharita de DNA
• Nuestro DNA codifica unos 20.000 genes
• Los genes ocupan solo el 4% del DNA
(exoma)
• Los genes son nuestro manual de
instrucciones
• Cuando las instrucciones tienen errores
(mutaciones), el mensaje se traduce mal
Transcripción del mensaje Traducción del mensaje
4. ¿Como encontrar mutaciones asociadas a
enfermedades?
La enfermedad genética es un error en la secuencia de “letras” del
genoma. Puede ser hereditaria o adquirida (ej. cáncer)
Un equivalente del genoma ocuparía unos 2000 libros conteniendo 1,5
millones de letras cada uno (aproximadamente 200 páginas). Si leyésemos
un libro a la semana necesitaríamos 10 años para leerlo entero
Toda esa información está en todas y cada una de los 50 billones de
células del cuerpo.
5. Solo UNA o POCAS mutaciones causan
muchas de las enfermedades genéticas
T
Ejemplo:
Libro 1129, pag. 163, 3er
párrafo, 5a linea, 27a letra
debería de ser A en vez de T
El reto es encontrar la
“letra” errónea entre
los 3000 millones de
letras de los 2000
libros de nuestra
biblioteca genómica
Solución:
Lo leemos todo
Problema:
Demasiado para leer
6. La secuenciación exomica se está usando
sistemáticamente para identificar genes de
enfermedades hereditarias
7. El reto: encontrar la mutación que causa
la enfermedad
Los secuenciadores masivos actuales no pueden
leer la secuencia genómica directamente. Leen
fragmentos de unas 200 letras.
Tenemos que inferir la secuencia del paciente
comparando los fragmentos de 200 letras con
toda la biblioteca (alineamiento).
ATCCACTGG
CCCCTCGTA
GCGAAAAGC
Vemos si el
fragmento es
idéntico o
tiene algún
cambio
(mutación)
con respecto
a la referencia
9. Las mutaciones cambian el sentido de las
“palabras” del mensaje genético
En un lugar de la Mancha, de cuyo hombre no quiero acordarme…
En un lugar de la Mancha, de cuyo hombre no quiero acordarme…
En un lugar de la Mancha, de cuyo hombre no quiero acordarme…
En un lugar de la Mancha, de cuyo hombre no quiero acordarme…
En u | n lugar d | e la Manc | ha, de c | uyo ho | mbre no qu | iero acor | darme
En un lu | gar de la M | ancha, de c | uyo hom | bre no q | uiero aco | rdarme
En | un luga | r de la Ma | ncha, de cu | yo hombr | e no quie | ro acordar | me
En un lu | gar de la Man | cha, d | e cuyo h | ombre n | o quier | o acorda | rme
Genomas de las células
Lectura del secuenciador
10. Localizando sobre el genoma de
referencia los fragmentos que se
leen permite descubrir que ha
cambiado (mutaciones)
yo hombr
un lugar de la Mancha, de cu ombre n darme
gar de la Man e cuyo h e no quie o acorda
En n lugar d ancha, de cuyo hom uiero aco me
En u gar de la M ha, de c bre no q iero acor rme
En un lu e la Manc mbre no qu ro acordar
En un lugar de la M cha, d uyo ho o quier rdarme
En un lugar de la Mancha, de cuyo nombre no quiero acordarme
nombre cambia a hombre
El significado del mensaje ha cambiado,
y eso puede tener consecuencias
(normalmente no buenas)
Genoma de
referencia
lecturas
11. Representación de un genoma: el
formato VCF
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5: 65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0: 18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4 :51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Referencia: En un lugar de la Mancha, de cuyo nombre no quiero acordarme
Genoma: En un lugar de la Mancha, de cuyo hombre no quiero acordarme
Representación:
#CHROM POS ID REF ALT
1 26 . n h
12. Impacto de la secuenciación sobre la
medicina: promueve la transición a la
medicina de precisión
Precision medicine is based on a better knowledge of phenotype-genotype relationships
Requires of a better way of defining diseases by introducing genomic technologies in the
diagnostic procedures and treatment decisions
Intuitive
Based on trial
and error
Identification of
probabilistic
patterns
Decisions and
actions based
on knowledge
Intuitive Medicine Empirical Medicine Precision Medicine
Today Tomorrow
Degree of personalization
13. Empirical medicine
Phase I: generation of knowledge
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
sequencing
Patient
Variants
Database. Query
Therapy outcome
System feedback
Genomic variants (biomarkers) can
be quickly associated to precise
diagnosis or therapy outcomes
Initially the system will need
much feedback: Knowledge
generation phase.
Empirical medicine
Knowledge
database
Genome studies enable
knowledge generation
14. Precision medicine.
Phase II: using the knowledge database
Patient
1) Genomic sequencing
2) Database of biomarkers
3) Therapy prediction
Genomic core facility phase II
Clinician receives
hints on possible
prescriptions and
therapeutic
interventions
+Other factors
(risk, cost, etc.)
Diagnosis /
Prescription
Pre-symptomatic:
• Genetic predisposition of acquired diseases
• Early diagnosis of genetic diseases
Symptomatic analysis
• Diagnostic of acquired diseases
• Early cancer detection
• Therapeutic recommendations
15. El reto de manejar e interpretar datos
genómicos
Informe
Automatico:
Biomarcador
encontrado?
Si
No
Priorización de variantes
16. Variant annotation
(function, putative effect,
conservation, etc.)
Report
Si se encuentra una variante conocida
Initial QC
Sequence
cleansing
Base quality
Remove adapters
Remove
duplicates
FASTQ file
Variant calling +
QC
Calling and labeling
of missing values
Calling SNVs and
indels (GATK) using
6 statistics based
on QC, strand bias,
consistence (poor
QC callings are
converted to
missing values as
well)
Create multiple VCF
with SNVs, indels
and missing values
VCF file
Mapping + QC
Mapping
Remove multiple
mapping reads
Remove low
quality mapping
reads
Realigning
Base quality
recalibrating
BAM file
Find the known diagnostic
/ therapeutic variant
Primary analysis Diagnosis
Depending on the
disease, we can find
a diagnostic variant
in 20 – 70% of the
cases.
17. ¿Por qué anotar las variantes?
https://github.com/opencb/cellbase
CellBase (Bleda, 2012, NAR), a
comprehensive integrative database and
RESTful Web Services API, more than
250GB of data:
● Core features: genes, transcripts, exons,
cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs, HapMap,
1000Genomes, EVS, EXAC, etc.
● Pathogenicity indexes and conservation: SIFT,
Polyphen, CADD, PhastCons, philoP, GERP,
etc.
● Disease: ClinVar, OMIM, HGMV, Cosmic, etc.
● Functional: 40 OBO ontologies (Gene Ontology,
HPO, etc.), Interpro, etc.
● Regulatory: TFBS, miRNA targets, conserved
regions, etc.
● System biology: Interactome (IntAct), Reactome
database, co-expressed genes.
● Compared in testing against VEP: more than
99.999% similarity in Consequence types
● Annotation tool of GEL
● More than 50000 genomes annotated so far
Información dispersa en mas de 40 fuentes, con distintos formatos (cambiantes)
que se usa para el filtrado. Cada anotación implica decenas de miles de consultas
18. El reto de manejar e interpretar datos
genómicos
Informe
Automatico:
Biomarcador
encontrado?
Si
No
Priorización de variantes
En un porcentaje
muy alto de los
casos (más del
60%) no se
encuentra
ninguna mutación
conocida
19. A día de hoy aún estamos en la Fase I, con
poco conocimiento sobre el significado de
las variantes genómicas
El reto es encontrar la mutación causativa
de la enfermedad entre todas
ATCCACTGG
CCCCTCGTA
GCGAAAAGC
ATCCACTGG
CCCCTCGTA
GCGAAAAGC
GCTATGGCG
ATTATCGGTA
CGACGTATC
GCTATGGCG
ATTATAGGTA
CGACGTATC
controles casos
20. Casablanca: Detengan a los sospechosos
habituales.
El proceso de priorización
Normalmente un exoma (la parte del genoma que codifica los
genes) presenta entre 60 y 80K variantes (y un genoma entre 1 y 2
millones). Solo una (o unas pocas) entre ellas son posibles
mutaciones de enfermedad.
El proceso de priorización es como una investigación policiaca en la
que se descartan a los sospechosos que tienen coartada
A través de una serie de filtrados secuenciales se va reduciendo la
lista a un tamaño de candidatos manejable
Agatha Christie
¿Tenemos 40-60K
sospechosos?
Necesitamos a…
21. 3-Methylglutaconic aciduria (3-
MGA-uria) is a heterogeneous
group of syndromes
characterized by an increased
excretion of 3-methylglutaconic
and 3-methylglutaric acids.
WES with a consecutive filter
approach is enough to detect
the new mutation in this case.
Heuristic Filtering approach
An example with 3-Methylglutaconic aciduria syndrome
22. Behind the scenes: the whole data
analysis process
Initial QC
Sequence
cleansing
Base quality
Remove adapters
Remove
duplicates
FASTQ file
Variant calling +
QC
Calling and labeling
of missing values
Calling SNVs and
indels (GATK) using
6 statistics based
on QC, strand bias,
consistence (poor
QC callings are
converted to
missing values as
well)
Create multiple VCF
with SNVs, indels
and missing values
VCF file
Mapping + QC
Mapping
Remove multiple
mapping reads
Remove low
quality mapping
reads
Realigning
Base quality
recalibrating
BAM file
Diagnosis; automatic or
based on prioritization
If no known mutations are
found, then prioritization:
Variant annotation
(function, putative effect,
conservation, etc.)
Inheritance analysis
(including compound
heterozygotes in recessive
inheritance)
Filtering by frequency with
external controls (Spanish
controls, dbSNP, 1000g,
5500g) and annotation
Multi-family intersection of
genes and variants
Network or pathway-based
prioritization
Report
Primary analysis Prioritization
23. Phase I lessons learned: the importance of
local variability in the prioritization process
We discovered some
12,000 “Spanish”
polymorphisms not
present in other
databases. The
filtering efficiency
enormously
increases using local
population data
24. The CSVS is a crowdsourcing project
Scenario: Sequencing projects of healthy
population are expensive and funding
bodies are reluctant to fund them
CSVS Aim: To offer increasingly accurate
information on variant frequencies
characteristic of Spanish population.
CSVS Main use: Frequency-based
filtering of candidate variants
Main data source: Sequencing projects
of individual researchers (CIBERER and
others)
Problem: Most of the contributions
correspond to patient exomes
Idea: Patients of disease A can be
considered healthy pseudo-controls for
disease B (providing no common genetic
background exist between A and B)
Beacon: CSVS has a Beacon server
http://csvs.babelomics.org/
Allelic population frequencies obtained
from 1,600 exomes are currently available
in CSVS
25. Reto: como compartir información genómica sin
proporcionar datos de pacientes: beacon (GA4GH
global alliance for genomics and health)
Como evitar re-identificaciones de pacientes en beacons
Compartir datos genómicos anonimizados y agregados es importante para que distintos
proyectos de investigación generen conocimiento. Pero comporta riesgos.
Soluciones sencillas: liberar solo mezclas de sanos y enfermos.
26. Reto: uso de datos genómicos en la
práctica clínica: ocultar la complejidad
?
eHR
K
NO
YES
D
I: patient`s Información
C: Informed Consent
G: patient`s Genome
D: high precision Diagnosis
Knowledge
K
Clinical research
D
Knowledge
Diagnosis /
therapy
G
I
Sequencing
Unit
Bioinformatics
Area
1
2
3
4
5
6
7
8
Corporative
analysis request
system
C
27. General diagnosis protocol for
rare diseases
Known
genes
Suspected
diagnosis
Unexpected findings:
• Pharmacogenomics
• Actionable diseases
• Reproductive risk
Disease panel:
• Diagnostic variants
• Genes
Disease panel:
• Diagnostic variants
• Genes
Known
variants
Disease panel:
• Diagnostic variants
• Genes
Disease panel:
• Diagnostic variants
• Genes
VUS in
Known genes
found
Variants
found
VUS
prioritization
successful
Expert
validation
VUS
prioritization
successful
yes
yes
no
yes
no
yes Report
positive
diagnosis
Report
negative
diagnosis
no
yes
no
yes
no
yesno
no
yes
no
Expert
validation
Variants
found
yes yes
50-70%
< 1 min.
Sample
QC
VUS
everywhere
found
28. Beyond rare diseases diagnosis:
Personalized Medicine in cancer
Biomarker 1 Therapy 1
Current use of biomarkers
Therapy 1
Therapy 2
Therapy 3
Enhanced use of biomarkers
Patient genomic data analysis allows one-step
association of biomarkers with therapies and
enables the detection of new actionable
biomarkers, or clinical trials compatible with
patients saving time and cost and increasing
treatment success
Prospective healthcare
Therapy 2
Genomic
biomarkers
Biomarker drugs
New drugs
Clinical trial
Result
+
Biomarker 2
Therapy 3Biomarker 3
1st line 2nd line 3rd line …..
29. The concept of virtual panel:
Sequence it all and observe what is pertinent today.
Keep genomic data for future pertinent new observations
Old panels in the archive (never
deleted for traceability)
Gene(s) in the
selected panel
Diagnostic variants in the
gene(s) of the selected
panel
Disease(s) in the
selected panel
30. Circuito de análisis
Basado en una secuencia intuitiva de
pasos que lleva desde la carga del dato
genómico hasta la generación del
informe
Una vez se han
cargado las muestras,
están listas para el
análisis
32. Circuito de análisis
A partir de la interpretación del análisis
(lista priorizada) se puede generar un
informe que incluye información de filtros,
versiones de las bases de datos, etc. para
la trazabilidad.
Retos: Trazabilidad y reproducibilidad. Las bases de datos de conocimiento
cambian. Mañana tendremos conocimiento que hoy no tenemos que puede
cambiar nuestras conclusiones
33. Front end: Personalized Medicine Module (MMP)
Sample selection
Variant prioritization
Selection of
variants for
the report
Report generation
(sent to the eHR)
34. Currently, the fastest and
more powerful genomic
database engine in the
world.
Used in the GEL for
genomic data
management
Backend: OpenCGA, a scalable
storage and genomic data
management platform
Extensive capabilities to query across genotype and phenotype relationships
https://github.com/opencb/opencga
In collaboration with
Genomics England (GEL)
Unique feature: population-level indexing (contrarily to
sample- or family-level indexing in most applications)
35. Reutilización de datos genómicos
(indexado poblacional)
VCFs
?
Recurrences and
population frequencies
Controls and pseudo-
controls
Full exploitation of genomic data
Indexado a nivel de
muestra: fácil de hacer y
útil para diagnóstico o
tratamiento de
precisión. Requiere
reidexado para
reanálisis
Indexado poblacional:
Más complejo pero es útil
para medicina de
precisión más allá de
símple diagnóstico)
36. GDPR compliance
The system has been designed in a way that is compliant with EU
and Spanish General Data Protection Regulation
• Clinicians requesting for a
genomic diagnostic have
access to eHR and get the
result of the test.
• Geneticists have access to
eHR and can query the
genomic data (but never
extract them)
• IT have access to de-
identified genomic data
and no to eHR.
37. Future vision involves big data integration:
Genomic data are especially relevant but not the
only useful big data
…
…
Genome Clinic
….
Study1 ….. Studyn
• Other big data are being
collected (medical image,
digital pathology, wearable
devices, etc.)
• Clinical data dynamically
associated to different big data
• The whole health system
becomes a enormous potential
prospective clinical study
• Immense possibility for data
reusability
• Growing genomic DB with
increasing study possibilities
Digital pathology Medical image ….
MMP
38. Genomic and clinical data within the health
system enable Personalized Medicine
• Database of patients with prospective clinical information. Patients
sequenced:
• Will have different responses to treatments in the future
• Can have other diseases in the future
• Dynamic diagnostic of undiagnosed patients as knowledge databases
update
• Dynamic assignment of treatments for patients without therapeutic
options as knowledge databases update
• Preventive medicine:
• Dynamic discovery of pharmacogenomic relevant variants in sequenced
individuals
• Dynamic discovery of new risk variants in sequenced individuals
• Dynamic discovery of reproductive risk variants
• Database of knowledge:
• Prospective discovery of new biomarkers of response to drugs,
therapies, prognostic, etc.
• The pool of disease or risk variants is limited and could be surveyed
soon
39. The real implementation of Personalized
Medicine requires a model that integrates
genomic data and universal eHR
…
…
Genome Clinic
….
Study1 ….. Studyn
MMP
• The whole health system becomes a
enormous potential prospective clinical
study
• Clinical data dynamically associated to
genomic data
• Possibility of many clinical studies by
reanalyzing genomic data under diverse
perspectives (with no extra investment)
• Growing genomic DB with increasing study
possibilities
40. Genomic initiatives are clinical studies but
not Personalized Medicine yet
Time
…
…
Genome Clinia
Clinical study
…
…
Genome Clinic
Clinical study
…….
• Each study requires of a specific
genomic and clinical data
collection into an external
database
• Serious security concerns
(genomic + clinical data outside
the hospital)
• Static clinical data (e.g. if a
control becomes a case the
external DB will not be updated)
• Limited genomic data reuse for
purposes different from the
original study
• Model of GEL (100,000
genomes), PERIS, RAREgenomics,
etc.
41. External repository
Genomic Clinic
…
Risk
….
Study1 ….. Studyn
• Risks associated to
sensitive data transfer
(data encryption, private
lines, etc.)
• Clinical data must be
homogenized across
hospitals
• Clinical data must be
updated to allow proper
prospective clinical
studies
• GDPR legal coverage must
be implemented outside
hospitals (consent
management, ethic
committees, etc.
Model
used by
100.000
genomes,
genomic
clouds,
etc.
42. Federated data management
…
Study1
• The data management
system queries other
hospital DMSs (advanced
clinical Beacons) that
returns limited specific
genomic information
• Relative risk associated to
data query / transactions
• Clinical data
homogenization at DMS
level
• GDPR controlled at DMS
level
Risk. Data
encryption
43. Corollary
• Clinical studies are useful for the first phase of personalized
medicine
• Personalized medicine is more than using genomic data for
diagnosis/treatment: it is full prospective exploitation of patient
genomic data linked to the clinical data enabling not only precision
diagnostic/treatment but also preventive medicine (dynamic
discovery of susceptibility or pharmacogenomic biomarkers), and
enhanced clinical discovery.
• Personalized medicine requires of a common genomic data
repository
• Fully connected health systems with universal EHR have a
competitive advantage to implement proper personalized medicine
practices.
• Unconnected health systems face serious challenges to fully exploit
genomic data beyond precision diagnostic/treatment
44. Clinical Bioinformatics Area
Fundación Progreso y Salud, Sevilla, Spain, and…
...the INB-ELIXIR-ES, National Institute of Bioinformatics
and the BiER (CIBERER Network of Centers for Research in Rare Diseases)
@xdopazo
@ClinicalBioinfo
Follow us on
twitter
https://www.slideshare.net/xdopazo/