How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

How Can We Make Genomic
Epidemiology a Widespread Reality?
William Hsiao, Ph.D.
William.hsiao@bccdc.ca
@wlhsiao
BC Public Health Microbiology and Reference Laboratory
BCCDC Grand Round May 26 2015

Outline
• Part 1: What is genomic epidemiology and
Why is it important for public health
microbiology
• Part 2: What are the requirements to bring
genomic epidemiology to routine public
health practice
– Introducing our project IRIDA as part of the
solution

3
Source: Peter Gleick, Scienceblogs.com

People
Place
Time
Source: Melanie Courtot

Molecular Epidemiology
• Laboratory generated biomarker results can
be correlated to epidemiological investigations
(People, Place, Time)
• Provides linkage based on common exposure
to the same pathogen at the molecular level
• Most tests detect one or a few of specific
biomarkers, representing a fraction of the
pathogens’ genetic information

Current Methods of Characterizing Foodborne
Pathogens in a Public Health Laboratory
• Growth characteristics
• Phenotypic panels
• Agglutination reactions
• Enzyme immuno assays (EIAs)
• PCR
• DNA arrays (hybridization)
• Sanger sequencing of marker genes
• DNA restriction
• Electrophoresis (PFGE, capillary)
Each pathogen is characterized by methods that are specific to that pathogen in
multiple workflows (separate workflows for each pathogen) TAT: 5 min – weeks
(months)
Source: Rebecca Lindsey

Genomic Epidemiology
Def: Using whole genome sequencing data from
pathogens and epidemiological investigations
to track spread of an infectious disease

Why Genomic Epidemiology
• One technology (DNA sequencing) compatible with
many types of pathogens
• Capable of generating 10-1000s of high quality
pathogen genomes within 1-7 days

Sequencing = lots of HQ Data
• Capture the pathogen’s entire genetic makeup
• Unbiased (~97-99+% of the genome captured using
common sequencing approaches)
• Significantly more data than traditional methods
• Allow higher resolution and higher sensitivity methods to
be applied
• Allow value-added
evolutionary & Functional
study of the pathogens
– Virulence factors
– AMR genes

$10K per human genome or $10
per bacterial genome
$100M per human genome
Sequencing cost continues to drop

Variations in genomes = Basis of
Comparison
• Mutations
– Point mutations
– Small insertions and deletion (indels)
– Can change functions of a gene
• Recombination, deletion, and duplication
– Rearrange genes, can change expression
– Increase gene copy number
– Delete genes
• Horizontal gene transfer
– Acquiring genetic material from non-parental organism
• E.g. Antibiotic resistance / new toxins

SNP Analysis
• What is a SNP?
– A SNP (single nucleotide polymorphism) is DNA
sequence variation occurring when a single nucleotide
differs between two or more genomes
ATCGCGATATCATACGG
ATCGCAATATCATACGG
ATCGCGATATCATACGG
ATCGCGATATCATACGG
ATCGCAATATCATACGG
• SNP can be created from point mutation but can
also be created from insertion and deletion of
one nucleotide

Why are SNPs useful
• Silent mutations that do not change protein
sequences happen quite frequently due to
DNA replication errors => High Resolution
• SNPs occurs across the whole genome and can
be detected from whole genome sequencing
=> Unbiased markers
• SNPs can also be used to infer phylogeny of
organisms
– More shared SNPs = more closely related

SNP Minimal Spanning Tree – colored by Phage Type
PT8
PT4
PT13a
PT52
The most similar isolates are connected first => clustering them together

SNP Minimal Spanning Tree – colored by outbreaks

Many phylogenetic trees based on SNPs
published to show clustering of outbreak cases
den Bakker et al Emerg Infect Dis. 2014 Aug;20(8)
Non-related
cases
Outbreak
cases
Allard, M et alPLoS ONE 8 (1) 2013

Forces Driving Pathogen Genome Evolution
Specialization
“lean and mean”
New
function can
be derived
through:
Gene expression
and be turned on
and off

Intra-cluster distances overlap with inter-cluster
distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.

Different species have different clustering
distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.

Genomics + Epidemiology
• Having genetic distance information alone
may not be enough to fully characterize
outbreaks
• Need to combine with epidemiological
investigations
• Using known clusters to establish (sub-
)species-specific genetic distance criteria
• Genomics can help connecting previous
unlinked cases to uncover new cases

Each year, one in eight Canadians (or
four million people)
get sick with a domestically acquired
food-borne illness.
http://www.phac-aspc.gc.ca/efwd-emoha/efbi-emoa-eng.php

Whole Genome Sequencing of Foodborne
Pathogens Around the World
• UK Public Health England committed to sequence all the
Salmonella isolates submitted to PH Lab
• US FDA and CDC (supported by National Center for
Biotechnology Information) created a distributed network
of labs to utilize WGS for pathogen identification
https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/
http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

Genome Canada Bioinformatics Competition: Large-Scale Project
“A Federated Bioinformatics Platform for
Public Health Microbial Genomics”
Our Goal
The IRIDA platform
(Integrated Rapid Infectious Disease Analysis)
An open source, standards compliant, high quality genomic epidemiology
analysis platform based on web-technology to support real-time (food-
borne) disease outbreak investigations
25 www.IRIDA.ca

Partnership among public health agencies and academic institutes to bridge the gaps
between advancements in genomic epidemiology and application to real-life and real-
time use cases in public health agencies
- Project Team has direct access to state of the art research in academia
- Project Team is directly embedded in user organization
National
Public Health Agency
Provincial
Public Health Agency
Academic/Public

IRIDA Project Phases
• Phase 1: genomics process and analysis pipeline to
produce categorical data (MLST and SNPs) suitable for
current epidemiological analysis – almost completed
• Phase 2: combine the categorical data with
epidemiological data (line list approach to replace
current Excel based approach) – in progress
• Phase 3: Develop IRIDA as an exploratory platform for
new ways of interpreting genomics data in light of
epidemiological and clinical data – in progress;
continuous process beyond current project

Interviews with key personnel to identify
barriers to implement genomic epidemiology in
public health agencies
28

GAP 1: PUBLIC HEALTH PERSONNEL
LACK TRAINING IN GENOMICS

Microbial genomics has been a valuable
research tool
• Help us understand:
– microbial evolution
– pathogenesis
– create novel industrial processes
– create new laboratory tests
• Use historical isolates – not real time
• Use of laboratory strains – no associated rich
clinical and epidemiological metadata

Cultural and Practical Differences
Genomics Research Laboratory Genomics Diagnostic Laboratory
Curiosity driven Production / Case driven
Exploratory analysis tolerated Exploratory analysis discouraged
Reproducibility = other labs’ problem Reproducibility critical
Tweaking protocols desirable Stability in protocols desirable
Protocols don’t need to be validated Protocols need to be validated
Novelty justifies the high cost of
experiment
Conscious of cost per unit test; tests need
to be scalable
How do we bridge the cultural and the practical differences?

Solution 1a: Build a User Friendly, high quality
analysis platform to process genomics data
• Carefully designed and engineered software platform is
just the starting point… User
Interface
Security
File system
Metadata
Storage
Application
logic
REST API
Workflow Execution Manager
Continuous Integration Documentation

• Easy to use interface hiding the technical details

Solution 1b: Build Portable and Transparent
Pipelines
• Use Galaxy as workflow engine – large
community support
• Retools to address usability, security, and
other limitations
• Version Controlled Pipeline Templates
• Input files, parameters, and workflow are
sent to IRIDA-specific Galaxy for execution
• Results and provenance information are
copied from Galaxy
1. Input
files sent to
Galaxy
3. Results
downloaded
from Galaxy
IRIDA UI/DB
Galaxy
Assembly Tools
Variant Calling Tools
…
REST API
Shared File System
Worker Worker
2. Tools executed
on Galaxy workers
Source: Franklin Bristow

Solution 1c: Start the training NOW!
• Canada’s National Microbiology Laboratory has hosted
genomic workshops for partners and collaborators
• At, PHMRL, we have been conducting workshops to train
technologists and researchers on some common genomic
analysis tools
• IRIDA Project has dedicated funding for hosting workshops in
4Q of 2015 and 2016
• We would like to engage the epidemiologists in the future for
training purpose as well

GAP 2: INFORMATION SHARING IS
INEFFICIENT AND AD-HOC

Many Players in surveillance and outbreak –
ineffective information sharing
Source: M. Taylor, BCCDC
Provincial public
health dept.
National laboratory
Local public
health dept.
Provincial
laboratory
Cases
Physicians Frontline lab
Information
BioinformaticsandAnalyticalCapacities

Many Systems used in Reporting Diseases –
require data re-entry and re-coding
National Ministry of
Health
Provincial public
health dept.
National laboratory
Local public
health dept.
Provincial
laboratory
Cases
Physicians Local laboratory
Fax/Electronic
Fax
Phone/Fax
Electronic/Paper
Electronic/Fax/Phone
Mailing of
Samples/Fax/Eelctroni
c
Source: M. Taylor, BCCDC

Semantic Web
Credit: http://www.cs.rpi.edu/~hendler/
 Semantic web is a suitable technology framework to
organize and share arbitrary datasets

What’s the web?
• World-Wide-Web (WWW) is a platform where
– Information is distributed (CBC for news, Netflix
for Movies, etc.)
– Information is heterogeneous (text, video,
pictures)
– (relevant) Information is linked by hyperlinks
– Often, information is only human readable
– Often, information is incorrect
– Often, information is not attributed

What’s Semantic web?
• Semantic web inherits many of the (good) attributes of
WWW (distributed, open, heterogeneous, and linked)
• It’s designed to be:
– machine readable based on a common language of logic
– Linking information can be automated making data sharing
easier
– Easier to describe granular data
– Errors can be detected based on logical reasoning
– Information can be attributed and can be made to persist
– “Smart Web”

IRIDA uses semantic web technologies to
address information management issues
• Solutions:
– 2a: Localized Instance of federated databases
– 2b: Permission Control – authentication /authorization for
information sharing
– 2c: User role-based display of information

Solution 2a: Local/Cloud Instances and Data
Federation
• Data processing capacity pushed to data generating
labs
• Allow data sharing securely for enhanced analysis
• Eventually cultivating a culture of openness of data
sharing and collaborative development of tools
44

Authorization
Solution 2b: Security
• Local authorization per instance.
• Method-level authorization.
• Object-level authorization.
• Allow secure, fine grained and
flexible information sharing
controlled by data producer

Solution 2c: Role-based Dynamic Display driven
by Ontology
• Ontologies often lack a content management system (CMS)
• An Interface Model Ontology (IFM) can define a CMS for an
ontology
Source: Damion Dooley

IFM Interface View Permissions
Detailed View Restricted View
E.g. User role permissions control visibility and editing of content
Source: Damion Dooley

GAP 3: INFORMATION
REPRESENTATION IS INCONSISTENT

There are at least 74 different ways to
say “female” in ENA database
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383942/

Solution 3a: Use Ontology
• Ontology: a way to describe types of entities
and relations between them
• Why use ontology
– Ontology is flexible and expandable
– Lower levels of expressivity (e.g. controlled vocabulary,
data dictionary) are heavy handed and show low level of
compliance and adoption
– Free text used as an alternative that are not computing
friendly
– Ontology and semantic web technologies may be a
solution

The Utility of Ontologies in Food-borne Investigations
Example:
Correlate PFGE type SSOXAI.0042 cases between 01 Mar 2015- 16 Mar 2015 with
Spinach  Leafy Greens  Produce  High-Risk Food Sources and Symptoms of Nausea
and Fever
Ontologist organizes how terms are related in a tree so one can search for terms at different
levels
Provides great information-resolving power!!
High-Risk Food
Produce Poultry Seafood
Leafy Greens Sprouts Deli Meat Nuggets Fish Shellfish
Source: Emma Griffiths

Many Domains of Knowledge are needed to describe
an outbreak investigation Build On, Work With:
OBI
TypON
NGSOnto
NIAID-GSC-BRC core metadata
MIxS Ontology
NCBI Biosample etc
TRANS – Pathogen Transmission
EPO
Exposure Ontology
Infectious Disease Ontology
CARD, ARO for AMR
USDA Nutrient DB
EFSA Comp. Food Consump. DB
Example gaps to be filled:
Expand food ontology; expand CARD
AMR data with others.

Lab Checklist/Ontology
• Currently finishing a lab/genomics checklist
• Metadata Domains:
– Sample Collection
– Sample Source
– Environmental
– Lab Analytics
– Sequencing Process /QC
– Sequencing Run /QC
– Assembly Process / QC
– Others overlapping with Epi: Demographic / Geographic / etc.
• Starting an epidemiology checklist to be completed this
year

GAP 4: GENOMIC DATA
INTERPRETATION IS COMPLEX AND
TECHNOLOGY IS EVOLVING

Solution 4a: Use of QA/QC in IRIDA
• Software Engineering
– High quality software that meets regulatory guidelines
– Open Source product to ensure “white box” testing
– Ontology driven software development
– Follow proper software development cycle
• Data Quality
– Built-in modules to check for input data quality
– Warnings and Feedbacks during pipeline execution to laboratory technologists
– Use of Ontology to check metadata (non-genomic) data quality
• Analytic Tool Quality
– Utilize validation datasets
– Use of abstract pipeline description – with version control
– Periodic analysis of exceptions and boundary cases to assess tool accuracy

Solution 4b: Generation of validation datasets
To Participate, Contact
Rene Hendriksen
rshe@food.dtu.dk
Or
Errol Strain
Errol.Strain@fda.hhs.gov
http://www.globalmicrobialidentifier.org/Workgroups#work-group-4
NML and BCPHMRL will be
participating in the GMI proficiency
test to compare our genomic
sequencing and analysis protocols
with other labs around the world

Solution 4c: Exploratory tools can access certain
data via REST API securely
58
http://pathogenomics.sfu.ca/islandviewer
IslandViewer
Dhillon and Laird et al. 2015, Nucleic Acids
Research
http://kiwi.cs.dal.ca/GenGIS
Parks et al. 2013, PLoS One

Availability
• Jun 1 2015: IRIDA 1.0 beta Internal Release
– Release to collaborators for installation and full test
• Jul 1 2015: IRIDA 1.0 beta1
– Announce Beta release, download, documentation
available on website – www.irida.ca
• Aug 1 2015: IRIDA 1.0 beta2
– Cloud installer, with documentation
– Additional pipelines as available
– Visualization as available

Acknowledgements
Project Leaders
Fiona Brinkman – SFU
Will Hsiao – PHMRL
Gary Van Domselaar – NML
University of Lisbon
Joᾶo Carriҫo
National Microbiology Laboratory (NML)
Franklin Bristow
Aaron Petkau
Thomas Matthews
Josh Adam
Adam Olson
Tarah Lynch
Shaun Tyler
Philip Mabon
Philip Au
Celine Nadon
Matthew Stuart-Edwards
Morag Graham
Chrystal Berry
Lorelee Tschetter
Aleisha Reimer
Laboratory for Foodborne Zoonoses (LFZ)
Eduardo Taboada
Peter Kruczkiewicz
Chad Laing
Vic Gannon
Matthew Whiteside
Ross Duncan
Steven Mutschall
Simon Fraser University (SFU)
Melanie Courtot
Emma Griffiths
Geoff Winsor
Julie Shay
Matthew Laird
Bhav Dhillon
Raymond Lo
BC Public Health Microbiology &
Reference Laboratory (PHMRL) and BC
Centre for Disease Control (BCCDC)
Judy Isaac-Renton
Patrick Tang
Natalie Prystajecky
Jennifer Gardy
Damion Dooley
Linda Hoang
Kim MacDonald
Yin Chang
Eleni Galanis
Marsha Taylor
Cletus D’Souza
Ana Paccagnella
University of Maryland
Lynn Schriml
Canadian Food Inspection Agency (CFIA)
Burton Blais
Catherine Carrillo
Dominic Lambert
Dalhousie University
Rob Beiko
Alex Keddy
60
McMaster University
Andrew McArthur
Daim Sardar
European Nucleotide Archive
Guy Cochrane
Petra ten Hoopen
Clara Amid
European Food Safety Agency
Leibana Criado Ernesto
Vernazza Francesco
Rizzi Valentina

61
61
IRIDA Annual General Meeting
Winnipeg, April 8-9, 2015

How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

Similar to How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao (20)

Recently uploaded

Recently uploaded (20)

How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao

Editor's Notes