Jalandhar Female Call Girls Contact Number 9053900678 💚Jalandhar Female Call...
How Can We Make Genomic Epidemiology a Widespread Reality? - William Hsiao
1. How Can We Make Genomic
Epidemiology a Widespread Reality?
William Hsiao, Ph.D.
William.hsiao@bccdc.ca
@wlhsiao
BC Public Health Microbiology and Reference Laboratory
BCCDC Grand Round May 26 2015
2. Outline
• Part 1: What is genomic epidemiology and
Why is it important for public health
microbiology
• Part 2: What are the requirements to bring
genomic epidemiology to routine public
health practice
– Introducing our project IRIDA as part of the
solution
7. Molecular Epidemiology
• Laboratory generated biomarker results can
be correlated to epidemiological investigations
(People, Place, Time)
• Provides linkage based on common exposure
to the same pathogen at the molecular level
• Most tests detect one or a few of specific
biomarkers, representing a fraction of the
pathogens’ genetic information
8. Current Methods of Characterizing Foodborne
Pathogens in a Public Health Laboratory
• Growth characteristics
• Phenotypic panels
• Agglutination reactions
• Enzyme immuno assays (EIAs)
• PCR
• DNA arrays (hybridization)
• Sanger sequencing of marker genes
• DNA restriction
• Electrophoresis (PFGE, capillary)
Each pathogen is characterized by methods that are specific to that pathogen in
multiple workflows (separate workflows for each pathogen) TAT: 5 min – weeks
(months)
Source: Rebecca Lindsey
9. Genomic Epidemiology
Def: Using whole genome sequencing data from
pathogens and epidemiological investigations
to track spread of an infectious disease
10. Why Genomic Epidemiology
• One technology (DNA sequencing) compatible with
many types of pathogens
• Capable of generating 10-1000s of high quality
pathogen genomes within 1-7 days
11. Sequencing = lots of HQ Data
• Capture the pathogen’s entire genetic makeup
• Unbiased (~97-99+% of the genome captured using
common sequencing approaches)
• Significantly more data than traditional methods
• Allow higher resolution and higher sensitivity methods to
be applied
• Allow value-added
evolutionary & Functional
study of the pathogens
– Virulence factors
– AMR genes
12. $10K per human genome or $10
per bacterial genome
$100M per human genome
Sequencing cost continues to drop
13. Variations in genomes = Basis of
Comparison
• Mutations
– Point mutations
– Small insertions and deletion (indels)
– Can change functions of a gene
• Recombination, deletion, and duplication
– Rearrange genes, can change expression
– Increase gene copy number
– Delete genes
• Horizontal gene transfer
– Acquiring genetic material from non-parental organism
• E.g. Antibiotic resistance / new toxins
14. SNP Analysis
• What is a SNP?
– A SNP (single nucleotide polymorphism) is DNA
sequence variation occurring when a single nucleotide
differs between two or more genomes
ATCGCGATATCATACGG
ATCGCAATATCATACGG
ATCGCGATATCATACGG
ATCGCGATATCATACGG
ATCGCAATATCATACGG
• SNP can be created from point mutation but can
also be created from insertion and deletion of
one nucleotide
15. Why are SNPs useful
• Silent mutations that do not change protein
sequences happen quite frequently due to
DNA replication errors => High Resolution
• SNPs occurs across the whole genome and can
be detected from whole genome sequencing
=> Unbiased markers
• SNPs can also be used to infer phylogeny of
organisms
– More shared SNPs = more closely related
16. SNP Minimal Spanning Tree – colored by Phage Type
PT8
PT4
PT13a
PT52
The most similar isolates are connected first => clustering them together
18. Many phylogenetic trees based on SNPs
published to show clustering of outbreak cases
den Bakker et al Emerg Infect Dis. 2014 Aug;20(8)
Non-related
cases
Outbreak
cases
Allard, M et alPLoS ONE 8 (1) 2013
19. Forces Driving Pathogen Genome Evolution
Specialization
“lean and mean”
New
function can
be derived
through:
Gene expression
and be turned on
and off
20. Intra-cluster distances overlap with inter-cluster
distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.
21. Different species have different clustering
distances
Leekitcharoenphon, et al. 2014. PLoS ONE 9 (2). doi:10.1371/journal.pone.0087991.
22. Genomics + Epidemiology
• Having genetic distance information alone
may not be enough to fully characterize
outbreaks
• Need to combine with epidemiological
investigations
• Using known clusters to establish (sub-
)species-specific genetic distance criteria
• Genomics can help connecting previous
unlinked cases to uncover new cases
23. Each year, one in eight Canadians (or
four million people)
get sick with a domestically acquired
food-borne illness.
http://www.phac-aspc.gc.ca/efwd-emoha/efbi-emoa-eng.php
24. Whole Genome Sequencing of Foodborne
Pathogens Around the World
• UK Public Health England committed to sequence all the
Salmonella isolates submitted to PH Lab
• US FDA and CDC (supported by National Center for
Biotechnology Information) created a distributed network
of labs to utilize WGS for pathogen identification
https://publichealthmatters.blog.gov.uk/2014/01/20/innovations-in-genomic-sequencing/
http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm
25. Genome Canada Bioinformatics Competition: Large-Scale Project
“A Federated Bioinformatics Platform for
Public Health Microbial Genomics”
Our Goal
The IRIDA platform
(Integrated Rapid Infectious Disease Analysis)
An open source, standards compliant, high quality genomic epidemiology
analysis platform based on web-technology to support real-time (food-
borne) disease outbreak investigations
25 www.IRIDA.ca
26. Partnership among public health agencies and academic institutes to bridge the gaps
between advancements in genomic epidemiology and application to real-life and real-
time use cases in public health agencies
- Project Team has direct access to state of the art research in academia
- Project Team is directly embedded in user organization
National
Public Health Agency
Provincial
Public Health Agency
Academic/Public
27. IRIDA Project Phases
• Phase 1: genomics process and analysis pipeline to
produce categorical data (MLST and SNPs) suitable for
current epidemiological analysis – almost completed
• Phase 2: combine the categorical data with
epidemiological data (line list approach to replace
current Excel based approach) – in progress
• Phase 3: Develop IRIDA as an exploratory platform for
new ways of interpreting genomics data in light of
epidemiological and clinical data – in progress;
continuous process beyond current project
28. Interviews with key personnel to identify
barriers to implement genomic epidemiology in
public health agencies
28
29. GAP 1: PUBLIC HEALTH PERSONNEL
LACK TRAINING IN GENOMICS
30. Microbial genomics has been a valuable
research tool
• Help us understand:
– microbial evolution
– pathogenesis
– create novel industrial processes
– create new laboratory tests
• Use historical isolates – not real time
• Use of laboratory strains – no associated rich
clinical and epidemiological metadata
31. Cultural and Practical Differences
Genomics Research Laboratory Genomics Diagnostic Laboratory
Curiosity driven Production / Case driven
Exploratory analysis tolerated Exploratory analysis discouraged
Reproducibility = other labs’ problem Reproducibility critical
Tweaking protocols desirable Stability in protocols desirable
Protocols don’t need to be validated Protocols need to be validated
Novelty justifies the high cost of
experiment
Conscious of cost per unit test; tests need
to be scalable
How do we bridge the cultural and the practical differences?
32. Solution 1a: Build a User Friendly, high quality
analysis platform to process genomics data
• Carefully designed and engineered software platform is
just the starting point… User
Interface
Security
File system
Metadata
Storage
Application
logic
REST API
Workflow Execution Manager
Continuous Integration Documentation
33. • Easy to use interface hiding the technical details
Solution 1a: Build a User Friendly, high quality
analysis platform to process genomics data
34. Solution 1a: Build a User Friendly, high quality
analysis platform to process genomics data
35. Solution 1b: Build Portable and Transparent
Pipelines
• Use Galaxy as workflow engine – large
community support
• Retools to address usability, security, and
other limitations
• Version Controlled Pipeline Templates
• Input files, parameters, and workflow are
sent to IRIDA-specific Galaxy for execution
• Results and provenance information are
copied from Galaxy
1. Input
files sent to
Galaxy
3. Results
downloaded
from Galaxy
IRIDA UI/DB
Galaxy
Assembly Tools
Variant Calling Tools
…
REST API
Shared File System
Worker Worker
2. Tools executed
on Galaxy workers
Source: Franklin Bristow
36. Solution 1c: Start the training NOW!
• Canada’s National Microbiology Laboratory has hosted
genomic workshops for partners and collaborators
• At, PHMRL, we have been conducting workshops to train
technologists and researchers on some common genomic
analysis tools
• IRIDA Project has dedicated funding for hosting workshops in
4Q of 2015 and 2016
• We would like to engage the epidemiologists in the future for
training purpose as well
38. Many Players in surveillance and outbreak –
ineffective information sharing
Source: M. Taylor, BCCDC
Provincial public
health dept.
National laboratory
Local public
health dept.
Provincial
laboratory
Cases
Physicians Frontline lab
Information
BioinformaticsandAnalyticalCapacities
39. Many Systems used in Reporting Diseases –
require data re-entry and re-coding
National Ministry of
Health
Provincial public
health dept.
National laboratory
Local public
health dept.
Provincial
laboratory
Cases
Physicians Local laboratory
Fax/Electronic
Fax
Phone/Fax
Electronic/Paper
Electronic/Fax/Phone
Mailing of
Samples/Fax/Eelctroni
c
Source: M. Taylor, BCCDC
41. What’s the web?
• World-Wide-Web (WWW) is a platform where
– Information is distributed (CBC for news, Netflix
for Movies, etc.)
– Information is heterogeneous (text, video,
pictures)
– (relevant) Information is linked by hyperlinks
– Often, information is only human readable
– Often, information is incorrect
– Often, information is not attributed
42. What’s Semantic web?
• Semantic web inherits many of the (good) attributes of
WWW (distributed, open, heterogeneous, and linked)
• It’s designed to be:
– machine readable based on a common language of logic
– Linking information can be automated making data sharing
easier
– Easier to describe granular data
– Errors can be detected based on logical reasoning
– Information can be attributed and can be made to persist
– “Smart Web”
43. IRIDA uses semantic web technologies to
address information management issues
• Solutions:
– 2a: Localized Instance of federated databases
– 2b: Permission Control – authentication /authorization for
information sharing
– 2c: User role-based display of information
44. Solution 2a: Local/Cloud Instances and Data
Federation
• Data processing capacity pushed to data generating
labs
• Allow data sharing securely for enhanced analysis
• Eventually cultivating a culture of openness of data
sharing and collaborative development of tools
44
45. Authorization
Solution 2b: Security
• Local authorization per instance.
• Method-level authorization.
• Object-level authorization.
• Allow secure, fine grained and
flexible information sharing
controlled by data producer
46. Solution 2c: Role-based Dynamic Display driven
by Ontology
• Ontologies often lack a content management system (CMS)
• An Interface Model Ontology (IFM) can define a CMS for an
ontology
Source: Damion Dooley
47.
48. IFM Interface View Permissions
Detailed View Restricted View
E.g. User role permissions control visibility and editing of content
Source: Damion Dooley
50. There are at least 74 different ways to
say “female” in ENA database
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383942/
51. Solution 3a: Use Ontology
• Ontology: a way to describe types of entities
and relations between them
• Why use ontology
– Ontology is flexible and expandable
– Lower levels of expressivity (e.g. controlled vocabulary,
data dictionary) are heavy handed and show low level of
compliance and adoption
– Free text used as an alternative that are not computing
friendly
– Ontology and semantic web technologies may be a
solution
52. The Utility of Ontologies in Food-borne Investigations
Example:
Correlate PFGE type SSOXAI.0042 cases between 01 Mar 2015- 16 Mar 2015 with
Spinach Leafy Greens Produce High-Risk Food Sources and Symptoms of Nausea
and Fever
Ontologist organizes how terms are related in a tree so one can search for terms at different
levels
Provides great information-resolving power!!
High-Risk Food
Produce Poultry Seafood
Leafy Greens Sprouts Deli Meat Nuggets Fish Shellfish
Source: Emma Griffiths
53. Many Domains of Knowledge are needed to describe
an outbreak investigation Build On, Work With:
OBI
TypON
NGSOnto
NIAID-GSC-BRC core metadata
MIxS Ontology
NCBI Biosample etc
TRANS – Pathogen Transmission
EPO
Exposure Ontology
Infectious Disease Ontology
CARD, ARO for AMR
USDA Nutrient DB
EFSA Comp. Food Consump. DB
Example gaps to be filled:
Expand food ontology; expand CARD
AMR data with others.
54. Lab Checklist/Ontology
• Currently finishing a lab/genomics checklist
• Metadata Domains:
– Sample Collection
– Sample Source
– Environmental
– Lab Analytics
– Sequencing Process /QC
– Sequencing Run /QC
– Assembly Process / QC
– Others overlapping with Epi: Demographic / Geographic / etc.
• Starting an epidemiology checklist to be completed this
year
55. GAP 4: GENOMIC DATA
INTERPRETATION IS COMPLEX AND
TECHNOLOGY IS EVOLVING
56. Solution 4a: Use of QA/QC in IRIDA
• Software Engineering
– High quality software that meets regulatory guidelines
– Open Source product to ensure “white box” testing
– Ontology driven software development
– Follow proper software development cycle
• Data Quality
– Built-in modules to check for input data quality
– Warnings and Feedbacks during pipeline execution to laboratory technologists
– Use of Ontology to check metadata (non-genomic) data quality
• Analytic Tool Quality
– Utilize validation datasets
– Use of abstract pipeline description – with version control
– Periodic analysis of exceptions and boundary cases to assess tool accuracy
57. Solution 4b: Generation of validation datasets
To Participate, Contact
Rene Hendriksen
rshe@food.dtu.dk
Or
Errol Strain
Errol.Strain@fda.hhs.gov
http://www.globalmicrobialidentifier.org/Workgroups#work-group-4
NML and BCPHMRL will be
participating in the GMI proficiency
test to compare our genomic
sequencing and analysis protocols
with other labs around the world
58. Solution 4c: Exploratory tools can access certain
data via REST API securely
58
http://pathogenomics.sfu.ca/islandviewer
IslandViewer
Dhillon and Laird et al. 2015, Nucleic Acids
Research
http://kiwi.cs.dal.ca/GenGIS
Parks et al. 2013, PLoS One
59. Availability
• Jun 1 2015: IRIDA 1.0 beta Internal Release
– Release to collaborators for installation and full test
• Jul 1 2015: IRIDA 1.0 beta1
– Announce Beta release, download, documentation
available on website – www.irida.ca
• Aug 1 2015: IRIDA 1.0 beta2
– Cloud installer, with documentation
– Additional pipelines as available
– Visualization as available
60. Acknowledgements
Project Leaders
Fiona Brinkman – SFU
Will Hsiao – PHMRL
Gary Van Domselaar – NML
University of Lisbon
Joᾶo Carriҫo
National Microbiology Laboratory (NML)
Franklin Bristow
Aaron Petkau
Thomas Matthews
Josh Adam
Adam Olson
Tarah Lynch
Shaun Tyler
Philip Mabon
Philip Au
Celine Nadon
Matthew Stuart-Edwards
Morag Graham
Chrystal Berry
Lorelee Tschetter
Aleisha Reimer
Laboratory for Foodborne Zoonoses (LFZ)
Eduardo Taboada
Peter Kruczkiewicz
Chad Laing
Vic Gannon
Matthew Whiteside
Ross Duncan
Steven Mutschall
Simon Fraser University (SFU)
Melanie Courtot
Emma Griffiths
Geoff Winsor
Julie Shay
Matthew Laird
Bhav Dhillon
Raymond Lo
BC Public Health Microbiology &
Reference Laboratory (PHMRL) and BC
Centre for Disease Control (BCCDC)
Judy Isaac-Renton
Patrick Tang
Natalie Prystajecky
Jennifer Gardy
Damion Dooley
Linda Hoang
Kim MacDonald
Yin Chang
Eleni Galanis
Marsha Taylor
Cletus D’Souza
Ana Paccagnella
University of Maryland
Lynn Schriml
Canadian Food Inspection Agency (CFIA)
Burton Blais
Catherine Carrillo
Dominic Lambert
Dalhousie University
Rob Beiko
Alex Keddy
60
McMaster University
Andrew McArthur
Daim Sardar
European Nucleotide Archive
Guy Cochrane
Petra ten Hoopen
Clara Amid
European Food Safety Agency
Leibana Criado Ernesto
Vernazza Francesco
Rizzi Valentina
Today, I’d like to tell you a bit about some of Canada’s effort on building a genomic epidemiology analysis platform
This is John Snow’s famous map. On it, I’ve colored in red his column of bars, each of which represents a cholera death. I’ve also circled in blue the local water pumps, including the Broad Street pump — servicing the well that was the source of cholera.
In a now legendary experiment in 1854, Dr. John Snow, a London physician, conducted a simple yet brilliant test that helped to settle the debate about the transmission of cholera. Snow drew a map [see Figure 2 below] of a virulent cholera outbreak in one of the poorest neighborhoods of London – served by central wells and no sewage collection. He plotted the homes and numbers of people affected, and in a flash of insight, mapped the location of the wells that provided water for the hardest hit neighborhoods. The maps he generated and the interviews he conducted with the families of victims convinced him that the source of contamination was the water from the Broad Street well. He received permission from local authorities to remove the pump, which forced residents to go to other, uncontaminated wells for water. Within days, the outbreak subsided.” [from “Bottled and Sold: The Story Behind Our Obsession with Bottled Water” Island Press, Washington DC.]
Having the ability to identify cluster of cases in the population is critical to allow us to understand what is happening, and track the cause of the outbreak
We can differentiate strains of organisms in the population and can tell us who carries the same pathogen. This is achievable via lab techniques which aim at subtyping pathogens
If we can overlay additional information, such as exposure – did they eat in the same restaurant? – we can then track the source of the outbreak.
Add source
In terms of cost
Despite our high standard in food safety, each year 1 in eight Canadian get food poisoning, costing the economy $4 billion dollars. It is important to track the source and spread of the disease to prevent further sickness
IRIDA was conceived about 2 years ago through a Genome Canada Bioinformatics Grant. It is an effort to build an open source, standards compliant, high quality genomic epidemiology analysis platform to support real-time disease outbreak investigations, initially focused on food-borne illnesses
IRIDA is partnership among provincial public health agencies, national public health agencies and academic institutes to bridge the gaps between advancements in genomic epidemiology and real-life and real-time use cases in public health agencies
Project Team has direct access to state of the art research in academia
Project Team is directly embedded in user organization
Since we have access to the end users, we conducted interviews with these subject experts to identify what are the barriers for up-taking of genomics epidemiology in public health agencies. We interviewed epidemiologists, lab scientists and technologists, medical microbiologists and lab administrators. So for the rest of the presentation, I’ll talk about some of the gaps we identified and how IRIDA can meet the requirements.
The first gap which should not be a surprise to this audience, is that public health workers are mostly unfamiliar with genomics and the bioinformatics analysis needed to process and interpret genomic data
While we do believe in the long run, adequate training in genomics is needed to bridge this gap in the short term having high quality analysis platform to automate data processing and has consistent analysis protocols will help to ease the transition.
However, carefully designed and engineered software platform is just the starting point and there will no doubt be many similar platforms to choose from. So I will touch on some of the more interesting design philosophies we have for IRIDA.
We found that in the diagnostic testing world, complex procedures with lots of options lead to more human errors and more non-compliance. So, one design solution that we stress on is to have a simple user interface that hides the technical details. This solution of course can’t stand on its own and I’ll describe measures to ensure that flexibility and scientific rigors can be maintained
We think a user interface should be like a joke… If you have to explain it , then it’s not good. That said, we do have extensive documentations for the administrators and accreditation auditors who don’t like jokes :P
Next solution is to leverage Galaxy which has a large community support and user base as our pipeline engine. We had to retool Galaxy extensively to address usability, security and other limitations. To achieve this we build IRIDA platform on top of the Galaxy engine where input files, parameters and workflows are sent to Galaxy for execution and results and pipeline provenance information are copied back into the IRIDA database for storage and archive
To address the knowledge gaps in genomics, we have started training our public health lab workers on genomic analysis. We would like to hear about other training initiatives and will be happy to share our experience and training material
The second gap that we identified is that sharing of information within and between organizations are highly inefficient and often involves sharing of Excel files with deleted columns to hide sensitive information
There are many players involved in infectious disease surveillance and outbreak investigation. However, concerns with privacy and confidentiality (both founded and unfounded) means that information tend to be aggregated and lost as we move from the frontline labs to public health and reference labs. However the bioinformatics and analytical capacities are the most abundant in central labs and academia
Moreover, different institutions have different software and often data is exported and printed, faxed, then re-imported to a new system by re-typing! This is a huge waste of time and source of errors
IRIDA has a few designs to deal with these issues, and I’ll highlight 3 here.
First we propose that we should push the data processing capacity to the periphery where data is the richest by encouraging local or private cloud instances of the IRIDA platform. This way our partners would not be obligated to give up their data. The different instances are connected via a federated database schema. Data can then be shared securely and easily to allow enhanced analysis to be done by genomic experts located centrally. The more we share successfully, the more likely people will realize the benefit in sharing and this can lead to a new culture of openness
Second we have built-in mechanisms for authentication and authorization at different levels to allow secure and fine grained information sharing. This would allow parties to customize the data they share per material and data transfer agreements
Third, we realized we need to have a flexible user interface to present the data. Therefore, we are in the process of developing an interface model ontology which defines a content management system.
As an example, based on the user’s role, they will be able to see the content of the database displayed differently.
The third gap we identified is that information representation is inconsistent across organizations
Given the richness and complexity of genomic epidemiological data, we opt to use and develop ontologies compliant with OBO Foundry to describe the data; Currently, lower levels of expressivity such as controlled vocabularies and data dictionaries are used but they tend to be heavy handed and show low level of compliance and adoption. We believe ontology and semantic web technologies can make data sharing across heterogeneous systems and platforms more tractable.
There are many domains of knowledge needed to describe an outbreak investigation and we strive to re-use existing standards as much as possible
Currently we are finishing a lab/genomic checklist and will be starting an epidemiology checklist soon
Lastly, Jon and others mentioned yesterday, genomic data interpretation is complex and the technology is still evolving, yet in the world of diagnostic lab, accreditation means standardized protocols need to be developed
So we focus quite a bit of our energy on developing high quality software with build-in QA and QC components to assess data quality and analytic tool performance
I also want to highlight GMI’s WG4’s effort in developing proficiency tests for wet lab and analysis pipelines. To participate you can contact Rene or Errol.
To facilitate tool improvement and to allow exploratory analysis not part of IRIDA pipelines to be done, we would also allow pre-authorized tools to connect to IRIDA via a REST API securely. Currently we have two external tools for genomic island detection and phylogeography analysis.
The software will be released to a few international collaborators for full testing by Jun 1. Then in Jul 1, we plan to release the beta version publicly so people can try it out. Of course the software will be free and we would love to collaborate with people on both the software and the ontology development.
Large Group of People who contributed to this work