U.S. Food and Drug Administration 
Institute for Genome Sciences 
Development of FDA MicroDB: 
A Regulatory-Grade 
Microbial Reference Database 
Heike Sichtig, Ph.D. 
Division of Microbiology Devices 
OIR/CDRH/FDA/HHS 
Heike.Sichtig@fda.hhs.gov 
Genomics Resource Center 
Institute for Genome Sciences 
ljtallon@som.umaryland.edu 
October 21-22, 2014 
Luke Tallon 
UMSOM 
NIST Workshop to Identify Standards Needed to Support Pathogen Identification 
via Next-Generation Sequencing, NIST, MD
2 
Microbial NGS-Based Diagnostic Devices 
• OIR/DMD working on a fast-tracked Draft Guidance 
• On April 1st 2014 held Public Workshop 
“Advancing Regulatory Science for High Throughput Sequencing 
Devices for Microbial Identification and Detection of Antimicrobial 
Resistance Markers” [FR Doc No: 2014-04940] 
• Workshop agenda, discussion paper and webcast online: 
http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm386967.htm 
Objectives: 
1. Streamline/shorten clinical trials for microbial diagnosis/identification 
2. Establish a new comparator algorithm for assays developed using this 
new technology 
3. Develop regulatory science standards for microbial genome sequencing 
4. Investigate the regulatory science required for antimicrobial resistance 
determination through microbial genome sequence information.
3 
Inter-Agency Working Group on Feasibility 
Approach: 
• Formed a diverse working group FDA, NIH-NCBI, NIAID, DTRA, 
LLNL, and CDC 
• Conducted small pilot study to generate information to evaluate 
quality of existing sequences in the public domain (In Progress) 
• Identify the pre-existing high-quality deposits, and build from 
there 
• Will use information to set quality bar for sequence outputs for 
our ongoing sequencing efforts 
• Utilized existing standards (if available) for technical and isolate 
metadata –no need to re-invent 
• Attention given to connecting antimicrobial resistance 
phenotype to genomic deposits – clinical collection site
Looking ahead: Predictions for Reference Databases 
– Multiple levels of Reference DBs likely 
• “High quality” genomes only 
– For validation and clinical use 
• “High quality” + other available genomes 
– For testing and development 
• Requires definition of “high quality” that must include 
some draft genomes 
– Extensive screening required 
• Human and other hosts; chimeras 
• Artificial constructs 
– Separate bacterial, viral, fungal reference DBs 
– Publicly available (NCBI/EMBL/DDBJ) 
4 
Courtesy of Tom Slezak
5 
Current Need 
Robust, Standardized, and High Quality Microbial 
Sequence Database in the Public Sector 
Cover illustration 
(Copyright © 2009, American Society 
for Microbiology. All Rights Reserved.) 
• Representative Samples 
• Metadata 
• High quality raw sequences 
• Assemblies 
• Annotation 
• Public Domain
Latest NCBI Genbank Report on Bacterial Genome 
25000 
20000 
15000 
10000 
5000 
6 
Growth 
0 
Bacterial 
Genomes 
Report 
Jul-­‐98 
Aug-­‐99 
Oct-­‐00 
Nov-­‐01 
Dec-­‐02 
Jan-­‐04 
Feb-­‐05 
Mar-­‐06 
Apr-­‐07 
Jun-­‐08 
Jul-­‐09 
Aug-­‐10 
Sep-­‐11 
Oct-­‐12 
Nov-­‐13 
Dec-­‐14 
Count 
Date 
#Genomes 
#Real 
Species 
Courtesy of NCBI
Microbial Reference Database (MicroDB)($1,67M) 
• Identify “gaps” and target sequencing efforts (Funding awarded by FDA/OCET) 
7 
• All raw reads, assemblies, annotations, metadata sent to NCBI and 
accessible to the PUBLIC 
• Traceable results that could be reevaluated as necessary 
>600 Clinically 
Relevant and MCM 
Microorganisms 
Highly 
Controlled 
and 
Documented 
Approach 
Collaborations with Clinical Labs and Repositories 
• Children’s National Hospital 
• DoD Critical Reagents Program (CRP, USAMRIID) 
• FDA-CFSAN, FDA-CBER, FDA-CDER 
• DHS National Biodefense Analysis and 
Countermeasures Center (NBACC) 
• The Rockefeller University 
• Culture Collections: ATCC, DSMZ 
Sequencing Center (UMD IGS) 
• Hybrid Approach (PacBio and Illumina) 
• Deposit of Raw Reads at NCBI (SRA) 
• Deposit of Assemblies at NCBI 
• Deposit of Annotations at NCBI 
• FDA Interface to Access Data
MicroDB Requirements 
A. Extracted Genomic DNA (gDNA) 
– Extracted gDNA should be of high quality and purity, and at sufficient concentration to 
achieve a suitable yield to assure adequate depth and breadth of genomic coverage for 
the type of sequencing method employed. 
B. BioSample Metadata 
– A minimal description of the isolate source material is necessary for traceability. We are 
using 14 descriptors as outlined below. (Note: Minimal metadata is modeled in part after 
NCBI’s minimal pathogen template) 
– Unique ID, organism, strain/isolate, sample site, specimen type, host disease, collection 
date, collected by, patient age, gender, geographic location, AST method*, AST method 
manufacturer*, Antimicrobial Susceptibilities* 
C. Sequencing Data 
– The minimum requirement for sequencing data is that the generated raw reads should be 
deposited in NCBI’s Sequence Read Archive (SRA) and assemblies should be deposited 
at NCBI’s Assembly division. The availability of raw reads and assemblies will provide a 
pathway to re-analyze the data as newer technologies emerge. Furthermore, annotation 
data should be deposited when available. 
– Raw reads, assemblies, annotations* 
*not used as a criteria for exclusion 8
MicroDB Requirements 
D. Sequencing Metadata 
– A minimal description of the sequencing process is necessary for traceability. We are 
using 7 descriptors as outlined below including bioinformatics tool information for assembly 
and annotation, and genomic coverage information. 
– Library, platform, submitted by, fold coverage, pipeline, assembler, annotation tool* 
E. Suggested phenotypic metadata* 
– A description of the phenotypic information is suggested to create a link between the 
phenotypic traits of particular organisms and their genomic sequence. We are 
recommending 5 descriptors as outlined below (1-4 are also included in sections B and C). 
– Annotation, AST method, AST method manufacturer, antimicrobial susceptibilities, 
additional phenotypic data 
*not used as a criteria for exclusion 9
NCBI Submission Cases 
1. Childrens National Medical Center 
– Submit all data when available 
– Register sample metadata via BioSample 
– Submit raw reads and assemblies generated by IGS when available 
2. FDA/CFSAN 
– Collaborative agreement: Wait for genome announcements 
– Follow same procedures as for 1 and put a ‘6 month hold’ to 
release data, lift hold when genome announcements are out 
3. Rockefeller University 
– Collaborative agreement: Wait for publication 
– Follow same procedures as for 1 and put a ‘6 month hold’ to 
release data, lift hold when publication is out 
Similar agreements in place with other collaborators depending 
on their needs 
10
Project 
Approach 
• Sequencing 
in 
large 
batches 
– Illumina 
HiSeq 
paired-­‐end 
sequencing: 
>200x 
– PacBio 
long-­‐insert 
SMRT 
P4-­‐C2 
sequencing: 
>80-­‐100x 
• Assembly 
– PacBio 
only 
(HGAP, 
PBcR 
CA) 
– Illumina 
only 
(CA, 
MaSuRCA) 
– PacBio/Illumina 
hybrid 
(CA) 
– Minimal 
manual 
QA/QC 
& 
curaon 
• Automated 
Annotaon 
• Base 
modificaon 
detecon 
• Raw 
reads 
-­‐> 
NCBI 
SRA 
• Assembled 
& 
annotated 
genomes 
-­‐> 
Genbank 
– NCBI 
BIOPROJECT 
ID: 
PRJNA231221 
• FDA 
Web 
interface 
to 
aggregate 
data
Progress 
-­‐ 
Batch 
1 
Rockefeller 
(50) 
• Uniform 
sample 
set 
– Staphylococcus 
aureus 
– 2.8Mbp 
genome 
size 
– 32.8 
%GC 
– Significant 
metadata 
CNH/CFSAN 
(41) 
• Diverse 
sample 
set 
– 18 
genera 
represented 
– 2 
– 
8 
Mbp 
genome 
size 
range 
– 38 
– 
67 
%GC 
range 
Wikimedia 
Commons 
Wikimedia 
Commons 
NCBI 
BioProject: 
PRJNA231221
Rockefeller 
Samples 
• Sequencing 
– Avg 
Illumina 
cvg: 
578x 
– Avg 
PacBio 
cvg: 
185x 
– 1 
or 
2 
SMRT 
cells 
each 
• Assembly: 
– 32 
of 
50 
in 
single 
cong 
chromosome 
– Average 
cong 
count 
= 
5 
– “Best” 
assembly: 
• HGAP 
= 
29 
• CA 
hybrid 
= 
21 
• Most 
differences 
subtle 
• Annotaon 
complete 
• Final 
QC 
& 
data 
submissions 
underway
CNH/CFSAN 
Samples 
• Sequencing 
– Avg 
Illumina 
cvg: 
315x 
– Avg 
PacBio 
cvg: 
167x 
• 2 
SMRT 
cells 
each 
• Assembly 
– 12 
of 
41 
in 
single 
cong 
chromosome 
• 29 
in 
<= 
5 
congs 
– Avg 
cong 
count 
= 
4.5 
– Median 
cong 
count 
= 
3 
– “Best” 
assembly 
(of 
41): 
• HGAP 
= 
24 
• PBcR 
CA 
= 
14 
• CA 
hybrid 
= 
3 
• Annotaon 
underway
ROCK_290 Celera8 ctg vs. ref 
0 500000 1000000 1500000 2000000 2500000 
gi|374362062|gb|CP003033.1| 
2500000 
2000000 
1500000 
1000000 
500000 
0 
ctg7180000000002 
100 
80 
60 
40 
20 
0 
Assembly 
QC 
& 
Curaon 
%similarity 
CA8 
– 
Ill/PB 
hybrid 
Largest 
Ctg 
Len: 
2,759,091bp 
Total 
asm 
Ctg 
Len: 
2,770,822 
bp 
ROCK_290 HGAP2 ctg vs. ref 
0 500000 1000000 1500000 2000000 2500000 
gi|374362062|gb|CP003033.1| 
ssccff77118800000000000000001134||qquuiivveerr 
QRY 
ssscccfff777111888000000000000000000000000111012|||qqquuuiiivvveeerrr 
100 
80 
60 
40 
20 
0 
%similarity 
HGAP2 
Largest 
Ctg 
Len: 
2,128,476bp 
Total 
asm 
Ctg 
Len: 
2,802,621 
bp
4bp 
overlap? 
0 500000 1000000 1500000 2000000 2500000 
gi|595636499|gb|CP007454.1| 
2500000 
2000000 
1500000 
1000000 
500000 
0 
scf7180000000002|quiver 
Assembly 
QC 
& 
Curaon 
100 80 
60 
similarity 
%40 
20 
0 
HGAP2 
Largest 
Ctg 
Len: 
2,764,709bp 
Total 
asm 
Ctg 
Len: 
2,764,709bp 
1X 
coverage 
TAAC 
1X 
coverage 
TAGC
Challenges 
& 
Opportunies 
• Sample 
acquision 
& 
quality 
• Efficiency/throughput 
vs. 
accuracy/quality 
– Sequencing 
strategy 
– Assembly 
QA/QC 
& 
curaon 
• Ever 
longer 
reads! 
– Reduced 
coverage 
-­‐> 
higher 
efficiency 
sequencing 
– More 
“closed” 
genomes! 
• Small 
plasmids 
– SageELF 
& 
Illumina
FDA Micro Team 
Peyton Hobson, Brittany Goldberg, Kevin Snyder, Tamara Feldblyum, Uwe Scherf, Sally Hojvat 
C ollaborators 
18 
Thank You 
LLNL 
Tom Slezak 
NIH-NCBI 
Bill Klimke, Martin Shumway, David Lipman 
NIH-NIAID 
Vivien Dugan, Maria Giovani 
DTRA 
Matt Tobelmann, Chris Detter, Eric 
VanGieson, Nels Olsen 
CDC 
Duncan MacCannell 
FDA-CFSAN 
Maria Hoffmann, Cary Pirone, Andrea 
Ottessen, Marc Allard, Eric Brown 
NMRC 
Kim Bishop-Lilly, Ken Frey 
IGS@UMD 
Lisa Sadzewicz, Luke Tallon, Naomi 
Sengamalay, Al Godinez, Sandy 
Ott, Sushma Nagaraj, Claire Fraser 
Rockefeller University 
Bryan Utter, Douglas Deutsch 
Children’s National Medical Center 
Brittany Goldberg, Joseph Campos 
DOD-CRP 
Shanmuga Sozhamannan, Mike Smith 
DOD-USAMRIID 
Tim Minogue 
NBACC 
Adam Phillippy, Nick Bergman 
ATCC 
Liz Kerrigan 
DSMZ 
Cathrin Sproer

Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database

  • 1.
    U.S. Food andDrug Administration Institute for Genome Sciences Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database Heike Sichtig, Ph.D. Division of Microbiology Devices OIR/CDRH/FDA/HHS Heike.Sichtig@fda.hhs.gov Genomics Resource Center Institute for Genome Sciences ljtallon@som.umaryland.edu October 21-22, 2014 Luke Tallon UMSOM NIST Workshop to Identify Standards Needed to Support Pathogen Identification via Next-Generation Sequencing, NIST, MD
  • 2.
    2 Microbial NGS-BasedDiagnostic Devices • OIR/DMD working on a fast-tracked Draft Guidance • On April 1st 2014 held Public Workshop “Advancing Regulatory Science for High Throughput Sequencing Devices for Microbial Identification and Detection of Antimicrobial Resistance Markers” [FR Doc No: 2014-04940] • Workshop agenda, discussion paper and webcast online: http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm386967.htm Objectives: 1. Streamline/shorten clinical trials for microbial diagnosis/identification 2. Establish a new comparator algorithm for assays developed using this new technology 3. Develop regulatory science standards for microbial genome sequencing 4. Investigate the regulatory science required for antimicrobial resistance determination through microbial genome sequence information.
  • 3.
    3 Inter-Agency WorkingGroup on Feasibility Approach: • Formed a diverse working group FDA, NIH-NCBI, NIAID, DTRA, LLNL, and CDC • Conducted small pilot study to generate information to evaluate quality of existing sequences in the public domain (In Progress) • Identify the pre-existing high-quality deposits, and build from there • Will use information to set quality bar for sequence outputs for our ongoing sequencing efforts • Utilized existing standards (if available) for technical and isolate metadata –no need to re-invent • Attention given to connecting antimicrobial resistance phenotype to genomic deposits – clinical collection site
  • 4.
    Looking ahead: Predictionsfor Reference Databases – Multiple levels of Reference DBs likely • “High quality” genomes only – For validation and clinical use • “High quality” + other available genomes – For testing and development • Requires definition of “high quality” that must include some draft genomes – Extensive screening required • Human and other hosts; chimeras • Artificial constructs – Separate bacterial, viral, fungal reference DBs – Publicly available (NCBI/EMBL/DDBJ) 4 Courtesy of Tom Slezak
  • 5.
    5 Current Need Robust, Standardized, and High Quality Microbial Sequence Database in the Public Sector Cover illustration (Copyright © 2009, American Society for Microbiology. All Rights Reserved.) • Representative Samples • Metadata • High quality raw sequences • Assemblies • Annotation • Public Domain
  • 6.
    Latest NCBI GenbankReport on Bacterial Genome 25000 20000 15000 10000 5000 6 Growth 0 Bacterial Genomes Report Jul-­‐98 Aug-­‐99 Oct-­‐00 Nov-­‐01 Dec-­‐02 Jan-­‐04 Feb-­‐05 Mar-­‐06 Apr-­‐07 Jun-­‐08 Jul-­‐09 Aug-­‐10 Sep-­‐11 Oct-­‐12 Nov-­‐13 Dec-­‐14 Count Date #Genomes #Real Species Courtesy of NCBI
  • 7.
    Microbial Reference Database(MicroDB)($1,67M) • Identify “gaps” and target sequencing efforts (Funding awarded by FDA/OCET) 7 • All raw reads, assemblies, annotations, metadata sent to NCBI and accessible to the PUBLIC • Traceable results that could be reevaluated as necessary >600 Clinically Relevant and MCM Microorganisms Highly Controlled and Documented Approach Collaborations with Clinical Labs and Repositories • Children’s National Hospital • DoD Critical Reagents Program (CRP, USAMRIID) • FDA-CFSAN, FDA-CBER, FDA-CDER • DHS National Biodefense Analysis and Countermeasures Center (NBACC) • The Rockefeller University • Culture Collections: ATCC, DSMZ Sequencing Center (UMD IGS) • Hybrid Approach (PacBio and Illumina) • Deposit of Raw Reads at NCBI (SRA) • Deposit of Assemblies at NCBI • Deposit of Annotations at NCBI • FDA Interface to Access Data
  • 8.
    MicroDB Requirements A.Extracted Genomic DNA (gDNA) – Extracted gDNA should be of high quality and purity, and at sufficient concentration to achieve a suitable yield to assure adequate depth and breadth of genomic coverage for the type of sequencing method employed. B. BioSample Metadata – A minimal description of the isolate source material is necessary for traceability. We are using 14 descriptors as outlined below. (Note: Minimal metadata is modeled in part after NCBI’s minimal pathogen template) – Unique ID, organism, strain/isolate, sample site, specimen type, host disease, collection date, collected by, patient age, gender, geographic location, AST method*, AST method manufacturer*, Antimicrobial Susceptibilities* C. Sequencing Data – The minimum requirement for sequencing data is that the generated raw reads should be deposited in NCBI’s Sequence Read Archive (SRA) and assemblies should be deposited at NCBI’s Assembly division. The availability of raw reads and assemblies will provide a pathway to re-analyze the data as newer technologies emerge. Furthermore, annotation data should be deposited when available. – Raw reads, assemblies, annotations* *not used as a criteria for exclusion 8
  • 9.
    MicroDB Requirements D.Sequencing Metadata – A minimal description of the sequencing process is necessary for traceability. We are using 7 descriptors as outlined below including bioinformatics tool information for assembly and annotation, and genomic coverage information. – Library, platform, submitted by, fold coverage, pipeline, assembler, annotation tool* E. Suggested phenotypic metadata* – A description of the phenotypic information is suggested to create a link between the phenotypic traits of particular organisms and their genomic sequence. We are recommending 5 descriptors as outlined below (1-4 are also included in sections B and C). – Annotation, AST method, AST method manufacturer, antimicrobial susceptibilities, additional phenotypic data *not used as a criteria for exclusion 9
  • 10.
    NCBI Submission Cases 1. Childrens National Medical Center – Submit all data when available – Register sample metadata via BioSample – Submit raw reads and assemblies generated by IGS when available 2. FDA/CFSAN – Collaborative agreement: Wait for genome announcements – Follow same procedures as for 1 and put a ‘6 month hold’ to release data, lift hold when genome announcements are out 3. Rockefeller University – Collaborative agreement: Wait for publication – Follow same procedures as for 1 and put a ‘6 month hold’ to release data, lift hold when publication is out Similar agreements in place with other collaborators depending on their needs 10
  • 11.
    Project Approach •Sequencing in large batches – Illumina HiSeq paired-­‐end sequencing: >200x – PacBio long-­‐insert SMRT P4-­‐C2 sequencing: >80-­‐100x • Assembly – PacBio only (HGAP, PBcR CA) – Illumina only (CA, MaSuRCA) – PacBio/Illumina hybrid (CA) – Minimal manual QA/QC & curaon • Automated Annotaon • Base modificaon detecon • Raw reads -­‐> NCBI SRA • Assembled & annotated genomes -­‐> Genbank – NCBI BIOPROJECT ID: PRJNA231221 • FDA Web interface to aggregate data
  • 12.
    Progress -­‐ Batch 1 Rockefeller (50) • Uniform sample set – Staphylococcus aureus – 2.8Mbp genome size – 32.8 %GC – Significant metadata CNH/CFSAN (41) • Diverse sample set – 18 genera represented – 2 – 8 Mbp genome size range – 38 – 67 %GC range Wikimedia Commons Wikimedia Commons NCBI BioProject: PRJNA231221
  • 13.
    Rockefeller Samples •Sequencing – Avg Illumina cvg: 578x – Avg PacBio cvg: 185x – 1 or 2 SMRT cells each • Assembly: – 32 of 50 in single cong chromosome – Average cong count = 5 – “Best” assembly: • HGAP = 29 • CA hybrid = 21 • Most differences subtle • Annotaon complete • Final QC & data submissions underway
  • 14.
    CNH/CFSAN Samples •Sequencing – Avg Illumina cvg: 315x – Avg PacBio cvg: 167x • 2 SMRT cells each • Assembly – 12 of 41 in single cong chromosome • 29 in <= 5 congs – Avg cong count = 4.5 – Median cong count = 3 – “Best” assembly (of 41): • HGAP = 24 • PBcR CA = 14 • CA hybrid = 3 • Annotaon underway
  • 15.
    ROCK_290 Celera8 ctgvs. ref 0 500000 1000000 1500000 2000000 2500000 gi|374362062|gb|CP003033.1| 2500000 2000000 1500000 1000000 500000 0 ctg7180000000002 100 80 60 40 20 0 Assembly QC & Curaon %similarity CA8 – Ill/PB hybrid Largest Ctg Len: 2,759,091bp Total asm Ctg Len: 2,770,822 bp ROCK_290 HGAP2 ctg vs. ref 0 500000 1000000 1500000 2000000 2500000 gi|374362062|gb|CP003033.1| ssccff77118800000000000000001134||qquuiivveerr QRY ssscccfff777111888000000000000000000000000111012|||qqquuuiiivvveeerrr 100 80 60 40 20 0 %similarity HGAP2 Largest Ctg Len: 2,128,476bp Total asm Ctg Len: 2,802,621 bp
  • 16.
    4bp overlap? 0500000 1000000 1500000 2000000 2500000 gi|595636499|gb|CP007454.1| 2500000 2000000 1500000 1000000 500000 0 scf7180000000002|quiver Assembly QC & Curaon 100 80 60 similarity %40 20 0 HGAP2 Largest Ctg Len: 2,764,709bp Total asm Ctg Len: 2,764,709bp 1X coverage TAAC 1X coverage TAGC
  • 17.
    Challenges & Opportunies • Sample acquision & quality • Efficiency/throughput vs. accuracy/quality – Sequencing strategy – Assembly QA/QC & curaon • Ever longer reads! – Reduced coverage -­‐> higher efficiency sequencing – More “closed” genomes! • Small plasmids – SageELF & Illumina
  • 18.
    FDA Micro Team Peyton Hobson, Brittany Goldberg, Kevin Snyder, Tamara Feldblyum, Uwe Scherf, Sally Hojvat C ollaborators 18 Thank You LLNL Tom Slezak NIH-NCBI Bill Klimke, Martin Shumway, David Lipman NIH-NIAID Vivien Dugan, Maria Giovani DTRA Matt Tobelmann, Chris Detter, Eric VanGieson, Nels Olsen CDC Duncan MacCannell FDA-CFSAN Maria Hoffmann, Cary Pirone, Andrea Ottessen, Marc Allard, Eric Brown NMRC Kim Bishop-Lilly, Ken Frey IGS@UMD Lisa Sadzewicz, Luke Tallon, Naomi Sengamalay, Al Godinez, Sandy Ott, Sushma Nagaraj, Claire Fraser Rockefeller University Bryan Utter, Douglas Deutsch Children’s National Medical Center Brittany Goldberg, Joseph Campos DOD-CRP Shanmuga Sozhamannan, Mike Smith DOD-USAMRIID Tim Minogue NBACC Adam Phillippy, Nick Bergman ATCC Liz Kerrigan DSMZ Cathrin Sproer