An integrated genomic surveillance platform reveals multiple introductions and accelerating localization of SARS-CoV-2 into California, USA and worldwide countries
Data Con LA 2020
Description
The Children's Hospital, Los Angeles (CHLA) COVID-19 Analysis Research Database (CARD) (https://covid19.cpmbiodev.net/) is a comprehensive genomic resource of SARS-CoV-2 viral genomes and associated meta-data of over 80,000 (as of August 13, 2020) isolates collected from global sequencing laboratories and the Center for Personalized Medicine (CPM) at CHLA. A Virus Genome Tracker accepts virus genome sequence and places the new viral isolate within the global or USA phylogenetic contexts based upon variant and haplotype comparisons to trace the transmission for genomic surveillance.
By haplotype analysis of 4,200 California isolates, 6,356 USA isolates, and over 80,000 global isolates, we identified a pattern of strongly localized outbreaks at the city-, state-, and country-levels, and temporal transmissions. Phylogenetic analyses revealed the cryptic introduction of multiple SARS-CoV-2 lineages into California and Los Angeles, deriving from state-to-state transmission, and from international travel by air and ship. The majority of sequences Orange County isolates formed distinct outbreak clusters whose haplotypes were different from isolates of the neighboring Los Angeles. From the 50,000 global isolates, 22,171 (45.8%) isolates carried country-private haplotypes. The percentage were 28.2-29.6% in January to March, and rapidly increased to 46.4% and 59.6% in April and May, co-occurring with global travel restrictions.
Speaker
Lishuang Shen, Children's Hospital Los Angeles, Sr. Bioinformatics Scientist
Similar to An integrated genomic surveillance platform reveals multiple introductions and accelerating localization of SARS-CoV-2 into California, USA and worldwide countries
Forest Environment Analysis for the Pandemic HealthJun Steed Huang
Similar to An integrated genomic surveillance platform reveals multiple introductions and accelerating localization of SARS-CoV-2 into California, USA and worldwide countries (20)
An integrated genomic surveillance platform reveals multiple introductions and accelerating localization of SARS-CoV-2 into California, USA and worldwide countries
1. An integrated genomic surveillance platform reveals
multiple introductions and accelerating localization of
SARS-CoV-2 into California, USA and worldwide countries
Lishuang Shen
Xiaowu Gai
Dien Bard J, Biegel JA, Judkins AR, and CPM team
DATA Con LA 2020
October 23, 2020
4. 4
CARD - A CHLA platform of SARS-CoV-2 data for genomic surveillance
Website: SARS-CoV 2 & COVID-19 Resource https://covid19.cpmbiodev.net/
5. Website: https://covid19.cpmbiodev.net/
Patients with SARS-CoV-2 Infection
Collaborating hospitals and institutes
CHLA, CHOP, TCH, etc
Extraction and library
preparation
Next-generation Seq.
Bioinformatics
Consensus, Variants
Clinical data management
& merging,
standardization
Merged and Standardized
Meta data
CARD - Children’s Hospital Los Angeles COVID-19
Analysis Research Database (MySQL) and Website
Genomic Data
Phenotype-guided variant prioritization
classification and interpretation
( i.e. Quick-Mitome, Exomiser, ANNOVAR)
Virus Bowser Search
Following ACMG Guidelines for variant
classification and interpretation
(i.e. ClinGen Variant Curation Interface,
Cartagenia, Golden Helix)
Public data repositories with COVID-19
patient and sequencing data
GISAID, GenBank, CNCB
Retrieve Consensus
genome assembly
Bioinformatics
Consensus, Variants
Clinical data management &
merging, standardization
Merged and Standardized
Meta data
Children’s Pathology Consortium CPC Cloud
Public and Controlled Access
CARD Web CPC Cloud
Group
CARD API Offline
Share
CARD - A CHLA platform of SARS-CoV-2 data for genomic surveillance
6. 6
▪ Data content in CARD
1. Global: About 145,000 strains (10/12/2020)
2. 25,000 variants
3. 145,000+ genome sequences -- 28,000 publicly accessible (non-GISAID)
a. 24,000+ USA sequences
b. 5,000 California source
c. 1,100 Los Angeles strains from patients
d. 750 CHLA internal strains
▪ Tools in CARD:
1. Virus Genome Tracker – Virus temporal -spatial transmission inference and tracing tool, find
and visualize the most similar strains from global collection,, quickly place external new
strain onto national and global virus phylogenetic tree context
2. Web-portal to global and local virus strain & genome data
3. Fully cross-linking patient – virus strain – mutation – phylogenetic clade information
4. Phylogenetic Analysis & Visualization: Global, USA, time series
5. SARS-CoV-2 Genome Browser
Reference: Shen L, Maglinte DT, Ostrow D, Pandey U, Bootwalla M et al (2020). Children's Hospital Los Angeles COVID-19 Analysis Research
Database (CARD) -A Resource for Rapid SARS-CoV-2 Genome Identification Using Interactive Online Phylogenetic Tools. bioRxiv
https://doi.org/10.1101/2020.05.11.089763
CARD Resource Status
7. 7
CARD- CHLA Resources of COVID-19/SARS-CoV-2
Virus strains, Demographic Data
▪ .
▪
Composite
precise filter
CHLA samples
8. 8
▪ .
CARD- CHLA Resources of COVID-19/SARS-CoV-2
Virus strains, Demographic Data
9. 9
Virus Genome Tracker – Genomic Surveillance by Genome and
Haplotype Comparison
FASTA Align
SNP calling
Genome Comp. Report
SNP Function Annot
Population Frequency
10. 10
Genome Gene, &
SNP Browser
Global virus strains of
top similarity
Link to their Phylo.
Tree Visualization
Virus Genome Tracker – Genomic Surveillance by Genome and
Haplotype Comparison
11. 11
Virus Genome Tracker – Highlight Matches on Global/USA
Phylogenetic Tree
Highlight virus by levels of
similarity in zoom-in subtree
12. Comprehensive genomic and epidemiological analysis
of SARS-CoV-2 isolates from Los Angeles, California,
USA and the World
13. 13
Genome Analysis of 6,000 USA Isolates Reveals Haplotype Signatures
and Localized Transmission Patterns by State and by Country
▪ Global and USA SARS-CoV-2 Sequence Data
1. The CHLA internal SARS-CoV-2 sequencing data were generated using the SARS-CoV-2
whole genome sequencing research assay, established by the CHLA Center for
Personalized Medicine and the Virology Laboratory.
2. The major external resources of SARS-CoV-2 strains, genome sequences, and variants
were GISAID, GenBank, CNCB, and NextStrain.
3. 6,356 USA isolates (February till Early May, 2020)
▪ Sequence Alignment, Variant Calling, Haplotype Analysis, Evolutionary analysis
1. Viral genome comparison and variant calling with MUMmer version 4.0.12 (Marçais et al)
2. Data management: MySQL database at CHLA COVID-19 Analysis Research Database
(CARD)
3. Haplotype analysis with SQL queries and custom scripts. Country and State-private
(within USA) haplotypes and variants were identified
4. Multiple sequence alignment (MSA) with MAFFT (version 7.460)
5. MSA was analyzed with IQ-TREE (v1.6.12) and MEGA-X for evolutionary history inference.
6. A Maximum likelihood tree was generated using GTR substitution model.
7. The evolutionary rate estimation and phylogeny was time-resolved using TreeTime
8. Visualization online using auspice (Nexstrain ), or Archaeopteryx.js .
9. Workflow control: snakemake, implemented in Nexstrain command line version
Reference:
Shen L, Dien Bard J, Biegel JA, Judkins AR, Gai X. (2020). Comprehensive genome analysis of 6,000 USA SARS-CoV-2 isolates reveals haplotype signatures and localized transmission patterns by state
and by country. Frontier in Microbiology https://doi.org/10.3389/fmicb.2020.573430
Shen L, Dien Bard J, Biegel JA, Judkins AR, Gai X. (2020). Comprehensive variant and haplotype landscape of 50,000 global SARS-CoV-2 isolates. bioRxiv https://doi.org/10.1101/2020.07.09.193722
14. 14
▪ Results:
1. Globally, ~50,500 isolate genomes from GISAID, GenBank, CHLA, and other
sources (as of June 18, 2020)
2. 6,070 variants and 2,513 haplotypes were detected in at least three isolates
3. 1,583 country-private variants from 10,238 isolates (20.6%)
4. 22,171 (45.8%) isolates carried country-private haplotypes, mostly singletons. 807
country-private haplotypes (5x) in 8,656 isolates from 39 countries
5. The localization of the variant haplotypes profiles were accelerating: 28.2-29.6% in
January to March, 46.4% and 59.6% in April and May, co-occurring with global
travel restrictions
6. Evidence supporting positive (orf3a, orf8, S genes) and purifying (M gene)
selections
Reference: Shen L, Dien Bard J, Biegel JA, Judkins AR, Gai X. (2020). Comprehensive variant and haplotype landscape of 50,000 global SARS-CoV-2
isolates. bioRxiv https://doi.org/10.1101/2020.07.09.193722
Comprehensive haplotype landscapes of 50,500 global SARS-CoV-2
isolates - accelerating accumulation of country-private variant profiles
17. 17
Maximum likelihood phylogenetic tree of representative isolates
carrying Country-private or non-private haplotypes from the global
isolates.
UK-private isolates in red,
USA in blue, other country green,
non- country-private in black
D614G
19. 19
Country-private recurrent variants present in 3 or more isolates
Country_exposure Variants Isolates % Isolate country Isolates country total
UK 896 6142 27.8322 22068
USA 329 2166 20.8229 10402
Australia 52 428 19.7326 2169
India 43 232 25 928
Netherlands 34 206 12.8349 1605
China 39 197 24.5636 802
Spain 24 108 7.0959 1522
Iceland 12 82 13.7584 596
Canada 14 57 6.5068 876
Portugal 8 54 8.3981 643
Denmark 9 50 6.7476 741
Luxembourg 13 46 16.9742 271
France 9 39 10.1299 385
Congo 5 36 27.0677 133
Singapore 10 35 9.1146 384
20. 20
Country private haplotypes present in 5 or more isolates
Country Haplotypes Strains Strains_in_country % Strains in country
UK 464 4942 22068 22.3944
USA 166 1728 10397 16.6202
Australia 32 356 2169 16.4131
Netherlands 26 234 1592 14.6985
Iceland 11 187 596 31.3758
Spain 10 107 1528 7.0026
Portugal 12 99 643 15.3966
Canada 9 72 876 8.2192
Thailand 5 69 203 33.9901
India 9 59 928 6.3578
Belgium 6 44 788 5.5838
Denmark 3 43 741 5.803
Singapore 3 43 384 11.1979
21. California in the first 9 months of the COVID-19
pandemic – the comprehensive transmission and
haplotype landscapes of 4,400 California SARS-CoV-2
isolates
22. 22
California SARS-CoV-2 genome data analysis
▪ Results:
1. 4416 California isolate genomes from CHLA (700+), GISAID, GenBank, and other
sources (as of October 10, 2020)
2. Los Angeles (LA) isolates distributed in most major clades, and largely absent
from some major UK/Europe clades
3. Los Angeles isolate from CHLA and non-CHLA sources are generally mixed in tree.
4. Los Angeles and neighboring Orange County have distinct isolate haplotypes.
5. San Diego and the neighboring Imperial County shared relative compact clusters.
The proportions of the clades are different from other CA areas.
6. The early date isolates were clustered around S/19A clade. The sources were
mainly from Northern California Bay Area (San Jose, San Francisco, and San
Joaquin County).
7. California had a high 82% isolates carrying state-private (in USA) haplotypes. New
York/New Jersey area had 56% (866 of 1543 isolates)
24. Questions
California isolates and closely-related non-California isolates
Branch Length by TIME
Spike Protein
S:D614G A23403G
Inferred divergence date:
2020-01-16
CI: 2019-12-09, 2020-01-19
26. 26
CHLA in blue, other LA in light blue, other USA in dark blue,
UK light green, San Francisco and Santa Clara in purple,
San Diego in yellow-green, Orange country in yellow highlight, other California in red
UK and other countries in green, all others in black
Global backbone and California isolate phylogenetic tree
27. Questions
Clade proportion by location changed overtime:
initial stage till March 31 (left), lockdown stage till June 15
(middle), post lockdown stage since June 16 (right)
Initial stage Lockdown stage Post-lockdown stage
Questions
Initial stage Lockdown stage Post-lockdown stage
28. Los Angeles 1058 isolates: CHLA (773, blue, yellow), CSMC (Cedars-
Sinai Medical Center, 144, light blue), and UCLA (141, grey)
Spike Protein
S:D614G A23403G
Inferred divergence date:
2020-01-16
CI: 2019-12-09, 2020-01-19
29. Los Angeles 1058 isolates: CHLA (773, blue, yellow), CSMC (Cedars-
Sinai Medical Center, 144, light blue), and UCLA (141, grey)
34. Questions
Orange County (orange color) ), Los Angeles (purple), San
Francisco (red), and San Diego (green) X-axis by collection date
35. Questions
Orange County (orange color) ), Los Angeles (purple), San
Francisco (red), and San Diego (green) X-axis by mutation
36. Questions
Diamond Princess Cruise, California, transmission to India and multiple
countries, or from a cryptic inferred ancestor haplotype
Diamond Princess
Cruise
37. Questions
Diamond Princess Cruise, California, transmission to India and multiple
countries, or from a cryptic inferred ancestor haplotype
43. 43
Acknowledgement
▪ Software and Tools:
▪ Nextstrain team
▪ Mummer
▪ Mafft
▪ MEGA-X
▪ The phylogenetic tree is rendered
with Archaeopteryx.js
▪ Genome visualization: JBrowse
▪ Data Sources:
▪ The SARS-CoV2 genomes and meta data were
generously shared via GISAID, GenBank, and China
National Center for Bioinformation (CNCB) .
▪ We gratefully acknowledge the Authors, Originating
and Submitting laboratories of the genetic sequence
and metadata made available through GISAID,
GenBank and CNCB on which this research is based.
People:
•Alexander Judkins
•Timothy Triche
•Jaclyn Biegel
•Jennifer Dien Bard
•Dejerianne (Gigi) Ostrow
•Utsav Pandey
•Dennis Maglinte
•Moiz Bootwalla
•Alex Ryutov
•David Ruble
•Jennifer Han
•Ananthanarayanan
Govindarajan
•James Done
•Ryan Schmidt