This document discusses how big data and AI can help fight Covid-19. It describes supercomputers being used for scientific research on Covid-19. An open data lake has been created containing various Covid-19 datasets for analysis. Natural language processing and BERT are being used to answer scientific questions from the Covid-19 literature by generating summaries and highlighting relevant text passages. Challenges are being conducted on the Covid-19 Open Research Dataset to further advance research.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Big Data and AI in Fighting Against COVID-19
1. Big Data and AI In Fighting
Against Covid-19
--
Andrew Zhang
zhangan@amazon.com
7/8/2020
2.
3. 1. Introduction
2. Supercomputers for Scientific Research
3. Covid-19 Open Data Lake
4. NLP and BERT to answer scientific questions
Agenda
4. Speaker: Andrew Zhang
Senior Tech Acct Manager at AWS, his specialties are big data, machine
learning, and HPC. Before joining Amazon, Andrew was a data science
engineer with IBM. His interest is scaling machine learning in a hybrid
multi-cloud enterprise environment. Previously, Andrew was an
enterprise architect with Novartis Pharmaceuticals.
7. Extensive research in
bioinformatics, epidemiology,
and molecular modeling to
understand the treatment
and develop strategies
Bringing together leaders to
provide access to the world’s
most powerful high-
performance computing
resources.
Covid-19 High Performance Computing Consortium
9. Covid-19 Active Research Projects
“We have identified two target proteins that generate novel molecules to inhibit
the relevant proteins. The compute capacity will enable us to run and optimize our
neural networks to generate better molecules and estimate their binding affinity to
the target proteins, drug-likeness and ADMET properties. Our work will evolve to
use 3D SMILES (currently at 2D) and other improvement.”
“ We are working with a team who have developed a device to allow safe ventilator
splitting between 2 or more patients. We made the software to guide device
selection based on the patient's respiratory states, but we want the app to allow for
just lookup into pre-computed values from the …”
We have designed a mobile app and a technological platform, compliant to the
European legislation, which enables unidentified contact/exposure information of
users to be efficiently collected in a fully anonymous way.
https://covid19-hpc-consortium.org/
11. Covid-19 Tracking and Prediction
COVID-19 confirmed cases and deaths Genomic epidemiological tracking Hospital resource utilization modeling
This is a visual representation of the
number of confirmed cases (counties)
and deaths (circles).
Data Source: COVID-19 data sources:
the 2019 Novel Coronavirus COVID-19
(2019-nCoV) Data Repository by Johns
Hopkins CSSE.
Genomic epidemiology of novel
coronavirus which provides real-
time tracking of pathogen
evolution (click to play the
transmissions and phylogeny)
Hospital resource utilization
modeling Data Source: University of
Washington’s Institute of Health and
Metrics Evaluation (IHME) COVID-19
projections.
Source: DataBricks
12. Covid-19 Research and Diagnosis
Answer Key Questions from Scientific Literature Read COVID-19 X-ray or CT image
• What is known about transmission, incubation,
and environmental stability?
• What do we know about COVID-19 risk factors?
• What do we know about virus genetics, origin,
and evolution?
• What has been published about medical care?
While PCR tests offer many advantages they are physical things that
require shipping the test or the sample. X-ray machines can be
plugged in to screen patients as long as they have electricity.
AI tools can help general practitioners to triage and treat patients.
Companies are developing AI tools and deploying them at
hospitals Wired 2020.
Source: IEEESource: Kaggle
13. Open Data Lake: Query and Visualization
(Amazon)
• Global Coronavirus (COVID-19) Data – Tracks confirmed COVID-19 cases in
provinces, states, and countries across the world with a breakdown to the
county level in the US.
• Coronavirus (COVID-19) Data in the United States – Tracks confirmed
cases and deaths in the US by state and county.
• Coronavirus Disease (COVID-19) Testing Data – Tracks the number of
people tested, pending tests, and positive and negative tests for COVID-
19.
• USA Hospital Beds – COVID-19 – Data on hospital beds and their
utilization in the US.
• COVID-19 Open Research Dataset (CORD-19) – A collection of over 45,000
research articles (over 33,000 with full text) about COVID-19, SARS-CoV-2,
and related coronaviruses. AWS has preprocessed and enriched these
with annotations extracted from Amazon Comprehend Medical.
• Amazon: S3 Explorer https://dj2taa9i652rf.cloudfront.net/
• Amazon: Glue Simple Cost Effective ETL https://aws.amazon.com/glue/
• Amazon: Athena a serverless SQL query engine
https://aws.amazon.com/athena/
• Amazon: QuickSight https://aws.amazon.com/quicksight/
A public data lake for analysis of COVID-19 data | AWS Big ... QuickSight Dashboard
14. Open Data Lake: Query and Visualization (Google)
COVID-19 data from Johns Hopkins Center for
Systems Science and Engineering
OpenStreetMap Public Dataset : World map including
healthcare provider locations
Global Health Dataset from The World Bank :
Global health and population trends asked questions
and tips to get started.
New York Times COVID-19 database: The New York
Times' COVID-19 database based on US health
agency reports.
ECDC COVID-19 Cases by Country : COVID-19 cases
by country as reported by the European Centre for
Disease Prevention and Control.
USAFacts COVID-19 Cases by US County : COVID-19
cases by county aggregated by USAFacts from US
health agencies.
Big Query
16. 16
NLP and BERT
Source
• BERT, as a contextual model, captures these
relationships in a bidirectional way.
• I made a bank deposit the unidirectional
representation of bank is only based on I made
a but not deposit.
• The pre-trained model on massive datasets enables
anyone building natural language processing to use
this free powerhouse.
• BERT theoretically allows us to smash multiple
benchmarks with minimal task-specific fine-
tuning.
• Corporate data to create different application.
17. 17
COVID-19 Open Research Dataset Challenge
https://www.whitehouse.gov/briefings-statements/call-action-tech-community-new-machine-readable-covid-19-dataset
Every scientist working on a cure or vaccine
must understand this prior research.
158,000 Coronavirus scholarly articles
including 75,000 with full text. • What is known about transmission, incubation, and environmental stability?
• What do we know about COVID-19 risk factors?
• What do we know about vaccines and therapeutics?
• What do we know about virus genetics, origin, and evolution?
• What has been published about medical care?
• What has been published about ethical and social science considerations?
• What do we know about non-pharmaceutical interventions?
• What do we know about diagnostics and surveillance?
18. Explore Covid-19 Scientific Literature (1)
Generate Summaries from Abstracts by training Summarizer Model
Databricks
Generate a WordCloud from all the titles
19. Explore Covid-19 Scientific Literature (2)All Task/Challenges answers using NLP:
We will use different libraries to get answer from these
papers.
• Bert QA Model (Pretrained by SQuAD dataset)
• BERT summary Model
• Python Google translate package
• HTML for visualize result
All Flow:
• Using QA Model, read all paper's abstract then
find answer for all tasks
• Concatenate Top 50 confident answers to be
article, and using Summary model to write
summary of answers
• Translate multiple language by google translate
• Write HTML to show summary of all ‘papers
answer for all tasks.
Kaggle
20. 20
Explore Covid-19 Scientific Literature (3)
Google
1. When the user asks an initial
question, the tool not only returns a
set of papers (like in a traditional
search) but also highlights snippets
from the paper that are possible
answers to the question.
2. The user can review the snippets
and quickly make a decision on
whether or not that paper is worth
further reading.
3. If the user is satisfied with the initial
set of papers and snippets, we have
added functionality to pose follow-
up questions, which act as new
queries for the original set of
retrieved articles.
21. 21
Explore Covid-19 Scientific Literature (4)
Amazon
AWS COVID-19 knowledge graph (CKG)
using AWS CloudFormation and Amazon
Neptune, and query the graph using
Jupyter notebooks hosted on Amazon
SageMaker in your AWS account.
The CKG aids in the exploration and
analysis of the COVID-19 Open Research
Dataset (CORD-19), hosted in the AWS
COVID-19 data lake.
The strength of the graph comes from
the connections between scholarly
articles, authors, scientific concepts, and
institutions. The CKG also helps power
the CORD-19 search page..