Understanding the Pandemic Through Mining
Covid News Using Natural Language
Processing
- IEEE CCWC, 2021
Presented by: Nishat Anjum
Authors:
Nafiz Sadman1 Nishat Anjum2 Kishor Datta Gupta3
M. A. Parvez Mahmud4
1. Silicon Orchard Research and Analytics Lab (SORAL, research.siliconorchard.com), Dhaka, Bangladesh
2. Independent University, Bangladesh
3. University of Memphis, Memphis, TN, USA
4. Deakin University, Geelong, Australia
ROAD MAP
INTRODUCTION
OUR RESEARCH AIM & CONTRIBUTION
NNK DATASET
EXPERIMENTAL FINDINGS
LIMITATIONS AND FUTURE WORK
INTRODUCTION
88
million
reported
cases
1.9
million
deaths
As of 12 January, 2021, Weekly Epidemiological Update World Wide, World Health Orgnaization
The first cluster of
the COVID-19 was
initially reported
on 31 December
2019, when the
WHO China
Country Office
was informed.
Information exchange media
Social Media
Newspaper
Television/
Digital news
3.6 bil
2.5 bil
600 mil
http://www.ifabc.org/news/More-People-Read-Newspapers-Worldwide-Than-Use-Web.
https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
Two types of fight agaist COVID-19:
a. Tangible
- Front line doctors, nurses, military personnel, NGOs,
volunteers, etc.
b. Intagible
- Researchers, scientists, academics, etc.
Insignificant number of research based on Natural Language
Processing
compared to:
- Computer Vision applications
- Chest X-ray classifications1
- CT-scans classifications1
- Genome sequencing2
1 - M. M. Ahsan, K. D. Gupta, M. M. Islam, S. Sen, M. Rahman,M. Shakhawat Hossainet al., “Covid-19 symptoms detection basedon
nasnetmobile with explainable ai using various imaging modalities,”Machine Learning and Knowledge Extraction, vol. 2, no. 4, pp. 490–504,2020
2 - G. S. Randhawa, M. P. Soltysiak, H. El Roz, C. P. de Souza, K. A. Hill,and L. Kari, “Machine learning using intrinsic genomic signatures forrapid
classification of novel pathogens: Covid-19 case study,”Plos one,vol. 15, no. 4, p. e0232391, 2020
- A. Alimadadi, S. Aryal, I. Manandhar, P. B. Munroe, B. Joe, andX. Cheng, “Artificial intelligence and machine learning to fight covid-19,”2020
- S. Tuli, S. Tuli, R. Tuli, and S. S. Gill, “Predicting the growth and trendof covid-19 pandemic using machine learning and cloud
computing,”Internet of Things, p. 100222, 2020
OUR RESEARCH AIM & CONTRIBUTION
1. Assert importance of newspapers (print/digital) in battling
COVID-19 through raising public awareness.
2. Utilize newspaper as primary source of information extraction
using Natural Language Processing (NLP) techniques.
3. Understand how newspapers portray the pandemic in a
developed country and in under developing country.
Contribution:
•Analysis and findings of the information extracted
fromnewspapers.1
•The code used to perform data analysis on the newspapers.1
•The dataset (NNK-Dataset) used in this paper.1,2
1. https://github.com/NNK-Dataset
2. https://doi.org/10.34740/kaggle/dsv/1511505
NNK DATASET
1. Data Collection
10 human
annotators
Age: 23-25
Occupation:
Under Grads
The headline must
have one or more
words directly
orindirectly related to
COVID-19.
The content of each news
must have 5 or more
keywords directly or
indirectly related to
COVID-19.
Avoid taking duplicate
reports.
Maintain a time frame for
the newspa-pers.
Covid-News-USA-NNK1
Covid-News-BD-NNK2
Google Forms
500 news from The
Washington Post
500 news from Star
Tribune
25 news from The
Daily Star
25 news from
Prothom Alo
1. https://github.com/NNK-Dataset/USA-NNK/blob/master/usaformlink.md
2. https://github.com/NNK-Dataset/BD-NNK/blob/master/bdformlink.md
2. Data Pre-processing
• Remove hyperlinks.
• Remove non-English, alphanumeric characters.
• Remove stop words
• Lemmatization
3. Data Description
No. of words per
headline
7 - 20
No. of words per
body content
150 - 2100
No. of words per
headline
10 - 20
No. of words per
body content
100 - 1500
Table 1: Covid-News-USA-NNK Table 2: Covid-News-BD-NNK
Date Date when news was posted
Link Hyperlink
Newspaper
Name
Name of newspaper
Headline
Keywords
Keywords extracted from
headline
Report
Keywords
Keyword extracted from
body
Date Date when news was posted
Link Hyperlink
Newspaper
Name
Name of newspaper
Headline Keywords extracted from
headline
Report Keyword extracted from body
4. Dataset Repository, Policy and License
• Project stored in Github: https://github.com/NNK-Dataset
• Covid-News-USA-NNK: https://github.com/NNK-Dataset/USA-NNK
• Covid-News-BD-NNK: https://github.com/NNK-Dataset/BD-NNK
• Kaggle: https://doi.org/10.34740/kaggle/dsv/1511505
• License: CCO (Creative Commons)
EXPERIMENTAL FINDINGS
Word Clouds: Washington Post News (USA)
February, 2020 March, 2020 April, 2020 May, 2020
Word Clouds: Star Tribune News (USA)
February, 2020 March, 2020 April, 2020 May, 2020
Word Clouds:
March, 2020 April, 2020 March, 2020 April, 2020
Daily Star News (BD) Prothom Alo News (BD)
Covid-cases through number extractions:
Cases(based on keyword in news report) related to COVID-19 fromFebruary till
March. X axis represents the month and Y axis represents casesin 10,000.
Numeric Extraction
keywords:
Infected, Died,
Infections, Died,
Quarantined, Lock-
down, Diagnosed.
Vader Sentiment Analysis:
- Average : -0.5 to -0.9 (Scale -1(highly negative) to +1(highly positive))
Keyword extraction using PageRank:
- : ’China’, Government’, ’Masks’, ’Economy’,’Crisis’, ’Theft’ , ’Stock market’ ,
’Jobs’ , ’Election’, ’Missteps’,’Health’, ’Response’.
LIMITATION AND FUTURE WORK
- Starting point for an important dataset.
- Assert importance of NLP in newspaper report analysis.
- Dataset open for research and enhancement
THANK YOU

understanding the pandemic through mining covid news using natural language processing

  • 1.
    Understanding the PandemicThrough Mining Covid News Using Natural Language Processing - IEEE CCWC, 2021
  • 2.
    Presented by: NishatAnjum Authors: Nafiz Sadman1 Nishat Anjum2 Kishor Datta Gupta3 M. A. Parvez Mahmud4 1. Silicon Orchard Research and Analytics Lab (SORAL, research.siliconorchard.com), Dhaka, Bangladesh 2. Independent University, Bangladesh 3. University of Memphis, Memphis, TN, USA 4. Deakin University, Geelong, Australia
  • 3.
    ROAD MAP INTRODUCTION OUR RESEARCHAIM & CONTRIBUTION NNK DATASET EXPERIMENTAL FINDINGS LIMITATIONS AND FUTURE WORK
  • 4.
  • 5.
    88 million reported cases 1.9 million deaths As of 12January, 2021, Weekly Epidemiological Update World Wide, World Health Orgnaization The first cluster of the COVID-19 was initially reported on 31 December 2019, when the WHO China Country Office was informed.
  • 6.
    Information exchange media SocialMedia Newspaper Television/ Digital news 3.6 bil 2.5 bil 600 mil http://www.ifabc.org/news/More-People-Read-Newspapers-Worldwide-Than-Use-Web. https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
  • 7.
    Two types offight agaist COVID-19: a. Tangible - Front line doctors, nurses, military personnel, NGOs, volunteers, etc. b. Intagible - Researchers, scientists, academics, etc.
  • 8.
    Insignificant number ofresearch based on Natural Language Processing compared to: - Computer Vision applications - Chest X-ray classifications1 - CT-scans classifications1 - Genome sequencing2 1 - M. M. Ahsan, K. D. Gupta, M. M. Islam, S. Sen, M. Rahman,M. Shakhawat Hossainet al., “Covid-19 symptoms detection basedon nasnetmobile with explainable ai using various imaging modalities,”Machine Learning and Knowledge Extraction, vol. 2, no. 4, pp. 490–504,2020 2 - G. S. Randhawa, M. P. Soltysiak, H. El Roz, C. P. de Souza, K. A. Hill,and L. Kari, “Machine learning using intrinsic genomic signatures forrapid classification of novel pathogens: Covid-19 case study,”Plos one,vol. 15, no. 4, p. e0232391, 2020 - A. Alimadadi, S. Aryal, I. Manandhar, P. B. Munroe, B. Joe, andX. Cheng, “Artificial intelligence and machine learning to fight covid-19,”2020 - S. Tuli, S. Tuli, R. Tuli, and S. S. Gill, “Predicting the growth and trendof covid-19 pandemic using machine learning and cloud computing,”Internet of Things, p. 100222, 2020
  • 9.
    OUR RESEARCH AIM& CONTRIBUTION
  • 10.
    1. Assert importanceof newspapers (print/digital) in battling COVID-19 through raising public awareness. 2. Utilize newspaper as primary source of information extraction using Natural Language Processing (NLP) techniques. 3. Understand how newspapers portray the pandemic in a developed country and in under developing country.
  • 11.
    Contribution: •Analysis and findingsof the information extracted fromnewspapers.1 •The code used to perform data analysis on the newspapers.1 •The dataset (NNK-Dataset) used in this paper.1,2 1. https://github.com/NNK-Dataset 2. https://doi.org/10.34740/kaggle/dsv/1511505
  • 12.
  • 13.
    1. Data Collection 10human annotators Age: 23-25 Occupation: Under Grads The headline must have one or more words directly orindirectly related to COVID-19. The content of each news must have 5 or more keywords directly or indirectly related to COVID-19. Avoid taking duplicate reports. Maintain a time frame for the newspa-pers. Covid-News-USA-NNK1 Covid-News-BD-NNK2 Google Forms 500 news from The Washington Post 500 news from Star Tribune 25 news from The Daily Star 25 news from Prothom Alo 1. https://github.com/NNK-Dataset/USA-NNK/blob/master/usaformlink.md 2. https://github.com/NNK-Dataset/BD-NNK/blob/master/bdformlink.md
  • 14.
    2. Data Pre-processing •Remove hyperlinks. • Remove non-English, alphanumeric characters. • Remove stop words • Lemmatization
  • 15.
    3. Data Description No.of words per headline 7 - 20 No. of words per body content 150 - 2100 No. of words per headline 10 - 20 No. of words per body content 100 - 1500 Table 1: Covid-News-USA-NNK Table 2: Covid-News-BD-NNK Date Date when news was posted Link Hyperlink Newspaper Name Name of newspaper Headline Keywords Keywords extracted from headline Report Keywords Keyword extracted from body Date Date when news was posted Link Hyperlink Newspaper Name Name of newspaper Headline Keywords extracted from headline Report Keyword extracted from body
  • 16.
    4. Dataset Repository,Policy and License • Project stored in Github: https://github.com/NNK-Dataset • Covid-News-USA-NNK: https://github.com/NNK-Dataset/USA-NNK • Covid-News-BD-NNK: https://github.com/NNK-Dataset/BD-NNK • Kaggle: https://doi.org/10.34740/kaggle/dsv/1511505 • License: CCO (Creative Commons)
  • 17.
  • 18.
    Word Clouds: WashingtonPost News (USA) February, 2020 March, 2020 April, 2020 May, 2020
  • 19.
    Word Clouds: StarTribune News (USA) February, 2020 March, 2020 April, 2020 May, 2020
  • 20.
    Word Clouds: March, 2020April, 2020 March, 2020 April, 2020 Daily Star News (BD) Prothom Alo News (BD)
  • 21.
    Covid-cases through numberextractions: Cases(based on keyword in news report) related to COVID-19 fromFebruary till March. X axis represents the month and Y axis represents casesin 10,000. Numeric Extraction keywords: Infected, Died, Infections, Died, Quarantined, Lock- down, Diagnosed.
  • 22.
    Vader Sentiment Analysis: -Average : -0.5 to -0.9 (Scale -1(highly negative) to +1(highly positive)) Keyword extraction using PageRank: - : ’China’, Government’, ’Masks’, ’Economy’,’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’,’Health’, ’Response’.
  • 23.
  • 24.
    - Starting pointfor an important dataset. - Assert importance of NLP in newspaper report analysis. - Dataset open for research and enhancement
  • 25.