COVID-19 Database for answering the queries by using Natural
Language Text processing.
This technical document reports the functionality of a database software that uses AI text processing technique to find the
answers to the questions about COVID-19. In this phase of development, clustering groups of the abstracts of the research papers are added to a relational database.In
the future development of this software we plan to use another AI/NLP technique called Node2Vec to include text of the bodies of the papers into the database. Using Node2Vec for lower dimension representation of scientific papers is discussed in our recent paper at the following link: https://www.sciencedirect.com/science/article/abs/pii/S0169023X1830185X
The relational database is Beta version of DE9 which is capable of showing the query results in a very simple
way to the users (without the need of writing SQL codes). Python used for NLP and clustering. Thanks to COVID-19 Open Research Dataset Challenge
(CORD-19) that provided an open dataset, we imported 43,233 metadata, including
the abstracts of the related research papers into the database. Here is the link to COVID19 dataset
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
To create the clusters, a BOW (Bag of Words) matrix
including the TFIDF vector for each abstract was created.It was done after preprocessing and text cleaning
(including removing stop words, stemming, etc.) for the abstracts of 43,233 papers. BoW matrix used by
K-means to cluster similar papers. For finding the optimum
value of K, the Elbow method was used with the values of K from
1 and to 1000. This task took approximately 42 hours of CPU time on a regular
computer. After K value of 800, the graph of SSE slowed down that shows the
optimal value of K starts from 800. The value of K between 5000 to 6000 was targeted for the number of clusters. Considering the random nature of
K-means algorithm the clustering was done three times for all data. The best
clustering results (that took around 8 hours by using cloud computing) was identified and added to the database. The number of final clusters is approximately 6000
clusters. For measuring distance of vectors, cosine similarity used to find the similar
abstracts. The Excel sheet of the best results (which is integrated as the relation to the database software) is linked here for public use.
The rest of this report shows the screenshots of the system and
user interaction to answer three questions about the
COVID19. To test the system,
three following queries with different difficulty levels are used:
Q1- What papers show the effects of the virus on
elderlies?(required an easy search)
Q2- What is known about the
incubation period of COVID 19? (required average level of search)
Q3- What is known about the seasonality
of transmission of COVID 19? (required difficult level of search)
The following screenshots with red-colored comments show the
interaction with the COVID-19 database system:
Using elderly keyword to answer query 1:
Below is the results of the title search:
We can either look at the papers with related abstracts
(there are four papers in the same cluster 599, two of them are shown above) or
look at the rest of the papers with the title, including “elderly”.
As shown above, there are 62 papers query results (with
a similar keyword in the title) that can be shown as a list for better comparison
(shown below):
Answering query 2: What are the papers discussing the
incubation period of COVID 19?
We can search for the keyword of *incubation period* either by typing in
the title or abstract. Below is the screenshot of searching the keyword *incubation period* in
the titles:
By reading the abstract of the above resulted paper it does
not look it is closed to the answer. This time we want to see all the result of the query
as the list which is shown below:
Below is the details of the second paper shown above:
The answer according to this paper is incubation period is
between 2 to 14 days, and the length of quarantine should be min 14 days. Below
is the first page of this paper that confirms paper abstract in the system.
Query 3 What is known about the seasonality of transmission
of COVID 19?
The above paper resulted from the keyword search in the
title but this paper has no abstract. We use the
same keyword to search in the abstract field. Here is the result:
In total there are 3 papers with abstract containing
"seasonality in transmission". From their abstracts, it is clear they are not
focusing on seasonality. Also they are discussing other viruses similar to
COVID19. By looking at their related
papers, again it looks like seasonality is not the main focus of them. Below is the list of
three papers from this keyword search.
Now we change the keyword to *seasonality of transmission*:
Here is the result:
Although this paper is not talking about COVID 19 but from the
abstract, it seems the focus is on seasonality. If we look at the first
related paper (Factors influencing..), and its abstract also we can see focus is
the seasonality. So we narrow the search to these two papers although they dont discuss
COVID19. They talk about similar viruses to COVID19 and a disease caused by them. The interesting
observation is seasons and dates are discussed in the papers in form of tables and graphs.
Below are these two papers: