COVID-19 Database for answering the queries by using Natural Language Text processing.

 

This technical document reports the functionality of a database software that uses AI text processing technique to find the answers to the questions about COVID-19. In this phase of development, clustering groups of the abstracts of the research papers are added to a relational database.In the future development of this software we plan to use another AI/NLP technique called Node2Vec to include text of the bodies of the papers into the database. Using Node2Vec for lower dimension representation of scientific papers is discussed in our recent paper at the following link: https://www.sciencedirect.com/science/article/abs/pii/S0169023X1830185X

 

The relational database is Beta version of DE9 which is capable of showing the query results in a very simple way to the users (without the need of writing SQL codes). Python used for NLP and clustering. Thanks to COVID-19 Open Research Dataset Challenge (CORD-19) that provided an open dataset, we imported 43,233 metadata, including the abstracts of the related research papers into the database. Here is the link to COVID19 dataset

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

 

 

To create the clusters, a BOW (Bag of Words) matrix including the TFIDF vector for each abstract was created.It was done after preprocessing and text cleaning (including removing stop words, stemming, etc.) for the abstracts of 43,233 papers. BoW matrix used by K-means to cluster similar papers. For finding the optimum value of K, the Elbow method was used with the values of K from 1 and to 1000. This task took approximately 42 hours of CPU time on a regular computer. After K value of 800, the graph of SSE slowed down that shows the optimal value of K starts from 800. The value of K between 5000 to 6000 was targeted for the number of clusters. Considering the random nature of K-means algorithm the clustering was done three times for all data. The best clustering results (that took around 8 hours by using cloud computing) was identified and added to the database. The number of final clusters is approximately 6000 clusters. For measuring distance of vectors, cosine similarity used to find the similar abstracts. The Excel sheet of the best results (which is integrated as the relation to the database software) is linked here for public use.

 

The rest of this report shows the screenshots of the system and user interaction to answer three questions about the COVID19. To test the system, three following queries with different difficulty levels are used:

 

Q1- What papers show the effects of the virus on elderlies?(required an easy search)

Q2- What is known about the incubation period of COVID 19? (required average level of search)

Q3- What is known about the seasonality of transmission of COVID 19? (required difficult level of search)

 

The following screenshots with red-colored comments show the interaction with the COVID-19 database system:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Using elderly keyword to answer query 1:

 

 

 

 

 

 

Below is the results of the title search:

 

 

 

We can either look at the papers with related abstracts (there are four papers in the same cluster 599, two of them are shown above) or look at the rest of the papers with the title, including “elderly”.

 

 

 

 

 

 

 

 

As shown above, there are 62 papers query results (with a similar keyword in the title) that can be shown as a list for better comparison (shown below):

 

 

 

 

Answering query 2: What are the papers discussing the incubation period of COVID 19?

 

We can search for the keyword of *incubation period* either by typing in the title or abstract. Below is the screenshot of searching the keyword *incubation period* in the titles:

 

 

 

 

 

By reading the abstract of the above resulted paper it does not look it is closed to the answer. This time we want to see all the result of the query as the list which is shown below:

 

 

 

 

Below is the details of the second paper shown above:

 

 

 

 

 

 

 

The answer according to this paper is incubation period is between 2 to 14 days, and the length of quarantine should be min 14 days. Below is the first page of this paper that confirms paper abstract in the system.

 

 

 

 

Query 3 What is known about the seasonality of transmission of COVID 19?

 

 

 

The above paper resulted from the keyword search in the title but this paper has no abstract. We use the same keyword to search in the abstract field. Here is the result:

 

 

In total there are 3 papers with abstract containing "seasonality in transmission". From their abstracts, it is clear they are not focusing on seasonality. Also they are discussing other viruses similar to COVID19. By looking at their related papers, again it looks like seasonality is not the main focus of them. Below is the list of three papers from this keyword search.

 

 

 

 

Now we change the keyword to *seasonality of transmission*:

 

 

Here is the result:

 

 

 

 

 

 

 

 

 

Although this paper is not talking about COVID 19 but from the abstract, it seems the focus is on seasonality. If we look at the first related paper (Factors influencing..), and its abstract also we can see focus is the seasonality. So we narrow the search to these two papers although they dont discuss COVID19. They talk about similar viruses to COVID19 and a disease caused by them. The interesting observation is seasons and dates are discussed in the papers in form of tables and graphs. Below are these two papers: