Topic Modeling on Research Paper Abstracts

Iqra Munawar (M.S. Analytics, NCSU 2020) & Sameen Salam (M.S. Analytics, NCSU 2020)

COVID-19 is one of the biggest challenges the world has had to face in recent memory. As scientists continue to work tirelessly towards a resolution of the crisis, one thing is abundantly clear: there is an enormous amount of COVID-19 related research. In this analysis, we implement a couple of different methods to categorize abstracts according to topical similarity. By doing this, we hope to make a large corpus of research like this easily parseable to individual labs and accessible to the general public. In this notebook, we use the K-means approach (based on TF-IDF) and the LDA topic modeling approach.

The original data source used in this analysis comes from the CORD-19 Research Data Challenge hosted on Kaggle. We created our own text pre-processing pipeline (paper_abstract_cleaner.ipynb) and fed the original Kaggle dataset into it to get an output with all original rows and columns plus the extra "abstract2" column. This additional column contains the cleaned and modelable abstracts, and is what is primarily used in this analysis notebook. There is a .zip file for the data included in this repository if you would like to run it locally. Alternatively, if you have a Kaggle account, you can download it locally off of the Kaggle commit output here: https://www.kaggle.com/sameensalam/paper-abstract-cleaner/output.

Pending Items

  • Summarize key points that define each cluster/topic for both K-Means and LDA
  • Create a search engine based on this corpus

Setup and Data

Import the necessary libraries, define the necessary functions, and read in the data for this analysis.

Libraries

In [1]:
#Load in the necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
import random
import matplotlib.pyplot as plt
import string
import re
import pickle
import gensim
import pyLDAvis.gensim
import time
import os
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans 
from nltk import FreqDist
from gensim.models import LdaModel
from gensim import corpora
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import Phrases
from gensim.models import CoherenceModel
from collections import Counter
from wordcloud import WordCloud
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from zipfile import ZipFile

Functions

In [1]:
def wordcloud(counter):
    """A small wordloud wrapper"""
    wc = WordCloud(width=1200, height=800, 
                   background_color="white", 
                   max_words=200) 
    wc.generate_from_frequencies(counter)

    # Plot
    fig=plt.figure(figsize=(6, 4))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()
In [13]:
def compute_coherence_values(dictionary, corpus, texts, stop, start=2, step=3, lda= True):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    for num_topics in range(start, stop, step):
        print('Calculating {}-topic model'.format(num_topics))
        
        if lda is True: 
            model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary,\
                                                     random_seed=0, workers=3)
        elif lda is False: 
            model = gensim.models.LdaMulticore(corpus = corpus, num_topics=num_topics, id2word=dictionary,
                                               random_state=0, workers=3)
        
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return coherence_values
In [42]:
#This snippet is necessary to convert the Mallet model object into something that the pyLDAvis library can use
#Function to bypass gensim.models.wrappers.ldamallet.malletmodel2ldamodel, which has known bugs that reduce model performance
#Credit to Stackoverflow user: norpa
def ldaMalletConvertToldaGen(mallet_model):
    model_gensim = LdaModel(id2word=mallet_model.id2word, num_topics=mallet_model.num_topics, alpha=mallet_model.alpha, eta=0,\
                            iterations=1000, gamma_threshold=0.001, dtype=np.float32)
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

Data Read-In

In [6]:
#Located in the repository with this notebook
file_name = "abstract_cleaned.zip"
with ZipFile(file_name, 'r') as zip: 
    zip.extractall()

#Read in the data
metadata = pd.read_csv("abstract_cleaned.csv")
In [2]:
#If downloaded directly as a CSV from Kaggle kernel output, then edit path and run 
#metadata = pd.read_csv(r'C:\Users\USER\Documents\Misc\covid19_research\abstract_cleaned.csv')
In [3]:
metadata.shape
Out[3]:
(75000, 20)
In [4]:
metadata.head()
Out[4]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal mag_id who_covidence_id arxiv_id pdf_json_files pmc_json_files url s2_id abstract2
0 lt17jimb NaN Medline The role of family variables in fruit and vege... 10.4081/jphr.2012.e22 NaN 25170457.0 unk ABSTRACT Most Americans, including children, c... 2012 Goldman, Rachel L; Radnitz, Cynthia L; McGrath... Journal of public health research NaN NaN NaN NaN NaN https://doi.org/10.4081/jphr.2012.e22; https:/... 10065731.0 ['americans', 'include', 'child', 'continue', ...
1 a943sj83 NaN WHO The ethical challenges of the SARS-CoV-2 pande... NaN NaN NaN unk The global COVID-19 death toll stands at the t... 2020 Schuklenk, Udo NaN NaN #154926 NaN NaN NaN NaN 216647886.0 ['global', 'COVID-19', 'death', 'toll', 'stand...
2 mpacgk6o 5222649230a68712e13dbfa99f7b2fd8f8190bc6 Elsevier; Medline; PMC Pathology of the Exotic Companion Mammal Gastr... 10.1016/j.cvex.2014.01.002 PMC7172800 24767738.0 no-cc A variety of disease agents can affect the gas... 2014-04-23 Reavill, Drury Vet Clin North Am Exot Anim Pract NaN NaN NaN document_parses/pdf_json/5222649230a68712e13db... document_parses/pmc_json/PMC7172800.xml.json https://doi.org/10.1016/j.cvex.2014.01.002; ht... 25425517.0 ['variety', 'disease', 'agent', 'affect', 'gas...
3 874kthen NaN Medline Risk factors for bleeding evaluated using the ... 10.1097/meg.0000000000000419 NaN 26075810.0 unk BACKGROUND/AIMS Bleeding remains a serious com... 2015 Noda, Hisatsugu; Ogasawara, Naotaka; Izawa, Sh... European journal of gastroenterology & hepatology NaN NaN NaN NaN NaN https://doi.org/10.1097/meg.0000000000000419; ... 10988506.0 ['backgroundaim', 'bleed', 'remain', 'serious'...
4 j6wylhzw 40cba5e191966ed333cfa3346d485d5af0791ec0 Elsevier; PMC Clinical perspectives of emerging pathogens in... 10.1016/s0140-6736(06)68036-7 PMC7138062 NaN els-covid Summary As a result of immunological and nucle... 2006-01-27 Ludlam, Christopher A; Powderly, William G; Bo... The Lancet NaN NaN NaN document_parses/pdf_json/40cba5e191966ed333cfa... document_parses/pmc_json/PMC7138062.xml.json https://www.sciencedirect.com/science/article/... 37595885.0 ['summary', 'result', 'immunological', 'nuclei...
In [13]:
metadata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cord_uid          75000 non-null  object 
 1   sha               37992 non-null  object 
 2   source_x          75000 non-null  object 
 3   title             75000 non-null  object 
 4   doi               63041 non-null  object 
 5   pmcid             36708 non-null  object 
 6   pubmed_id         58815 non-null  float64
 7   license           75000 non-null  object 
 8   abstract          75000 non-null  object 
 9   publish_time      74991 non-null  object 
 10  authors           74159 non-null  object 
 11  journal           70465 non-null  object 
 12  mag_id            0 non-null      float64
 13  who_covidence_id  10002 non-null  object 
 14  arxiv_id          1202 non-null   object 
 15  pdf_json_files    37992 non-null  object 
 16  pmc_json_files    29466 non-null  object 
 17  url               70520 non-null  object 
 18  s2_id             64050 non-null  float64
 19  abstract2         75000 non-null  object 
dtypes: float64(3), object(17)
memory usage: 11.4+ MB

Data Exploration

In this section, we perform a touch of extra cleaning and take a look at the most common words in our corpus via wordcloud.

In [10]:
#Changing each entry in abstract2 column to a list 
cleaned_abstracts = metadata['abstract2']
cleaned_abstracts = cleaned_abstracts.apply(eval)

We are eliminating every abstract with less than 60 tokens because short abstracts with little semantic content could add noise to our language models.

In [11]:
cleaned_abstracts = cleaned_abstracts[cleaned_abstracts.str.len() > 60]
In [12]:
cleaned_abstracts.shape
Out[12]:
(65198,)
In [9]:
tokens = cleaned_abstracts.to_list()
In [10]:
#List of additional words not considered stopwords/ were overlooked in the cleaning process
N= {'using', 'may', 'also', 'used', 'use', 'however', 'including', 'among', 'could', 'based', 'within','OBJECTIVES',\
    'FINDINGS'}

# Removing those additional words from list of lists 
tokens = [[ele for ele in sub if ele not in N] for sub in tokens] 
In [22]:
tokens_all = [item for items in tokens for item in items]
counter = Counter(tokens_all)
#counter.most_common(1000)
In [51]:
%matplotlib inline
# create wordcloud
wordcloud(counter)

K-Means Approach

Here we use K-means clustering on the TF-IDF matrix of the entire corpus and visualize the clusters in a lower dimensional space using PCA in combination with TSNE.

In [10]:
#Creating new cleaned_abstracts_tf object since these next lines of code only applies to K-means
cleaned_abstracts_tf = cleaned_abstracts.apply(lambda x: ' '.join(x))
cleaned_abstracts_tf = list(cleaned_abstracts_tf)
In [11]:
#Direct way to get TF-IDF model without having to go through an initial Bag o Words Model
tv = TfidfVectorizer(min_df= 0.008, max_df = 0.5, norm = 'l2', use_idf = True, smooth_idf= True, lowercase=False,\
                     analyzer="word", token_pattern=r"(?u)\S\S+")

#Transforming the cleaned_abstracts_tf object
tv_matrix = tv.fit_transform(cleaned_abstracts_tf)
tv_matrix = tv_matrix.toarray()
terms = tv.get_feature_names()

#Creating and showing a dataframe to help show how the transformation happened. Each row is a document, each feature is a token 
tv_dataframe = pd.DataFrame(np.round(tv_matrix, 2), columns=terms)
tv_dataframe   
Out[11]:
2019-ncov ACE2 AIM ARDS BMI CD4 CD8 CI COVID-19 CT ... worldwide would wound wuhan x-ray year yet yield young zoonotic
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.09 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.11
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65193 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
65194 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
65195 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
65196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00
65197 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00

65198 rows × 1848 columns

In [15]:
#Running through the k-means model for n topics from 2 to 29
inertia_vals = []
cluster_number = range(2,30)

for i in cluster_number:
    true_k = i
    kmeans_model = KMeans(n_clusters = true_k, random_state = 0)
    kmeans_model.fit(tv_matrix)
    inertia_vals.append(kmeans_model.inertia_)

Below is a plot of total sum of squares for each value k tested in the K-Means approach. Since there is not a clear elbow in this plot, we will go with 17 clusters as the number of clusters for our visual. This also corresponds with the number of topics we selected for our final LDA model (discussed later).

In [16]:
#Plotting the Total Sums of Squares for each value of k in Kmeans
plt.plot(cluster_number, inertia_vals, 'bx-')
plt.xlabel('k')
plt.ylabel('Total Sum of Squares')
plt.title('Elbow Method For Optimal K')
plt.show()
In [13]:
#Defining our K-Means model with the optimal number of clusters (17)
true_k = 17
kmeans_model = KMeans(n_clusters= true_k, random_state= 0)
In [15]:
#Fitting the kmeans model
kmeans_model.fit(tv_matrix)
Out[15]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=17, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)
In [16]:
#Checking to see if the model properly labeled a few of the documents
kmeans_model.labels_
Out[16]:
array([14,  7, 13, ...,  4,  4,  4])
In [17]:
#Getting the contribution of each word in descending order per centroid
order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
order_centroids
Out[17]:
array([[1335,  215,  258, ..., 1027,  868,  136],
       [ 920, 1202, 1642, ...,  501, 1026, 1466],
       [ 749, 1343,  250, ..., 1622, 1083,  136],
       ...,
       [ 277, 1452, 1803, ..., 1027,  920,   44],
       [  28, 1803, 1799, ...,  136, 1836, 1262],
       [1613,  493,  822, ..., 1181,  777, 1031]], dtype=int64)

Below, you will see the top words that define each cluster. Using these top words, we can infer the general subject matter for abstracts within any given cluster. For example, a research paper abstract that falls into cluster 0 most likely pertains to the cellular biological characteristics of the novel coronavirus.

In [61]:
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind]),
    print("-----------------------------")
Cluster 0:
 protein
 bind
 cell
 domain
 virus
 fusion
 SARS-CoV
 membrane
 viral
 structure
 peptide
 interaction
 spike
 antibody
 residue
-----------------------------
Cluster 1:
 laparoscopic
 patient
 surgery
 complication
 postoperative
 procedure
 surgical
 technique
 perform
 resection
 repair
 mean
 hernia
 underwent
 time
-----------------------------
Cluster 2:
 health
 public
 care
 pandemic
 disease
 COVID-19
 service
 system
 global
 healthcare
 social
 risk
 medical
 country
 emergency
-----------------------------
Cluster 3:
 group
 patient
 compare
 difference
 significantly
 postoperative
 control
 study
 day
 surgery
 time
 rate
 significant
 score
 complication
-----------------------------
Cluster 4:
 cell
 expression
 infection
 virus
 mice
 response
 immune
 receptor
 viral
 protein
 gene
 induce
 human
 signal
 activation
-----------------------------
Cluster 5:
 aneurysm
 stroke
 patient
 occlusion
 endovascular
 stent
 artery
 outcome
 ischemic
 treatment
 intracranial
 device
 cerebral
 treat
 flow
-----------------------------
Cluster 6:
 virus
 infection
 human
 bat
 disease
 viral
 animal
 host
 respiratory
 specie
 cause
 MERS-CoV
 coronavirus
 transmission
 infectious
-----------------------------
Cluster 7:
 model
 case
 epidemic
 COVID-19
 number
 spread
 transmission
 estimate
 datum
 country
 disease
 outbreak
 rate
 china
 measure
-----------------------------
Cluster 8:
 influenza
 virus
 h1n1
 pandemic
 infection
 respiratory
 avian
 human
 vaccine
 viral
 patient
 seasonal
 case
 surveillance
 season
-----------------------------
Cluster 9:
 sequence
 gene
 strain
 genome
 isolate
 virus
 IBV
 analysis
 nucleotide
 protein
 region
 phylogenetic
 amino
 genetic
 acid
-----------------------------
Cluster 10:
 assay
 detection
 sample
 test
 virus
 PCR
 detect
 sensitivity
 antibody
 method
 positive
 diagnostic
 RT-PCR
 respiratory
 specimen
-----------------------------
Cluster 11:
 vaccine
 antibody
 response
 vaccination
 immune
 virus
 development
 antigen
 epitope
 mice
 protection
 protein
 induce
 neutralize
 challenge
-----------------------------
Cluster 12:
 COVID-19
 patient
 SARS-cov-2
 disease
 coronavirus
 case
 pandemic
 severe
 infection
 respiratory
 clinical
 risk
 treatment
 symptom
 report
-----------------------------
Cluster 13:
 patient
 treatment
 clinical
 study
 CI
 outcome
 risk
 day
 mortality
 case
 hospital
 disease
 infection
 associated
 therapy
-----------------------------
Cluster 14:
 child
 respiratory
 virus
 infection
 RSV
 viral
 patient
 tract
 detect
 rhinovirus
 year
 acute
 infant
 asthma
 syncytial
-----------------------------
Cluster 15:
 RNA
 virus
 viral
 replication
 antiviral
 activity
 protein
 cell
 compound
 inhibitor
 drug
 protease
 infection
 inhibit
 target
-----------------------------
Cluster 16:
 study
 disease
 increase
 effect
 review
 treatment
 result
 model
 system
 clinical
 include
 datum
 high
 may
 method
-----------------------------
In [30]:
#Defining and fitting a 3 component PCA model for visualization
pca = PCA(n_components=3)
scatter_plot_points = pca.fit_transform(tv_matrix)
In [115]:
#Creating a dataframe with the 3 principal components and each abstract's K-Means cluster label
pca_points = pd.DataFrame(columns=['pca1','pca2','pca3'])

pca_points["pca1"] = scatter_plot_points[:,0]
pca_points["pca2"] = scatter_plot_points[:,1]
pca_points["pca3"] = scatter_plot_points[:,2]
pca_points["cluster_label"] = kmeans_model.labels_

pca_points.head()
Out[115]:
pca1 pca2 pca3 cluster_label
0 -0.037151 0.051144 -0.061883 14
1 -0.046053 0.124345 -0.106154 7
2 -0.127841 -0.122545 0.004256 13
3 0.141154 0.052286 -0.002346 6
4 0.191412 -0.098651 -0.080220 4
In [116]:
#Plot for Principal components 1 and 2

#Predefined 17 colors (chosen for maximal contrast within 17 distinct groups)
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
          "#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]

#Creating the plot
x_axis = [o for o in pca_points["pca1"]]
y_axis = [o for o in pca_points["pca2"]]
fig, ax = plt.subplots(figsize=(20,10))

_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in pca_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')

#Adding easily visible numbers. Added single quotes around numbers to prevent confusion (i.e. cluster 1 label close to cluster 3 label vs cluster 13)
for i in range(true_k):
    xtext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca1"])+(np.random.rand()/100)
    ytext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca2"])+(np.random.rand()/100)
    txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
    txt.set_path_effects([
        PathEffects.Stroke(linewidth=5, foreground="w"),
        PathEffects.Normal()])
In [113]:
#Plot of principal components 2 and 3
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
          "#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]

x_axis = [o for o in pca_points["pca2"]]
y_axis = [o for o in pca_points["pca3"]]
fig, ax = plt.subplots(figsize=(20,10))

_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in pca_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')

for i in range(true_k):
    xtext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca2"])+(np.random.rand()/100)
    ytext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca3"])+(np.random.rand()/100)
    txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
    txt.set_path_effects([
        PathEffects.Stroke(linewidth=5, foreground="w"),
        PathEffects.Normal()])

The above PCA plots are helpful, but there is a lot of noise and it is somewhat difficult to see cohesive clusters. Using PCA in combination with the TSNE algorithm yields much cleaner visualizations because of the removal of additional noise from the signal. We reduced the number of components from the TF-IDF input by 90% (~1850 features to 185 features). Based on the resulting plot, most clusters see slight overlap with other clusters far from their respective centroids. This makes sense, since much of this research shares common terminology or cites the same previously observed phenomena. Clusters that are close together have similar subject matter. For example, clusters 2 and 7 in the top right of this visual are very close to one another in the TSNE plot, and share similar top words, such as COVID-19 and disease. Clusters that are splattered, for lack of a better term, away from the centroid tend to have more terms that overlap with other clusters. Cluster 16 (in the center), for example has a lot of overlap in top words with other topics.

In [18]:
#Reducing the tv_matrix into ~10% of its original dimensionality
pca_185 = PCA(n_components=185)
reduced_dim = pca_185.fit_transform(tv_matrix)
In [19]:
#Creating and fitting the TSNE model with the PCA output
tsne = TSNE(random_state=0)
tsne_results = tsne.fit_transform(reduced_dim)
In [20]:
#Creating the tsne_points dataframe as input for graphinc
tsne_points = pd.DataFrame(columns=['tsne1','tsne2','cluster_label'])

tsne_points["tsne1"] = tsne_results[:,0]
tsne_points["tsne2"] = tsne_results[:,1]
tsne_points["cluster_label"] = kmeans_model.labels_
In [21]:
#Plotting the TSNE 2-dimensional model to see how the clusters looked after this transformation
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
          "#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]

x_axis = [o for o in tsne_points["tsne1"]]
y_axis = [o for o in tsne_points["tsne2"]]
fig, ax = plt.subplots(figsize=(20,10))

_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in tsne_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')

for i in range(17):
    xtext = np.median(tsne_points.loc[tsne_points["cluster_label"] == i,"tsne1"])+(np.random.rand()/100)
    ytext = np.median(tsne_points.loc[tsne_points["cluster_label"] == i,"tsne2"])+(np.random.rand()/100)
    txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
    txt.set_path_effects([
        PathEffects.Stroke(linewidth=5, foreground="w"),
        PathEffects.Normal()])

We clearly have some really interesting preliminary results here, but we also wanted to take on a topic modeling algorithm suited for this sort of analysis: Latent Direchlet Allocation, or LDA.

LDA Model Approach

Here we use the Latent Direcelht Allocation (LDA) model to model the topics in this corpus of 75,000 abstracts. We included bigrams, or pairs of tokens that occur together in more than 20 documents in this case, and eliminated words that occurred in less than 625 times. Using the resulting token ids for the entire corpus and the resulting individual bag of words models for each abstract, we conducted a grid search for the optimal values for the following parameters:

  • Number of topics- number of topics the model going to assume while looking at each document.
  • Alpha- The per-document topic distribution. A higher value of alpha tells the model that each document is made up of more topics.
  • Beta/Eta- The per-topic word distribution. A higher value of beta/eta tells the model that each topic includes more words.
  • Learning decay- The rate at which the model "forgets" the old weights as it passes through the corpus.

We tried these parameters on the regular Gensim LDA algorithm, which utilizes variational Bayes sampling. We also used the Mallet implementation via Gensim, which uses Gibbs sampling. The ideal model was selected based on coherence score, which measures how similar the high scoring words are in each topic. Using the final candidate model, we plotted the topic distribution using the pyLDAvis library, a staple for this particular algorithm.

In [8]:
# Compute bigrams from the cleaned_abstracts object defined earlier
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(cleaned_abstracts, min_count=20)
for idx in range(len(cleaned_abstracts)):
    for token in bigram[cleaned_abstracts.iloc[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            cleaned_abstracts.iloc[idx].append(token)
In [9]:
#Create token ids for entire corpus
dictionary = corpora.Dictionary(cleaned_abstracts)

# Filter out words that occur less than 521 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below= 521, no_above=0.5)
In [11]:
#Creating individual bag of words for each abstract mapped according to the dictionary
corpus = [dictionary.doc2bow(text) for text in cleaned_abstracts]

#Saving the dictionary and corpus items for later use
pickle.dump(corpus, open('corpus.pkl', 'wb'))

dictionary.save('dictionary.gensim')
In [190]:
#Gridsearch parameters for LDA model. This code takes about 6hrs to run, so it's commented out for convenience here. 

#test_alphas = [0.05,0.17,0.5]-----default (1/num_topics 1/6~0.17) was best
#test_betas = [0.05,0.17,0.5]----0.5 was best, but only very very slightly
#test_decay = [0.5,0.6,0.7]-----0.5 test decay was the best value
#coherence_vals = []
#counter = 1

#for topic_val in num_topics:
    
#    for alpha_val in test_alphas:
        
#        for beta_val in test_betas:
            
#            for decay_val in test_decay:
                
#                start_time = time.time()

#                ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = topic_val, id2word=dictionary, passes=7, workers=3, random_state=0,\
#                                                             alpha=alpha_val, eta=beta_val, decay=decay_val)

#                end_time = time.time()

                #print("Perplexity Score:", ldamodel_gensim.log_perplexity(corpus))

#                coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= corrected_subset, dictionary=dictionary, coherence='c_v')
#                coherence_vals.append(coherence_model_lda.get_coherence())
                
                
#               print("Finished model:",counter, "of 72")
#                counter+=1

#coherence_vals
In [14]:
#Iterating through all possible topic values from 2 to 30
start=2; stop=30; step=1;
stop += 1
coherence_values = compute_coherence_values(dictionary=dictionary,corpus=corpus,texts=cleaned_abstracts,start=start,stop=stop,\
                                                        step=step, lda=False)

#Because the above code takes a long time to run, I am saving the output so that I don't have to rerun it everytime I want to graph coherence
f = open('coherence_reg.pckl', 'wb')
pickle.dump(coherence_values, f)
f.close()
Calculating 2-topic model
Calculating 3-topic model
Calculating 4-topic model
Calculating 5-topic model
Calculating 6-topic model
Calculating 7-topic model
Calculating 8-topic model
Calculating 9-topic model
Calculating 10-topic model
Calculating 11-topic model
Calculating 12-topic model
Calculating 13-topic model
Calculating 14-topic model
Calculating 15-topic model
Calculating 16-topic model
Calculating 17-topic model
Calculating 18-topic model
Calculating 19-topic model
Calculating 20-topic model
Calculating 21-topic model
Calculating 22-topic model
Calculating 23-topic model
Calculating 24-topic model
Calculating 25-topic model
Calculating 26-topic model
Calculating 27-topic model
Calculating 28-topic model
Calculating 29-topic model
Calculating 30-topic model
In [39]:
#Reading in the pickle file containing graphable coherence values
f = open('coherence_reg.pckl', 'rb')
coherence_values = pickle.load(f)
f.close()

#Graph of Mallet LDA coherence values as number of topics increases
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.title("Regular LDA Coherence Score Over Number of Topics");
plt.show()
#plt.legend(("coherence_values"), loc='best')
In [79]:
#Regular Gensim LDA model with gridsearch ideal parameters

ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = 17, id2word=dictionary, passes=7, workers=3,\
                                             random_state=0)


coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_lda.get_coherence()
Out[79]:
0.5838257450601767

To run Mallet LDA via Gensim, you need to have installed the mallet software package. This link will take you to where you can download it: http://mallet.cs.umass.edu/download.php. This link will take you to a tutorial that helps with proper usage of Mallet: https://www.tutorialspoint.com/gensim/gensim_creating_lda_mallet_model.htm.

In [21]:
#Mallet implementation of gensim LDA with ideal number of topics
os.environ.update({'MALLET_HOME':r'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/'})

mallet_path = 'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/bin/mallet'
In [22]:
#Iterating through all possible topic values from 2 to 30
start=2; stop=30; step=1;
stop += 1
coherence_values_lda = compute_coherence_values(dictionary=dictionary,corpus=corpus,texts=cleaned_abstracts,start=start,stop=stop,\
                                                        step=step, lda= True)

#Because the above code takes a long time to run, I am saving the output so that I don't have to rerun it everytime I want to graph coherence
f = open('coherence_lda.pckl', 'wb')
pickle.dump(coherence_values_lda, f)
f.close()
Calculating 2-topic model
Calculating 3-topic model
Calculating 4-topic model
Calculating 5-topic model
Calculating 6-topic model
Calculating 7-topic model
Calculating 8-topic model
Calculating 9-topic model
Calculating 10-topic model
Calculating 11-topic model
Calculating 12-topic model
Calculating 13-topic model
Calculating 14-topic model
Calculating 15-topic model
Calculating 16-topic model
Calculating 17-topic model
Calculating 18-topic model
Calculating 19-topic model
Calculating 20-topic model
Calculating 21-topic model
Calculating 22-topic model
Calculating 23-topic model
Calculating 24-topic model
Calculating 25-topic model
Calculating 26-topic model
Calculating 27-topic model
Calculating 28-topic model
Calculating 29-topic model
Calculating 30-topic model

Below, you will see a graph of the coherence value of the Mallet LDA model as the number of topics increases. Looking at this chart, it is clear that the Mallet LDA has the higher coherence scores.

In [40]:
#Reading in the pickle file containing graphable coherence values
f = open('coherence_lda.pckl', 'rb')
coherence_values_lda = pickle.load(f)
f.close()

#Graph of Mallet LDA coherence values as number of topics increases
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values_lda)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.title("Mallet LDA Coherence Score Over Number of Topics")
#plt.legend(("coherence_values"), loc='best')
plt.show()
In [36]:
#Creating combined plot for easier comparison
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values_lda)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("mallet","regular"), loc='best')
plt.show()
In [24]:
# Rerun the 17 topic Mallet model
ldamodel_gensim_mallet = gensim.models.wrappers.LdaMallet(mallet_path= mallet_path, corpus= corpus, num_topics= 17,\
                                                          id2word=dictionary,workers=3,random_seed=0)
In [26]:
coherence_model_lda = CoherenceModel(model=ldamodel_gensim_mallet,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_lda.get_coherence()
Out[26]:
0.6101971685041588

After finding that the Mallet LDA model (0.610) performed better than the regular Gensim implementation (0.583), we decided to move forward with the Mallet model for visualization.

In [43]:
#Converting the Mallet model to an object that pyLDAvis understands
converted_model = ldaMalletConvertToldaGen(ldamodel_gensim_mallet)
In [44]:
#Saving the converted model for use in PyLDAvis
converted_model.save('model.gensim')

#Printing the top 10 words in each of the 5 topics
topics = converted_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
(0, '0.032*"development" + 0.027*"drug" + 0.025*"potential" + 0.019*"provide" + 0.018*"strategy" + 0.017*"target" + 0.016*"therapeutic" + 0.015*"develop" + 0.015*"effective" + 0.014*"approach"')
(1, '0.051*"increase" + 0.042*"effect" + 0.039*"level" + 0.036*"result" + 0.032*"change" + 0.029*"study" + 0.029*"show" + 0.024*"high" + 0.023*"reduce" + 0.020*"find"')
(2, '0.109*"patient" + 0.071*"treatment" + 0.029*"clinical" + 0.025*"treat" + 0.020*"therapy" + 0.018*"cancer" + 0.017*"outcome" + 0.015*"lesion" + 0.013*"stroke" + 0.012*"imaging"')
(3, '0.164*"virus" + 0.044*"influenza" + 0.039*"vaccine" + 0.035*"antibody" + 0.033*"strain" + 0.022*"isolate" + 0.019*"human" + 0.016*"infect" + 0.016*"infection" + 0.012*"influenza_virus"')
(4, '0.065*"health" + 0.029*"care" + 0.024*"public" + 0.020*"pandemic" + 0.016*"medical" + 0.015*"public_health" + 0.012*"management" + 0.012*"practice" + 0.011*"healthcare" + 0.011*"emergency"')
(5, '0.034*"lung" + 0.025*"blood" + 0.021*"tissue" + 0.017*"pressure" + 0.016*"chronic" + 0.016*"injury" + 0.015*"increase" + 0.015*"pulmonary" + 0.014*"failure" + 0.014*"disease"')
(6, '0.112*"patient" + 0.042*"risk" + 0.036*"year" + 0.031*"age" + 0.029*"mortality" + 0.026*"CI" + 0.023*"hospital" + 0.023*"factor" + 0.020*"high" + 0.013*"outcome"')
(7, '0.094*"COVID-19" + 0.066*"case" + 0.032*"SARS-cov-2" + 0.024*"coronavirus" + 0.023*"spread" + 0.022*"outbreak" + 0.022*"china" + 0.022*"number" + 0.021*"epidemic" + 0.021*"report"')
(8, '0.075*"protein" + 0.031*"RNA" + 0.027*"sequence" + 0.025*"gene" + 0.022*"viral" + 0.018*"bind" + 0.017*"genome" + 0.015*"virus" + 0.014*"structure" + 0.014*"site"')
(9, '0.066*"model" + 0.041*"datum" + 0.023*"method" + 0.023*"base" + 0.022*"system" + 0.018*"analysis" + 0.014*"propose" + 0.013*"network" + 0.013*"provide" + 0.012*"predict"')
(10, '0.094*"group" + 0.058*"compare" + 0.042*"study" + 0.039*"significantly" + 0.037*"control" + 0.035*"day" + 0.034*"low" + 0.034*"significant" + 0.030*"difference" + 0.028*"high"')
(11, '0.045*"patient" + 0.032*"surgery" + 0.027*"perform" + 0.026*"complication" + 0.025*"procedure" + 0.023*"technique" + 0.020*"laparoscopic" + 0.019*"surgical" + 0.019*"time" + 0.016*"postoperative"')
(12, '0.101*"cell" + 0.040*"response" + 0.031*"infection" + 0.026*"expression" + 0.024*"immune" + 0.019*"mice" + 0.017*"role" + 0.016*"mechanism" + 0.016*"induce" + 0.015*"viral"')
(13, '0.052*"test" + 0.044*"sample" + 0.035*"detect" + 0.032*"positive" + 0.031*"assay" + 0.029*"result" + 0.028*"detection" + 0.026*"method" + 0.020*"diagnostic" + 0.018*"testing"')
(14, '0.100*"disease" + 0.039*"human" + 0.031*"infectious" + 0.022*"animal" + 0.020*"specie" + 0.020*"population" + 0.017*"emerge" + 0.015*"infectious_disease" + 0.015*"important" + 0.014*"pathogens"')
(15, '0.094*"infection" + 0.091*"respiratory" + 0.048*"severe" + 0.041*"acute" + 0.034*"child" + 0.033*"syndrome" + 0.028*"SARS" + 0.023*"viral" + 0.022*"symptom" + 0.021*"pneumonia"')
(16, '0.084*"study" + 0.037*"review" + 0.036*"include" + 0.027*"report" + 0.026*"evidence" + 0.021*"trial" + 0.019*"intervention" + 0.019*"datum" + 0.017*"identify" + 0.016*"quality"')
In [45]:
#Loading the necessary objects for pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model.gensim')

#Plotting the LDA model results
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
C:\Users\USER\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:223: RuntimeWarning: divide by zero encountered in log
  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
C:\Users\USER\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:240: RuntimeWarning: divide by zero encountered in log
  log_lift = np.log(topic_term_dists / term_proportion)
C:\Users\USER\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py:241: RuntimeWarning: divide by zero encountered in log
  log_ttd = np.log(topic_term_dists)
Out[45]:

Based on the graphic above, you can see that the 17 topics are fairly well spaced out with evenly distributed tokens. There is some overlap, but that is to be expected with certain sub-fields of COVID-19 research sharing similar termiology (i.e. cellular mechanism and disease testing).

Unlike Kmeans clustering, LDA assumes that documents are made up of a distribution of each topic the model was trained with. Kmeans performs hard clustering, meaning each document has one label and no trace of any other.

In [48]:
top_topics = converted_model.top_topics(corpus)
avg_topic_coherence = sum([t[1] for t in top_topics]) / true_k
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)
Average topic coherence: -2.0363.
[([(0.094175905, 'COVID-19'),
   (0.06571329, 'case'),
   (0.031871248, 'SARS-cov-2'),
   (0.024096353, 'coronavirus'),
   (0.022986049, 'spread'),
   (0.022322644, 'outbreak'),
   (0.022214388, 'china'),
   (0.021603722, 'number'),
   (0.021365007, 'epidemic'),
   (0.020698825, 'report'),
   (0.019238776, 'disease'),
   (0.018103492, 'transmission'),
   (0.017959151, 'country'),
   (0.015286097, 'pandemic'),
   (0.013912098, 'contact'),
   (0.013462424, 'death'),
   (0.0133569455, 'measure'),
   (0.013340292, 'confirm'),
   (0.011264024, 'control'),
   (0.010517346, 'infect')],
  -1.63786370696118),
 ([(0.11208527, 'patient'),
   (0.041550543, 'risk'),
   (0.035843838, 'year'),
   (0.030963836, 'age'),
   (0.029246848, 'mortality'),
   (0.02642944, 'CI'),
   (0.022843398, 'hospital'),
   (0.02257797, 'factor'),
   (0.020012166, 'high'),
   (0.0132161025, 'outcome'),
   (0.012641009, 'rate'),
   (0.012300929, 'analysis'),
   (0.011391285, 'datum'),
   (0.010268746, 'study'),
   (0.010141562, 'ratio'),
   (0.009901018, 'cohort'),
   (0.009773834, 'incidence'),
   (0.009370161, 'admission'),
   (0.009209799, 'care'),
   (0.009162796, 'unit')],
  -1.8110549078777107),
 ([(0.09413205, 'infection'),
   (0.09111527, 'respiratory'),
   (0.047724683, 'severe'),
   (0.04109772, 'acute'),
   (0.033610154, 'child'),
   (0.032587994, 'syndrome'),
   (0.02840925, 'SARS'),
   (0.023155527, 'viral'),
   (0.021772968, 'symptom'),
   (0.021039747, 'pneumonia'),
   (0.020691777, 'clinical'),
   (0.019772142, 'coronavirus'),
   (0.019483203, 'respiratory_syndrome'),
   (0.017299071, 'common'),
   (0.016040787, 'severe_acute'),
   (0.014624053, 'illness'),
   (0.013173143, 'tract'),
   (0.012517593, 'MERS-CoV'),
   (0.012486524, 'fever'),
   (0.010634827, 'disease')],
  -1.8379553379052407),
 ([(0.044821236, 'patient'),
   (0.03156017, 'surgery'),
   (0.026585827, 'perform'),
   (0.025801165, 'complication'),
   (0.024805138, 'procedure'),
   (0.023169221, 'technique'),
   (0.020490948, 'laparoscopic'),
   (0.019289346, 'surgical'),
   (0.01910983, 'time'),
   (0.016124642, 'postoperative'),
   (0.013243691, 'underwent'),
   (0.011978389, 'range'),
   (0.010776785, 'loss'),
   (0.010521987, 'pain'),
   (0.00965915, 'case'),
   (0.009583869, 'repair'),
   (0.009334862, 'approach'),
   (0.009297222, 'min'),
   (0.00904532, 'require'),
   (0.008975829, 'open')],
  -1.8526078109316377),
 ([(0.051743746, 'test'),
   (0.0438193, 'sample'),
   (0.0350125, 'detect'),
   (0.032029927, 'positive'),
   (0.030533608, 'assay'),
   (0.029134585, 'result'),
   (0.028118027, 'detection'),
   (0.025819872, 'method'),
   (0.020297587, 'diagnostic'),
   (0.017835036, 'testing'),
   (0.016365558, 'diagnosis'),
   (0.01543623, 'PCR'),
   (0.014919564, 'negative'),
   (0.014000302, 'laboratory'),
   (0.013949977, 'reaction'),
   (0.013533961, 'collect'),
   (0.013450086, 'sensitivity'),
   (0.012691863, 'specimen'),
   (0.012530824, 'clinical'),
   (0.011903444, 'rapid')],
  -1.8735887657300365),
 ([(0.10110144, 'cell'),
   (0.040418852, 'response'),
   (0.030531386, 'infection'),
   (0.026447138, 'expression'),
   (0.024421308, 'immune'),
   (0.018582787, 'mice'),
   (0.017026754, 'role'),
   (0.016453763, 'mechanism'),
   (0.016070867, 'induce'),
   (0.014772814, 'viral'),
   (0.013116303, 'host'),
   (0.012700818, 'signal'),
   (0.012160416, 'pathway'),
   (0.011470655, 'receptor'),
   (0.01130772, 'type'),
   (0.011066033, 'gene'),
   (0.0101210065, 'replication'),
   (0.010115576, 'antiviral'),
   (0.009789704, 'production'),
   (0.009697375, 'cytokine')],
  -1.8825397934701695),
 ([(0.07463499, 'protein'),
   (0.031103754, 'RNA'),
   (0.026617998, 'sequence'),
   (0.024952015, 'gene'),
   (0.022148417, 'viral'),
   (0.018029287, 'bind'),
   (0.017115422, 'genome'),
   (0.014907589, 'virus'),
   (0.014379219, 'structure'),
   (0.0137268435, 'site'),
   (0.013492311, 'acid'),
   (0.013160733, 'region'),
   (0.0121282535, 'domain'),
   (0.011176649, 'analysis'),
   (0.010842374, 'interaction'),
   (0.010578188, 'SARS-CoV'),
   (0.009931204, 'membrane'),
   (0.009807199, 'peptide'),
   (0.00933544, 'reveal'),
   (0.009268045, 'show')],
  -1.9104886457662833),
 ([(0.06524621, 'health'),
   (0.028692294, 'care'),
   (0.024010906, 'public'),
   (0.019537525, 'pandemic'),
   (0.015522759, 'medical'),
   (0.015186942, 'public_health'),
   (0.011605731, 'management'),
   (0.0115706455, 'practice'),
   (0.011487944, 'healthcare'),
   (0.0113275545, 'emergency'),
   (0.009400367, 'risk'),
   (0.009182336, 'social'),
   (0.0091673, 'impact'),
   (0.008906665, 'system'),
   (0.008818952, 'global'),
   (0.0087187085, 'service'),
   (0.008295178, 'information'),
   (0.008292672, 'provide'),
   (0.008292672, 'community'),
   (0.0082575865, 'work')],
  -1.9218596664123775),
 ([(0.08389636, 'study'),
   (0.036756597, 'review'),
   (0.03633027, 'include'),
   (0.027321847, 'report'),
   (0.025774103, 'evidence'),
   (0.021078354, 'trial'),
   (0.019230947, 'intervention'),
   (0.019067215, 'datum'),
   (0.017108593, 'identify'),
   (0.01599953, 'quality'),
   (0.014491947, 'assess'),
   (0.01407798, 'search'),
   (0.013287118, 'literature'),
   (0.0130029, 'conduct'),
   (0.011452068, 'article'),
   (0.011096797, 'research'),
   (0.010707544, 'clinical'),
   (0.010475846, 'database'),
   (0.010015539, 'systematic'),
   (0.009796198, 'assessment')],
  -1.9640083730532345),
 ([(0.093885995, 'group'),
   (0.05837329, 'compare'),
   (0.041661795, 'study'),
   (0.038768828, 'significantly'),
   (0.03696534, 'control'),
   (0.03472175, 'day'),
   (0.03443861, 'low'),
   (0.03433397, 'significant'),
   (0.030059151, 'difference'),
   (0.028437244, 'high'),
   (0.024753328, 'rate'),
   (0.022762107, 'time'),
   (0.015655873, 'score'),
   (0.014541772, 'total'),
   (0.014138604, 'evaluate'),
   (0.013796988, 'week'),
   (0.012642879, 'similar'),
   (0.01032235, 'period'),
   (0.010282341, 'dose'),
   (0.010033053, 'subject')],
  -2.00426938983439),
 ([(0.10004691, 'disease'),
   (0.03896734, 'human'),
   (0.03057537, 'infectious'),
   (0.021701364, 'animal'),
   (0.019711748, 'specie'),
   (0.019511169, 'population'),
   (0.016816292, 'emerge'),
   (0.015127545, 'infectious_disease'),
   (0.014648744, 'important'),
   (0.013626438, 'pathogens'),
   (0.012439139, 'identify'),
   (0.011494476, 'bat'),
   (0.0110189095, 'include'),
   (0.009986898, 'potential'),
   (0.009880138, 'host'),
   (0.009721615, 'major'),
   (0.009307517, 'outbreaks'),
   (0.009077822, 'role'),
   (0.008854596, 'transmission'),
   (0.008637842, 'understanding')],
  -2.025850713582223),
 ([(0.032052416, 'development'),
   (0.026760524, 'drug'),
   (0.024730673, 'potential'),
   (0.019050812, 'provide'),
   (0.017678954, 'strategy'),
   (0.017387202, 'target'),
   (0.015739111, 'therapeutic'),
   (0.015136984, 'develop'),
   (0.01510905, 'effective'),
   (0.014295868, 'approach'),
   (0.013582006, 'recent'),
   (0.013544761, 'system'),
   (0.013451648, 'research'),
   (0.013060576, 'current'),
   (0.0126664, 'discuss'),
   (0.01212014, 'application'),
   (0.012014613, 'technology'),
   (0.01122626, 'agent'),
   (0.01082898, 'review'),
   (0.010667586, 'antiviral')],
  -2.047166594302885),
 ([(0.050782956, 'increase'),
   (0.0417871, 'effect'),
   (0.03858507, 'level'),
   (0.035891563, 'result'),
   (0.03209572, 'change'),
   (0.028959308, 'study'),
   (0.02862467, 'show'),
   (0.023969267, 'high'),
   (0.023286866, 'reduce'),
   (0.019766606, 'find'),
   (0.019218719, 'suggest'),
   (0.017932659, 'activity'),
   (0.01712231, 'investigate'),
   (0.015399909, 'concentration'),
   (0.014402556, 'decrease'),
   (0.01205681, 'examine'),
   (0.01132848, 'influence'),
   (0.011033211, 'condition'),
   (0.010905261, 'exposure'),
   (0.010800277, 'observe')],
  -2.1222980067739945),
 ([(0.06561948, 'model'),
   (0.041134555, 'datum'),
   (0.023437742, 'method'),
   (0.023279734, 'base'),
   (0.021656288, 'system'),
   (0.018269975, 'analysis'),
   (0.014273321, 'propose'),
   (0.013105307, 'network'),
   (0.012683954, 'provide'),
   (0.012349351, 'predict'),
   (0.011745205, 'dynamic'),
   (0.011354835, 'result'),
   (0.0111968275, 'information'),
   (0.010899402, 'time'),
   (0.010846733, 'approach'),
   (0.010781671, 'set'),
   (0.008839112, 'paper'),
   (0.00850141, 'number'),
   (0.008392973, 'parameter'),
   (0.008067665, 'performance')],
  -2.144183627391466),
 ([(0.16379371, 'virus'),
   (0.044312708, 'influenza'),
   (0.038911857, 'vaccine'),
   (0.03455581, 'antibody'),
   (0.03307058, 'strain'),
   (0.021622699, 'isolate'),
   (0.018623296, 'human'),
   (0.015710695, 'infect'),
   (0.015501733, 'infection'),
   (0.012029756, 'influenza_virus'),
   (0.01142216, 'pig'),
   (0.011254991, 'vaccination'),
   (0.011190695, 'calf'),
   (0.010750269, 'viral'),
   (0.009875844, 'antigen'),
   (0.009519002, 'porcine'),
   (0.009445063, 'type'),
   (0.009133227, 'diarrhea'),
   (0.008847111, 'show'),
   (0.008676726, 'IBV')],
  -2.4667981099998917),
 ([(0.10894127, 'patient'),
   (0.07104648, 'treatment'),
   (0.028627314, 'clinical'),
   (0.024824712, 'treat'),
   (0.020037869, 'therapy'),
   (0.018369349, 'cancer'),
   (0.016541475, 'outcome'),
   (0.014597992, 'lesion'),
   (0.013035707, 'stroke'),
   (0.012392046, 'imaging'),
   (0.012260814, 'tumor'),
   (0.012254565, 'CT'),
   (0.011535914, 'month'),
   (0.010848508, 'aneurysm'),
   (0.009764282, 'follow-up'),
   (0.009164365, 'survival'),
   (0.008967517, 'primary'),
   (0.008611316, 'device'),
   (0.00843634, 'artery'),
   (0.008089513, 'efficacy')],
  -2.5362340183046466),
 ([(0.034263525, 'lung'),
   (0.024955839, 'blood'),
   (0.021096725, 'tissue'),
   (0.016529642, 'pressure'),
   (0.016442882, 'chronic'),
   (0.016071547, 'injury'),
   (0.01471808, 'increase'),
   (0.014700728, 'pulmonary'),
   (0.014305099, 'failure'),
   (0.014003172, 'disease'),
   (0.013919882, 'liver'),
   (0.013718597, 'function'),
   (0.012885694, 'ventilation'),
   (0.012333897, 'cat'),
   (0.010852025, 'normal'),
   (0.010123235, 'condition'),
   (0.010102413, 'plasma'),
   (0.009734547, 'rat'),
   (0.009491617, 'present'),
   (0.009234806, 'follow')],
  -2.5786480859020893)]

Search Engine

Using this search engine framework, we created an application that can parse through the repository of 75,000 abstracts provided in the data source for this notebook. Check it out at:

In [100]:
metadata.abstract2[1]
Out[100]:
"['global', 'COVID-19', 'death', 'toll', 'stands', 'time', 'writing', 'time', 'read', 'number', 'increase', 'significantly', 'likely', 'see', 'end', 'time', 'policy', 'maker', 'global', 'north', 'well', 'global', 'south', 'rise', 'challenge', 'decidedly', 'mixed', 'response', 'well', 'decidedly', 'mixed', 'result', 'comparison', 'report', 'case', 'loads', 'say', 'UK', 'germany', 'brazil', 'PR', 'china', 'show', 'discipline', 'specific', 'response', 'translate', 'many', 'global', 'collaborative', 'effort', 'aim', 'develop', 'treatment', 'model', 'impact', 'varying', 'policy', 'option', 'continue', 'pandemic', 'preventive', 'vaccine', 'trial', 'forth']"
In [110]:
SE_test = metadata.copy()#.sample(n=10000,random_state=0)
In [11]:
SE_test = SE_test#.reset_index()
In [64]:
SE_test.head()
Out[64]:
cord_uid sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal mag_id who_covidence_id arxiv_id pdf_json_files pmc_json_files url s2_id abstract2
0 lt17jimb NaN Medline The role of family variables in fruit and vege... 10.4081/jphr.2012.e22 NaN 25170457.0 unk ABSTRACT Most Americans, including children, c... 2012 Goldman, Rachel L; Radnitz, Cynthia L; McGrath... Journal of public health research NaN NaN NaN NaN NaN https://doi.org/10.4081/jphr.2012.e22; https:/... 10065731.0 ['americans', 'include', 'child', 'continue', ...
1 a943sj83 NaN WHO The ethical challenges of the SARS-CoV-2 pande... NaN NaN NaN unk The global COVID-19 death toll stands at the t... 2020 Schuklenk, Udo NaN NaN #154926 NaN NaN NaN NaN 216647886.0 ['global', 'COVID-19', 'death', 'toll', 'stand...
2 mpacgk6o 5222649230a68712e13dbfa99f7b2fd8f8190bc6 Elsevier; Medline; PMC Pathology of the Exotic Companion Mammal Gastr... 10.1016/j.cvex.2014.01.002 PMC7172800 24767738.0 no-cc A variety of disease agents can affect the gas... 2014-04-23 Reavill, Drury Vet Clin North Am Exot Anim Pract NaN NaN NaN document_parses/pdf_json/5222649230a68712e13db... document_parses/pmc_json/PMC7172800.xml.json https://doi.org/10.1016/j.cvex.2014.01.002; ht... 25425517.0 ['variety', 'disease', 'agent', 'affect', 'gas...
3 874kthen NaN Medline Risk factors for bleeding evaluated using the ... 10.1097/meg.0000000000000419 NaN 26075810.0 unk BACKGROUND/AIMS Bleeding remains a serious com... 2015 Noda, Hisatsugu; Ogasawara, Naotaka; Izawa, Sh... European journal of gastroenterology & hepatology NaN NaN NaN NaN NaN https://doi.org/10.1097/meg.0000000000000419; ... 10988506.0 ['backgroundaim', 'bleed', 'remain', 'serious'...
4 j6wylhzw 40cba5e191966ed333cfa3346d485d5af0791ec0 Elsevier; PMC Clinical perspectives of emerging pathogens in... 10.1016/s0140-6736(06)68036-7 PMC7138062 NaN els-covid Summary As a result of immunological and nucle... 2006-01-27 Ludlam, Christopher A; Powderly, William G; Bo... The Lancet NaN NaN NaN document_parses/pdf_json/40cba5e191966ed333cfa... document_parses/pmc_json/PMC7138062.xml.json https://www.sciencedirect.com/science/article/... 37595885.0 ['summary', 'result', 'immunological', 'nuclei...
In [13]:
#Changing each entry in abstract2 column to a list 
SE_test.abstract2 = SE_test['abstract2'].apply(eval)
In [14]:
#Creating new cleaned_abstracts_tf object since these next lines of code only applies to K-means
SE_test.abstract2 = SE_test['abstract2'].apply(lambda x: ' '.join(x))
#search_engine_cleaned_tf = list(search_engine_cleaned_tf)
In [143]:
data = SE_test.copy()
data.abstract2 = data["abstract2"].apply(eval)
n_results = 10
In [255]:
def covid_search (data,n_results = 5):
    #Get user input
    query = input()

    #Tokenize query into words
    query = word_tokenize(query)

    #Lowercase query the same way the data was lowercased
    lower_case_clean_query = []
    for term in query:
        if len(term) > 2 and sum(1 for c in term if c.isupper()) <= 1:
            lower_case_clean_query.append(term.lower())
        else:
            lower_case_clean_query.append(term)

    #Lemmatize query
    lemmatizer = WordNetLemmatizer()
    clean_query = []
    for term in lower_case_clean_query:
        clean_query.append(lemmatizer.lemmatize(term))

    #Insert filtration of abstracts without query words here
    query_as_set = set(clean_query)

    #Filtering for rows that contain at least one of the query tokens
    data_sub = data.loc[data.abstract2.apply(lambda x: len(query_as_set.intersection(x)) >= 1)]
    data_sub.reset_index(inplace = True, drop = True)
    #data_sub.abstract2 = data_sub['abstract2'].apply(lambda x: ' '.join(x))

    #Rejoin query tokens
    clean_query = ' '.join(clean_query)

    #Figure out this statement later, refers to subset of data outside of function
    abstracts = data_sub.abstract2.tolist()
    abstracts = [' '.join(list_val) for list_val in abstracts] #= ' '.join(abstracts)#.apply(lambda x:' '.join(x))


    #Add query to list object (n+1 total documents)
    abstracts.append(clean_query)

    #Create TF-IDF vectorizer object 
    tv = TfidfVectorizer(norm = 'l2', use_idf = True, smooth_idf= True, lowercase=False,\
                         analyzer="word", token_pattern=r"(?u)\S\S+")

    #Run TF-IDF fit transform and convert to array on all documents + query
    tv_matrix = tv.fit_transform(abstracts)
    tv_matrix = tv_matrix.toarray()


    #Calculate cosine similarity between each of n documents and the query (last entry of list)
    cos_sim_matrix = cosine_similarity(tv_matrix[:-1,],tv_matrix[-1,].reshape(1,-1))

    #Compiling results into dataframe 
    results = pd.DataFrame(cos_sim_matrix)

    #Sort based on cosine similarity column (labeled 0 by default)
    results = results.sort_values(0, ascending = False)

    #Merge in original abstract column
    results = results.merge(data_sub.abstract,left_index=True, right_index=True, how='left')

    #Merge in cord_uid column 
    results = results.merge(data_sub.cord_uid,left_index=True, right_index=True, how= 'left')

    #Merge in title column
    results = results.merge(data_sub.title,left_index=True, right_index=True, how= 'left')

    #Renaming columns for better readability
    results.rename(columns={0:"cosine_similarity"}, inplace=True)

    #Filtering out results with cosine similarity of 0
    #results = results[results.cosine_similarity > 0]

    final_results = results.iloc[0:n_results,]

    return final_results
In [258]:
covid_search(data = data,n_results=10)
MERS-CoV Transmission Symptoms
Out[258]:
cosine_similarity abstract cord_uid title
1612 0.506918 Middle East respiratory syndrome coronavirus (... tktk2hi9 Middle East respiratory syndrome coronavirus: ...
9767 0.501036 The newly emerged Middle East respiratory synd... i2zmfwa6 Crystal structure of the receptor-binding doma...
8503 0.433815 Summary Background A case control study to bet... ynh84b98 Predictors of MERS-CoV infection: A large case...
5352 0.423941 Middle East respiratory syndrome coronavirus (... 404zraam Growth and Quantification of MERS-CoV Infection
11163 0.422500 Middle East respiratory syndrome coronavirus (... ynlj4cge Detection of the Middle East Respiratory Syndr...
1212 0.420533 BACKGROUND: The new Middle East respiratory sy... n2abv3ub Interhuman transmissibility of Middle East res...
4594 0.412288 Middle East respiratory syndrome coronavirus (... im1fs69k Development of Small-Molecule MERS-CoV Inhibitors
10767 0.409794 The emergence of the Middle East respiratory s... 7hw23xae Deciphering MERS-CoV Evolution in Dromedary Ca...
5613 0.400443 BACKGROUND: Middle East respiratory syndrome c... w90eitw9 Current epidemiological status of Middle East ...
6820 0.397828 The Middle East Respiratory Coronavirus (MERS-... 28znb322 Middle East respiratory syndrome coronavirus (...

Conclusions

We have been able to boil the research presented in the CORD-19 Research Data Challenge to 17 main research areas. These topics reflect the various different facets of research that scienists are conducting on COVID-19, and many topics contain overlapping fields with commonalities.

From here, we will assign these topic labels back to the individual abstracts using the LDA soft-clustering results. We will create topical density plots to show the abstract distribution and how "soft" the boundaries are between topics in a more graunular way. We will also implement some type of summarizing algorithm that takes in all abstracts attributed to a specific topic label and breaks it down into the most important sentences and phrases. Additionally, we will perfect the search engine and figure out how to make it dynamic enough for use.

Sources

We found the following articles helpful for code and/or concepts:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd
https://radimrehurek.com/gensim/models/ldamulticore.html
https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html
https://towardsdatascience.com/using-mallet-lda-to-learn-why-players-hate-pok%C3%A9mon-sword-shield-23b12e4fc395
https://pythonprogramminglanguage.com/kmeans-text-clustering/
https://www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ
https://www.datacamp.com/community/tutorials/introduction-t-sne