Iqra Munawar (M.S. Analytics, NCSU 2020) & Sameen Salam (M.S. Analytics, NCSU 2020)
COVID-19 is one of the biggest challenges the world has had to face in recent memory. As scientists continue to work tirelessly towards a resolution of the crisis, one thing is abundantly clear: there is an enormous amount of COVID-19 related research. In this analysis, we implement a couple of different methods to categorize abstracts according to topical similarity. By doing this, we hope to make a large corpus of research like this easily parseable to individual labs and accessible to the general public. In this notebook, we use the K-means approach (based on TF-IDF) and the LDA topic modeling approach.
The original data source used in this analysis comes from the CORD-19 Research Data Challenge hosted on Kaggle. We created our own text pre-processing pipeline (paper_abstract_cleaner.ipynb) and fed the original Kaggle dataset into it to get an output with all original rows and columns plus the extra "abstract2" column. This additional column contains the cleaned and modelable abstracts, and is what is primarily used in this analysis notebook. There is a .zip file for the data included in this repository if you would like to run it locally. Alternatively, if you have a Kaggle account, you can download it locally off of the Kaggle commit output here: https://www.kaggle.com/sameensalam/paper-abstract-cleaner/output.
Import the necessary libraries, define the necessary functions, and read in the data for this analysis.
#Load in the necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
import random
import matplotlib.pyplot as plt
import string
import re
import pickle
import gensim
import pyLDAvis.gensim
import time
import os
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from nltk import FreqDist
from gensim.models import LdaModel
from gensim import corpora
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import Phrases
from gensim.models import CoherenceModel
from collections import Counter
from wordcloud import WordCloud
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from zipfile import ZipFile
def wordcloud(counter):
"""A small wordloud wrapper"""
wc = WordCloud(width=1200, height=800,
background_color="white",
max_words=200)
wc.generate_from_frequencies(counter)
# Plot
fig=plt.figure(figsize=(6, 4))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
def compute_coherence_values(dictionary, corpus, texts, stop, start=2, step=3, lda= True):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
for num_topics in range(start, stop, step):
print('Calculating {}-topic model'.format(num_topics))
if lda is True:
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary,\
random_seed=0, workers=3)
elif lda is False:
model = gensim.models.LdaMulticore(corpus = corpus, num_topics=num_topics, id2word=dictionary,
random_state=0, workers=3)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return coherence_values
#This snippet is necessary to convert the Mallet model object into something that the pyLDAvis library can use
#Function to bypass gensim.models.wrappers.ldamallet.malletmodel2ldamodel, which has known bugs that reduce model performance
#Credit to Stackoverflow user: norpa
def ldaMalletConvertToldaGen(mallet_model):
model_gensim = LdaModel(id2word=mallet_model.id2word, num_topics=mallet_model.num_topics, alpha=mallet_model.alpha, eta=0,\
iterations=1000, gamma_threshold=0.001, dtype=np.float32)
model_gensim.state.sstats[...] = mallet_model.wordtopics
model_gensim.sync_state()
return model_gensim
#Located in the repository with this notebook
file_name = "abstract_cleaned.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
#Read in the data
metadata = pd.read_csv("abstract_cleaned.csv")
#If downloaded directly as a CSV from Kaggle kernel output, then edit path and run
#metadata = pd.read_csv(r'C:\Users\USER\Documents\Misc\covid19_research\abstract_cleaned.csv')
metadata.shape
metadata.head()
metadata.info()
In this section, we perform a touch of extra cleaning and take a look at the most common words in our corpus via wordcloud.
#Changing each entry in abstract2 column to a list
cleaned_abstracts = metadata['abstract2']
cleaned_abstracts = cleaned_abstracts.apply(eval)
We are eliminating every abstract with less than 60 tokens because short abstracts with little semantic content could add noise to our language models.
cleaned_abstracts = cleaned_abstracts[cleaned_abstracts.str.len() > 60]
cleaned_abstracts.shape
tokens = cleaned_abstracts.to_list()
#List of additional words not considered stopwords/ were overlooked in the cleaning process
N= {'using', 'may', 'also', 'used', 'use', 'however', 'including', 'among', 'could', 'based', 'within','OBJECTIVES',\
'FINDINGS'}
# Removing those additional words from list of lists
tokens = [[ele for ele in sub if ele not in N] for sub in tokens]
tokens_all = [item for items in tokens for item in items]
counter = Counter(tokens_all)
#counter.most_common(1000)
%matplotlib inline
# create wordcloud
wordcloud(counter)
Here we use K-means clustering on the TF-IDF matrix of the entire corpus and visualize the clusters in a lower dimensional space using PCA in combination with TSNE.
#Creating new cleaned_abstracts_tf object since these next lines of code only applies to K-means
cleaned_abstracts_tf = cleaned_abstracts.apply(lambda x: ' '.join(x))
cleaned_abstracts_tf = list(cleaned_abstracts_tf)
#Direct way to get TF-IDF model without having to go through an initial Bag o Words Model
tv = TfidfVectorizer(min_df= 0.008, max_df = 0.5, norm = 'l2', use_idf = True, smooth_idf= True, lowercase=False,\
analyzer="word", token_pattern=r"(?u)\S\S+")
#Transforming the cleaned_abstracts_tf object
tv_matrix = tv.fit_transform(cleaned_abstracts_tf)
tv_matrix = tv_matrix.toarray()
terms = tv.get_feature_names()
#Creating and showing a dataframe to help show how the transformation happened. Each row is a document, each feature is a token
tv_dataframe = pd.DataFrame(np.round(tv_matrix, 2), columns=terms)
tv_dataframe
#Running through the k-means model for n topics from 2 to 29
inertia_vals = []
cluster_number = range(2,30)
for i in cluster_number:
true_k = i
kmeans_model = KMeans(n_clusters = true_k, random_state = 0)
kmeans_model.fit(tv_matrix)
inertia_vals.append(kmeans_model.inertia_)
Below is a plot of total sum of squares for each value k tested in the K-Means approach. Since there is not a clear elbow in this plot, we will go with 17 clusters as the number of clusters for our visual. This also corresponds with the number of topics we selected for our final LDA model (discussed later).
#Plotting the Total Sums of Squares for each value of k in Kmeans
plt.plot(cluster_number, inertia_vals, 'bx-')
plt.xlabel('k')
plt.ylabel('Total Sum of Squares')
plt.title('Elbow Method For Optimal K')
plt.show()
#Defining our K-Means model with the optimal number of clusters (17)
true_k = 17
kmeans_model = KMeans(n_clusters= true_k, random_state= 0)
#Fitting the kmeans model
kmeans_model.fit(tv_matrix)
#Checking to see if the model properly labeled a few of the documents
kmeans_model.labels_
#Getting the contribution of each word in descending order per centroid
order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
order_centroids
Below, you will see the top words that define each cluster. Using these top words, we can infer the general subject matter for abstracts within any given cluster. For example, a research paper abstract that falls into cluster 0 most likely pertains to the cellular biological characteristics of the novel coronavirus.
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :15]:
print(' %s' % terms[ind]),
print("-----------------------------")
#Defining and fitting a 3 component PCA model for visualization
pca = PCA(n_components=3)
scatter_plot_points = pca.fit_transform(tv_matrix)
#Creating a dataframe with the 3 principal components and each abstract's K-Means cluster label
pca_points = pd.DataFrame(columns=['pca1','pca2','pca3'])
pca_points["pca1"] = scatter_plot_points[:,0]
pca_points["pca2"] = scatter_plot_points[:,1]
pca_points["pca3"] = scatter_plot_points[:,2]
pca_points["cluster_label"] = kmeans_model.labels_
pca_points.head()
#Plot for Principal components 1 and 2
#Predefined 17 colors (chosen for maximal contrast within 17 distinct groups)
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
"#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]
#Creating the plot
x_axis = [o for o in pca_points["pca1"]]
y_axis = [o for o in pca_points["pca2"]]
fig, ax = plt.subplots(figsize=(20,10))
_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in pca_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')
#Adding easily visible numbers. Added single quotes around numbers to prevent confusion (i.e. cluster 1 label close to cluster 3 label vs cluster 13)
for i in range(true_k):
xtext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca1"])+(np.random.rand()/100)
ytext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca2"])+(np.random.rand()/100)
txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
txt.set_path_effects([
PathEffects.Stroke(linewidth=5, foreground="w"),
PathEffects.Normal()])
#Plot of principal components 2 and 3
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
"#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]
x_axis = [o for o in pca_points["pca2"]]
y_axis = [o for o in pca_points["pca3"]]
fig, ax = plt.subplots(figsize=(20,10))
_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in pca_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')
for i in range(true_k):
xtext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca2"])+(np.random.rand()/100)
ytext = np.median(pca_points.loc[pca_points["cluster_label"] == i,"pca3"])+(np.random.rand()/100)
txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
txt.set_path_effects([
PathEffects.Stroke(linewidth=5, foreground="w"),
PathEffects.Normal()])
The above PCA plots are helpful, but there is a lot of noise and it is somewhat difficult to see cohesive clusters. Using PCA in combination with the TSNE algorithm yields much cleaner visualizations because of the removal of additional noise from the signal. We reduced the number of components from the TF-IDF input by 90% (~1850 features to 185 features). Based on the resulting plot, most clusters see slight overlap with other clusters far from their respective centroids. This makes sense, since much of this research shares common terminology or cites the same previously observed phenomena. Clusters that are close together have similar subject matter. For example, clusters 2 and 7 in the top right of this visual are very close to one another in the TSNE plot, and share similar top words, such as COVID-19 and disease. Clusters that are splattered, for lack of a better term, away from the centroid tend to have more terms that overlap with other clusters. Cluster 16 (in the center), for example has a lot of overlap in top words with other topics.
#Reducing the tv_matrix into ~10% of its original dimensionality
pca_185 = PCA(n_components=185)
reduced_dim = pca_185.fit_transform(tv_matrix)
#Creating and fitting the TSNE model with the PCA output
tsne = TSNE(random_state=0)
tsne_results = tsne.fit_transform(reduced_dim)
#Creating the tsne_points dataframe as input for graphinc
tsne_points = pd.DataFrame(columns=['tsne1','tsne2','cluster_label'])
tsne_points["tsne1"] = tsne_results[:,0]
tsne_points["tsne2"] = tsne_results[:,1]
tsne_points["cluster_label"] = kmeans_model.labels_
#Plotting the TSNE 2-dimensional model to see how the clusters looked after this transformation
colors = ["#ff0000", "#00ff00", "#eeeeee", "#c79dd7", "black","#666547", "#fb2e01", "#6fcb9f", "#ffe28a", "#fffeb3","#363b74", "#673888",\
"#ef4f91", "#0000ff", "#4d1b7b","#8f9779", "#4d5d53"]
x_axis = [o for o in tsne_points["tsne1"]]
y_axis = [o for o in tsne_points["tsne2"]]
fig, ax = plt.subplots(figsize=(20,10))
_= ax.scatter(x_axis, y_axis, c=[colors[d] for d in tsne_points["cluster_label"]], alpha = 0.7)
_= ax.axis('off')
for i in range(17):
xtext = np.median(tsne_points.loc[tsne_points["cluster_label"] == i,"tsne1"])+(np.random.rand()/100)
ytext = np.median(tsne_points.loc[tsne_points["cluster_label"] == i,"tsne2"])+(np.random.rand()/100)
txt = ax.text(xtext, ytext, "'"+str(i)+"'", fontsize=24)
txt.set_path_effects([
PathEffects.Stroke(linewidth=5, foreground="w"),
PathEffects.Normal()])
We clearly have some really interesting preliminary results here, but we also wanted to take on a topic modeling algorithm suited for this sort of analysis: Latent Direchlet Allocation, or LDA.
Here we use the Latent Direcelht Allocation (LDA) model to model the topics in this corpus of 75,000 abstracts. We included bigrams, or pairs of tokens that occur together in more than 20 documents in this case, and eliminated words that occurred in less than 625 times. Using the resulting token ids for the entire corpus and the resulting individual bag of words models for each abstract, we conducted a grid search for the optimal values for the following parameters:
We tried these parameters on the regular Gensim LDA algorithm, which utilizes variational Bayes sampling. We also used the Mallet implementation via Gensim, which uses Gibbs sampling. The ideal model was selected based on coherence score, which measures how similar the high scoring words are in each topic. Using the final candidate model, we plotted the topic distribution using the pyLDAvis library, a staple for this particular algorithm.
# Compute bigrams from the cleaned_abstracts object defined earlier
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(cleaned_abstracts, min_count=20)
for idx in range(len(cleaned_abstracts)):
for token in bigram[cleaned_abstracts.iloc[idx]]:
if '_' in token:
# Token is a bigram, add to document.
cleaned_abstracts.iloc[idx].append(token)
#Create token ids for entire corpus
dictionary = corpora.Dictionary(cleaned_abstracts)
# Filter out words that occur less than 521 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below= 521, no_above=0.5)
#Creating individual bag of words for each abstract mapped according to the dictionary
corpus = [dictionary.doc2bow(text) for text in cleaned_abstracts]
#Saving the dictionary and corpus items for later use
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')
#Gridsearch parameters for LDA model. This code takes about 6hrs to run, so it's commented out for convenience here.
#test_alphas = [0.05,0.17,0.5]-----default (1/num_topics 1/6~0.17) was best
#test_betas = [0.05,0.17,0.5]----0.5 was best, but only very very slightly
#test_decay = [0.5,0.6,0.7]-----0.5 test decay was the best value
#coherence_vals = []
#counter = 1
#for topic_val in num_topics:
# for alpha_val in test_alphas:
# for beta_val in test_betas:
# for decay_val in test_decay:
# start_time = time.time()
# ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = topic_val, id2word=dictionary, passes=7, workers=3, random_state=0,\
# alpha=alpha_val, eta=beta_val, decay=decay_val)
# end_time = time.time()
#print("Perplexity Score:", ldamodel_gensim.log_perplexity(corpus))
# coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= corrected_subset, dictionary=dictionary, coherence='c_v')
# coherence_vals.append(coherence_model_lda.get_coherence())
# print("Finished model:",counter, "of 72")
# counter+=1
#coherence_vals
#Iterating through all possible topic values from 2 to 30
start=2; stop=30; step=1;
stop += 1
coherence_values = compute_coherence_values(dictionary=dictionary,corpus=corpus,texts=cleaned_abstracts,start=start,stop=stop,\
step=step, lda=False)
#Because the above code takes a long time to run, I am saving the output so that I don't have to rerun it everytime I want to graph coherence
f = open('coherence_reg.pckl', 'wb')
pickle.dump(coherence_values, f)
f.close()
#Reading in the pickle file containing graphable coherence values
f = open('coherence_reg.pckl', 'rb')
coherence_values = pickle.load(f)
f.close()
#Graph of Mallet LDA coherence values as number of topics increases
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.title("Regular LDA Coherence Score Over Number of Topics");
plt.show()
#plt.legend(("coherence_values"), loc='best')
#Regular Gensim LDA model with gridsearch ideal parameters
ldamodel_gensim = gensim.models.LdaMulticore(corpus, num_topics = 17, id2word=dictionary, passes=7, workers=3,\
random_state=0)
coherence_model_lda = CoherenceModel(model=ldamodel_gensim,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_lda.get_coherence()
To run Mallet LDA via Gensim, you need to have installed the mallet software package. This link will take you to where you can download it: http://mallet.cs.umass.edu/download.php. This link will take you to a tutorial that helps with proper usage of Mallet: https://www.tutorialspoint.com/gensim/gensim_creating_lda_mallet_model.htm.
#Mallet implementation of gensim LDA with ideal number of topics
os.environ.update({'MALLET_HOME':r'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/'})
mallet_path = 'C:/Users/USER/Documents/Misc/covid19_research/mallet-2.0.8/bin/mallet'
#Iterating through all possible topic values from 2 to 30
start=2; stop=30; step=1;
stop += 1
coherence_values_lda = compute_coherence_values(dictionary=dictionary,corpus=corpus,texts=cleaned_abstracts,start=start,stop=stop,\
step=step, lda= True)
#Because the above code takes a long time to run, I am saving the output so that I don't have to rerun it everytime I want to graph coherence
f = open('coherence_lda.pckl', 'wb')
pickle.dump(coherence_values_lda, f)
f.close()
Below, you will see a graph of the coherence value of the Mallet LDA model as the number of topics increases. Looking at this chart, it is clear that the Mallet LDA has the higher coherence scores.
#Reading in the pickle file containing graphable coherence values
f = open('coherence_lda.pckl', 'rb')
coherence_values_lda = pickle.load(f)
f.close()
#Graph of Mallet LDA coherence values as number of topics increases
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values_lda)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.title("Mallet LDA Coherence Score Over Number of Topics")
#plt.legend(("coherence_values"), loc='best')
plt.show()
#Creating combined plot for easier comparison
x = range(start, stop, step)
plt.figure(figsize=(10, 8))
plt.plot(x, coherence_values_lda)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("mallet","regular"), loc='best')
plt.show()
# Rerun the 17 topic Mallet model
ldamodel_gensim_mallet = gensim.models.wrappers.LdaMallet(mallet_path= mallet_path, corpus= corpus, num_topics= 17,\
id2word=dictionary,workers=3,random_seed=0)
coherence_model_lda = CoherenceModel(model=ldamodel_gensim_mallet,texts= cleaned_abstracts, dictionary=dictionary, coherence='c_v')
coherence_model_lda.get_coherence()
After finding that the Mallet LDA model (0.610) performed better than the regular Gensim implementation (0.583), we decided to move forward with the Mallet model for visualization.
#Converting the Mallet model to an object that pyLDAvis understands
converted_model = ldaMalletConvertToldaGen(ldamodel_gensim_mallet)
#Saving the converted model for use in PyLDAvis
converted_model.save('model.gensim')
#Printing the top 10 words in each of the 5 topics
topics = converted_model.print_topics(num_words=10)
for topic in topics:
print(topic)
#Loading the necessary objects for pyLDAvis
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model.gensim')
#Plotting the LDA model results
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
Based on the graphic above, you can see that the 17 topics are fairly well spaced out with evenly distributed tokens. There is some overlap, but that is to be expected with certain sub-fields of COVID-19 research sharing similar termiology (i.e. cellular mechanism and disease testing).
Unlike Kmeans clustering, LDA assumes that documents are made up of a distribution of each topic the model was trained with. Kmeans performs hard clustering, meaning each document has one label and no trace of any other.
top_topics = converted_model.top_topics(corpus)
avg_topic_coherence = sum([t[1] for t in top_topics]) / true_k
print('Average topic coherence: %.4f.' % avg_topic_coherence)
from pprint import pprint
pprint(top_topics)
Using this search engine framework, we created an application that can parse through the repository of 75,000 abstracts provided in the data source for this notebook. Check it out at:
metadata.abstract2[1]
SE_test = metadata.copy()#.sample(n=10000,random_state=0)
SE_test = SE_test#.reset_index()
SE_test.head()
#Changing each entry in abstract2 column to a list
SE_test.abstract2 = SE_test['abstract2'].apply(eval)
#Creating new cleaned_abstracts_tf object since these next lines of code only applies to K-means
SE_test.abstract2 = SE_test['abstract2'].apply(lambda x: ' '.join(x))
#search_engine_cleaned_tf = list(search_engine_cleaned_tf)
data = SE_test.copy()
data.abstract2 = data["abstract2"].apply(eval)
n_results = 10
def covid_search (data,n_results = 5):
#Get user input
query = input()
#Tokenize query into words
query = word_tokenize(query)
#Lowercase query the same way the data was lowercased
lower_case_clean_query = []
for term in query:
if len(term) > 2 and sum(1 for c in term if c.isupper()) <= 1:
lower_case_clean_query.append(term.lower())
else:
lower_case_clean_query.append(term)
#Lemmatize query
lemmatizer = WordNetLemmatizer()
clean_query = []
for term in lower_case_clean_query:
clean_query.append(lemmatizer.lemmatize(term))
#Insert filtration of abstracts without query words here
query_as_set = set(clean_query)
#Filtering for rows that contain at least one of the query tokens
data_sub = data.loc[data.abstract2.apply(lambda x: len(query_as_set.intersection(x)) >= 1)]
data_sub.reset_index(inplace = True, drop = True)
#data_sub.abstract2 = data_sub['abstract2'].apply(lambda x: ' '.join(x))
#Rejoin query tokens
clean_query = ' '.join(clean_query)
#Figure out this statement later, refers to subset of data outside of function
abstracts = data_sub.abstract2.tolist()
abstracts = [' '.join(list_val) for list_val in abstracts] #= ' '.join(abstracts)#.apply(lambda x:' '.join(x))
#Add query to list object (n+1 total documents)
abstracts.append(clean_query)
#Create TF-IDF vectorizer object
tv = TfidfVectorizer(norm = 'l2', use_idf = True, smooth_idf= True, lowercase=False,\
analyzer="word", token_pattern=r"(?u)\S\S+")
#Run TF-IDF fit transform and convert to array on all documents + query
tv_matrix = tv.fit_transform(abstracts)
tv_matrix = tv_matrix.toarray()
#Calculate cosine similarity between each of n documents and the query (last entry of list)
cos_sim_matrix = cosine_similarity(tv_matrix[:-1,],tv_matrix[-1,].reshape(1,-1))
#Compiling results into dataframe
results = pd.DataFrame(cos_sim_matrix)
#Sort based on cosine similarity column (labeled 0 by default)
results = results.sort_values(0, ascending = False)
#Merge in original abstract column
results = results.merge(data_sub.abstract,left_index=True, right_index=True, how='left')
#Merge in cord_uid column
results = results.merge(data_sub.cord_uid,left_index=True, right_index=True, how= 'left')
#Merge in title column
results = results.merge(data_sub.title,left_index=True, right_index=True, how= 'left')
#Renaming columns for better readability
results.rename(columns={0:"cosine_similarity"}, inplace=True)
#Filtering out results with cosine similarity of 0
#results = results[results.cosine_similarity > 0]
final_results = results.iloc[0:n_results,]
return final_results
covid_search(data = data,n_results=10)
We have been able to boil the research presented in the CORD-19 Research Data Challenge to 17 main research areas. These topics reflect the various different facets of research that scienists are conducting on COVID-19, and many topics contain overlapping fields with commonalities.
From here, we will assign these topic labels back to the individual abstracts using the LDA soft-clustering results. We will create topical density plots to show the abstract distribution and how "soft" the boundaries are between topics in a more graunular way. We will also implement some type of summarizing algorithm that takes in all abstracts attributed to a specific topic label and breaks it down into the most important sentences and phrases. Additionally, we will perfect the search engine and figure out how to make it dynamic enough for use.
We found the following articles helpful for code and/or concepts:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd
https://radimrehurek.com/gensim/models/ldamulticore.html
https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html
https://towardsdatascience.com/using-mallet-lda-to-learn-why-players-hate-pok%C3%A9mon-sword-shield-23b12e4fc395
https://pythonprogramminglanguage.com/kmeans-text-clustering/
https://www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ
https://www.datacamp.com/community/tutorials/introduction-t-sne