Covid-19 Research Paper Data Cleaning Pipeline

Iqra Munawar and Sameen Salam

Purpose

To take the abstracts from scientific literature related to Covid (and potentially any other academic field) and properly posture them for topic modeling and clustering. This particular text cleaner is fairly adapted for the realm of biology in that it seeks to preserve scientific terminology (i.e. capitalization of "RNA", hypenation of "SP-D", elimination of standalone numbers among other things) to make resulting models as accurate, interpretable, and convenient as possible.

Solved Items

  • Drops observations with no abstract or title
  • Word tokenizes each abstract and eliminates common terms for abstracts specifically (BACKGROUND, CONCLUSIONS, etc.)
  • Filters out regular stopwords based on NLTK corpus of stopwords
  • Filters out all standard punctuation except for hyphens ("-")
  • Eliminates standalone numbers ("100" or "14"), tokens that contain no letters ("%3" or "35!"), and tokens that are length one ("C" or "f")
  • Filters out tokens that are two letters long and are not capital ("sg")
  • Lowercasing all tokens that have a single capital letter and are 3 or more characters in length ("Research" to "research", but "RNA" remains intact)
  • Removing specialty punctuation or characters that remain, outside of the standard punctuation (languages outside of alphanumeric characters)
  • Removing tokens with a hyphen at the end ("corona-" or "noro-")
  • Lemmatization and final posturing of every token in each abstract

Pending Items

  • Cannot eliminate abstracts in languages that share the same alphabet (German or Spanish)
  • Code cannot be run on full dataset (~104K abstracts as of this version) due to computational constraints (Kaggle server limit of ~9 hrs)
  • spaCy lemmatizer does not always work in cases that should not be ambiguous ("detected" to "detect")
In [2]:
#Loading in the necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
import spacy
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#       print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Our data source comes from CORD-19 research challenge on Kaggle.

In [3]:
#Loading in the metadata (for the abstracts) and creating a copy so I don't have to constantly reload this data.
all_source_metadata = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
metadata = all_source_metadata.copy()
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (13,14) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [4]:
#Subsetting for observations with actual text in the abstract column
#Then I reset the index and dropped the old index column...I forgot why
metadata = metadata[metadata["abstract"].notnull()]
metadata = metadata.reset_index(drop = True)

metadata = metadata[metadata["title"].notnull()]
metadata = metadata.reset_index(drop = True)
In [5]:
#Get shape of metadata 
metadata.shape
Out[5]:
(104393, 19)
In [ ]:
#Getting a subset of the data for processing because of computational limitations
reasonable = metadata.sample(n=75000,random_state=0)

Below we are defining the variables that need to be specified once for the function to work properly. We use regular English stopwords and added some additional stopwords that appeared very frequently in the data but contributed very little scientific value.

In [6]:
#Defining English Stopwords and other additional stopwords
stop_words=set(stopwords.words("english"))
special_stop_words = ["BACKGROUND", "METHODS", "CONCLUSION", "RESULTS", ":", "Abstract", "abstract", "ABSTRACT", "CONCLUSIONS",\
                          "SUPPLEMENTARY", "MATERIAL", "OBJECTIVE", "IMPORTANCE", "METHODOLOGY", "METHODOLOGYPRINCIPAL", "DESIGN",\
                          "Background", "PURPOSE", "MATERIALS", "INTRODUCTION", "ELECTRONIC"]
    
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

#Defining the punctuation for this project (again, can be done outside of this function)
special_punc = [x for i,x in enumerate(string.punctuation) if x!= "-"]

#Defining a porter stemmer object
#ps = PorterStemmer()

Below is the function doc_proc that does the cleaning.

In [7]:
#Creating a function to clean all abstracts: doc_proc
def doc_proc(text):
    
    '''
    doc_proc is a function that takes in a text item (in this case, a scientific journal abstract) and conducts all pre-processing before 
    further analysis. The function automatically does all of the above individual preprocessing steps. This function outputs an object after
    word lemmatization (lemmatized_words). This can be applied over a dataframe through the apply functions.
    
    Keyword Arguments:
    text: the input text to be processed. Must be a continuous string that can be tokenized into words.
    
    '''
    
    ###Starting the text manipulation process###
    
    #Tokenizing into words
    tokenized_word=word_tokenize(text)

    #Filtering out special stopwords first and rejoining the words back together into sentences
    special_filtered_word=[]
    for w in tokenized_word:
        if w not in special_stop_words:
            special_filtered_word.append(w)   
    s = " "
    special_filtered_word = s.join(special_filtered_word)
   
    #Word Tokenizing all text
    text = word_tokenize(special_filtered_word)
    
    #Filtering out all stopwords
    stopword_filtered = []
    for w in text:
        if w.lower() not in stop_words:
            stopword_filtered.append(w)
    
    
    #Filtering out all punctuation, then rejoining all of the resulting sublists into one list
    punc_filtered = [''.join(c for c in s if c not in special_punc) for s in stopword_filtered]
    punc_filtered = [s for s in punc_filtered if s]
    
    #Filtering out stand-alone numbers
    numeric_filtered = [term for term in punc_filtered if term.isdecimal() == False]
    
    #Filtering out any tokens that contain no letters
    numeric_filtered2 = [term for term in numeric_filtered if re.search('[a-zA-Z]', term) is not None]
    
    #Filtering out any tokens left over that are length one (i.e. "C" or "f")
    single_stripped = [term for term in numeric_filtered2 if len(term) > 1]
    
    #Filtering out any tokens left two letters long and lower case (i.e. "sg")
    lower_double_stripped = [term for term in single_stripped if len(term) is not 2 or term.isupper() == True]
    
    #Lower casing all tokens that have a single capital letter and are longer than 2 characters long
    lower_cased_nonsci = []
    for term in lower_double_stripped:
        if len(term) > 2 and sum(1 for c in term if c.isupper()) <= 1:
            lower_cased_nonsci.append(term.lower())
        else:
            lower_cased_nonsci.append(term)
    
    #Removing all specialty punctuation outside of default ones
    for i in range(len(lower_cased_nonsci)):
        if len(lower_cased_nonsci[i]) == 1 and lower_cased_nonsci[i].lower() not in list(string.ascii_lowercase):
            lower_cased_nonsci[i] = ''        
    while('' in lower_cased_nonsci) : 
        lower_cased_nonsci.remove('') 
    
    #Removing all tokens that end in a hyphen
    suffix_hyphen_stripped = [term for term in lower_cased_nonsci if term.endswith("-") == False]
    
    #STEMMING METHOD####################################################################
    # Parse the sentence using the loaded 'en' model object `nlp`
    #final_words = []
    #for term in suffix_hyphen_stripped:
    #    final_words.append(ps.stem(term))
    ####################################################################################
    #LEMMATIZATION METHOD###############################################################
    #
    final_words = []
    for term in suffix_hyphen_stripped:
        doc = nlp(term)
        final_words.append([token.lemma_ for token in doc])
    
    #Recombining sublists containing more than one token
    recombined1 = []
    for term_list in final_words:
        if len(term_list) > 1:
            recombined1.append(["".join(term_list)])
        else:
            recombined1.append(term_list)
    
    #Recombining all sublists into a main list
    recombined2 = [item for sublist in recombined1 for item in sublist]
    #####################################################################################        
    return(recombined2)
In [ ]:
# Code to test run times for different versions of doc_proc

#import time
#start_time = time.time()
#test = doc_proc(metadata["abstract"][18])
#end_time = time.time()

#end_time-start_time
In [ ]:
#Applying doc_proc to the small dataset into a new column "abstract2" 
reasonable["abstract2"] = reasonable['abstract'].apply(doc_proc)

The following output shows a comparison between the original and cleaned forms of a fairly messy abstract in the data.

In [12]:
#Checking the results
print(metadata["abstract"][0])
print('----------------------------------------------')
print(doc_proc(metadata["abstract"][0]))
OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were the most common symptoms, and crepitations (60%), and wheezes (40%) were the most common signs. Most patients with pneumonia had crepitations (79.2%) but only 25% had bronchial breathing. Immunocompromised patients were more likely than non-immunocompromised patients to present with pneumonia (8/9 versus 16/31, P = 0.05). Of the 24 patients with pneumonia, 14 (58.3%) had uneventful recovery, 4 (16.7%) recovered following some complications, 3 (12.5%) died because of M pneumoniae infection, and 3 (12.5%) died due to underlying comorbidities. The 3 patients who died of M pneumoniae pneumonia had other comorbidities. CONCLUSION: our results were similar to published data except for the finding that infections were more common in infants and preschool children and that the mortality rate of pneumonia in patients with comorbidities was high.
----------------------------------------------
['retrospective', 'chart', 'review', 'describes', 'epidemiology', 'clinical', 'features', 'patient', 'culture-prove', 'mycoplasma', 'pneumoniae', 'infection', 'king', 'abdulaziz', 'university', 'hospital', 'jeddah', 'saudi', 'arabia', 'patient', 'positive', 'pneumoniae', 'culture', 'respiratory', 'specimen', 'january', 'december', 'identify', 'microbiology', 'record', 'charts', 'patient', 'review', 'patient', 'identify', 'require', 'admission', 'infection', 'community-acquire', 'infection', 'affected', 'age', 'group', 'common', 'infant', 'pre-school', 'child', 'occur', 'year-round', 'common', 'fall', 'spring', 'three-quarter', 'patient', 'comorbiditie', 'twenty-four', 'isolate', 'associated', 'pneumonia', 'upper', 'respiratory', 'tract', 'infection', 'bronchiolitis', 'cough', 'fever', 'malaise', 'common', 'symptom', 'crepitation', 'wheeze', 'common', 'sign', 'patient', 'pneumonia', 'crepitation', 'bronchial', 'breathe', 'immunocompromise', 'patient', 'likely', 'non-immunocompromised', 'patient', 'present', 'pneumonia', 'versus', 'patient', 'pneumonia', 'uneventful', 'recovery', 'recover', 'follow', 'complication', 'die', 'pneumoniae', 'infection', 'die', 'due', 'underlie', 'comorbiditie', 'patient', 'die', 'pneumoniae', 'pneumonia', 'comorbiditie', 'result', 'similar', 'publish', 'datum', 'except', 'finding', 'infection', 'common', 'infant', 'preschool', 'child', 'mortality', 'rate', 'pneumonia', 'patient', 'comorbiditie', 'high']
In [ ]:
#Exporting the data for use in other kernels for modeling
reasonable.to_csv("abstract_cleaned.csv", index = False)