Data Science Portfolio

StockX 2019 Data Challenge

I combined my love for sneakers with my passion for data by analyzing StockX consumer purchases and trying to elucidate the underlying patterns. Using the dataset StockX provided for its 2019 Data Challenge, I conducted an end-to-end analysis comparing two popular brands of sneakers: Yeezys and Off-Whites. I found that Off-Whites are more hype than Yeezys, and that it was possible to predict a "hype index", or how much consumers would pay over retail price for sneakers. Feel free to take a look at either the full project Jupyter notebook, the exploratory data analysis notebook, or the modeling notebook. I also created an accompanying dashboard and blog post for the project.

Predicting Client Subscription

I took on an age-old business problem: predicting client subscriptions. I used the publicly available telemarketing subscription dataset available at the UCI Machine Learning Repository. This data is based on real data from a Portuguese banking institution. This project was a great way to try out a variety of models while making informed business assumptions. And while all of the models ended up overfitting the training data, this project was a useful exercise in taking a step back and evaluating the many ways models could be improved in future iterations. Take a look at the notebook.

Topic Modeling of COVID-19 Research Abstracts

COVID-19 has altered life as we know it. As scientists all across the globe continue to study the novel coronavirus, fellow MSA '20 graduate Iqra Munawar and I decided to do our part to help resolve this crisis. Using the CORD-19 Research Data Challenge dataset hosted on Kaggle, we implemented NLP techniques and popular language models to distill the immense volume of research publications down to several major themes. Taking things a step further, we built a search engine and random insights generator application. This tool aims to help labs locate relevant research without having to go through every single piece of literature out there. Through taking on this project, I have augmented my skills in NLP pre-processing, topic modeling, unsupervised learning, and software engineering. This has also been an excellent opportunity to leverage my biology background. Check out the text pre-processing notebook and the modeling notebook. Additionally, check out the application source code. Unfortunately, the application is no longer live because of the cost associated with a powerful enough instance on AWS.