Sameen Salam- M.S. Candidate, Institute for Advanced Analytics at North Carolina State University
Mail: ssalam@ncsu.edu
In this notebook, we will explore sneaker transaction data from the online marketplace StockX. StockX put out this dataset as part of its 2019 Data Challenge. The data consists of a random sample of all Off-White and Yeezy 350 sales from between 9/1/2017 (the month that Off-White first debuted “The Ten” collection) and 2/15/2019.
Through the modeling process, I found that Off-Whites are more hype than Yeezys. Keeping in mind StockX's business model, I recommend that they advertise Off-White shoe listings more heavily than those for Yeezys, especially for buyers with high purchase volume at high price points.
Here is an overview of how this notebook is structured:
The Sneaker Game - brief explanation on some basic aspects of the sneaker market
Exploratory Data Analysis - basic familiarity with the data
Feature Engineering - conceptualization and creation of additional variables with modeling in mind
Modeling - mapping out the relationships between features and the selected target
Conclusions - summarzing modeling results and paths the project could take moving forward
Sneaker culture has become ubiquitous in recent years. All around the world, millions of sneakerheads (myself included) go online to cop the newest releases and rarest classics. But it's not always quite as simple as clicking the "Add to Cart" button and making the purchase. Some sneakers have incredibly high demand leading up to a very limited supply upon release. Only a few dedicated and/or lucky people will be successful in purchasing these shoes from a shoe store or website. Some may choose wear their shoes and keep them in their collection for years to come. But many will choose to resell deadstock (brand new) shoes at a profit in order to purchase even rarer ones.
This is where StockX, GOAT, FlightClub, or any other online sneaker marketplace comes in. Resellers need to connect with individuals who want the shoes they have up for sale. These entities offer a platform that put resellers in direct contact with potential buyers. StockX in particular prides itself on being the stock market analog in the sneaker world. Resellers can list sneakers for sale at whatever price they see fit, and buyers can make whatever bids or offers they would like on any sneaker. StockX's role in the transaction is to make sure that the resellers are selling authentic sneakers to protect buyers from receiving fake or damaged sneakers.
Yeezys and Off-Whites are popular examples of coveted shoes that sneakerheads buy off of one another. The Yeezy line is a collaboration between Adidas and musical artist Kanye West. There are several other sneakers that fall under the Yeezy brand, but this dataset only covers Yeezy Boost 350 models. The Off-White line is a collaboration between Nike and luxury designer Virgil Abloh. Like the Yeezys, this dataset focuses in on a subset of Off-White sneaker models known as "The Ten". This is a set of ten different shoes released by Nike over a period of several months. The sneakers that carry these brand labels represent some of the most sought after kicks in the world, selling out in stores and online almost instantly upon release.
After conducting some research on StockX's website, I found that StockX's revenue stream comes primarily from a 3% payment processing fee and a regressive transaction fee (i.e. the more a reseller sells on StockX, the lower your fee per item is). It is in StockX's best interest to foster sales of shoes with higher sale prices from a revenue standpoint. If reasonably accurate predictions can be made on resale prices as they relate to retail prices, then StockX can make decisions promoting certain sneaker listings.
Importing needed libraries and reading in the spreadsheet directly from StockX's 2019 Data Challenge page. Note that this code will download the dataset in the form of an .xlsx spreadsheet in the directory from which this notebook is being run.
#Importing the necessary libraries
import pandas as pd
from statistics import *
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import matplotlib.patches as mpatch
import matplotlib.dates as mdate
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from regressors import stats as stats_reg
from sklearn.ensemble import RandomForestRegressor
import requests
#Reading in the data directly from the StockX Data Challenge 2019 webpage
url = "https://s3.amazonaws.com/stockx-sneaker-analysis/wp-content/uploads/2019/02/StockX-Data-Contest-2019-3.xlsx"
my_file = requests.get(url)
output = open('StockX-Data-Contest-2019-3.xlsx', 'wb')
output.write(my_file.content)
output.close()
#Save the contents of the excel file into variable: stockx_data
stockx_data = pd.read_excel("StockX-Data-Contest-2019-3.xlsx", sheet_name = 1)
Here we are looking at some basic characteristics and quality of the dataset. Data cleaning was not a significant hurdle in this project because the dataset provided by StockX was very clean.
#Taking a quick look at how the data is structured
stockx_data.head(n=5)
Below, we confirm that the dataset has the correct dimensions: 99,956 rows and 8 columns
#Getting the dimensions of the data
stockx_data.shape
From the output below, we can see that this data contains no missing values.
#Checking for missing values
stockx_data.isnull().sum()
We can see that the datatypes of the columns are congruent with the information they provide.
#Checking the data types of each column
stockx_data.dtypes
In the Brand variable, Yeezys were indicated by " Yeezy", making analysis more difficult. In this code, I strip all leading and trailing white space in Brand to make the data just a bit cleaner.
#Stripping all leading and trailing white space in the Brand column
stockx_data["Brand"] = stockx_data["Brand"].apply(lambda x: x.strip())
99,956 rows and 8 columns
No missing values
Order Date: datetime, Date the transaction occurred
Brand: string, Brand of the sneaker
Sneaker Name: string, Name of the sneaker
Sale Price: numpy float, Price paid for sneaker in transaction (resale price)
Retail Price: numpy int, Retail price of sneaker in stores (release price)
Release Date: datetime, Date the sneaker dropped
Shoe Size: numpy float, Shoe size (most likely in mens)
Buyer Region: string, transaction state (most likely for purchaser)
Here we explore the variables and get a better feel for the dataset. Additionally, looking at the current variables will help us conceptualize additional features and a potential target for modeling.
First, let's take a look at Sale Price. The mean is 446.63 dollars, the median is 370.00 dollars, and the standard deviation is 255.98 dollars.
#Looking at mean and median of Sale Price: basicstats_saleprice
basicstats_saleprice = {}
basicstats_saleprice["Mean"] = mean(stockx_data["Sale Price"])
basicstats_saleprice["Median"] = median(stockx_data["Sale Price"])
basicstats_saleprice["Standard Deviation"] = stdev(stockx_data["Sale Price"])
basicstats_saleprice
Looking at the histogram below, it is clear that Sale Price is quite right skewed.
#Plotting the distribution of Sale Price
plt.hist(x=stockx_data["Sale Price"], bins=100);
plt.xlabel("Sale Price")
plt.ylabel("Count")
plt.title("Histogram of Sale Price");
This box and whisker plot demonstrates that Sale Price has quite a few high outliers. Outliers in this context are defined as 1.5 times the inter-quartile range.
#Box and whisker plot of Sale Price
sns.boxplot(x = "Sale Price", data= stockx_data);
Next, let's take a look at Retail Price. The mean is 208.61 dollars, the median is 220.00 dollars, and the standard deviation is 25.20 dollars. The lack of variability (exhibited by the small standard deviation) in Retail Price is due to manufacturers making all sneakers of a certain type cost the same regardless of colorway. For example, all Yeezys in the data retail for 220 dollars, regardless of how much they resell for on StockX.
#Looking at mean and median of Retail Price: basicstats_retailprice
basicstats_retailprice = {}
basicstats_retailprice["Mean"] = mean(stockx_data["Retail Price"])
basicstats_retailprice["Median"] = median(stockx_data["Retail Price"])
basicstats_retailprice["Standard Deviation"] = stdev(stockx_data["Retail Price"])
basicstats_retailprice
Now let's take a look at Shoe Size. Though not explicitly stated, these are most likely mens shoe sizes. Unlike Sale Price and Retail Price, Shoe Size is discrete. The mean is 9.34, the median is 9.5, and the mode is 10.
#Looking at mean and median of Shoe Size: basicstats_shoesize
basicstats_shoesize = {}
basicstats_shoesize["Mean"] = mean(stockx_data["Shoe Size"])
basicstats_shoesize["Median"] = median(stockx_data["Shoe Size"])
basicstats_shoesize["Mode"] = mode(stockx_data["Shoe Size"])
basicstats_shoesize
Below is a histogram of Shoe Size. The distribution looks sensible, as the middle sizes (8-12) have the highest frequencies.
#Plotting the distribution of shoe size.
plt.hist(x=stockx_data["Shoe Size"], bins=20);
plt.xlabel("Shoe Size")
plt.ylabel("Count")
plt.title("Histogram of Shoe Size");
This box and whisker plot shows that there are not a lot of outliers in Shoe Size.
#Box and whisker plot of Shoe Size.
sns.boxplot(x = "Shoe Size", data= stockx_data);
Now let's take a closer look at the levels and frequencies for the categorical variables. Brand contains two levels: "Yeezy" and "Off-White". 72,162 of the values for Brand are "Yeezy", while the remaining 27,794 are "Off-White", confirming that StockX's data is aligned with their website's description of it.
#Looking at Brand levels and value counts
stockx_data["Brand"].value_counts().to_frame()
Next up is Sneaker Name. There are many levels to this variable, but the five most frequently ordered sneakers are all Yeezy Boost 350 V2s in the Butter, Beluga 2.0, Zebra, Blue-Tint and Cream-White colorways, in that order. The remaining shoes and frequencies can be found from the output of the code below.
#looking at Sneaker Name levels and counts
stockx_data["Sneaker Name"].value_counts().to_frame().head(n=5)
Last but not least, here is Buyer Region. There are 50 levels, one per state. California and New York are overwhelmingly the most common values in Buyer Region. The remaining states and frequencies can be found in the output below.
#Looking at Buyer Region levels and counts: California and New York make up a significant portion of total purchases,
#with New York more than double Oregon sitting in 3rd place
stockx_data["Buyer Region"].value_counts().to_frame().head(n=5)
Now that we have taken a brief look at all non-time variables present in the dataset, we can now examine relationships between variables.
The graph below is a visual representation of Retail Price and Sale Price over Shoe Size. Here, it's easy to easy to see that the Sale Price and Retail Price maintain a consistent relationship between size 4-14.5, but Sale Price goes up at a size 15+. This is most likely occuring due to the relatively low number of observations at these extreme sizes.
#Graphical representation of Mean Sale Price and Mean Retail Price by Shoe Size (table above).
fig = plt.plot(stockx_data.groupby('Shoe Size')['Sale Price'].mean())
plt.plot(stockx_data.groupby('Shoe Size')['Retail Price'].mean())
plt.xlabel("Shoe Size")
plt.ylabel("Mean Sale Price")
plt.title("Shoe Size vs Mean Sale Price")
plt.xticks(np.arange(min(stockx_data["Shoe Size"]), max(stockx_data["Shoe Size"])+1,0.5),rotation = 60);
The graph below plots Mean and Median Sale Price over time. In general, Sale Price seems to be decreasing over the time span of the data. There are some steep increases, usually marked by the releases of individual sneakers with high demand. For example, there is a massive spike in Sale Price around June 2018. This corresponds with the pre-release period of the Off-White Air Jordan 1 in the UNC colorway. Some individual pairs sold for nearly 4,000 dollars, driving up the Sale Price at that time. The steep drops are most likely due to a very hype sneaker selling out soon after release on StockX. If no one has a pair of a very sought after shoe, then no transactions can take place involving said shoe. This results in a drop in mean Sale Price for the dataset as a whole.
#Overlaid plot of Mean vs Median Sale Price over Order Date
fig = plt.plot(stockx_data.groupby('Order Date')['Sale Price'].mean(), color = '#ABCDEF', alpha = 1)
plt.plot(stockx_data.groupby('Order Date')['Sale Price'].median(), color = '#FEDCBA', alpha = 1)
plt.xlabel("Order Date")
plt.ylabel("Sale Price")
plt.title("Mean and Median Sale Price over Time (Order Date)")
blue_patch, orange_patch = mpatch.Patch(color = '#ABCDEF', label = "Mean Sale Price"), \
mpatch.Patch(color = '#FEDCBA', label = "Median Sale Price")
plt.legend(handles = [blue_patch, orange_patch])
plt.tick_params(axis='x',labelbottom = True)
plt.xticks(rotation = 45)
plt.rcParams["grid.alpha"] = 0.1
plt.grid(True)
Looking at the boxplot below, it is clear that Off-Whites have a higher median and greater spread with regards to Sale Price. Both brands of shoes have a lot of high outliers, but Off-White outliers extend to higher dollar amounts.
#Boxplot of Brand vs Sale Price
sns.boxplot(x = "Brand", y = "Sale Price", data=stockx_data);
This overlaid histogram below shows that a greater volume of Yeezys are sold at lower prices, while Off-Whites sell at a lower volume for higher prices.
#Overlaid histogram of Sale Price broken down by Brand
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","Sale Price"], bins, alpha=0.5, label='Yeezy',)
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","Sale Price"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("Sale Price")
plt.ylabel("Count")
plt.title("Histogram of Sale Price By Brand")
plt.show()
The table below shows average Sale Price over a few states. Delaware has the highest average Sale Price, while Wyoming has the lowest.
stockx_data.groupby("Buyer Region", as_index=False)["Sale Price"].mean().head(n=5)
The code output below shows the proportion of transactions attributed to each level of Buyer Region, or state.
#Getting each state's proportion of transactions
stockx_data["Buyer Region"].value_counts().to_frame().head(n=5)/stockx_data.shape[0]
Sometimes the variables in your dataset do not capture the complete story. Feature engineering, or creating new variables based on ones you do have, can allow for clearer or more interesting insights and pave the way for useful models.
With potential modeling applications in mind, here is a breakdown of some additional features I want to create:
date_diff: Elapsed time between Order Date and Release Date. This allows cross comparisons of different sneakers over the same point in time after release.
price_ratio: Ratio of Sale Price to Retail Price. This standardizes each shoe relative to its retail price and acts as a better indicator of the hype of any given sneaker.
jordan: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Jordan or not.
V2: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a V2 model or not.
blackcol: A binary indicator variable derived from Sneaker Name that indicates if the shoe has black in the colorway or not.
airmax90: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
airmax97: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
zoom: A binary indicator variable derived from Sneaker Name that indicates if the shoe has Zoom in the name or not.
presto: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Presto or not.
airforce: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an AirForce or not.
blazer: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Blazer or not.
vapormax: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Vapor Max or not.
california: A binary indicator variable derived from Buyer Region that indicates if the buyer is from California or not.
new_york: A binary indicator variable derived from Buyer Region that indicates if the buyer is from New York or not.
oregon: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Oregon or not.
florida: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Florida or not.
texas: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Texas or not.
other_state: A binary indicator variable derived from Buyer Region that indicates if the buyer is from any other state. The states in this category had <5% of transactions in the data.
brand2: A binary indicator variable derived from Brand for modeling. Yeezy = 1, Off-White = 0.
#Taking the difference between Order Date and Release Date to create a new column: date_diff
stockx_data["date_diff"] = stockx_data['Order Date'].sub(stockx_data['Release Date'], axis=0)/np.timedelta64('1','D')
#Creating a new column containing the ratio of Sale Price and Retail Price: price_ratio
stockx_data["price_ratio"] = stockx_data["Sale Price"]/stockx_data["Retail Price"]
#Creating the jordan variable
stockx_data["jordan"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Jordan' in x.split("-") else 0)
#Creating the V2 variable
stockx_data["V2"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'V2' in x.split("-") else 0)
#Creating the blackcol variable
stockx_data["blackcol"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Black' in x.split("-") else 0)
#Creating the airmax90 variable
stockx_data["airmax90"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '90' in x.split("-") else 0)
#Creating the airmax97 variable
stockx_data["airmax97"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '97' in x.split("-") else 0)
#Creating the zoom variable
stockx_data["zoom"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Zoom' in x.split("-") else 0)
#Creating the presto variable
stockx_data["presto"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Presto' in x.split("-") else 0)
#Creating the airforce variable
stockx_data["airforce"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Force' in x.split("-") else 0)
#Creating the blazer variable
stockx_data["blazer"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Blazer' in x.split("-") else 0)
#Creating the vapormax variable
stockx_data["vapormax"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'VaporMax' in x.split("-") else 0)
#Creating the california variable
stockx_data["california"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'California' in x else 0)
#Creating the new_york variable
stockx_data["new_york"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'New York' in x else 0)
#Creating the oregon variable
stockx_data["oregon"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Oregon' in x else 0)
#Creating the florida variable
stockx_data["florida"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Florida' in x else 0)
#Creating the texas variable
stockx_data["texas"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Texas' in x else 0)
#Creating the other_state variable
above5pct_states = ["California", "New York", "Oregon", "Florida", "Texas"]
stockx_data["other_state"] = pd.Series(list(map(int,~stockx_data["Buyer Region"].isin(above5pct_states))))
#Creating the brand2 variable
stockx_data["brand2"] = stockx_data["Brand"].apply(lambda x : 1 if 'Yeezy' in x else 0)
Let's take a look at the new version of stockx_data with the additional features that we have created.
stockx_data.head(n = 3)
#Run this code if you want to save the altered stockx_data as full_features.csv. I did this to generate the accompanying
#Tableau dashboard.
#pd.DataFrame.to_csv(stockx_data, "C:/.........../full_features.csv")
The first new feature (and most interesting) is price_ratio. The mean is 2.25, the median is 1.70, and the standard deviation is 1.5. For clarity, the transactions in this data have a Sale Price that is 2.25 times the Retail Price on average. I found that price_ratio was the best way to adjust for differing Retail Price between different sneaker models. Note that price_ratio cannot be less than or equal to 0 because no one gets free or negatively priced shoes on StockX.
#Looking at mean and median of price_ratio
basicstats_priceratio = {}
basicstats_priceratio["Mean"] = mean(stockx_data["price_ratio"])
basicstats_priceratio["Median"] = median(stockx_data["price_ratio"])
basicstats_priceratio["Standard Deviation"] = stdev(stockx_data["price_ratio"])
basicstats_priceratio
The following is a box and whisker plot of price_ratio. Not surprisingly, it looks quite similar to the same plot of Sale Price, but the outliers have become compressed into a smaller range due to the transformation.
#Boxplot of price_ratio.
sns.boxplot(x=stockx_data["price_ratio"]);
Outliers sometimes pose problems for modeling. After examining the outliers in price_ratio, I found that almost all of them were Off-White Air Jordan 1. Since there is a common pattern in the outliers, they should not be tampered with.
The histogram below plots the distribution of price_ratio separated by Brand. Here it is very easy to see how the distribution of price_ratio differs between Yeezys and Off-Whites.
#Overlaid histogram of price_ratio broken down by Brand.
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","price_ratio"], bins, alpha=0.5, label='Yeezy')
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","price_ratio"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("price_ratio")
plt.ylabel("Count")
plt.title("Histogram of Price Ratio By Brand")
plt.show()
The output below is a table of average price_ratio by Buyer Region. They generally range from 1.5 to 2.5, meaning that per state, the average sneaker Sale Price is 1.5 to 2.5 times its Retail Price.
stockx_data.groupby("Buyer Region", as_index=False)["price_ratio"].mean().head(n=5)
Now that we have thoroughly explored price_ratio, it is time to take a look at date_diff, or the number of days the transaction occurred after the Release Date of the sneaker. The mean is 183.71 days, the median is 56 days, the standard deviation is 232.35 days, and the range is -69 days to 1321 days. A negative date_diff value indicates the shoe was purchased prior to Release Date in a pre-order. It is also important to note that several sneakers have release dates that predate data collection leading to a gap in the data for those sneakers leading up to the 9/1/2017 (start of data collection).
#Basic statistics of the difference between Order Date and Release Date?
basicstats_datediff = {}
basicstats_datediff["Mean"] = mean(stockx_data["date_diff"])
basicstats_datediff["Median"] = median(stockx_data["date_diff"])
basicstats_datediff["Standard Deviation"] = stdev(stockx_data["date_diff"])
basicstats_datediff["Range"] = [min(stockx_data["date_diff"]), max(stockx_data["date_diff"])]
basicstats_datediff
This overlaid histogram of date_diff separated by Brand shows that Yeezys have the largest variability in time between Release Date and Order Date, especially looking at the higher extremes. This makes sense, because data collection started at roughly the same time that "The Ten" (Off-White shoes) were being released. Yeezys, however, were released earlier than the data was collected, especially the original Yeezy Boost 350s. This leads to the long tail in the date_diff distribution for Yeezys.
#Overlaid histograms of the elapsed time between Order and Release dates broken down by brand.
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","date_diff"], bins, alpha=0.5, label='Yeezy',)
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","date_diff"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("Elapsed Time (Days)")
plt.ylabel("Count")
plt.title("Histogram of Elapsed Time By Brand")
plt.show()
The following density-scatter plot shows price_ratio as a function of date_diff. Here it is evident that Off-Whites exhibit higher price_ratio relative to the Yeezys at the same point in time after Release Date. The reference line is drawn at price_ratio = 1 for scale.
#Creating a density-esque scatter plot of price_ratio vs date_diff separated by Brand
fig = sns.scatterplot(x= "date_diff", y= "price_ratio", data=stockx_data, hue= "Brand",s=8, alpha=0.03);
fig.hlines(1, -70,1300);
Based on our exploration of the existing and engineered features, further understanding and modeling price_ratio may be in the best interest of StockX. Using price_ratio balances the needs of the company and consumers as opposed to using Sale Price alone. For StockX, price_ratio predictions could easily be converted into revenue by multiplying the shoes by their Retail Price and then calculating fees. For buyers on the platform, using price_ratio allows their voice to be heard in the selective promoting process. Buyers have to accept the resale prices, and price_ratio better captures how much consumers want a specific shoe than Sale Price alone. Because it balances StockX's goal of maximizing revenue and matching sneakerhead demands, I believe that price_ratio is the best target variable for modeling.
In this section, we are going to do some predictive modeling on price_ratio. After doing some data processing, we will attempt some multiple linear regression models to predict price_ratio. After evaluating these linear models for performance and validity, we will move on to random forest regression models and compare their performances. All regression models in this notebook will be compared using mean absolute percent error, or MAPE. This metric can be interpreted as follows: "On average, the model outputs predictions that are XX.X% off from the true value."
Since date_diff is a direct linear combination of Order Date and Release Date, we will drop these variables before splitting the dataset.
#Get rid of the date columns
stockx_data = stockx_data.drop(['Order Date','Release Date'], axis=1)
Next, we get rid of Brand, since brand2 is a numerical version of the same information. We also get rid of Sneaker Name because we have extracted most of the information from that column in the form of several binary indicator variables.
#Get rid of the original Brand and Sneaker Name columns
stockx_data = stockx_data.drop(['Brand','Sneaker Name'], axis=1)
Similar to date_diff, we drop Sale Price and Retail Price because they are explained by our target variable, price_ratio.
#Getting rid of the Sale Price and Retail Price
stockx_data = stockx_data.drop(['Sale Price', 'Retail Price'], axis = 1)
Now we get rid of Buyer Region because we have several binary indicator variables in its place.
#Getting rid of Buyer Region
stockx_data = stockx_data.drop(['Buyer Region'], axis = 1)
Let's take a look at our model-ready dataset!
stockx_data.head(n=5)
Now we're ready to split the dataset into training and testing data.
#Create objects x and y to separate predictors from the target.
x = stockx_data.drop("price_ratio", axis=1)
y = stockx_data["price_ratio"]
#Run train_test_split to create 4 different data objects with train/test preds ana train/test targets
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.20)
The first model we will look at is multiple linear regression. One of the assumptions of this model is that predictors cannot be correlated to one another (multi-collinear). To address this, we look at the pearson correlation matrix for the dataset. Here we remove brand2 because it is highly correlated with V2. Additionally, we remove other_state because it is directly explained by the california, new_york, oregon, florida, and texas variables.
#Looking at the correlation values for each predictor relative to each other and the target
stockx_data.corr(method='pearson')
#Creating copies of the train/test datasets without brand2 or other_state
x_train_m1 = x_train.copy()
x_train_m1 = x_train_m1.drop(["brand2","other_state"], axis=1)
x_test_m1 = x_test.copy()
x_test_m1 = x_test_m1.drop(["brand2","other_state"], axis=1)
Now we create and fit our linear model object. This first model is predicting raw price_ratio values from both the Yeezy and Off-White data in the same model.
#Creating a linear regression object: lm1
lm1 = linear_model.LinearRegression()
#Fitting the model with the first model's predictors (x_train_m1) and the train response
m1 = lm1.fit(x_train_m1, y_train)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for this model.
#Get the summary table with p-values for each predictor.
print("\n=========== SUMMARY ===========")
xlabels = x_train_m1.columns
stats_reg.summary(lm1, x_train_m1, y_train, xlabels)
Now we predict using the testing predictor data and calculate the residuals. Based on the following QQ-plot for these residuals, this model breaks the assumption of normally distributed residuals. So while the parameter estimates are still valid, this model should not be used to predict on new data.
#Getting test predictions
test_preds_m1 = lm1.predict(x_test_m1)
#Calculating residuals
residuals_m1 = test_preds_m1 - y_test
#Making QQ-plot of residuals
sm.qqplot(residuals_m1,line="s");
Now we calculate the mean absolute percent error of the model (MAPE) for comparison to other models.
#Calculate the MAPE
lm1_mape = np.mean(100 * abs((residuals_m1/y_test)))
lm1_mape
Based on the residual QQ-plot from the previous model, the assumptions of linear regression were not met. This issue could be happening because price_ratio is a bounded quantity. As I mentioned earlier, it is impossible to have a price_ratio less than or equal to 0. Additionally, it is extremely rare that a sneaker has a price ratio less than 1 (557 transactions of 99956 observations). But price_ratio has no upper bound. Because of this asymmetrical bound, the model predicting raw price_ratio values gets wonky around 1. By taking the natural log of price_ratio we convert the target into an unbounded logarithmic scale, which improves the symmetry and spread of the target distribution. Looking at the overlaid histogram below, it is clear that the frequencies for individual values goes down quite significantly after log transformatio and that the spread improves.
#Plotting the overlaid histogram of the log-transformed target and regular target
plt.hist(np.log(stockx_data["price_ratio"]),bins=100, alpha = 0.5);
plt.hist(stockx_data["price_ratio"], bins = 100, alpha = 0.2);
plt.legend(["log(price_ratio)", "price_ratio"]);
This time when making our copies of the data, we must log transform the training labels (y values). We do not touch the testing data.
#Creating copies of the train/test datasets without brand2 in it AND changing the y test/train to be log transformed.
x_train_m2 = x_train.copy()
x_train_m2 = x_train_m2.drop(["brand2","other_state"], axis=1)
x_test_m2 = x_test.copy()
x_test_m2 = x_test_m2.drop(["brand2", "other_state"], axis=1)
y_train_m2 = y_train.copy()
y_train_m2 = np.log(y_train_m2)
y_test_m2 = y_test.copy()
Now we create and fit our linear model object. This first model is predicting log transformed price_ratio values from both the Yeezy and Off-White data in the same model.
#Creating the linear model object: lm2
lm2 = linear_model.LinearRegression()
#Fitting the linear model: m2
m2 = lm2.fit(x_train_m2, y_train_m2)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for the log transformed model.
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_m2.columns
stats_reg.summary(lm2, x_train_m2, y_train_m2, xlabels)
Now we predict using the testing predictor data and calculate the residuals. Based on the following QQ-plot for these residuals, this model still breaks the assumption of normally distributed residuals.
#Getting test predictions
test_preds_m2 = lm2.predict(x_test_m2)
#Calculating residuals
residuals_m2 = np.exp(test_preds_m2) - y_test_m2
#Making QQ-plot of the residuals
sm.qqplot(residuals_m2,line="s");
Now we get the MAPE of the log transformed model. This is a slight improvement (lower MAPE) compared to the previous model.
#Calculate the MAPE: On average, the log transformed linear model is 22.03% off in predicting price
#ratio.
lmlog_mape = np.mean(100 * abs(residuals_m2/y_test_m2))
lmlog_mape
Based on the results and looking back on the distributions of price_ratio for Yeezys and Off-Whites, it makes sense to pursue separate linear regression models for each brand.
Creating the dataset this time around is not quite as simple as dropping unwanted columns. First we drop other_state for collinearity concerns. Then we subset the training and testing data for rows in which brand2 = 1 (Yeezy). After that, we drop jordan, airmax90, airmax97, zoom, presto, airforce, blazer, and vapormax because those columns exclusively apply to Off-Whites. After that, we drop brand2.
#Creating a 3rd copy of the x_train and x_test
x_train_m3 = x_train.copy()
x_train_m3 = x_train_m3.drop(["other_state"], axis=1)
x_test_m3 = x_test.copy()
x_test_m3 = x_test_m3.drop(["other_state"], axis=1)
#Subset only rows with brand2 == 1 (Yeezy). Note that y_train_yeezy is log transformed
x_train_yeezy = x_train_m3.loc[x_train_m3["brand2"] == 1,]
x_test_yeezy = x_test_m3.loc[x_test_m3["brand2"] == 1,]
y_train_yeezy = np.log(y_train.loc[x_train_m3["brand2"] == 1,])
y_test_yeezy = y_test.loc[x_test_m3["brand2"] == 1,]
#Drop: jordan, airmax90, airmax97, zoom, presto, airforce, blazer, vapormax, brand2.
x_train_yeezy = x_train_yeezy.drop(["jordan", "airmax90", "airmax97", "zoom", \
"presto", "airforce","blazer", "vapormax","brand2"],axis = 1)
x_test_yeezy = x_test_yeezy.drop(["jordan", "airmax90", "airmax97", "zoom",\
"presto", "airforce","blazer", "vapormax","brand2"], axis = 1)
Now we create and fit our linear model object. This third model is predicting log transformed price_ratio values from only the Yeezy shoes in the data.
#Creating a linear model object: lm_yeezy
lm_yeezy = linear_model.LinearRegression()
#Training the Yeezy model using the Yeezy train/target
m_yeezy = lm_yeezy.fit(x_train_yeezy, y_train_yeezy)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for the log transformed Yeezy model.
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_yeezy.columns
stats_reg.summary(lm_yeezy, x_train_yeezy, y_train_yeezy, xlabels)
Now we predict and view the QQ-plot of the residuals. This model also fails to meet the normality of residuals requirement for linear regression.
#Getting the test predictions
test_preds_yeezy = lm_yeezy.predict(x_test_yeezy)
#Calculating the residuals. The test predictions have to be exponentiated when compared to the test set.
residuals_yeezy = np.exp(test_preds_yeezy) - y_test_yeezy
#Making the QQ-plot of the residuals
sm.qqplot(residuals_yeezy,line="s");
Here we calculate the MAPE of the Yeezy linear model. Separating by brand seems to have slightly reduced the MAPE in the case of the Yeezys.
#Calculate the MAPE
lmyeezy_mape = np.mean(100 * abs(residuals_yeezy/y_test_yeezy))
lmyeezy_mape
Similar to how the Yeezy data and labels were created, we subset the data to only rows with brand2 = 0 (Off-White) and then get rid of V2 (a strictly Yeezy term) and brand2, since it is no longer needed.
#Creating yeezy and off-white training and testing sets for brand separated models
x_train_offwhite = x_train_m3.loc[x_train_m3["brand2"] == 0,]
x_test_offwhite = x_test_m3.loc[x_test_m3["brand2"] == 0,]
#Taking the natural log of the training target
y_train_offwhite = np.log(y_train.loc[x_train_m3["brand2"] == 0,])
y_test_offwhite = y_test.loc[x_test_m3["brand2"] == 0,]
#Drop: V2, brand2.
x_train_offwhite = x_train_offwhite.drop(["V2","brand2"],axis=1)
x_test_offwhite = x_test_offwhite.drop(["V2","brand2"],axis=1)
Now we create and fit our linear model object. This fourth linear model is predicting log transformed price_ratio values from only the Off-White shoes in the data.
#Creating linear regression object for the Off-Whites
lm_offwhite = linear_model.LinearRegression()
#Training the Off-White model using the Off-White train/target
m_offwhite = lm_offwhite.fit(x_train_offwhite, y_train_offwhite)
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_offwhite.columns
stats_reg.summary(lm_offwhite, x_train_offwhite, y_train_offwhite, xlabels)
Now we predict and view the QQ-plot of the residuals. Like all previous linear models, this one fails to meet the normality of residuals requirement for linear regression.
#Getting test predictions
test_preds_offwhite = lm_offwhite.predict(x_test_offwhite)
#Calculating residuals
residuals_offwhite = np.exp(test_preds_offwhite) - y_test_offwhite
#Plotting QQ-plot
sm.qqplot(residuals_offwhite,line="s");
Now we calculate the MAPE of the Off-White model. This one is the best performing model, and this is most likely due to the greater number of effective predictors in the Off-White dataset as compared to the Yeezy dataset.
#Calculate the MAPE
lmoffwhite_mape = np.mean(100 * abs(residuals_offwhite/y_test_offwhite))
lmoffwhite_mape
After these four failed modeling attempts with linear regression, the best course of action is to pursue a model that does not have strict assumptions like those of linear regression. Random forest model it is!
The first random forest model runs on both Yeezy and Off-White data to predict price_ratio.
I settled on two hyperparameters to tune: number of regression trees and maximum number of features considered for each split. My process with optimizing the number of regression trees was quite simple, I just ran 4 different random forests with 10, 100, 500, and 1000 trees in them and looked at the MAPEs to determine where the point of diminishing returns was, and apply that number of trees for all subsequent models. As for the maximum number of features in a split, I created a function that can adapt to any dataset and outputs MAPEs for each possible value of max_features in the sklearn function RandomForestRegressor. I would then look for the lowest MAPE and report that as the final random forest regression model.
Here we make some fresh copies of the originally split data. The training set has to be further broken into training and validation in anticipation of hyperparameter tuning. Ultimately the split here is 60% training, 20% validation, and 20% test data.
#Creating the copies of the training, testing, and validation data for the general random forest model
x_train_m4 = x_train.copy()
x_test_m4 = x_test.copy()
y_train_m4 = y_train.copy()
y_test_m4 = y_test.copy()
x_train_m4_small, x_valid, y_train_m4_small, y_valid = train_test_split(x_train_m4,y_train_m4,test_size = 0.25)
I settled on two hyperparameters to tune: number of regression trees and maximum number of features considered at each split. The code below shows how I selected the number of trees. Knowing that more trees equates to more run time, I picked the tree value that seemed to mark the beginning of the point of diminished returns, which in this case was 100 trees. All random forest models use 100 trees.
#Testing 10, 100, 500, 1000 trees for their resulting MAPE values.
trees_to_test = [10, 100, 500, 1000]
tree_mapes = {}
for i in trees_to_test:
rf_treevalid = RandomForestRegressor(random_state=np.random.seed(42), n_estimators=i)
rf_treevalid.fit(x_train_m4_small, y_train_m4_small)
rf_treevalid_preds = rf_treevalid.predict(x_valid)
rf_treevalid_errors = rf_treevalid_preds - y_valid
mape = mean(100 * abs((rf_treevalid_errors/rf_treevalid_preds)))
tree_mapes.update({i:mape})
tree_mapes
The next hyperparameter I selected for tuning is the maximum features considered at every split. To do this effectively, I created a function that inputs a training data, training labels, testing data, testing labels, and number of trees. For every value from 1 to the total number of features in the data, this function will fit a random forest on the testing with 100 trees, and return the MAPE. I then pick the lowest MAPE and that corresponding number of features and that is the final random forest model.
#Creating a new function max_features_tuning to find the best value for max_features (meaning number
#of features considered at every node) for each random forest. Trees will default to 100 because
#that's what I determined to be the best tradeoff of accuracy vs runtime, but it can be altered.
def max_features_tuning (xtrainset, ytrainset, xvalidset, yvalidset, trees = 100):
MAPE_values = {}
for i in range(0, xtrainset.shape[1]):
rf = RandomForestRegressor(random_state=np.random.seed(42), n_estimators=trees,\
max_features=i+1)
rf.fit(xtrainset, ytrainset)
rf_preds = rf.predict(xvalidset)
rf_errors = rf_preds - yvalidset
mape = mean(100 * abs((rf_errors/rf_preds)))
MAPE_values.update({i+1:mape})
return MAPE_values
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value for max features for this model is 11.
#Looking for the best max_features value using training and validation set
max_features_tuning(xtrainset=x_train_m4_small, ytrainset=y_train_m4_small,\
xvalidset=x_valid, yvalidset = y_valid, trees = 100)
Now that trees and max features at any split have been optimized, I rerun the random forest to get the feature importances for the model.
#Rerun the general Random forest with the optimal parameters
rf = RandomForestRegressor(random_state=np.random.seed(42), max_features=11, n_estimators=100)
rf.fit(x_train_m4, y_train_m4)
The output below shows the feature importances for this random forest model. The most important variable for this random forest is V2, or whether or not a sneaker is a Yeezy 350 Boost V2.
#Print list of feature importances
features = x_train_m4.columns
importances = list(rf.feature_importances_)
for i in range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf.predict(x_test_m4)
test_errors = test_preds - y_test_m4
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
Just like with linear regression, we now create brand separated random forest models.
The Yeezy train/test data and labels were already created earlier in the notebook when we made the Yeezy linear regression. We have to exponentiate the y-training labels because there is no longer a reason to log-transform price_ratio.
#Exponentiating the y_train_yeezy to undo the log-transform for the Yeezy linear model
y_train_yeezy = np.exp(y_train_yeezy)
#Separating the training set into training, validation, and test datasets
x_train_yeezy_small, x_valid_yeezy, y_train_yeezy_small, y_valid_yeezy = train_test_split(x_train_yeezy,y_train_yeezy,\
test_size = 0.25)
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value of max features for the Yeezy random forest is 7.
#Looking for the best max_features value using the training and validation sets
max_features_tuning(xtrainset=x_train_yeezy_small, ytrainset=y_train_yeezy_small,\
xvalidset=x_valid_yeezy, yvalidset = y_valid_yeezy, trees = 100)
Now we rerun the Yeezy random forest with the optimized parameters.
#Rerun the Yeezy Random forest with the optimal parameters for
rf_yeezy = RandomForestRegressor(random_state=np.random.seed(42), max_features=7, n_estimators=100)
rf_yeezy.fit(x_train_yeezy, y_train_yeezy)
The output below contains the feature importances for the Yeezy model. The most important variable is blackcol, or whether or not a Yeezy has black in the colorway.
#Print list of feature importance
features = x_train_yeezy.columns
importances = list(rf_yeezy.feature_importances_)
for i in x_train_m4_small, x_valid, y_train_m4_small, y_valid = train_test_split(x_train_m4,y_train_m4,test_size = 0.25)range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned Yeezy random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf_yeezy.predict(x_test_yeezy)
test_errors = test_preds - y_test_yeezy
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
We repeat the same data transformation for the Off-White training labels as we did with the Yeezy training labels.
#Exponentiating the y_train_offwhite to undo the log-transform for the Off-White linear model
y_train_offwhite = np.exp(y_train_offwhite)
Here we separate the Off-White training data into training and validation for hyperparameter tuning.
#Separating the training set into training, validation, and test datasets
x_train_offwhite_small, x_valid_offwhite, y_train_offwhite_small, y_valid_offwhite = train_test_split(x_train_offwhite,\
y_train_offwhite,\
test_size = 0.25)
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value for the max features in the Off-White random forest is 12.
#Looking for the best max_features value
max_features_tuning(xtrainset=x_train_offwhite_small, ytrainset=y_train_offwhite_small,\
xvalidset=x_valid_offwhite, yvalidset = y_valid_offwhite, trees = 100)
Now we rerun the optimized Off-White random forest to get the feature importance.
#Rerun the optimized random forest
rf_offwhite = RandomForestRegressor(random_state=42,max_features=12, n_estimators=100)
rf_offwhite.fit(x_train_offwhite, y_train_offwhite)
The output below shows the feature importance for the Off-White random forest. The most important variable is date_diff.
#Print the feature importance
features = x_train_offwhite.columns
importances = list(rf_offwhite.feature_importances_)
#importances
for i in range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned Off-White random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf_offwhite.predict(x_test_offwhite)
test_errors = test_preds - y_test_offwhite
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
In this analysis, we used sneaker transaction data from StockX to understand what makes certain sneakers hype. We found that Off-Whites are, in fact, more hype than Yeezys. We found that it is possible to predict the hype of a sneaker, represented by the ratio of sale price to retail price. We were able to determine the most important factors in predicting the hype of a sneaker, such as number of days after release or whether a shoe was a Yeezy Boost 350 V2 or not. StockX can use these results to make decisions on which sneakers to promote to buyers based on how they want to balance maximal revenue with buyer interests.