Sameen Salam- M.S. Candidate, Institute for Advanced Analytics at North Carolina State University
Mail: ssalam@ncsu.edu
In this notebook, we will explore sneaker transaction data from the online marketplace StockX. StockX put out this dataset as part of its 2019 Data Challenge. The data consists of a random sample of all Off-White and Yeezy 350 sales from between 9/1/2017 (the month that Off-White first debuted “The Ten” collection) and 2/15/2019.
This is the modeling notebook. Here is an overview of how this notebook is structured:
The Sneaker Game - brief explanation on some basic aspects of the sneaker market
Feature Engineering - conceptualization and creation of additional variables with modeling in mind
Modeling - mapping out the relationships between features and the selected target
Conclusions - summarzing modeling results and paths the project could take moving forward
Sneaker culture has become ubiquitous in recent years. All around the world, millions of sneakerheads (myself included) go online to cop the newest releases and rarest classics. But it's not always quite as simple as clicking the "Add to Cart" button and making the purchase. Some sneakers have incredibly high demand leading up to a very limited supply upon release. Only a few dedicated and/or lucky people will be successful in purchasing these shoes from a shoe store or website. Some may choose wear their shoes and keep them in their collection for years to come. But many will choose to resell deadstock (brand new) shoes at a profit in order to purchase even rarer ones.
This is where StockX, GOAT, FlightClub, or any other online sneaker marketplace comes in. Resellers need to connect with individuals who want the shoes they have up for sale. These entities offer a platform that put resellers in direct contact with potential buyers. StockX in particular prides itself on being the stock market analog in the sneaker world. Resellers can list sneakers for sale at whatever price they see fit, and buyers can make whatever bids or offers they would like on any sneaker. StockX's role in the transaction is to make sure that the resellers are selling authentic sneakers to protect buyers from receiving fake or damaged sneakers.
Yeezys and Off-Whites are popular examples of coveted shoes that sneakerheads buy off of one another. The Yeezy line is a collaboration between Adidas and musical artist Kanye West. There are several other sneakers that fall under the Yeezy brand, but this dataset only covers Yeezy Boost 350 models. The Off-White line is a collaboration between Nike and luxury designer Virgil Abloh. Like the Yeezys, this dataset focuses in on a subset of Off-White sneaker models known as "The Ten". This is a set of ten different shoes released by Nike over a period of several months. The sneakers that carry these brand labels represent some of the most sought after kicks in the world, selling out in stores and online almost instantly upon release.
After conducting some research on StockX's website, I found that StockX's revenue stream comes primarily from a 3% payment processing fee and a regressive transaction fee (i.e. the more a reseller sells on StockX, the lower your fee per item is). It is in StockX's best interest to foster sales of shoes with higher sale prices from a revenue standpoint. If reasonably accurate predictions can be made on resale prices as they relate to retail prices, then StockX can make decisions promoting certain sneaker listings.
Importing needed libraries and reading in the spreadsheet directly from StockX's 2019 Data Challenge page. Note that this code will download the dataset in the form of an .xlsx spreadsheet in the directory from which this notebook is being run.
#Importing the necessary libraries
import pandas as pd
from statistics import *
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import matplotlib.patches as mpatch
import matplotlib.dates as mdate
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from regressors import stats as stats_reg
from sklearn.ensemble import RandomForestRegressor
import requests
#Reading in the data directly from the StockX Data Challenge 2019 webpage
url = "https://s3.amazonaws.com/stockx-sneaker-analysis/wp-content/uploads/2019/02/StockX-Data-Contest-2019-3.xlsx"
my_file = requests.get(url)
output = open('StockX-Data-Contest-2019-3.xlsx', 'wb')
output.write(my_file.content)
output.close()
#Save the contents of the excel file into variable: stockx_data
stockx_data = pd.read_excel("StockX-Data-Contest-2019-3.xlsx", sheet_name = 1)
Here we are looking at some basic characteristics and quality of the dataset. Data cleaning was not a significant hurdle in this project because the dataset provided by StockX was very clean.
#Taking a quick look at how the data is structured
stockx_data.head(n=5)
Below, we confirm that the dataset has the correct dimensions: 99,956 rows and 8 columns
#Getting the dimensions of the data
stockx_data.shape
From the output below, we can see that this data contains no missing values.
#Checking for missing values
stockx_data.isnull().sum()
We can see that the datatypes of the columns are congruent with the information they provide.
#Checking the data types of each column
stockx_data.dtypes
In the Brand variable, Yeezys were indicated by " Yeezy", making analysis more difficult. In this code, I strip all leading and trailing white space in Brand to make the data just a bit cleaner.
#Stripping all leading and trailing white space in the Brand column
stockx_data["Brand"] = stockx_data["Brand"].apply(lambda x: x.strip())
99,956 rows and 8 columns
No missing values
Order Date: datetime, Date the transaction occurred
Brand: string, Brand of the sneaker
Sneaker Name: string, Name of the sneaker
Sale Price: numpy float, Price paid for sneaker in transaction (resale price)
Retail Price: numpy int, Retail price of sneaker in stores (release price)
Release Date: datetime, Date the sneaker dropped
Shoe Size: numpy float, Shoe size (most likely in mens)
Buyer Region: string, transaction state (most likely for purchaser)
Sometimes the variables in your dataset do not capture the complete story. Feature engineering, or creating new variables based on ones you do have, can allow for clearer or more interesting insights and pave the way for useful models.
With potential modeling applications in mind, here is a breakdown of some additional features I want to create:
date_diff: Elapsed time between Order Date and Release Date. This allows cross comparisons of different sneakers over the same point in time after release.
price_ratio: Ratio of Sale Price to Retail Price. This standardizes each shoe relative to its retail price and acts as a better indicator of the hype of any given sneaker.
jordan: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Jordan or not.
V2: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a V2 model or not.
blackcol: A binary indicator variable derived from Sneaker Name that indicates if the shoe has black in the colorway or not.
airmax90: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
airmax97: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
zoom: A binary indicator variable derived from Sneaker Name that indicates if the shoe has Zoom in the name or not.
presto: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Presto or not.
airforce: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an AirForce or not.
blazer: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Blazer or not.
vapormax: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Vapor Max or not.
california: A binary indicator variable derived from Buyer Region that indicates if the buyer is from California or not.
new_york: A binary indicator variable derived from Buyer Region that indicates if the buyer is from New York or not.
oregon: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Oregon or not.
florida: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Florida or not.
texas: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Texas or not.
other_state: A binary indicator variable derived from Buyer Region that indicates if the buyer is from any other state. The states in this category had <5% of transactions in the data.
brand2: A binary indicator variable derived from Brand for modeling. Yeezy = 1, Off-White = 0.
#Taking the difference between Order Date and Release Date to create a new column: date_diff
stockx_data["date_diff"] = stockx_data['Order Date'].sub(stockx_data['Release Date'], axis=0)/np.timedelta64('1','D')
#Creating a new column containing the ratio of Sale Price and Retail Price: price_ratio
stockx_data["price_ratio"] = stockx_data["Sale Price"]/stockx_data["Retail Price"]
#Creating the jordan variable
stockx_data["jordan"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Jordan' in x.split("-") else 0)
#Creating the V2 variable
stockx_data["V2"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'V2' in x.split("-") else 0)
#Creating the blackcol variable
stockx_data["blackcol"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Black' in x.split("-") else 0)
#Creating the airmax90 variable
stockx_data["airmax90"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '90' in x.split("-") else 0)
#Creating the airmax97 variable
stockx_data["airmax97"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '97' in x.split("-") else 0)
#Creating the zoom variable
stockx_data["zoom"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Zoom' in x.split("-") else 0)
#Creating the presto variable
stockx_data["presto"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Presto' in x.split("-") else 0)
#Creating the airforce variable
stockx_data["airforce"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Force' in x.split("-") else 0)
#Creating the blazer variable
stockx_data["blazer"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Blazer' in x.split("-") else 0)
#Creating the vapormax variable
stockx_data["vapormax"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'VaporMax' in x.split("-") else 0)
#Creating the california variable
stockx_data["california"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'California' in x else 0)
#Creating the new_york variable
stockx_data["new_york"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'New York' in x else 0)
#Creating the oregon variable
stockx_data["oregon"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Oregon' in x else 0)
#Creating the florida variable
stockx_data["florida"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Florida' in x else 0)
#Creating the texas variable
stockx_data["texas"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Texas' in x else 0)
#Creating the other_state variable
above5pct_states = ["California", "New York", "Oregon", "Florida", "Texas"]
stockx_data["other_state"] = pd.Series(list(map(int,~stockx_data["Buyer Region"].isin(above5pct_states))))
#Creating the brand2 variable
stockx_data["brand2"] = stockx_data["Brand"].apply(lambda x : 1 if 'Yeezy' in x else 0)
In this section, we are going to do some predictive modeling on price_ratio. After doing some data processing, we will attempt some multiple linear regression models to predict price_ratio. After evaluating these linear models for performance and validity, we will move on to random forest regression models and compare their performances. All regression models in this notebook will be compared using mean absolute percent error, or MAPE. This metric can be interpreted as follows: "On average, the model outputs predictions that are XX.X% off from the true value."
Since date_diff is a direct linear combination of Order Date and Release Date, we will drop these variables before splitting the dataset.
#Get rid of the date columns
stockx_data = stockx_data.drop(['Order Date','Release Date'], axis=1)
Next, we get rid of Brand, since brand2 is a numerical version of the same information. We also get rid of Sneaker Name because we have extracted most of the information from that column in the form of several binary indicator variables.
#Get rid of the original Brand and Sneaker Name columns
stockx_data = stockx_data.drop(['Brand','Sneaker Name'], axis=1)
Similar to date_diff, we drop Sale Price and Retail Price because they are explained by our target variable, price_ratio.
#Getting rid of the Sale Price and Retail Price
stockx_data = stockx_data.drop(['Sale Price', 'Retail Price'], axis = 1)
Now we get rid of Buyer Region because we have several binary indicator variables in its place.
#Getting rid of Buyer Region
stockx_data = stockx_data.drop(['Buyer Region'], axis = 1)
Let's take a look at our model-ready dataset!
stockx_data.head(n=5)
Now we're ready to split the dataset into training and testing data.
#Create objects x and y to separate predictors from the target.
x = stockx_data.drop("price_ratio", axis=1)
y = stockx_data["price_ratio"]
#Run train_test_split to create 4 different data objects with train/test preds ana train/test targets
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.20)
The first model we will look at is multiple linear regression. One of the assumptions of this model is that predictors cannot be correlated to one another (multi-collinear). To address this, we look at the pearson correlation matrix for the dataset. Here we remove brand2 because it is highly correlated with V2. Additionally, we remove other_state because it is directly explained by the california, new_york, oregon, florida, and texas variables.
#Looking at the correlation values for each predictor relative to each other and the target
stockx_data.corr(method='pearson')
#Creating copies of the train/test datasets without brand2 or other_state
x_train_m1 = x_train.copy()
x_train_m1 = x_train_m1.drop(["brand2","other_state"], axis=1)
x_test_m1 = x_test.copy()
x_test_m1 = x_test_m1.drop(["brand2","other_state"], axis=1)
Now we create and fit our linear model object. This first model is predicting raw price_ratio values from both the Yeezy and Off-White data in the same model.
#Creating a linear regression object: lm1
lm1 = linear_model.LinearRegression()
#Fitting the model with the first model's predictors (x_train_m1) and the train response
m1 = lm1.fit(x_train_m1, y_train)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for this model.
#Get the summary table with p-values for each predictor.
print("\n=========== SUMMARY ===========")
xlabels = x_train_m1.columns
stats_reg.summary(lm1, x_train_m1, y_train, xlabels)
Now we predict using the testing predictor data and calculate the residuals. Based on the following QQ-plot for these residuals, this model breaks the assumption of normally distributed residuals. So while the parameter estimates are still valid, this model should not be used to predict on new data.
#Getting test predictions
test_preds_m1 = lm1.predict(x_test_m1)
#Calculating residuals
residuals_m1 = test_preds_m1 - y_test
#Making QQ-plot of residuals
sm.qqplot(residuals_m1,line="s");
Now we calculate the mean absolute percent error of the model (MAPE) for comparison to other models.
#Calculate the MAPE
lm1_mape = np.mean(100 * abs((residuals_m1/y_test)))
lm1_mape
Based on the residual QQ-plot from the previous model, the assumptions of linear regression were not met. This issue could be happening because price_ratio is a bounded quantity. As I mentioned earlier, it is impossible to have a price_ratio less than or equal to 0. Additionally, it is extremely rare that a sneaker has a price ratio less than 1 (557 transactions of 99956 observations). But price_ratio has no upper bound. Because of this asymmetrical bound, the model predicting raw price_ratio values gets wonky around 1. By taking the natural log of price_ratio we convert the target into an unbounded logarithmic scale, which improves the symmetry and spread of the target distribution. Looking at the overlaid histogram below, it is clear that the frequencies for individual values goes down quite significantly after log transformatio and that the spread improves.
#Plotting the overlaid histogram of the log-transformed target and regular target
plt.hist(np.log(stockx_data["price_ratio"]),bins=100, alpha = 0.5);
plt.hist(stockx_data["price_ratio"], bins = 100, alpha = 0.2);
plt.legend(["log(price_ratio)", "price_ratio"]);
This time when making our copies of the data, we must log transform the training labels (y values). We do not touch the testing data.
#Creating copies of the train/test datasets without brand2 in it AND changing the y test/train to be log transformed.
x_train_m2 = x_train.copy()
x_train_m2 = x_train_m2.drop(["brand2","other_state"], axis=1)
x_test_m2 = x_test.copy()
x_test_m2 = x_test_m2.drop(["brand2", "other_state"], axis=1)
y_train_m2 = y_train.copy()
y_train_m2 = np.log(y_train_m2)
y_test_m2 = y_test.copy()
Now we create and fit our linear model object. This first model is predicting log transformed price_ratio values from both the Yeezy and Off-White data in the same model.
#Creating the linear model object: lm2
lm2 = linear_model.LinearRegression()
#Fitting the linear model: m2
m2 = lm2.fit(x_train_m2, y_train_m2)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for the log transformed model.
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_m2.columns
stats_reg.summary(lm2, x_train_m2, y_train_m2, xlabels)
Now we predict using the testing predictor data and calculate the residuals. Based on the following QQ-plot for these residuals, this model still breaks the assumption of normally distributed residuals.
#Getting test predictions
test_preds_m2 = lm2.predict(x_test_m2)
#Calculating residuals
residuals_m2 = np.exp(test_preds_m2) - y_test_m2
#Making QQ-plot of the residuals
sm.qqplot(residuals_m2,line="s");
Now we get the MAPE of the log transformed model. This is a slight improvement (lower MAPE) compared to the previous model.
#Calculate the MAPE: On average, the log transformed linear model is 22.03% off in predicting price
#ratio.
lmlog_mape = np.mean(100 * abs(residuals_m2/y_test_m2))
lmlog_mape
Based on the results and looking back on the distributions of price_ratio for Yeezys and Off-Whites, it makes sense to pursue separate linear regression models for each brand.
Creating the dataset this time around is not quite as simple as dropping unwanted columns. First we drop other_state for collinearity concerns. Then we subset the training and testing data for rows in which brand2 = 1 (Yeezy). After that, we drop jordan, airmax90, airmax97, zoom, presto, airforce, blazer, and vapormax because those columns exclusively apply to Off-Whites. After that, we drop brand2.
#Creating a 3rd copy of the x_train and x_test
x_train_m3 = x_train.copy()
x_train_m3 = x_train_m3.drop(["other_state"], axis=1)
x_test_m3 = x_test.copy()
x_test_m3 = x_test_m3.drop(["other_state"], axis=1)
#Subset only rows with brand2 == 1 (Yeezy). Note that y_train_yeezy is log transformed
x_train_yeezy = x_train_m3.loc[x_train_m3["brand2"] == 1,]
x_test_yeezy = x_test_m3.loc[x_test_m3["brand2"] == 1,]
y_train_yeezy = np.log(y_train.loc[x_train_m3["brand2"] == 1,])
y_test_yeezy = y_test.loc[x_test_m3["brand2"] == 1,]
#Drop: jordan, airmax90, airmax97, zoom, presto, airforce, blazer, vapormax, brand2.
x_train_yeezy = x_train_yeezy.drop(["jordan", "airmax90", "airmax97", "zoom", \
"presto", "airforce","blazer", "vapormax","brand2"],axis = 1)
x_test_yeezy = x_test_yeezy.drop(["jordan", "airmax90", "airmax97", "zoom",\
"presto", "airforce","blazer", "vapormax","brand2"], axis = 1)
Now we create and fit our linear model object. This third model is predicting log transformed price_ratio values from only the Yeezy shoes in the data.
#Creating a linear model object: lm_yeezy
lm_yeezy = linear_model.LinearRegression()
#Training the Yeezy model using the Yeezy train/target
m_yeezy = lm_yeezy.fit(x_train_yeezy, y_train_yeezy)
Now we get the parameter estimates, p-values, R-squared, and adjusted R-squared for the log transformed Yeezy model.
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_yeezy.columns
stats_reg.summary(lm_yeezy, x_train_yeezy, y_train_yeezy, xlabels)
Now we predict and view the QQ-plot of the residuals. This model also fails to meet the normality of residuals requirement for linear regression.
#Getting the test predictions
test_preds_yeezy = lm_yeezy.predict(x_test_yeezy)
#Calculating the residuals. The test predictions have to be exponentiated when compared to the test set.
residuals_yeezy = np.exp(test_preds_yeezy) - y_test_yeezy
#Making the QQ-plot of the residuals
sm.qqplot(residuals_yeezy,line="s");
Here we calculate the MAPE of the Yeezy linear model. Separating by brand seems to have slightly reduced the MAPE in the case of the Yeezys.
#Calculate the MAPE
lmyeezy_mape = np.mean(100 * abs(residuals_yeezy/y_test_yeezy))
lmyeezy_mape
Similar to how the Yeezy data and labels were created, we subset the data to only rows with brand2 = 0 (Off-White) and then get rid of V2 (a strictly Yeezy term) and brand2, since it is no longer needed.
#Creating yeezy and off-white training and testing sets for brand separated models
x_train_offwhite = x_train_m3.loc[x_train_m3["brand2"] == 0,]
x_test_offwhite = x_test_m3.loc[x_test_m3["brand2"] == 0,]
#Taking the natural log of the training target
y_train_offwhite = np.log(y_train.loc[x_train_m3["brand2"] == 0,])
y_test_offwhite = y_test.loc[x_test_m3["brand2"] == 0,]
#Drop: V2, brand2.
x_train_offwhite = x_train_offwhite.drop(["V2","brand2"],axis=1)
x_test_offwhite = x_test_offwhite.drop(["V2","brand2"],axis=1)
Now we create and fit our linear model object. This fourth linear model is predicting log transformed price_ratio values from only the Off-White shoes in the data.
#Creating linear regression object for the Off-Whites
lm_offwhite = linear_model.LinearRegression()
#Training the Off-White model using the Off-White train/target
m_offwhite = lm_offwhite.fit(x_train_offwhite, y_train_offwhite)
#Printing summary
print("\n=========== SUMMARY ===========")
xlabels = x_train_offwhite.columns
stats_reg.summary(lm_offwhite, x_train_offwhite, y_train_offwhite, xlabels)
Now we predict and view the QQ-plot of the residuals. Like all previous linear models, this one fails to meet the normality of residuals requirement for linear regression.
#Getting test predictions
test_preds_offwhite = lm_offwhite.predict(x_test_offwhite)
#Calculating residuals
residuals_offwhite = np.exp(test_preds_offwhite) - y_test_offwhite
#Plotting QQ-plot
sm.qqplot(residuals_offwhite,line="s");
Now we calculate the MAPE of the Off-White model. This one is the best performing model, and this is most likely due to the greater number of effective predictors in the Off-White dataset as compared to the Yeezy dataset.
#Calculate the MAPE
lmoffwhite_mape = np.mean(100 * abs(residuals_offwhite/y_test_offwhite))
lmoffwhite_mape
After these four failed modeling attempts with linear regression, the best course of action is to pursue a model that does not have strict assumptions like those of linear regression. Random forest model it is!
The first random forest model runs on both Yeezy and Off-White data to predict price_ratio.
I settled on two hyperparameters to tune: number of regression trees and maximum number of features considered for each split. My process with optimizing the number of regression trees was quite simple, I just ran 4 different random forests with 10, 100, 500, and 1000 trees in them and looked at the MAPEs to determine where the point of diminishing returns was, and apply that number of trees for all subsequent models. As for the maximum number of features in a split, I created a function that can adapt to any dataset and outputs MAPEs for each possible value of max_features in the sklearn function RandomForestRegressor. I would then look for the lowest MAPE and report that as the final random forest regression model.
Here we make some fresh copies of the originally split data. The training set has to be further broken into training and validation in anticipation of hyperparameter tuning. Ultimately the split here is 60% training, 20% validation, and 20% test data.
#Creating the copies of the training, testing, and validation data for the general random forest model
x_train_m4 = x_train.copy()
x_test_m4 = x_test.copy()
y_train_m4 = y_train.copy()
y_test_m4 = y_test.copy()
x_train_m4_small, x_valid, y_train_m4_small, y_valid = train_test_split(x_train_m4,y_train_m4,test_size = 0.25)
I settled on two hyperparameters to tune: number of regression trees and maximum number of features considered at each split. The code below shows how I selected the number of trees. Knowing that more trees equates to more run time, I picked the tree value that seemed to mark the beginning of the point of diminished returns, which in this case was 100 trees. All random forest models use 100 trees.
#Testing 10, 100, 500, 1000 trees for their resulting MAPE values.
trees_to_test = [10, 100, 500, 1000]
tree_mapes = {}
for i in trees_to_test:
rf_treevalid = RandomForestRegressor(random_state=np.random.seed(42), n_estimators=i)
rf_treevalid.fit(x_train_m4_small, y_train_m4_small)
rf_treevalid_preds = rf_treevalid.predict(x_valid)
rf_treevalid_errors = rf_treevalid_preds - y_valid
mape = mean(100 * abs((rf_treevalid_errors/rf_treevalid_preds)))
tree_mapes.update({i:mape})
tree_mapes
The next hyperparameter I selected for tuning is the maximum features considered at every split. To do this effectively, I created a function that inputs a training data, training labels, testing data, testing labels, and number of trees. For every value from 1 to the total number of features in the data, this function will fit a random forest on the testing with 100 trees, and return the MAPE. I then pick the lowest MAPE and that corresponding number of features and that is the final random forest model.
#Creating a new function max_features_tuning to find the best value for max_features (meaning number
#of features considered at every node) for each random forest. Trees will default to 100 because
#that's what I determined to be the best tradeoff of accuracy vs runtime, but it can be altered.
def max_features_tuning (xtrainset, ytrainset, xvalidset, yvalidset, trees = 100):
MAPE_values = {}
for i in range(0, xtrainset.shape[1]):
rf = RandomForestRegressor(random_state=np.random.seed(42), n_estimators=trees,\
max_features=i+1)
rf.fit(xtrainset, ytrainset)
rf_preds = rf.predict(xvalidset)
rf_errors = rf_preds - yvalidset
mape = mean(100 * abs((rf_errors/rf_preds)))
MAPE_values.update({i+1:mape})
return MAPE_values
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value for max features for this model is 12.
#Looking for the best max_features value using training and validation set
max_features_tuning(xtrainset=x_train_m4_small, ytrainset=y_train_m4_small,\
xvalidset=x_valid, yvalidset = y_valid, trees = 100)
Now that trees and max features at any split have been optimized, I rerun the random forest to get the feature importances for the model.
#Rerun the general Random forest with the optimal parameters
rf = RandomForestRegressor(random_state=np.random.seed(42), max_features=12, n_estimators=100)
rf.fit(x_train_m4, y_train_m4)
The output below shows the feature importances for this random forest model. The most important variable for this random forest is V2, or whether or not a sneaker is a Yeezy 350 Boost V2.
#Print list of feature importances
features = x_train_m4.columns
importances = list(rf.feature_importances_)
for i in range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf.predict(x_test_m4)
test_errors = test_preds - y_test_m4
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
Just like with linear regression, we now create brand separated random forest models.
The Yeezy train/test data and labels were already created earlier in the notebook when we made the Yeezy linear regression. We have to exponentiate the y-training labels because there is no longer a reason to log-transform price_ratio.
#Exponentiating the y_train_yeezy to undo the log-transform for the Yeezy linear model
y_train_yeezy = np.exp(y_train_yeezy)
#Separating the training set into training, validation, and test datasets
x_train_yeezy_small, x_valid_yeezy, y_train_yeezy_small, y_valid_yeezy = train_test_split(x_train_yeezy,y_train_yeezy,\
test_size = 0.25)
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value of max features for the Yeezy random forest is 6.
#Looking for the best max_features value
max_features_tuning(xtrainset=x_train_yeezy_small, ytrainset=y_train_yeezy_small,\
xvalidset=x_valid_yeezy, yvalidset = y_valid_yeezy, trees = 100)
Now we rerun the Yeezy random forest with the optimized parameters.
#Rerun the Yeezy Random forest with the optimal parameters for
rf_yeezy = RandomForestRegressor(random_state=np.random.seed(42), max_features=6, n_estimators=100)
rf_yeezy.fit(x_train_yeezy, y_train_yeezy)
The output below contains the feature importances for the Yeezy model. The most important variable is blackcol, or whether or not a Yeezy has black in the colorway.
#Print list of feature importance
features = x_train_yeezy.columns
importances = list(rf_yeezy.feature_importances_)
for i in range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned Yeezy random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf_yeezy.predict(x_test_yeezy)
test_errors = test_preds - y_test_yeezy
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
We repeat the same data transformation for the Off-White training labels as we did with the Yeezy training labels.
#Exponentiating the y_train_offwhite to undo the log-transform for the Off-White linear model
y_train_offwhite = np.exp(y_train_offwhite)
Here we separate the Off-White training data into training and validation for hyperparameter tuning.
#Separating the training set into training, validation, and test datasets
x_train_offwhite_small, x_valid_offwhite, y_train_offwhite_small, y_valid_offwhite = train_test_split(x_train_offwhite,\
y_train_offwhite,\
test_size = 0.25)
The output below shows each possible value of the max features parameter and the corresponding MAPE. The best value for the max features in the Off-White random forest is 11.
#Looking for the best max_features value
max_features_tuning(xtrainset=x_train_offwhite_small, ytrainset=y_train_offwhite_small,\
xvalidset=x_valid_offwhite, yvalidset = y_valid_offwhite, trees = 100)
Now we rerun the optimized Off-White random forest to get the feature importance.
#Rerun the optimized random forest
rf_offwhite = RandomForestRegressor(random_state=42,max_features=11, n_estimators=100)
rf_offwhite.fit(x_train_offwhite, y_train_offwhite)
The output below shows the feature importance for the Off-White random forest. The most important variable is date_diff.
#Print the feature importance
features = x_train_offwhite.columns
importances = list(rf_offwhite.feature_importances_)
#importances
for i in range(0,len(importances)):
print(features[i], "----", "importance: ", importances[i])
Below we run the tuned Off-White random forest on the test dataset to get a final performance metric on unseen data.
test_preds = rf_offwhite.predict(x_test_offwhite)
test_errors = test_preds - y_test_offwhite
test_mape = mean(100 * abs((test_errors/test_preds)))
test_mape
In this analysis, we used sneaker transaction data from StockX to understand what makes certain sneakers hype. We found that it is possible to predict the hype of a sneaker, represented by the ratio of sale price to retail price. We were able to determine the most important factors in predicting the hype of a sneaker, such as number of days after release or whether a shoe was a Yeezy Boost 350 V2 or not. StockX can use these results to make decisions on which sneakers to promote to buyers based on how they want to balance maximal revenue with buyer interests.