Sameen Salam- M.S. Candidate, Institute for Advanced Analytics at North Carolina State University
Mail: ssalam@ncsu.edu
In this notebook, we will explore sneaker transaction data from the online marketplace StockX. StockX put out this dataset as part of its 2019 Data Challenge. The data consists of a random sample of all Off-White and Yeezy 350 sales from between 9/1/2017 (the month that Off-White first debuted “The Ten” collection) and 2/15/2019.
This is the exploratory data analysis notebook. Here is an overview of how this notebook is structured:
The Sneaker Game - brief explanation on some basic aspects of the sneaker market
Exploratory Data Analysis - basic familiarity with the data
Feature Engineering - conceptualization and creation of additional variables with modeling in mind
Conclusions - summarzing modeling results and paths the project could take moving forward
Sneaker culture has become ubiquitous in recent years. All around the world, millions of sneakerheads (myself included) go online to cop the newest releases and rarest classics. But it's not always quite as simple as clicking the "Add to Cart" button and making the purchase. Some sneakers have incredibly high demand leading up to a very limited supply upon release. Only a few dedicated and/or lucky people will be successful in purchasing these shoes from a shoe store or website. Some may choose wear their shoes and keep them in their collection for years to come. But many will choose to resell deadstock (brand new) shoes at a profit in order to purchase even rarer ones.
This is where StockX, GOAT, FlightClub, or any other online sneaker marketplace comes in. Resellers need to connect with individuals who want the shoes they have up for sale. These entities offer a platform that put resellers in direct contact with potential buyers. StockX in particular prides itself on being the stock market analog in the sneaker world. Resellers can list sneakers for sale at whatever price they see fit, and buyers can make whatever bids or offers they would like on any sneaker. StockX's role in the transaction is to make sure that the resellers are selling authentic sneakers to protect buyers from receiving fake or damaged sneakers.
Yeezys and Off-Whites are popular examples of coveted shoes that sneakerheads buy off of one another. The Yeezy line is a collaboration between Adidas and musical artist Kanye West. There are several other sneakers that fall under the Yeezy brand, but this dataset only covers Yeezy Boost 350 models. The Off-White line is a collaboration between Nike and luxury designer Virgil Abloh. Like the Yeezys, this dataset focuses in on a subset of Off-White sneaker models known as "The Ten". This is a set of ten different shoes released by Nike over a period of several months. The sneakers that carry these brand labels represent some of the most sought after kicks in the world, selling out in stores and online almost instantly upon release.
After conducting some research on StockX's website, I found that StockX's revenue stream comes primarily from a 3% payment processing fee and a regressive transaction fee (i.e. the more a reseller sells on StockX, the lower your fee per item is). It is in StockX's best interest to foster sales of shoes with higher sale prices from a revenue standpoint. If reasonably accurate predictions can be made on resale prices as they relate to retail prices, then StockX can make decisions promoting certain sneaker listings.
Importing needed libraries and reading in the spreadsheet directly from StockX's 2019 Data Challenge page. Note that this code will download the dataset in the form of an .xlsx spreadsheet in the directory from which this notebook is being run.
#Importing the necessary libraries
import pandas as pd
from statistics import *
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import matplotlib.patches as mpatch
import matplotlib.dates as mdate
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from regressors import stats as stats_reg
from sklearn.ensemble import RandomForestRegressor
import requests
#Reading in the data directly from the StockX Data Challenge 2019 webpage
url = "https://s3.amazonaws.com/stockx-sneaker-analysis/wp-content/uploads/2019/02/StockX-Data-Contest-2019-3.xlsx"
my_file = requests.get(url)
output = open('StockX-Data-Contest-2019-3.xlsx', 'wb')
output.write(my_file.content)
output.close()
#Save the contents of the excel file into variable: stockx_data
stockx_data = pd.read_excel("StockX-Data-Contest-2019-3.xlsx", sheet_name = 1)
Here we are looking at some basic characteristics and quality of the dataset. Data cleaning was not a significant hurdle in this project because the dataset provided by StockX was very clean.
#Taking a quick look at how the data is structured
stockx_data.head(n=5)
Below, we confirm that the dataset has the correct dimensions: 99,956 rows and 8 columns
#Getting the dimensions of the data
stockx_data.shape
From the output below, we can see that this data contains no missing values.
#Checking for missing values
stockx_data.isnull().sum()
We can see that the datatypes of the columns are congruent with the information they provide.
#Checking the data types of each column
stockx_data.dtypes
In the Brand variable, Yeezys were indicated by " Yeezy", making analysis more difficult. In this code, I strip all leading and trailing white space in Brand to make the data just a bit cleaner.
#Stripping all leading and trailing white space in the Brand column
stockx_data["Brand"] = stockx_data["Brand"].apply(lambda x: x.strip())
99,956 rows and 8 columns
No missing values
Order Date: datetime, Date the transaction occurred
Brand: string, Brand of the sneaker
Sneaker Name: string, Name of the sneaker
Sale Price: numpy float, Price paid for sneaker in transaction (resale price)
Retail Price: numpy int, Retail price of sneaker in stores (release price)
Release Date: datetime, Date the sneaker dropped
Shoe Size: numpy float, Shoe size (most likely in mens)
Buyer Region: string, transaction state (most likely for purchaser)
Here we explore the variables and get a better feel for the dataset. Additionally, looking at the current variables will help us conceptualize additional features and a potential target for modeling.
First, let's take a look at Sale Price. The mean is 446.63 dollars, the median is 370.00 dollars, and the standard deviation is 255.98 dollars.
#Looking at mean and median of Sale Price: basicstats_saleprice
basicstats_saleprice = {}
basicstats_saleprice["Mean"] = mean(stockx_data["Sale Price"])
basicstats_saleprice["Median"] = median(stockx_data["Sale Price"])
basicstats_saleprice["Standard Deviation"] = stdev(stockx_data["Sale Price"])
basicstats_saleprice
Looking at the histogram below, it is clear that Sale Price is quite right skewed.
#Plotting the distribution of Sale Price
plt.hist(x=stockx_data["Sale Price"], bins=100);
plt.xlabel("Sale Price")
plt.ylabel("Count")
plt.title("Histogram of Sale Price");
This box and whisker plot demonstrates that Sale Price has quite a few high outliers. Outliers in this context are defined as 1.5 times the inter-quartile range.
#Box and whisker plot of Sale Price
sns.boxplot(x = "Sale Price", data= stockx_data);
Next, let's take a look at Retail Price. The mean is 208.61 dollars, the median is 220.00 dollars, and the standard deviation is 25.20 dollars. The lack of variability (exhibited by the small standard deviation) in Retail Price is due to manufacturers making all sneakers of a certain type cost the same regardless of colorway. For example, all Yeezys in the data retail for 220 dollars, regardless of how much they resell for on StockX.
#Looking at mean and median of Retail Price: basicstats_retailprice
basicstats_retailprice = {}
basicstats_retailprice["Mean"] = mean(stockx_data["Retail Price"])
basicstats_retailprice["Median"] = median(stockx_data["Retail Price"])
basicstats_retailprice["Standard Deviation"] = stdev(stockx_data["Retail Price"])
basicstats_retailprice
Now let's take a look at Shoe Size. Though not explicitly stated, these are most likely mens shoe sizes. Unlike Sale Price and Retail Price, Shoe Size is discrete. The mean is 9.34, the median is 9.5, and the mode is 10.
#Looking at mean and median of Shoe Size: basicstats_shoesize
basicstats_shoesize = {}
basicstats_shoesize["Mean"] = mean(stockx_data["Shoe Size"])
basicstats_shoesize["Median"] = median(stockx_data["Shoe Size"])
basicstats_shoesize["Mode"] = mode(stockx_data["Shoe Size"])
basicstats_shoesize
Below is a histogram of Shoe Size. The distribution looks sensible, as the middle sizes (8-12) have the highest frequencies.
#Plotting the distribution of shoe size.
plt.hist(x=stockx_data["Shoe Size"], bins=20);
plt.xlabel("Shoe Size")
plt.ylabel("Count")
plt.title("Histogram of Shoe Size");
This box and whisker plot shows that there are not a lot of outliers in Shoe Size.
#Box and whisker plot of Shoe Size.
sns.boxplot(x = "Shoe Size", data= stockx_data);
Now let's take a closer look at the levels and frequencies for the categorical variables. Brand contains two levels: "Yeezy" and "Off-White". 72,162 of the values for Brand are "Yeezy", while the remaining 27,794 are "Off-White", confirming that StockX's data is aligned with their website's description of it.
#Looking at Brand levels and value counts
stockx_data["Brand"].value_counts().to_frame()
Next up is Sneaker Name. There are many levels to this variable, but the five most frequently ordered sneakers are all Yeezy Boost 350 V2s in the Butter, Beluga 2.0, Zebra, Blue-Tint and Cream-White colorways, in that order. The remaining shoes and frequencies can be found from the output of the code below.
#looking at Sneaker Name levels and counts
stockx_data["Sneaker Name"].value_counts().to_frame().head(n=5)
Last but not least, here is Buyer Region. There are 50 levels, one per state. California and New York are overwhelmingly the most common values in Buyer Region. The remaining states and frequencies can be found in the output below.
#Looking at Buyer Region levels and counts: California and New York make up a significant portion of total purchases,
#with New York more than double Oregon sitting in 3rd place
stockx_data["Buyer Region"].value_counts().to_frame().head(n=5)
Now that we have taken a brief look at all non-time variables present in the dataset, we can now examine relationships between variables.
The graph below is a visual representation of Retail Price and Sale Price over Shoe Size. Here, it's easy to easy to see that the Sale Price and Retail Price maintain a consistent relationship between size 4-14.5, but Sale Price goes up at a size 15+. This is most likely occuring due to the relatively low number of observations at these extreme sizes.
#Graphical representation of Mean Sale Price and Mean Retail Price by Shoe Size (table above).
fig = plt.plot(stockx_data.groupby('Shoe Size')['Sale Price'].mean())
plt.plot(stockx_data.groupby('Shoe Size')['Retail Price'].mean())
plt.xlabel("Shoe Size")
plt.ylabel("Mean Sale Price")
plt.title("Shoe Size vs Mean Sale Price")
plt.xticks(np.arange(min(stockx_data["Shoe Size"]), max(stockx_data["Shoe Size"])+1,0.5),rotation = 60);
The graph below plots Mean and Median Sale Price over time. In general, Sale Price seems to be decreasing over the time span of the data. There are some steep increases, usually marked by the releases of individual sneakers with high demand. For example, there is a massive spike in Sale Price around June 2018. This corresponds with the pre-release period of the Off-White Air Jordan 1 in the UNC colorway. Some individual pairs sold for nearly 4,000 dollars, driving up the Sale Price at that time. The steep drops are most likely due to a very hype sneaker selling out soon after release on StockX. If no one has a pair of a very sought after shoe, then no transactions can take place involving said shoe. This results in a drop in mean Sale Price for the dataset as a whole.
#Overlaid plot of Mean vs Median Sale Price over Order Date
fig = plt.plot(stockx_data.groupby('Order Date')['Sale Price'].mean(), color = '#ABCDEF', alpha = 1)
plt.plot(stockx_data.groupby('Order Date')['Sale Price'].median(), color = '#FEDCBA', alpha = 1)
plt.xlabel("Order Date")
plt.ylabel("Sale Price")
plt.title("Mean and Median Sale Price over Time (Order Date)")
blue_patch, orange_patch = mpatch.Patch(color = '#ABCDEF', label = "Mean Sale Price"), \
mpatch.Patch(color = '#FEDCBA', label = "Median Sale Price")
plt.legend(handles = [blue_patch, orange_patch])
plt.tick_params(axis='x',labelbottom = True)
plt.xticks(rotation = 45)
plt.rcParams["grid.alpha"] = 0.1
plt.grid(True)
Looking at the boxplot below, it is clear that Off-Whites have a higher median and greater spread with regards to Sale Price. Both brands of shoes have a lot of high outliers, but Off-White outliers extend to higher dollar amounts.
#Boxplot of Brand vs Sale Price
sns.boxplot(x = "Brand", y = "Sale Price", data=stockx_data);
This overlaid histogram below shows that a greater volume of Yeezys are sold at lower prices, while Off-Whites sell at a lower volume for higher prices.
#Overlaid histogram of Sale Price broken down by Brand
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","Sale Price"], bins, alpha=0.5, label='Yeezy',)
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","Sale Price"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("Sale Price")
plt.ylabel("Count")
plt.title("Histogram of Sale Price By Brand")
plt.show()
The table below shows average Sale Price over a few states. Delaware has the highest average Sale Price, while Wyoming has the lowest.
stockx_data.groupby("Buyer Region", as_index=False)["Sale Price"].mean().head(n=5)
The code output below shows the proportion of transactions attributed to each level of Buyer Region, or state.
#Getting each state's proportion of transactions
stockx_data["Buyer Region"].value_counts().to_frame().head(n=5)/stockx_data.shape[0]
Sometimes the variables in your dataset do not capture the complete story. Feature engineering, or creating new variables based on ones you do have, can allow for clearer or more interesting insights and pave the way for useful models.
With potential modeling applications in mind, here is a breakdown of some additional features I want to create:
date_diff: Elapsed time between Order Date and Release Date. This allows cross comparisons of different sneakers over the same point in time after release.
price_ratio: Ratio of Sale Price to Retail Price. This standardizes each shoe relative to its retail price and acts as a better indicator of the hype of any given sneaker.
jordan: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Jordan or not.
V2: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a V2 model or not.
blackcol: A binary indicator variable derived from Sneaker Name that indicates if the shoe has black in the colorway or not.
airmax90: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
airmax97: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Airmax 90 or not.
zoom: A binary indicator variable derived from Sneaker Name that indicates if the shoe has Zoom in the name or not.
presto: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Presto or not.
airforce: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an AirForce or not.
blazer: A binary indicator variable derived from Sneaker Name that indicates if the shoe is a Blazer or not.
vapormax: A binary indicator variable derived from Sneaker Name that indicates if the shoe is an Vapor Max or not.
california: A binary indicator variable derived from Buyer Region that indicates if the buyer is from California or not.
new_york: A binary indicator variable derived from Buyer Region that indicates if the buyer is from New York or not.
oregon: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Oregon or not.
florida: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Florida or not.
texas: A binary indicator variable derived from Buyer Region that indicates if the buyer is from Texas or not.
other_state: A binary indicator variable derived from Buyer Region that indicates if the buyer is from any other state. The states in this category had <5% of transactions in the data.
brand2: A binary indicator variable derived from Brand for modeling. Yeezy = 1, Off-White = 0.
#Taking the difference between Order Date and Release Date to create a new column: date_diff
stockx_data["date_diff"] = stockx_data['Order Date'].sub(stockx_data['Release Date'], axis=0)/np.timedelta64('1','D')
#Creating a new column containing the ratio of Sale Price and Retail Price: price_ratio
stockx_data["price_ratio"] = stockx_data["Sale Price"]/stockx_data["Retail Price"]
#Creating the jordan variable
stockx_data["jordan"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Jordan' in x.split("-") else 0)
#Creating the V2 variable
stockx_data["V2"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'V2' in x.split("-") else 0)
#Creating the blackcol variable
stockx_data["blackcol"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Black' in x.split("-") else 0)
#Creating the airmax90 variable
stockx_data["airmax90"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '90' in x.split("-") else 0)
#Creating the airmax97 variable
stockx_data["airmax97"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if '97' in x.split("-") else 0)
#Creating the zoom variable
stockx_data["zoom"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Zoom' in x.split("-") else 0)
#Creating the presto variable
stockx_data["presto"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Presto' in x.split("-") else 0)
#Creating the airforce variable
stockx_data["airforce"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Force' in x.split("-") else 0)
#Creating the blazer variable
stockx_data["blazer"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'Blazer' in x.split("-") else 0)
#Creating the vapormax variable
stockx_data["vapormax"] = stockx_data['Sneaker Name'].apply(lambda x : 1 if 'VaporMax' in x.split("-") else 0)
#Creating the california variable
stockx_data["california"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'California' in x else 0)
#Creating the new_york variable
stockx_data["new_york"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'New York' in x else 0)
#Creating the oregon variable
stockx_data["oregon"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Oregon' in x else 0)
#Creating the florida variable
stockx_data["florida"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Florida' in x else 0)
#Creating the texas variable
stockx_data["texas"] = stockx_data["Buyer Region"].apply(lambda x : 1 if 'Texas' in x else 0)
#Creating the other_state variable
above5pct_states = ["California", "New York", "Oregon", "Florida", "Texas"]
stockx_data["other_state"] = pd.Series(list(map(int,~stockx_data["Buyer Region"].isin(above5pct_states))))
#Creating the brand2 variable
stockx_data["brand2"] = stockx_data["Brand"].apply(lambda x : 1 if 'Yeezy' in x else 0)
Let's take a look at the new version of stockx_data with the additional features that we have created.
stockx_data.head(n = 3)
#Run this code if you want to save the altered stockx_data as full_features.csv. I did this to generate the accompanying
#Tableau dashboard.
#pd.DataFrame.to_csv(stockx_data, "C:/.........../full_features.csv")
The first new feature (and most interesting) is price_ratio. The mean is 2.25, the median is 1.70, and the standard deviation is 1.5. For clarity, the transactions in this data have a Sale Price that is 2.25 times the Retail Price on average. I found that price_ratio was the best way to adjust for differing Retail Price between different sneaker models. Note that price_ratio cannot be less than or equal to 0 because no one gets free or negatively priced shoes on StockX.
#Looking at mean and median of price_ratio
basicstats_priceratio = {}
basicstats_priceratio["Mean"] = mean(stockx_data["price_ratio"])
basicstats_priceratio["Median"] = median(stockx_data["price_ratio"])
basicstats_priceratio["Standard Deviation"] = stdev(stockx_data["price_ratio"])
basicstats_priceratio
The following is a box and whisker plot of price_ratio. Not surprisingly, it looks quite similar to the same plot of Sale Price, but the outliers have become compressed into a smaller range due to the transformation.
#Boxplot of price_ratio.
sns.boxplot(x=stockx_data["price_ratio"]);
Outliers sometimes pose problems for modeling. After examining the outliers in price_ratio, I found that almost all of them were Off-White Air Jordan 1. Since there is a common pattern in the outliers, they should not be tampered with.
The histogram below plots the distribution of price_ratio separated by Brand. Here it is very easy to see how the distribution of price_ratio differs between Yeezys and Off-Whites.
#Overlaid histogram of price_ratio broken down by Brand.
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","price_ratio"], bins, alpha=0.5, label='Yeezy')
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","price_ratio"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("price_ratio")
plt.ylabel("Count")
plt.title("Histogram of Price Ratio By Brand")
plt.show()
The output below is a table of average price_ratio by Buyer Region. They generally range from 1.5 to 2.5, meaning that per state, the average sneaker Sale Price is 1.5 to 2.5 times its Retail Price.
stockx_data.groupby("Buyer Region", as_index=False)["price_ratio"].mean().head(n=5)
Now that we have thoroughly explored price_ratio, it is time to take a look at date_diff, or the number of days the transaction occurred after the Release Date of the sneaker. The mean is 183.71 days, the median is 56 days, the standard deviation is 232.35 days, and the range is -69 days to 1321 days. A negative date_diff value indicates the shoe was purchased prior to Release Date in a pre-order. It is also important to note that several sneakers have release dates that predate data collection leading to a gap in the data for those sneakers leading up to the 9/1/2017 (start of data collection).
#Basic statistics of the difference between Order Date and Release Date?
basicstats_datediff = {}
basicstats_datediff["Mean"] = mean(stockx_data["date_diff"])
basicstats_datediff["Median"] = median(stockx_data["date_diff"])
basicstats_datediff["Standard Deviation"] = stdev(stockx_data["date_diff"])
basicstats_datediff["Range"] = [min(stockx_data["date_diff"]), max(stockx_data["date_diff"])]
basicstats_datediff
This overlaid histogram of date_diff separated by Brand shows that Yeezys have the largest variability in time between Release Date and Order Date, especially looking at the higher extremes. This makes sense, because data collection started at roughly the same time that "The Ten" (Off-White shoes) were being released. Yeezys, however, were released earlier than the data was collected, especially the original Yeezy Boost 350s. This leads to the long tail in the date_diff distribution for Yeezys.
#Overlaid histograms of the elapsed time between Order and Release dates broken down by brand.
bins = 100
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Yeezy","date_diff"], bins, alpha=0.5, label='Yeezy',)
plt.hist(stockx_data.loc[stockx_data["Brand"] == "Off-White","date_diff"], bins, alpha=0.5, label='Off-White')
plt.legend(loc='upper right')
plt.xlabel("Elapsed Time (Days)")
plt.ylabel("Count")
plt.title("Histogram of Elapsed Time By Brand")
plt.show()
The following density-scatter plot shows price_ratio as a function of date_diff. Here it is evident that Off-Whites exhibit higher price_ratio relative to the Yeezys at the same point in time after Release Date. The reference line is drawn at price_ratio = 1 for scale.
#Creating a density-esque scatter plot of price_ratio vs date_diff separated by Brand
fig = sns.scatterplot(x= "date_diff", y= "price_ratio", data=stockx_data, hue= "Brand",s=8, alpha=0.03);
fig.hlines(1, -70,1300);
Based on our exploration of the existing and engineered features, further understanding and modeling price_ratio may be in the best interest of StockX. Using price_ratio balances the needs of the company and consumers as opposed to using Sale Price alone. For StockX, price_ratio predictions could easily be converted into revenue by multiplying the shoes by their Retail Price and then calculating fees. For buyers on the platform, using price_ratio allows their voice to be heard in the selective promoting process. Buyers have to accept the resale prices, and price_ratio better captures how much consumers want a specific shoe than Sale Price alone. Because it balances StockX's goal of maximizing revenue and matching sneakerhead demands, I believe that price_ratio is the best target variable for modeling. Check out the modeling notebook if you are interested in seeing what models I try and their results!
In this analysis, we used sneaker transaction data from StockX to understand what makes certain sneakers hype. We found that Off-Whites are, in fact, more hype than Yeezys. StockX can use these results to make decisions on which sneakers to promote to buyers based on how they want to balance maximal revenue with buyer interests.