In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.
The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel'
and 'Region'
will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
Exploring the Data
As always, we will start by importing the libraries we will use, reading in the data, and taking a look at our dataset.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
print("Dataset could not be loaded. Is the dataset missing?")
Wholesale customers dataset has 440 samples with 6 features each.
Data Exploration
In this section, we will begin exploring the data through visualizations and code to understand how each feature is related to the others. Let's first observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which we will track throughout the project.
Note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'.
# Display a description of the dataset
display(data.describe())
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

count  440.000000  440.000000  440.000000  440.000000  440.000000  440.000000 
mean  12000.297727  5796.265909  7951.277273  3071.931818  2881.493182  1524.870455 
std  12647.328865  7380.377175  9503.162829  4854.673333  4767.854448  2820.105937 
min  3.000000  55.000000  3.000000  25.000000  3.000000  3.000000 
25%  3127.750000  1533.000000  2153.000000  742.250000  256.750000  408.250000 
50%  8504.000000  3627.000000  4755.500000  1526.000000  816.500000  965.500000 
75%  16933.750000  7190.250000  10655.750000  3554.250000  3922.000000  1820.250000 
max  112151.000000  73498.000000  92780.000000  60869.000000  40827.000000  47943.000000 
Implementation: Selecting Samples
To get a better understanding of the customers and how their data will transform through the analysis, we will select a few sample data points and explore them in more detail throughout the project.
# Select three indices of your choice you wish to sample from the dataset
indices = [22,154,398]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
# look at percentile ranks
pcts = 100. * data.rank(axis=0, pct=True).iloc[indices].round(decimals=3)
# visualize percentiles with heatmap
sns.heatmap(pcts.reset_index(drop=True), annot=True, vmin=1, vmax=99, fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' category spending')
plt.xticks(rotation=45, ha='center');
Chosen samples of wholesale customers dataset:
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

0  31276  1917  4469  9408  2381  4334 
1  622  55  137  75  7  8 
2  11442  1032  582  5390  74  247 
Samples:
 0: This customer ranks above the 90th percentile for annual spending amounts in Fresh
, Frozen
, and the Delicatessen
categories. These features along with above average spending for detergents_paper could lead us to believe this customer is a market. Markets generally put an emphasis on having a large variety of fresh foods available and often contain a delicatessen or deli.

1: On the opposite side of the spectrum, this customer ranks in the bottom 10th percentile across all product categories. It's highest ranking category is 'Fresh' which might suggest it is a small cafe or similar.

2: Our last customer spends a lot in the
Fresh
andFrozen
categories but moreso in the latter. I would suspect this is a wholesale retailer because of the focus onFresh
andFrozen
foods.
Implementation: Feature Relevance
One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop('Milk', axis=1)
# Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Milk'], test_size=0.25, random_state=1)
# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print(score)
0.515849943807
As you can see, we attempted to predict Milk
using the other features in the dataset and the score ended up being 0.515. At this initial stage we might say that this feature is somewhat difficult to predict because the score is around the halfway point of possible scores. Remember that R^2 goes from 0 to 1. This might indicate that it could be an important feature to consider.
Visualize Feature Distributions
To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If we found that the feature we attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if we believed that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Let's run the code block below to produce a scatter matrix and take a look.
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
Milk
showed some signs of correlation for about half of the features it was compared to which aligns with our earlier prediction. The pair of features with the highest correlation are Detergents_Paper
and Grocery
which intuitively makes sense as many people shop for both when they go "grocery shopping." One other visible point to note is how many of the points are around 0 for features compared to Delicatessen
. The data for all of these features are rightskewed with many points hovering at the origin or near it and long tails.
Data Preprocessing
Now we will start to preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results we obtain from your analysis are significant and meaningful.
Implementation: Feature Scaling
If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a nonlinear scaling — particularly for financial data. One way to achieve this scaling is by using a BoxCox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm which is what we will do below.
# Scale the data using the natural logarithm
log_data = np.log(data.copy())
# Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newlytransformed features
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
Observation
After applying a natural logarithm scaling to the data, the distribution of each feature appears much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).
# Let's compare the original sample data to the logtransformed sample data
print("Original chosen samples of wholesale customers dataset:")
display(samples)
# Display the logtransformed sample data
print("Logtransformed samples of wholesale customers dataset:")
display(log_samples)
Original chosen samples of wholesale customers dataset:
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

0  31276  1917  4469  9408  2381  4334 
1  622  55  137  75  7  8 
2  11442  1032  582  5390  74  247 
Logtransformed samples of wholesale customers dataset:
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

0  10.350606  7.558517  8.404920  9.149316  7.775276  8.374246 
1  6.432940  4.007333  4.919981  4.317488  1.945910  2.079442 
2  9.345046  6.939254  6.366470  8.592301  4.304065  5.509388 
Implementation: Outlier Detection
Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data, 25)
# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data, 75)
# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = (Q3  Q1) * 1.5
# Display the outliers
print("Data points considered outliers for the feature '{}':".format(feature))
display(log_data[~((log_data[feature] >= Q1  step) & (log_data[feature] <= Q3 + step))])
# Select the indices for data points you wish to remove
outliers = [66, 75, 338, 142, 154, 289]
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
Data points considered outliers for the feature 'Fresh':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

66  2.197225  7.335634  8.911530  5.164786  8.151333  3.295837 
95  1.098612  7.979339  8.740657  6.086775  5.407172  6.563856 
96  3.135494  7.869402  9.001839  4.976734  8.262043  5.379897 
218  2.890372  8.923191  9.629380  7.158514  8.475746  8.759669 
338  1.098612  5.808142  8.856661  9.655090  2.708050  6.309918 
Data points considered outliers for the feature 'Milk':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen 

Data points considered outliers for the feature 'Grocery':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

75  9.923192  7.036148  1.098612  8.390949  1.098612  6.882437 
Data points considered outliers for the feature 'Frozen':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

38  8.431853  9.663261  9.723703  3.496508  8.847360  6.070738 
65  4.442651  9.950323  10.732651  3.583519  10.095388  7.260523 
420  8.402007  8.569026  9.490015  3.218876  8.827321  7.239215 
Data points considered outliers for the feature 'Detergents_Paper':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

75  9.923192  7.036148  1.098612  8.390949  1.098612  6.882437 
122  9.410174  5.303305  5.501258  7.596392  3.218876  6.756932 
142  10.519646  8.875147  9.018332  8.004700  2.995732  1.098612 
154  6.432940  4.007333  4.919981  4.317488  1.945910  2.079442 
161  9.428190  6.291569  5.645447  6.995766  1.098612  7.711101 
177  9.453992  8.899731  8.419139  7.468513  2.995732  7.875119 
204  7.578657  6.792344  8.561401  7.232010  1.609438  7.191429 
237  9.835851  8.252707  6.385194  8.441176  3.332205  7.102499 
289  10.663966  5.655992  6.154858  7.235619  3.465736  3.091042 
338  1.098612  5.808142  8.856661  9.655090  2.708050  6.309918 
356  10.029503  4.897840  5.384495  8.057377  2.197225  6.306275 
402  10.186371  8.466531  8.535230  5.393628  2.302585  5.828946 
Data points considered outliers for the feature 'Delicatessen':
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

66  2.197225  7.335634  8.911530  5.164786  8.151333  3.295837 
109  7.248504  9.724899  10.274568  6.511745  6.728629  1.098612 
128  4.941642  9.087834  8.248791  4.955827  6.967909  1.098612 
137  8.034955  8.997147  9.021840  6.493754  6.580639  3.583519 
142  10.519646  8.875147  9.018332  8.004700  2.995732  1.098612 
154  6.432940  4.007333  4.919981  4.317488  1.945910  2.079442 
184  5.789960  6.822197  8.457443  4.304065  5.811141  2.397895 
187  7.798933  8.987447  9.192075  8.743372  8.148735  1.098612 
203  6.368187  6.529419  7.703459  6.150603  6.860664  2.890372 
233  6.871091  8.513988  8.106515  6.842683  6.013715  1.945910 
285  10.602965  6.461468  8.188689  6.948897  6.077642  2.890372 
289  10.663966  5.655992  6.154858  7.235619  3.465736  3.091042 
There were a handful of specific rows containing outliers in multiple features based on our definition of an outlier. I chose to remove these rows because having a row show up as multiple outliers can add to our confidence that it is truly an outlier.
Feature Transformation
In this section we will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.
Implementation: PCA
Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can go ahead and apply PCA to the good_data
to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=6)
pca.fit(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
Note: A positive increase in a specific dimension corresponds with an increase of the positiveweighted features and a decrease of the negativeweighted features. The rate of increase or decrease is based on the individual feature weights.
From our visualizations above we can see that the first and second principal components alone explain 0.7216 of the variance. The first four principal components total explain up to 0.9343 of the variance.
In the first dimension, the features Milk
, Grocery
, Detergents_Paper
are primarily grouped together which might indicate that this dimension represents purchases by retailers.
The second dimension paired the Fresh
, Frozen
, and Delicatessen
features together which could indicate it represents a restuarant or similar establishment.
In the third dimension we observe the grouping of Frozen
and Delicatessen
positively weighted and then Fresh
and Detergents_Paper
negatively weighted. This dimension might represent something like a convenience store.
Finally, the fourth dimension pairs the features Fresh
and Delicatessen
negatively and Frozen
and Detergents_Paper
positively. This component could represent a gas station for example.
Observation
Let's run the code below to see how the logtransformed sample data has changed after having a PCA transformation applied to it in six dimensions. Observe the numerical value for the first four dimensions of the sample points.
# Display sample logdata after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1  Dimension 2  Dimension 3  Dimension 4  Dimension 5  Dimension 6  

0  0.1039  2.8019  0.0746  0.4443  1.3751  0.1083 
1  6.2001  7.0196  0.9409  0.7406  0.3790  0.0659 
2  3.7970  0.2673  0.4271  0.8460  0.1136  0.5462 
Implementation: Dimensionality Reduction
When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)
# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
Observation
Let's run the code below to see how the logtransformed sample data has changed after having a PCA transformation applied to it using only two dimensions. Observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.
# Display sample logdata after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
Dimension 1  Dimension 2  

0  0.1039  2.8019 
1  6.2001  7.0196 
2  3.7970  0.2673 
Visualizing a Biplot
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1
and Dimension 2
). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
<matplotlib.axes._subplots.AxesSubplot at 0x1e461f56860>
Observation
Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk'
, 'Grocery'
and 'Detergents_Paper'
, but not so much on the other product categories.
Clustering
In this section, we can choose to use either a KMeans clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data. Then, we will recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.
The advantages to using the KMeans clustering algorithm are that is scales well to large sample sizes and always converges. However, sometimes it may converage at a local minima instead of the global minima and therefore will need to be run again. The advantages of the Gaussian Mixture Model on the other hand are that it is a fast algorithim and it doesn't require the data to be normally distributed to work. Similarly to the KMeans algorithm, it can suffer from converging at a local minima. Because we know that the first two principal components explain 72.16% of variance, and therefore we have a rough idea of how many clusters to expect, we will use the KMeans clustering algorithm with 2 clusters and we will also see how it performs with different numbers of clusters.
Implementation: Creating Clusters
Depending on the problem, the number of clusters that you expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from 1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
n_clusters = 2
# Apply your clustering algorithm of choice to the reduced data
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
clusterer.fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.cluster_centers_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds, random_state=1)
print("{0} clusters: {1:.4f}".format(n_clusters, score))
2 clusters: 0.4262
 2 clusters: 0.4262
 3 clusters: 0.4017
 4 clusters: 0.3355
 5 clusters: 0.3512
 6 clusters: 0.3633
 7 clusters: 0.3632
 8 clusters: 0.3656
 9 clusters: 0.3584
Our best silhouete score came from the model trained with two clusters resulting in a score of 0.4262
Cluster Visualization
Now that we've chosen the optimal number of clusters for our clustering algorithm using the scoring metric above, we can visualize the results by executing the code block below.
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
Implementation: Data Recovery
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

Segment 0  3716.0  7983.0  12283.0  895.0  4638.0  974.0 
Segment 1  9167.0  1940.0  2494.0  2100.0  310.0  724.0 
# Let's compare it with our original table
display(data.describe())
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

count  440.000000  440.000000  440.000000  440.000000  440.000000  440.000000 
mean  12000.297727  5796.265909  7951.277273  3071.931818  2881.493182  1524.870455 
std  12647.328865  7380.377175  9503.162829  4854.673333  4767.854448  2820.105937 
min  3.000000  55.000000  3.000000  25.000000  3.000000  3.000000 
25%  3127.750000  1533.000000  2153.000000  742.250000  256.750000  408.250000 
50%  8504.000000  3627.000000  4755.500000  1526.000000  816.500000  965.500000 
75%  16933.750000  7190.250000  10655.750000  3554.250000  3922.000000  1820.250000 
max  112151.000000  73498.000000  92780.000000  60869.000000  40827.000000  47943.000000 

Our cluster 0 came in at the following percentiles for each of the following categories:
Fresh
: 2550%,Milk
: 7599%,Grocery
: 7599%,Frozen
: 2550%,Detergents_Paper
: 7599%,Delicatessen
: 5075%. This segment spent the most in theGrocery
,Milk
, andDetergents_Paper
category which makes it likely to represent establishments like supermarkets or big retailers. 
Our cluster 1 had a very different spending profile as we will see in a moment.
Fresh
: 5075%,Milk
: 2550%,Grocery
: 2550%,Frozen
:~24%,Detergents_Paper
: 2550%,Delicatessen
: 2550%. This segment purchases an above average amount in theFresh
category but below average an all other areas. This might mean it represents restaurants or large chain restaurants that require fresh ingredients multiple times a week.
print("Chosen samples of wholesale customers dataset:")
display(samples)
# visualize percentiles with heatmap
sns.heatmap(pcts.reset_index(drop=True), annot=True, vmin=1, vmax=99, fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' category spending')
plt.xticks(rotation=45, ha='center');
Chosen samples of wholesale customers dataset:
Fresh  Milk  Grocery  Frozen  Detergents_Paper  Delicatessen  

0  31276  1917  4469  9408  2381  4334 
1  622  55  137  75  7  8 
2  11442  1032  582  5390  74  247 
# Display the predictions
for i, pred in enumerate(sample_preds):
print("Sample point", i, "predicted to be in Cluster", pred)
Sample point 0 predicted to be in Cluster 1
Sample point 1 predicted to be in Cluster 1
Sample point 2 predicted to be in Cluster 1
Sample point 0 is a heavy spender and particularly in the categories of Fresh
, Frozen
, and Delicatessen
which would definetly fall into our cluster 1.
Sample point 1 is a much smaller buyer but we can see that from what they do buy they spend mostly on Fresh
and Frozen
foods which would again put them in cluster 1 but farther out from the cluster center.
Sample point 2 is another medium to heavy spender but they focus more on the Frozen
category than the Fresh
and very little in the other categories unlike our sample point 0. It makes sense then that the sample point 2 was classified into cluster 1.
Conclusion
Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. Let's imagine that the wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.
 How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?
 Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?
Because we have some domain knowledge and we understand that of the different types of goods we sell some are perishable, we may hypothesize that the customers who purchase perishable goods would react negatively to a less frequent delivery service. However, our customer segment that purchases goods that aren't perishable or that have a longer shelf life wouldn't mind the decreased delivery service and may react positively since we could pass some of the benefits of the costsavings onto them.
Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider 'customer segment' as an engineered feature for the data. Assume the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a customer segment to determine the most appropriate delivery service.
* How can the wholesale distributor label the new customers using only their estimated product spending and the customer segment data?
The wholesale distributor would let our clustering algorithm automatically classify each of the new customers in a cluster just like we did earlier with our 3 sample points. Another option is to use the existing data and predicted customer segment with a supervised learning algorithm where the target variable is the customer segment. We could then predict in advance what type of delivery service would best be suited to our customer's needs and also save on costs by delivering less frequently to customers who don't require it.
Visualizing Underlying Distributions
At the beginning of this project, it was discussed that the 'Channel'
and 'Region'
features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel'
feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.
Let's run the code block below to see how each data point is labeled either 'HoReCa'
(Hotel/Restaurant/Cafe) or 'Retail'
the reduced space.
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)
Overall, the clustering algorithm and number of clusters we've chosen does a good job at classifying the points compared to the underlying distribution of Retailers to Hotel/Restaurant/Cafe customers. It's clearly shown that there are a decent number of Hotel/Restaurant/Cafe points that are incorrectly classified in the cluster 0. There are far fewer incorrectly classified points for Retailers in cluster 1. There are no distinct groupings within these two segments and it would make sense because there is some overlap between these two segments as well as within them thus creating the spectrum we see. Still, I believe we can consider these results to be consistent with our previous definition of customer segments but it seems that the Hotel/Restaurant/Cafe segment is a broad definition and is thus more diverse than what we would originally expect.