In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in *monetary units*) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

As always, we will start by importing the libraries we will use, reading in the data, and taking a look at our dataset.

```
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
print("Dataset could not be loaded. Is the dataset missing?")
```

`Wholesale customers dataset has 440 samples with 6 features each.`

In this section, we will begin exploring the data through visualizations and code to understand how each feature is related to the others. Let's first observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which we will track throughout the project.

Note that the dataset is composed of six important product categories: **'Fresh'**, **'Milk'**, **'Grocery'**, **'Frozen'**, **'Detergents_Paper'**, and **'Delicatessen'**.

```
# Display a description of the dataset
display(data.describe())
```

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-------|---------------|--------------|--------------|--------------|------------------|--------------|
| count | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 |
| mean | 12000.297727 | 5796.265909 | 7951.277273 | 3071.931818 | 2881.493182 | 1524.870455 |
| std | 12647.328865 | 7380.377175 | 9503.162829 | 4854.673333 | 4767.854448 | 2820.105937 |
| min | 3.000000 | 55.000000 | 3.000000 | 25.000000 | 3.000000 | 3.000000 |
| 25% | 3127.750000 | 1533.000000 | 2153.000000 | 742.250000 | 256.750000 | 408.250000 |
| 50% | 8504.000000 | 3627.000000 | 4755.500000 | 1526.000000 | 816.500000 | 965.500000 |
| 75% | 16933.750000 | 7190.250000 | 10655.750000 | 3554.250000 | 3922.000000 | 1820.250000 |
| max | 112151.000000 | 73498.000000 | 92780.000000 | 60869.000000 | 40827.000000 | 47943.000000 |
```

To get a better understanding of the customers and how their data will transform through the analysis, we will select a few sample data points and explore them in more detail throughout the project.

```
# Select three indices of your choice you wish to sample from the dataset
indices = [22,154,398]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
# look at percentile ranks
pcts = 100. * data.rank(axis=0, pct=True).iloc[indices].round(decimals=3)
# visualize percentiles with heatmap
sns.heatmap(pcts.reset_index(drop=True), annot=True, vmin=1, vmax=99, fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' category spending')
plt.xticks(rotation=45, ha='center');
```

Chosen samples of wholesale customers dataset:

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|---|-------|------|---------|--------|------------------|--------------|
| 0 | 31276 | 1917 | 4469 | 9408 | 2381 | 4334 |
| 1 | 622 | 55 | 137 | 75 | 7 | 8 |
| 2 | 11442 | 1032 | 582 | 5390 | 74 | 247 |
```

**Samples:** - 0: This customer ranks above the 90th percentile for annual spending amounts in Fresh, Frozen, and the Delicatessen categories. These features along with above average spending for detergents_paper could lead us to believe this customer is a market. Markets generally put an emphasis on having a large variety of fresh foods available and often contain a delicatessen or deli.

- 1: On the opposite side of the spectrum, this customer ranks in the bottom 10th percentile across all product categories. It's highest ranking category is 'Fresh' which might suggest it is a small cafe or similar.
- 2: Our last customer spends a lot in the Fresh and Frozen categories but moreso in the latter. I would suspect this is a wholesale retailer because of the focus on Fresh and Frozenfoods.

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop('Milk', axis=1)
# Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Milk'], test_size=0.25, random_state=1)
# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print(score)
```

`0.515849943807`

As you can see, we attempted to predict Milk using the other features in the dataset and the score ended up being 0.515. At this initial stage we might say that this feature is somewhat difficult to predict because the score is around the halfway point of possible scores. Remember that R^2 goes from 0 to 1. This might indicate that it could be an important feature to consider.

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If we found that the feature we attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if we believed that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Let's run the code block below to produce a scatter matrix and take a look.

```
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

Milk showed some signs of correlation for about half of the features it was compared to which aligns with our earlier prediction. The pair of features with the highest correlation are Detergents_Paper and Grocery which intuitively makes sense as many people shop for both when they go "grocery shopping." One other visible point to note is how many of the points are around 0 for features compared to Delicatessen. The data for all of these features are right-skewed with many points hovering at the origin or near it and long tails.

Now we will start to preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results we obtain from your analysis are significant and meaningful.

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm which is what we will do below.

```
# Scale the data using the natural logarithm
log_data = np.log(data.copy())
# Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

After applying a natural logarithm scaling to the data, the distribution of each feature appears much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

```
# Let's compare the original sample data to the log-transformed sample data
print("Original chosen samples of wholesale customers dataset:")
display(samples)
# Display the log-transformed sample data
print("Log-transformed samples of wholesale customers dataset:")
display(log_samples)
```

Original chosen samples of wholesale customers dataset:

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|---|-------|------|---------|--------|------------------|--------------|
| 0 | 31276 | 1917 | 4469 | 9408 | 2381 | 4334 |
| 1 | 622 | 55 | 137 | 75 | 7 | 8 |
| 2 | 11442 | 1032 | 582 | 5390 | 74 | 247 |
```

Log-transformed samples of wholesale customers dataset:

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|---|-----------|----------|----------|----------|------------------|--------------|
| 0 | 10.350606 | 7.558517 | 8.404920 | 9.149316 | 7.775276 | 8.374246 |
| 1 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
| 2 | 9.345046 | 6.939254 | 6.366470 | 8.592301 | 4.304065 | 5.509388 |
```

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An *outlier step* is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

```
# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data, 25)
# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data, 75)
# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = (Q3 - Q1) * 1.5
# Display the outliers
print("Data points considered outliers for the feature '{}':".format(feature))
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
# Select the indices for data points you wish to remove
outliers = [66, 75, 338, 142, 154, 289]
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
```

Data points considered outliers for the feature 'Fresh':

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-----|----------|----------|----------|----------|------------------|--------------|
| 66 | 2.197225 | 7.335634 | 8.911530 | 5.164786 | 8.151333 | 3.295837 |
| 95 | 1.098612 | 7.979339 | 8.740657 | 6.086775 | 5.407172 | 6.563856 |
| 96 | 3.135494 | 7.869402 | 9.001839 | 4.976734 | 8.262043 | 5.379897 |
| 218 | 2.890372 | 8.923191 | 9.629380 | 7.158514 | 8.475746 | 8.759669 |
| 338 | 1.098612 | 5.808142 | 8.856661 | 9.655090 | 2.708050 | 6.309918 |
```

Data points considered outliers for the feature 'Milk':

```
| Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-------|------|---------|--------|------------------|--------------|
```

Data points considered outliers for the feature 'Grocery':

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|----|----------|----------|----------|----------|------------------|--------------|
| 75 | 9.923192 | 7.036148 | 1.098612 | 8.390949 | 1.098612 | 6.882437 |
```

Data points considered outliers for the feature 'Frozen':

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-------|----------|----------|-----------|------------------|------------------|--------------|
| 38 | 8.431853 | 9.663261 | 9.723703 | 3.496508 | 8.847360 | 6.070738 |
| 65 | 4.442651 | 9.950323 | 10.732651 | 3.583519 | 10.095388 | 7.260523 |
| 420 | 8.402007 | 8.569026 | 9.490015 | 3.218876 | 8.827321 | 7.239215 |
```

Data points considered outliers for the feature 'Detergents_Paper':

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-----|-----------|----------|----------|----------|------------------|--------------|
| 75 | 9.923192 | 7.036148 | 1.098612 | 8.390949 | 1.098612 | 6.882437 |
| 122 | 9.410174 | 5.303305 | 5.501258 | 7.596392 | 3.218876 | 6.756932 |
| 142 | 10.519646 | 8.875147 | 9.018332 | 8.004700 | 2.995732 | 1.098612 |
| 154 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
| 161 | 9.428190 | 6.291569 | 5.645447 | 6.995766 | 1.098612 | 7.711101 |
| 177 | 9.453992 | 8.899731 | 8.419139 | 7.468513 | 2.995732 | 7.875119 |
| 204 | 7.578657 | 6.792344 | 8.561401 | 7.232010 | 1.609438 | 7.191429 |
| 237 | 9.835851 | 8.252707 | 6.385194 | 8.441176 | 3.332205 | 7.102499 |
| 289 | 10.663966 | 5.655992 | 6.154858 | 7.235619 | 3.465736 | 3.091042 |
| 338 | 1.098612 | 5.808142 | 8.856661 | 9.655090 | 2.708050 | 6.309918 |
| 356 | 10.029503 | 4.897840 | 5.384495 | 8.057377 | 2.197225 | 6.306275 |
| 402 | 10.186371 | 8.466531 | 8.535230 | 5.393628 | 2.302585 | 5.828946 |
```

Data points considered outliers for the feature 'Delicatessen':

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-----|-----------|----------|-----------|----------|------------------|--------------|
| 66 | 2.197225 | 7.335634 | 8.911530 | 5.164786 | 8.151333 | 3.295837 |
| 109 | 7.248504 | 9.724899 | 10.274568 | 6.511745 | 6.728629 | 1.098612 |
| 128 | 4.941642 | 9.087834 | 8.248791 | 4.955827 | 6.967909 | 1.098612 |
| 137 | 8.034955 | 8.997147 | 9.021840 | 6.493754 | 6.580639 | 3.583519 |
| 142 | 10.519646 | 8.875147 | 9.018332 | 8.004700 | 2.995732 | 1.098612 |
| 154 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
| 184 | 5.789960 | 6.822197 | 8.457443 | 4.304065 | 5.811141 | 2.397895 |
| 187 | 7.798933 | 8.987447 | 9.192075 | 8.743372 | 8.148735 | 1.098612 |
| 203 | 6.368187 | 6.529419 | 7.703459 | 6.150603 | 6.860664 | 2.890372 |
| 233 | 6.871091 | 8.513988 | 8.106515 | 6.842683 | 6.013715 | 1.945910 |
| 285 | 10.602965 | 6.461468 | 8.188689 | 6.948897 | 6.077642 | 2.890372 |
| 289 | 10.663966 | 5.655992 | 6.154858 | 7.235619 | 3.465736 | 3.091042 |
```

There were a handful of specific rows containing outliers in multiple features based on our definition of an outlier. I chose to remove these rows because having a row show up as multiple outliers can add to our confidence that it is truly an outlier.

In this section we will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can go ahead and apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the *explained variance ratio* of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

```
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=6)
pca.fit(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
```

**Note:** A positive increase in a specific dimension corresponds with an *increase* of the *positive-weighted* features and a *decrease* of the *negative-weighted* features. The rate of increase or decrease is based on the individual feature weights.

From our visualizations above we can see that the first and second principal components alone explain 0.7216 of the variance. The first four principal components total explain up to 0.9343 of the variance.

In the first dimension, the features Milk, Grocery, Detergents_Paper are primarily grouped together which might indicate that this dimension represents purchases by retailers.

The second dimension paired the Fresh, Frozen, and Delicatessen features together which could indicate it represents a restaurant or similar establishment.

In the third dimension we observe the grouping of Frozen and Delicatessen positively weighted and then Fresh and Detergents_Paper negatively weighted. This dimension might represent something like a convenience store.

Finally, the fourth dimension pairs the features Fresh and Delicatessen negatively and Frozen and Detergents_Paper positively. This component could represent a gas station for example.

Let's run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions. Observe the numerical value for the first four dimensions of the sample points.

```
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
```

```
| | Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 |
|---|-------------|-------------|-------------|-------------|-------------|-------------|
| 0 | 0.1039 | -2.8019 | 0.0746 | 0.4443 | -1.3751 | 0.1083 |
| 1 | 6.2001 | 7.0196 | -0.9409 | -0.7406 | -0.3790 | 0.0659 |
| 2 | 3.7970 | 0.2673 | -0.4271 | 0.8460 | 0.1136 | -0.5462 |
```

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the *cumulative explained variance ratio* is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a significant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

```
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)
# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
```

Let's run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions. Observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

```
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
```

```
| | Dimension 1 | Dimension 2 |
|---|-------------|-------------|
| 0 | 0.1039 | -2.8019 |
| 1 | 6.2001 | 7.0196 |
| 2 | 3.7970 | 0.2673 |
```

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

```
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
```

Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.

In this section, we can choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data. Then, we will recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

The advantages to using the K-Means clustering algorithm are that is scales well to large sample sizes and always converges. However, sometimes it may converge at a local minima instead of the global minima and therefore will need to be run again. The advantages of the Gaussian Mixture Model on the other hand are that it is a fast algorithm and it doesn't require the data to be normally distributed to work. Similarly to the K-Means algorithm, it can suffer from converging at a local minima. Because we know that the first two principal components explain 72.16% of variance, and therefore we have a rough idea of how many clusters to expect, we will use the K-Means clustering algorithm with 2 clusters and we will also see how it performs with different numbers of clusters.

Depending on the problem, the number of clusters that you expect to be in the data may already be known. When the number of clusters is not known *a priori*, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's *silhouette coefficient*. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the *mean *silhouette coefficient provides for a simple scoring method of a given clustering.

```
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
n_clusters = 2
# Apply your clustering algorithm of choice to the reduced data
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
clusterer.fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.cluster_centers_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds, random_state=1)
print("{0} clusters: {1:.4f}".format(n_clusters, score))
```

`2 clusters: 0.4262`

- 2 clusters: 0.4262
- 3 clusters: 0.4017
- 4 clusters: 0.3355
- 5 clusters: 0.3512
- 6 clusters: 0.3633
- 7 clusters: 0.3632
- 8 clusters: 0.3656
- 9 clusters: 0.3584

Our best silhouete score came from the model trained with two clusters resulting in a score of 0.4262

Now that we've chosen the optimal number of clusters for our clustering algorithm using the scoring metric above, we can visualize the results by executing the code block below.

```
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
```

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the *averages* of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to *the average customer of that segment*. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

```
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
```

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-----------|--------|--------|---------|--------|------------------|--------------|
| Segment 0 | 3716.0 | 7983.0 | 12283.0 | 895.0 | 4638.0 | 974.0 |
| Segment 1 | 9167.0 | 1940.0 | 2494.0 | 2100.0 | 310.0 | 724.0 |
```

```
# Let's compare it with our original table
display(data.describe())
```

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|-------|---------------|--------------|--------------|--------------|------------------|--------------|
| count | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 |
| mean | 12000.297727 | 5796.265909 | 7951.277273 | 3071.931818 | 2881.493182 | 1524.870455 |
| std | 12647.328865 | 7380.377175 | 9503.162829 | 4854.673333 | 4767.854448 | 2820.105937 |
| min | 3.000000 | 55.000000 | 3.000000 | 25.000000 | 3.000000 | 3.000000 |
| 25% | 3127.750000 | 1533.000000 | 2153.000000 | 742.250000 | 256.750000 | 408.250000 |
| 50% | 8504.000000 | 3627.000000 | 4755.500000 | 1526.000000 | 816.500000 | 965.500000 |
| 75% | 16933.750000 | 7190.250000 | 10655.750000 | 3554.250000 | 3922.000000 | 1820.250000 |
| max | 112151.000000 | 73498.000000 | 92780.000000 | 60869.000000 | 40827.000000 | 47943.000000 |
```

- Our cluster 0 came in at the following percentiles for each of the following categories: Fresh: 25-50%, Milk: 75-99%, Grocery: 75-99%, Frozen: 25-50%, Detergents_Paper: 75-99%, Delicatessen: 50-75%. This segment spent the most in the Grocery, Milk, and Detergents_Paper category which makes it likely to represent establishments like supermarkets or big retailers.
- Our cluster 1 had a very different spending profile as we will see in a moment. Fresh: 50-75%, Milk: 25-50%, Grocery: 25-50%, Frozen:~24%, Detergents_Paper: 25-50%, Delicatessen: 25-50%. This segment purchases an above average amount in the Fresh category but below average an all other areas. This might mean it represents restaurants or large chain restaurants that require fresh ingredients multiple times a week.

```
print("Chosen samples of wholesale customers dataset:")
display(samples)
# visualize percentiles with heatmap
sns.heatmap(pcts.reset_index(drop=True), annot=True, vmin=1, vmax=99, fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' category spending')
plt.xticks(rotation=45, ha='center');
```

Chosen samples of wholesale customers dataset:

```
| | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen |
|---|-------|------|---------|--------|------------------|--------------|
| 0 | 31276 | 1917 | 4469 | 9408 | 2381 | 4334 |
| 1 | 622 | 55 | 137 | 75 | 7 | 8 |
| 2 | 11442 | 1032 | 582 | 5390 | 74 | 247 |
```

```
# Display the predictions
for i, pred in enumerate(sample_preds):
print("Sample point", i, "predicted to be in Cluster", pred)
```

```
Sample point 0 predicted to be in Cluster 1
Sample point 1 predicted to be in Cluster 1
Sample point 2 predicted to be in Cluster 1
```

Sample point 0 is a heavy spender and particularly in the categories of Fresh, Frozen, and Delicatessen which would definetly fall into our cluster 1.

Sample point 1 is a much smaller buyer but we can see that from what they do buy they spend mostly on Fresh and Frozen foods which would again put them in cluster 1 but farther out from the cluster center.

Sample point 2 is another medium to heavy spender but they focus more on the Frozen category than the Fresh and very little in the other categories unlike our sample point 0. It makes sense then that the sample point 2 was classified into cluster 1.

Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. Let's imagine that the wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.

- How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?
- Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?

Because we have some domain knowledge and we understand that of the different types of goods we sell some are perishable, we may hypothesize that the customers who purchase perishable goods would react negatively to a less frequent delivery service. However, our customer segment that purchases goods that aren't perishable or that have a longer shelf life wouldn't mind the decreased delivery service and may react positively since we could pass some of the benefits of the cost-savings onto them.

Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a * customer segment* it best identifies with (depending on the clustering algorithm applied), we can consider

* How can the wholesale distributor label the new customers using only their estimated product spending and the

The wholesale distributor would let our clustering algorithm automatically classify each of the new customers in a cluster just like we did earlier with our 3 sample points. Another option is to use the existing data and predicted customer segment with a supervised learning algorithm where the target variable is the customer segment. We could then predict in advance what type of delivery service would best be suited to our customer's needs and also save on costs by delivering less frequently to customers who don't require it.

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.

Let's run the code block below to see how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space.

```
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)
```

Overall, the clustering algorithm and number of clusters we've chosen does a good job at classifying the points compared to the underlying distribution of Retailers to Hotel/Restaurant/Cafe customers. It's clearly shown that there are a decent number of Hotel/Restaurant/Cafe points that are incorrectly classified in the cluster 0. There are far fewer incorrectly classified points for Retailers in cluster 1. There are no distinct groupings within these two segments and it would make sense because there is some overlap between these two segments as well as within them thus creating the spectrum we see. Still, I believe we can consider these results to be consistent with our previous definition of customer segments but it seems that the Hotel/Restaurant/Cafe segment is a broad definition and is thus more diverse than what we would originally expect.

Recent posts