Assumption: The clustering method assumes that each data point is similar enough to other data points, so it can be assumed that the data is initially clustered into 1 cluster.
Step 1: Import required libraries

Step 2: Load and clear data
# Change workplace to file location
cd C: UsersDevDesktopKaggleCredit_Card
X
=
pd.read_csv (
'CC_GENERAL.csv'
)
< / p>
# Remove the CUST_ID column from the data
X
=
X.drop (
'CUST_ID '
, axis
=
1
)
# Handle missing values
X.fillna (method
=
'ffill'
, inplace
=
True
)
Step 3: Data preprocessing
# Data scaling, to make all functions comparable
scaler
=
StandardScaler ()
X_scaled
=
scaler. fit_transform (X)
# Normalize the data so that the data is roughly
# follows Gaussian distribution
X_normalized
=
normalize (X_scaled)
# Convert numpy array to panda DataFrame
X_normalized
=
pd.DataFrame (X_normalized)
Step 4: Downsizing the data
< table border = "0" cellpadding = "0" cellspacing = "0">
pca
=
PCA (n_components
=
2
)
X_principal
=
pca.fit_transform (X_normalized)
X_principal
=
pd.DataFrame (X_principal)
X_principal.columns
=
[
'P1'
,
'P2'
]
Dendograms are used to divide a given cluster into many different clusters.
Step 5: Visualize dendograms
< / p>

To determine the optimal number of clusters by visualizing the data, imagine that all the horizontal lines are completely horizontal, and then, after calculating the maximum distance between any two horizontal lines, draw a horizontal line at the calculated maximum distance.
The image above shows that the optimal number of clusters should be 2 for this data.
Step 6: Build and visualize different clustering models for different k values
a) k = 2
ac2
=
AgglomerativeClustering (n_clusters
=
2
)
# Clustering visualization
plt.figure (figsize
=
(
6
,
6
))
plt.scatter (X_principal [
'P1'
], X_principal [
' P2'
],
c
=
ac2.fit_predict (X_principal), cmap
=
'rainbow'
)
plt.show ()
b) k = 3

c) c = 4
ac4
=
AgglomerativeClustering (n_clusters
=
4
)
plt.figure (figsize
=
(
6
,
6
))
plt.scatter (X_principal [ 'P1'
], X_principal [
' P2'
],
c
=
ac4.fit_predict (X_principal), cmap
=
' rainbow'
)
plt.show ()
d) k = 5

e) k = 6
ac6
=
AgglomerativeClustering ( n_clusters
=
6
)
plt.figure (figsize
=
(
6
,
6
))
plt.scatter (X_principal [
'P1'
], X_principal [
' P2'
],
c
=
ac6.fit_predict (X_principal), cmap
=
'rainbow' )
plt.show ()
Now let's determine the optimal number of clusters using mathematical techniques. Here we will use silhouette points for this purpose.
Step 7: Evaluate different models and visualize the results.

Managing and analyzing data have always offered the greatest benefits and the greatest challenges for organizations of all sizes and across all industries. Businesses have long struggled with finding ...
10/07/2020
Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information (TM), 2nd Edition. Execute Data Quality Projects, Second Edition presents a structured yet flexible approach to cr...
28/08/2021
Topics on Big Data are growing rapidly. From the first 3 V’s that originally characterized Big Data, the industry now has identified 42 V’s associated with Big Data. The list of how we characteriz...
10/07/2020
I remember one day, when I was about 15, my little cousin had come over. Being the good elder sister that I was, I spent time with her outside in the garden, while all the adults were inside having a ...
23/09/2020