Clustering Data with D3

K-means Clustering Algorithm

To illustrate how the k-means clustering algorithm works, scroll down to follow each step of this algorithm.

Initial Graph

Let's consider 25 random data points in a vector space (for the above mentioned example, these would be customers).

Step 1 - Plot the Centroids on a scatterplot

We then choose any 3 random points in our vector space as our centroids and color them to distinguish them from each other.

Step 2 - Cluster Assignment

For each data point(x_i), find the nearest centroid(c_j) by calculating either the Euclidean distance or Manhattan distance. Assign each data point to the nearest centroid and change it to the corresponding centroid color.

Manhattan distance is used for data where dimensions are not comparable and data noise (variance) is not high otherwise Euclidean distance is used. In general, Euclidean distance reduces error prcentage is datasets with comparable dimensions.

Step 3 - Euclidean Distance

For this graph, we have used Euclidean distance to calculate and assign each data point to its nearest centroid.
This distance is given by the formula:

Hover over the lines or points to view the distance of the point from its corresponding centroid.

Step 4 - Recalculating Centroids till Local Optimum is reached

Recalculate the new centroids for every cluster by finding the mean of all the points assigned to the previous jth cluster. Change the centroid and reassign cluster points based on lowest distance. Repeat this till no point changes its cluster.

Observe the graph to see the centroids and clusters change till local optimum (0 points change) is achieved.

Is optimum really optimal?

Unfortunately k-means is dependent on the number of clusters we choose and the initial centroids we choose. Which means the local optimum clusters might not be the best solution. To find optimum number of clusters, concept of Elbow Point is used.

So how will our retail website use this algorithm? Scroll on to find out!

Try It Yourself!

Here you get the chance to choose your own settings and see the clusters come to life!
You can choose as many points (N) as you want and upto 15 clusters (K) to cluster your data points.
Use the 'Step' button to update the centroid and intermediate clusters till local optimum is reached.
Use the 'Reset' button to restart and randomize the initial data points.

Enter the number of nodes you want (N):

Current Value of K (Max 15):

Number of Points Changing Clusters

K-Means Clustering Algorithm

Let's learn with an example

What does this mean?

Try It Yourself!