A Simple Explanation of K-Means Clustering

Overview

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems. Before we start let’s take a look at the points which we are going to understand.

Table Of Contents

  • Introduction
  • How does the K-means algorithm work?
  • How to choose the value of K?
  • Elbow Method.
  • Silhouette Method.
  • Advantages of k-means.
  • Disadvantages of k-means.

Introduction

Let us understand the K-means clustering algorithm with its simple definition.

How Does the K-means clustering algorithm work?

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these three steps.

  1. Select the k values.
  2. Initialize the centroids.
  3. Select the group and find the average.
  • Figure 1 shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
  • In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
  • The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
  • The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

How to choose the value of K?

One of the most challenging tasks in this clustering algorithm is to choose the right values of k. What should be the right k-value? How to choose the k-value? Let us find the answer to these questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you will choose the wrong value then it will directly affect your model performance. So there are two methods by which you can select the right value of k.

Elbow Method

Elbow is one of the most famous methods by which you can select the right value of k and boost your model performance. We also perform the hyperparameter tuning to chose the best value of k. Let us see how this elbow method works.

Silhouette Method

The silhouette method is somewhat different. The elbow method it also picks up the range of the k values and draws the silhouette graph. It calculates the silhouette coefficient of every point. It calculates the average distance of points within its cluster a (i) and the average distance of the points to its next closest cluster called b (i).

Advantages of K-means

  1. It is very simple to implement.
  2. It is scalable to a huge data set and also faster to large datasets.
  3. it adapts the new examples very frequently.
  4. Generalization of clusters for different shapes and sizes.

Disadvantages of K-means

  1. It is sensitive to outliers.
  2. Choosing the k values manually is a tough job.
  3. As the number of dimensions increases its scalability decreases.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store