What is K Means Clustering?
Clustering means grouping things which are similar or have features in common and so is the purpose of k-means clustering. K-means clustering is an unsupervised machine learning algorithm for clustering ‘n’ observations into ‘k’ clusters where k is predefined or user-defined constant. The main idea is to define k centroids, one for each cluster.
The K Means algorithm involves:
- Choosing the number of clusters “k”.
- Randomly assign each point to a cluster.
- Until clusters stop changing, repeat the following:
- For each cluster, compute the cluster centroid by taking the mean vector of points in the cluster.
- Assign each data point to the cluster for which the centroid is the closest.
Two things are very important in K means, the first is to scale the variables before clustering the data, and second is to look at a scatter plot or a data table to estimate the number of cluster centers to set for the k parameter in the model.
Choosing the optimal K value:
One way of choosing the k value is to use the elbow method. First, you compute the sum of squared error (SSE) for some values of k. SSE is the sum of the squared distance between each member of the cluster and its centroid. If you plot k against the SSE, you will see that the error decreases with increasing k. This is because as the number of clusters increases, the error should be smaller and therefore, distortion should be smaller. The idea of the elbow method is to choose the k value at which the SSE decreases significantly.
Applications of K-Means Clustering:
k-means can be applied to data that has a smaller number of dimensions, is numeric, and is continuous. such as document clustering, identifying crime-prone areas, customer segmentation, insurance fraud detection, public transport data analysis, clustering of IT alerts…etc.