Data Science 101: The Power and Pitfalls of Clustering

ClusteringClustering is the most well known method of unsupervised machine learning, but it is also the most misunderstood as it is a rather subjective modelling technique. A common example of clustering usage is segmenting a customer portfolio based on demographics, transaction behavior or other behavioral attributes. Clustering algorithms are also non-deterministic meaning you can use different initial conditions and get very different results. This means you must be a domain expert or have one at your disposal to get the most out of clustering.

I found two excellent survey articles published over on the India-based Analytics Vidhya blog that go through a review of all the important aspects of this powerful learning technique. Both hierarchical and K-means clustering are covered.  Part I covers the basics and defines the type of problems that can be solved with clustering. Part II article goes a step further by examining what can go wrong with clustering and how to get the most out of the effort.

Getting your clustering right (Part I)

Getting your clustering right (Part II)

This blog has many other well-written articles of interest to data scientists. You might want to keep this one handy in your favorites list.

Resource Links: