We frequently get questions about whether we have chosen all the right parameters to build a machine learning model. There are two scenarios: either we have sufficient attributes (or variables) and we need to select the best ones OR we have only a handful of attributes and we need to know if these are impactful. Both are classic examples of feature engineering challenges.
Most of the time, feature selection questions pop up as a prelude to model building. However, recently one of the trainees in our data science course had this question – based on his experience in working with some real data – “can we tell which attributes were most important in determining why a particular example (or a data point) ended up in a particular cluster?”
There were two things unique about this question – the first was it is feature selection in reverse and the second was that feature selection typically does not get enough attention in unsupervised techniques (such as clustering).
In this article, we will show how quickly the question can be answered, especially if you are comfortable with a tool like RapidMiner. All you need to do is pull in any standard example set, such as the Iris dataset as we do here, build a k-means clustering model and simply add an attribute weighting operator after the model is built. There are of course a few details: how do we verify if the attribute ranking actually worked? The details are described below along with the xml for implementing this simple but instructive use case.
Step 1: No feature selection
Pull the Iris example set, Normalize the data using Z-transformation and Rename the variables. Put together the process as shown below noting that the Select Attributes in the middle is disabled for step 1. After we build a k-means Clustering model (with k=3) we change the roles of a couple of attributes. The flower name is made an id variable (meaning it will not count as a feature) and cluster is made the label variable. This allows us to rank the attributes using an operator like Weight by Information Gain etc.
When we run this above process, we will obtain 3 clusters roughly corresponding to the 3 flower types. However as seen from the bar chart below, there is considerable error – particularly in versicolor and virginica groups. For a perfect classification, we should see the 3 colors (red, green and blue) as 3 separate bars. See figure below.
When we check the results of attribute ranking, we see that 2 of the 4 attributes: petal W and petal L have the highest rank while sepal W and sepal L have much lower (relative) ranking. Of course flower name which is the id has 0 weight (as it should be).
Step 2: Include feature selection
Based on the feature ranking table from above, we now deselect the Sepal * variables. When we run this process we see that the clusters now are much better with the separation between the 3 flower types. Now each cluster has (mostly) the same flower type in it. This demonstrates that feature selection can benefit unsupervised learning as well. Additionally it answers the question raised earlier: we now know that Petal *attributes have a higher significance in determining which cluster an example from this data set would belong to. See figure below which shows that there is significantly less contamination of the wrong class in each cluster compared to before.
For more on feature selection, read some of the earlier articles on this blog. The final xml of the above process is available here.