Previously, we saw how unsupervised learning actually has built-in supervision, albeit hidden from the user.
In this post we will see how supervised and unsupervised learning algorithms share more in common than the textbooks would suggest. As a matter of fact, both classes can use identical equations for creating mathematical models of the data, and both can use identical learning algorithms to find optimal parameter values for those models.
The consequence of this relation is that one can easily transform a supervised learning method into an unsupervised one, and vice versa. The only change you need to do is determine how Y will be computed; that is, you have to decide how your error for learning (training) will be defined.
You may have not noticed so far, but the general linear model (GLM) has been used as a versatile model with a versatile set of learning methods in order to create various supervised and unsupervised learning methods.
When one thinks of GLM, probably the first methods that come to mind are regression and inferential statistics (e.g., ANOVA), both of which fall into the category of supervised learning. However, GLM has been used just as extensively in unsupervised setups. This relates to dimensionality reduction techniques in which the algorithm is not being told with which dimensions particular data points are being saturated. Rather, the algorithm is left to “discover” on its own those dimensions. Principal component analysis (PCA) and various forms of factor analyses are all examples of unsupervised applications of GLM.
This easy jump from supervised to unsupervised is not just a property of simple models such as GLM. Exactly the same applies to computationally elaborate methods such as deep learning neural networks. A neural network can be easily set to operate with supervision or unsupervised; most commonly known ones are supervised applications, such as image recognition in which humans initially provided labels about the categories to which each image belongs. The network then learns that assignment, and if everything is done right, is capable of correctly classifying new images representing those trained categories (e.g., distinguishing human faces from houses; from tools; etc.).
Neural networks can be used just as efficiently in an unsupervised learning setup. Perhaps the most common examples are auto-encoders, which are capable of detecting anomalies in data. Here, the network is trained to produce an output that has exactly the same values as the inputs it receives. The difference between what it has generated and what it should have generated i.e., the error, is used for adjusting its synaptic weights. The training continues until the network can do the job satisfactorily well using data that have not been used for training (i.e., test data set).
What makes this learning non-trivial is that the topology of the neural network is made such that at least one of the hidden layers has a smaller number of units than the number of units in the input (and output) layer(s). This forces the network to find a representation of the data with reduced dimensionality, similar to that performed by PCA and factor analyses.
Such networks are useful for applications in which labels possibly do not exist, or would be impractically difficult to obtain. Also, they can be very useful for applications in which collection of labels may take years, such as for example, fraud detection and predictive maintenance.
A piece of advice to data scientists: don’t be afraid to turn your supervised learning method into an unsupervised one or vice versa, if you see that this fits your problem. You will need some creative thinking and more coding than usual but as a result, you may end up with exactly the solution that the task you are solving requires.
Here is one general rule to keep in mind: supervised learning methods will always be capable of solving a wider range of different real-life problems than unsupervised ones. This is because supervised ones are much more specialized: their error computation is already determined by the algorithm. In addition, error computation is limited to whatever can be extracted from the input data. In contrast, unsupervised methods, being open to error data coming from the outside world, can basically take advantage of the errors “computed” by the entire external universe – including the physical events underlying the actual phenomenon that these methods are trying to model (e.g., a real physical event of a machine becoming broken provides the training information for a predictive model of whether a machine will soon be broken).
All other things being equal, supervised methods will require less data and computational power to achieve a similar result. Unsupervised algorithms can learn to classify objects, as for example cats. But this comes with the expense of a lot more resources than needed for a supervised equivalent. In case of Google’s algorithm that discovered cats in images, 10 million images were required, 1 billion connections, 16,000 computer cores, three days of computation and a team of eight scientists from Google and Stanford. That’s a lot of resources.
In conclusion, we now know the terms ‘supervised’ and ‘unsupervised’ may be misleading, as there is quite a bit of supervision in unsupervised learning. Maybe a better analogy would be if supervised learning was referred to as ‘micro-managed learning’, and instead of unsupervised learning we used the term ‘macro-managed learning’. These two would probably better describe what is actually happening in the background of the respective algorithms.
Knowing that supervised and unsupervised methods can be seen as two different applications of the same general set of tools can be quite useful for creative problem solving in data science. By assuming a bit of an inventive attitude, one can relatively effortlessly convert an existing method from one form to another, as circumstances require.