Labels are how humans define and categorise different concepts. There’s lots of evolutionary psychology, neuroscience and linguistics behind this, but without going into that, without labels human (and other animal) intelligence would not be possible. Labels are the algebra of everyday life.
But what’s that got to do with Data Science? As it happens, quite a lot. When we want to understand what people believe or perceive, we do it by analysing their communication either written or spoken. Let’s say we’re wanting to analyse voice of customer text data.
The classical way to approach this is text mining based on keywords and rules to drive topic analysis e.g. using TFIDF or some other kind of ‘vectorization’, and sentiment analysis of the opinion terms.
There are issues here. Firstly, what are we supposed to do with all the topics? If we build a word cloud how useful is that? If they use synonyms which aren’t in a dictionary, do we group these together in advance? We are essentially trying to second-guess and group terms, which might not match the intentions of the customers, or be different for different situations. Things for sentiment are even more dissonant and we haven’t begun to explore the technical challenges with sarcasm, context, comparators and double negatives which all perform very poorly in such analyses.
So how else are we meant to analyse text data, apart from painfully compiling dictionaries and constant manual checking? Well, say hello to the wonderful world of labels. The labels being referred here are generated from machine learning i.e. by replicating human judgment based on a training sample of manually labelled data. The machine doesn’t need to be told keywords, it figures out common patterns which might be a lot more than single keywords, and might include where they are in the sentence and whether they are nouns or verbs, just as a human might.
So, if labels are so great, why isn’t everyone using them? Well the short answer is that it’s expensive. It’s expensive in terms of time because someone with the requisite domain knowledge needs to generate the labels, and it’s even more expensive because a data scientist needs to use those labels to try to generate a signal using various techniques without resorting to ‘data torture’ (i.e. the phenomenon of eventually getting out of a dataset what you wanted, even if not scientifically justifiable). The problem and approach need to be carefully defined, the data cleansed, parsed and filtered to suit the approach, and frankly a great deal of trial and error. Even if a predictive model is generated, it needs to be tuned, tested for stability and then checked and curated carefully over time in case the data and performance changes (and they always do in anything interesting!). This explains why labelling from a machine learning point of view is precious and only used sparingly for the highest-value use cases.
Thankfully this no longer needs to be the case thanks to the latest technologies. Imagine a world where AI-based labelling is cheap and plentiful, where data scientists are not required to tune and drive models.
The basic premise is that the labelling machine judges its own uncertainty and invites user intervention to label things manually that it needs to maximise its performance for the minimum human intervention. The human intervention just needs to be someone with domain knowledge and doesn’t need to be a data scientist, and the labelling required is ‘just enough’ to achieve the requisite business performance. No data artistry needed. Also, because it invites human intervention when there’s uncertainty, it can spot new topics i.e. ‘early warning’ of new signals, and keep the models maintained to their requisite performance. If there are differences in labelling, labels can be merged or moved around in hierarchies. If the performance at the granular level isn’t high then it will chose the coarser level just as a human might.
To spell out the potential savings, suppose a business wants to automate its complaint handling system by building a predictive model of categories for queries (i.e. labels). There might be hundreds of categories and, as a data scientist, you might ask for an estimate of the initial labelling set which could take many man-weeks, with the possibility of not actually finding a signal. Then there’s feature engineering, in itself an iterative activity with no guarantees. If all this takes 6 weeks of labelling, then the latest technology might be typically 2% of that i.e. just over half a day to achieve the same performance. Any time spent in feature engineering is also massively reduced as you rapidly test and tune with more certainty and a much quicker feedback loop. Furthermore, the time spent curating models disappears from the data science team and is instead replaced by the minimal amount of labelling when new or ambiguous signals appear. This might be a day or so per year rather than a heavy overhead. You can quickly see that a model that might cost many hundreds of thousands per year of human input might literally only cost a few thousand instead, and be more flexible and powerful in terms of early warning and adaptability.
So now you can see that labels are indeed very powerful in a machine learning context. They move text analytics to the next level and now there are technologies which lower the time, cost and technical skill levels to deploying them. What will you label?