With text analytics, various burning questions around the ‘why’ and ‘what’ of a piece or group of content can be answered. Examples like social media chatter around brand can create a supremely spiraling impact (remember the post which showed a Kentucky man was violently removed from his United Airlines seat on an overbooked flight? And how it lead to a social media disaster for the airline?). Hence there need to be ways to make sense of the unstructured data from diverse sources.
This is where text analytics steps in.
Text analytics takes in such unstructured data, extracts relevant information, and structures it for further actions or decision making. In addition to social media data, other examples include e-mail messages, call center notes, and customer records. It helps extract different types of information like:
- Terms: These are extraction based on keywords (on own site or competitor site).
- Named entities: These are extracted to answer the ‘who’, ‘what’, or ‘where’. Some instances include name, location, timestamp, or product.
- Concept: These are extracted to answer the ‘about’ of a piece of content. It describes the idea behind the content.
- Sentiment: These are extracted to gauge the overall feeling around a brand at the moment. The above United Airlines example will be (evidently) negative sentiment, denoting unhappy customers, and potential business losses.
Which tools/algorithms are most popular for text analytics?
The reason why text analytics has gone mainstream is because there are more than a handful of tools and applications available today to derive immense value. Let’s have a look at a few popular ones:
Decision Trees
This type of classifier seeks to repeatedly group data into groups or classes. It comes in handy for tasks like classification or regression. Popular algorithms in decision trees include:
- ID3: Iternative Dichotomizer builds a decision tree that splits data based on highest information gain (and lowest entropy) till every group has homogenous data.
- C4.5: This algorithm too uses information gain and entropy to classify data (just like ID3). Unlike ID3, it accepts continuous and discrete features and handles incomplete data too.
- CART: Classification and Regression Tree works just like C4.5. One notable difference is that CART uses Gini impurity (to assess ‘purity’ or homogeneity of the node) instead of information gain/entropy used by C4.5
Naive-Bayes
This is a popular technique to classify text and documents based on a category (whether to classify a document as Sport or as Political based on the occurrence of certain words). It is a simple way to assign class or category labels to instances or cases.
Rather than being a single distinct algorithm, it is a set of algorithms that work on one underlying principle — “the value of a given feature is independent of the value of any other feature”.
Practical applications include mark or not mark email as spam, assess a piece of content as positive or negative. It is also used in facial recognition software.
Support Vector Machines
This is a supervised machine learning algorithm. It can be applied on classification and regression problems. Its essential component is kernel trick which transforms linear data into non-linear data by replacing its features by a kernel function. It is used in hypertext categorization, classification of images, and facial recognition applications.
K Nearest Neighbors
k-NN is used is search items where you are looking for something similar. You determine similarity by creating a vector representation of the items and then compare how similar or dissimilar they are using a distance metric like Euclidean distance.
The best example of k-NN’s prowess is an e-commerce site’s product recommendation feature. You can also utilize k-NN to do Concept Search (finding semantically similar documents).
Artificial Neural Networks
ANNs are primarily utilized for non-linear boundaries- based classification. Much like the working of the human brain, ANN operates on hidden states (which correspond to the neurons in the brain). It can have the below 3 forms of algorithms to help in training ANN
- Gradient Descent
- Evolutionary Algorithms
- Genetic Algorithms
Image compression, handwriting analysis, and stock exchange movement prediction are some sectors where ANN comes in useful. It examines a huge volume of information and helps make quick decisions.
Fuzzy C-Means
This is a useful form of clustering that can add value when there are items that can be a part of more than one cluster. It works on the principle that after the clustering is over, all items in a cluster are as similar as possible to each other. Additionally, they will be as dissimilar to other items in other clusters as possible.
It comprises of the below steps (similar to k-means clustering)
- Pick a number of clusters where the items can be categorized
- Assign coefficient to each data point for being present inside the cluster
- Repeat till the coefficients’ value updates between two iterations is not more than the pre-defined sensitivity threshold value
Disciplines like Bioinformatics, healthcare, and economics make use of fuzzy c-means with great success. In image analysis too it overcomes the barriers to traditional k-means clustering (lot of noise, shadowing, camera variations etc.), to do better image processing.
LDA
Applying Latent Dirichlet Allocation (LDA) helps to find a linear combination of features that distinguishes or characterizes multiple classes of events or objects. A small example of how LDA helps in topic clustering is as below
Suppose there are three separate sentences
- I eat chicken and vegetables
- Chicken are pets
- My dog loves to eat chicken
With LDA, topic clustering for these 3 lines are done as follows –
Sentence 1 = 100% Topic B
Sentence 2 = 100% Topic A
Sentence 3= 33% Topic A and 67% Topic B
Now we classify based on the words in the sentence. We can propose that there are two clusters – Pets (Topic A) and Food (Topic B).
This example finally boils down to to the below steps
- Provide an estimate of the potential number of topics
- Algorithm assigns a word to a topic
- Algorithm will check the accuracy of topic assignment in a loop
This helps in ensuring coherent topic clustering.
With these tools, your text analytics objectives can be met with favorable outcomes. Do let us know which one is your favorite text analytics tool in the comments box below.
Article contributed by PromptCloud, a pioneer in large-scale and custom web data extraction services.