(This post originally appeared on recurrentnull.wordpress.com, as first part in a series on sentiment analysis of movie reviews.)
Imagine I show you a book review, on amazon.com, say. Imagine I hide the number of stars, – all you get to see is the number of stars. And now I’m asking you, that review, is it good or bad? Just two categories, good or bad. That’s easy, right?
Well, it should be easy, for humans (although depending on the input there can be lots of disagreement between humans, too.) But if you want to do it automatically, it turns out to be surprisingly difficult.
This is the start of a short series on sentiment analysis, based on my TechEvent presentation. My focus will be more on data exploration than on achieving the best possible accuracy; more on getting a feeling for the difficulties than on jungling with parameters. More on the Natural Language Processing (NLP) aspect than on generic classification. And even though the series will be short (for now – given time constraints ;-)), it’s definitely a topic to be pursued (looking at the rapidity of developments in the field).
Let’s jump right in. So why would one do sentiment analysis? Because out there, not every relevant text comes labeled as “good” or “bad”. Take emails, blog posts, support tickets. Is that guy actually angry at us (our service desk/team/company)? Is she disappointed by our product? We’d like to know.
So while sentiment analysis is undoubtedly useful, the quality of the results will rely on having a big enough training set – and someone will have to sit down and categorize all those tweets/reviews/whatever. (Datasets are available where labeling of the training set was done automatically, see e.g. the Stanford Sentiment140 dataset, but this approach must induce biases, to say the least.) In our case, we care more about how things work than about the actual accuracies; still, keep in mind, when looking at the accuracy numbers, that especially for the models discussed in later posts, a bigger dataset might achieve much better performance.
The data
Our dataset consists of 25.000 labeled training reviews, plus 25.000 test reviews (also labeled), available from http://ai.stanford.edu/~amaas/data/sentiment/. Of the 25.000 training / test reviews, 12.500 each have been rated positive, and 12.500 negative by human annotators.
The dataset has originally been used in Maas et al. (2011), Learning Word Vectors for Sentiment Analysis.
Preprocessing was done after the example of the gensim doc2vec notebook (we will describe doc2vec in a later post).
Good or bad?
Let’s load the preprocessed data and have a look at the very first training review. (For better readability, I’ll try not to clutter this text with too much code, so there’ll only be code when there’s a special point in showing it. For more code, see the notebook for the original talk.)
a reasonable effort is summary for this film . a good sixties film but lacking any sense of achievement . maggie smith gave a decent performance which was believable enough but not as good as she could have given , other actors were just dreadful ! a terrible portrayal . it wasn't very funny and so it didn't really achieve its genres as it wasn't particularly funny and it wasn't dramatic . the only genre achieved to a satisfactory level was romance . target audiences were not hit and the movie sent out confusing messages . a very basic plot and a very basic storyline were not pulled off or performed at all well and people were left confused as to why the film wasn't as good and who the target audiences were etc . however maggie was quite good and the storyline was alright with moments of capability . 4 . n
Looking at this text, we already see complexity emerging. As a human reader, I’m sure you’ll say this is a negative review, and undoubtedly there are some clearly negative words (“dreadful”, “confusing”, “terrible”). But to a high degree, negativity comes from negated positive words: “lacking achievement”, “wasn’t very funny”, “not as good as she could have given”. So clearly we cannot just look at single words in isolation, but at sequences of words – n-grams (bigrams, trigrams, …) as they say in natural language processing.
n-grams
The question is though, at how many consecutive words should we look? Let’s step through an example. “Funny” (unigram) is positive, “very funny” (bigram) even more so. “Not very funny” (trigram) is negative. If it were “not so very funny” we’d need 4-grams … How about “I didn’t think it was so very funny”? And this could probably go on like that… So how many adjacent words do we need to consider? There evidently is no clear border… how can we decide? Fortunately, we won’t have to decide upfront. We’ll do that as part of our search for the optimal classification algorithm.
So in general, how can automatic sentiment analysis work? The simplest approach is via word counts. Basically, we count the positive words, giving them different weights according to how positive they are. Same with the negative words. And then, the side with the highest score “wins”.
But – no-ones gonna sit there and categorize all those words! The algorithm has to figure that out itself. How can it do that? Via the labeled training samples. For them, we have the sentiment as well the information how often each word occurred, e.g., like this:
sentiment | beautiful | bad | awful | decent | horrible | ok | awesome | |
---|---|---|---|---|---|---|---|---|
review 1 | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 0 |
review 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
review 3 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
From this, the algorithm can determine the words’ polarities and weights, and arrive at something like:
word | beautiful | bad | awful | decent | horrible | ok | awesome |
---|---|---|---|---|---|---|---|
weight | 3.4 | -2.9 | -5.6 | -0.2 | -4.9 | -0.1 | 5.2 |
Now, what can do is run a grid search over combinations of
- classification algorithms,
- parameters for those algorithms (algorithm-dependent),
- different ngrams,
and record the combinations that work best on the test set.
Algorithms included in the search were logistic regression (with different settings for regularization), support vector machines, and random forests (with different settings for the maximum tree depth). In that way, both linear and non-linear procedures were present. All aforementioned combinations of algorithms and parameters were tried with unigrams, unigrams + bigrams, and unigrams + bigrams + trigrams as features.
And so after long computations, the winner is … wait! I didn’t yet say anything about stopword removal. Without filtering, the most frequent words in this dataset are stopwords to a high degree, so we will definitely want to remove noise. The Python nltk library provides a stopword list, but this contains words like ‘not’, ‘nor’, ‘no’, ‘wasn’, ‘ain’, etc., words that we definitely do NOT want to remove when doing sentiment analysis. So I’ve used a subset of the nltk list where I’ve removed all negations / negated forms.
Here, then, are the accuracies obtained on the test set. For each classifier, I’m displaying the one with the most successful parameter settings (without detailing them here, in order not to distract from the main topic of the post) and the most successful n-gram configuration.
1-grams with stopword filtering |
1-2-grams with stopword filtering |
1-3-grams no stopword filtering |
|
---|---|---|---|
Logistic Regression | 0.89 | ||
Support Vector Machine | 0.84 | ||
Random Forest | 0.84 |
Overall, these accuracies look astonishingly good, given that in general, for sentiment analysis, something around 80% is seen as a to-be-expected value for accuracy. However, I find it difficult to talk about a to-be-expected value here: The accuracy achieved will very much depend on the dataset in question! So we really would need to know, for the exact dataset used, what accuracies have been achieved by other algorithms, and most importantly: what is the agreement between human annotators here? If humans agree a 100% on whether items of a dataset are positive or negative, then 80% accuracy for a classifier sounds rather bad! But if agreement between humans is 85% only, the picture is totally different. (And then there’s a totally different angle, extremely important but not the focus of this post: Say we achieve 90% accuracy where others achieve 80% and humans agree to 90%. Technically we’re doing great! But we’re still misclassifying one in ten texts! Depending on why we’re doing this at all, what automated action we’re planning to take based on the results, getting one in ten wrong might turn out to be catastrophical!)
Having said that, I find the results interesting for two reasons: For one, logistic regression, a linear classifier, does best here. This just confirms something that is often seen in machine learning,- logistic regression being a simple but very powerful algorithm. Secondly, the logistic regression best result was reached when including bigrams as features, whereas trigrams did not bring on any further improvements. A great thing with logistic regression is that you can peep into the classifier’s brain and see what features it decided are important, by looking at the coefficients. Let’s inspect what words make a review positive. The most positive features, in order:
coef | word | |
---|---|---|
2969 | 0.672635 | excellent |
6681 | 0.563958 | perfect |
9816 | 0.521026 | wonderful |
8646 | 0.520818 | superb |
3165 | 0.505146 | favorite |
431 | 0.502118 | amazing |
5923 | 0.481505 | must see |
5214 | 0.461807 | loved |
3632 | 0.458645 | funniest |
2798 | 0.453481 | enjoyable |
Pretty much makes sense, doesn’t it? And we do see a bigram among these: “must see”. How about other bigrams contributing to the plus side?
coef | word | |
---|---|---|
5923 | 0.481505 | must see |
3 | 0.450675 | 10 10 |
6350 | 0.421314 | one best |
9701 | 0.389081 | well worth |
5452 | 0.371277 | may not |
6139 | 0.329485 | not bad |
6970 | 0.323805 | pretty good |
2259 | 0.307238 | definitely worth |
5208 | 0.303380 | love movie |
9432 | 0.301404 | very good |
These mostly make a lot of sense, too. How about words / ngrams that make it negative? First, the “overall ranking” – last one is worst:
coef | word | |
---|---|---|
6864 | -0.564446 | poor |
2625 | -0.565503 | dull |
9855 | -0.575060 | worse |
4267 | -0.588133 | horrible |
2439 | -0.596302 | disappointing |
6866 | -0.675187 | poorly |
1045 | -0.681608 | boring |
2440 | -0.688024 | disappointment |
702 | -0.811184 | awful |
9607 | -0.838195 | waste |
So we see worst of all is when it’s a waste of time. Could agree to that!
Now, this time, there are no bigrams among the 10 worst ranked features. Let’s look at them in isolation:
coef | word | |
---|---|---|
6431 | -0.247169 | only good |
3151 | -0.250090 | fast forward |
9861 | -0.264564 | worst movie |
6201 | -0.324169 | not recommend |
6153 | -0.332796 | not even |
6164 | -0.333147 | not funny |
6217 | -0.357056 | not very |
6169 | -0.368976 | not good |
6421 | -0.437750 | one worst |
9609 | -0.451138 | waste time |
Evidently, it was worth keeping the negations! So, the way this classifier works pretty much makes sense, and we seem to have reached acceptable accuracy (I hesitate do write this because … what is acceptable depends on … see above). If we take a less simple approach – move away from basically, just counting (weighted) words, where every word is a one-hot-encoded vector – can we do any better?
With that cliff hanger, I end for today … stay tuned for the continuation, where we dive into the fascinating world of word vectors … See you there!