This paper enlightens the way companies can design Intelligent System to understand their customers’ sentiments better to improve their experience, which will help the businesses change their market position.
Sentiment analysis is widely acknowledged in the web and social media monitoring. It allows businesses to gain a comprehensive public opinion on the organization and its services. The ability to deduce insights from the text and emoticons from social media is a practice that is now widely adopted by the organizations worldwide. Digital media represents an extensive opportunity for businesses of any industry to acquire the needs, opinions and intent that users share on social media and web. Listening to consumer’s voice requires in-depth understanding of what customer’s express in Natural Language. This research paper describes the designing of an Intelligent System to understand the human language and crack the sentiment behind it.
1. Introduction:
In recent years the amount of data generated on internet had increased rapidly and continue to grow exponentially in near future. Every day, large amount of data is generated by social media, financial transactions, behaviour of the internet user, consumer’s browsing and purchasing history. This data is being continuously explored by industry and academia for useful insights that can enhance revenue of the industry and user experience on internet.The data also includes huge chunk of raw text data in the form of product reviews, news or research articles, blogs, song lyrics, poems, etc. Labeling or categorization of this text data helps in efficiently searching relevant information about the product or query, from the huge data on internet.
The financial organizations are more concerned about their products and their reputation. Hence, they rely on customer reviews for improving their services and product. Recently, various text mining and machine learning techniques have been explored to draw insight about the sentiment polarity of the reviews.
The proposed work is the comparative study of performance of deep learning techniques and traditional classification techniques to find polarity of customer reviews of Banking and Insurance domain. The aim is to simplify the task of manually rating each and every feedback and automating them. This approach will give good estimate about the company’s reputation in the market in very less time, so, that optimal decisions can be made in real time. The methodology employed deep learning techniques like convolution neural network (CNN), bidirectional recurrent neural network (RNN), bidirectional long short-term memory (LSTM) and two traditional text classification algorithms i.e., support vector machine (SVM) and Naive Bayes (NB).
2. Methodology:
The methodology consists of two main steps, the first step consists of data crawling from web resources followed by its manual rating for classifier training. The Second step involves training of classification algorithms.
2.1 Data construction: Different online sites that consists of huge number of customer feedback on different banks and insurance agencies are fetched using different Python libraries such as Scrapy, Beautiful Soup etc. The data is then manually rated in three categories i.e., negative, neutral and positive.
The constructed dataset consisted of labeled 5000 reviews in the document. The punctuations are irrelevant, therefore, removed from the reviews. The unique words of the dataset were ordered according to their frequency. The stop words such as the, is, an, about, etc., were also removed from the dictionary because they do not affect the sentiment polarity and are present in high frequency in all documents. The final dictionary of size was prepared which have unique words of the dataset. Each review is represented as a binary vector of size having at index of dictionary location if that word is present in the reviews. Hence, whole document can be represented as matrix:
Where Tr and Ts are the number of training and test review samples respectively. Each row of the matrix represents a review binary vector in D
The training samples were used for learning following supervised learning methods for comparative study:
3. Support vector machine (SVM): Support vector machine is an efficient discriminative supervised classification model. It has been widely used in different classification problems of the industry due to its high prediction accuracy and ability to handle high-dimensional data. These models separate two classes on the basis of two key concepts: In the first step, the kernel function is transformed from non-linearly separable input data to linearly separable high dimensional feature space. In the second step, the margin that separates optimal hyper plane is maximized that act as decision boundary for the classification.
4. Naive Bayes classifier: Naive Bayes (NB) is a generative (probabilistic) model for classification based on the assumption of independent features. It is applied to solve business intelligence problems like text mining, computer vision when training examples are less but features are independent of each other. As this is a generative classifier, it learns a model of the joint distribution P(X,yj) of input and output, where input data is X and the output (class label) is yj. The posterior from joint distribution is obtained using Bayes rule, i.e., the probability of class yj for the input data X.
The parameters of distribution and were estimated by various parameter estimation methods like Maximum likelihood, Expected maximization.
Finally, NB assigns the label of most probable target class Y to any given data instance xi, i.e.,
Where, L(xi) is the label assigned to given data instance xi.
5. Artificial Neural Networks (ANN): Recently artificial neural network and its variants have been widely exploited for
classification tasks to make intelligent systems for business decision making like predicting financial frauds, hand-writing recognition, computer vision, text mining, self-driving cars, etc. These models mimic the behaviour of brain neurons to learn from the given situations. The simplest form of ANN consists of only two layers of neurons,i.e., input layer and output layer, and can be applied for linear regression and linear classification purpose. The non-linear classification problems such as XOR, and needs to addressed by the introduction of hidden layers to introduce complexity to the model. The size of the hidden layer (number of neurons/ layer)is also reduced significantly by adding more hidden layers.
Additionally, the increment in hidden layers may cause overfitting of the model. Therefore, the trade-off between complexity and overfitting should be considered while building a model. Various architectures of ANN have been proposed for different problems.
5.1 Feed forward neural network: In feed forward neural network, each neuron or node in one layer is connected to every neuron in the next layer. Hence information is constantly “fed forward” from one layer to the next. The pairs of input and output values are fed into the network for many cycles to minimize errors using back propagation algorithm to update weights, so that the network can learn the relationship between the input and output. The networks that have many hidden layers are deep neural networks (DNN), and each of the successive hidden layers learns more complex patterns than previous one.
However, the introduction of successive hidden layers may make the model more specific to training examples which cause bad performance on the test or unseen instances. Another problem is faced in deep neural networks is “vanishing gradient problem”. The different layers in DNN are learning vastly at different speeds eg. the later layers in the network are learning well, on the other hand, the early layers may get stuck during training, learning almost nothing.
5.1.1 Convolution neural network: Convolution is a particular case of DNN which overcomes the “vanishing gradient problem” by using weight initialization, feature preparation (through batch normalization —centering all input feature values to zero), and rectified linear units (ReLU). This approach has been successfully used to extract deep features for classification tasks and has been widely used in computer vision. Convolution network combines three architectural ideas to ensure some degree shift, scale, and distortion invariance: local respective field, shared weights (weight replication), and spatial or temporal subsampling.
Basically, a CNNconsists of two primary layers. In the case of computer vision, First, convolution layers that convolve local image regions independently with multiple filters, and the responses are combined according to the coordinates of the image regions. Second, the pooling layers summarise the feature responses, and pooling is processed with a fixed stride and a pooling kernel size. The convolution neural networks (CNNs) do not consider contextual dependencies between different image regions because both convolution and pooling operations are locally applied on image areas separately. The contextual
information is crucial to obtain real meaning from the raw sequential text data. Hence, other architectures of DNN have been developed to capture contextual information like recurrent neural networks (RNN) and its variant long short-term memory (LSTM).
5.1.2 Recurrent neural network (RNN): Various learning tasks require information from sequential data. The processes such as time series prediction, speech recognition, language modelling, translation, musical information retrieval, text mining, and video analysis, a model must learn from the sequential input. The current neural network (RNN) is a class of DNN designed for learning contextual dependencies among sequential data by using the recurrent (feedback) connections.
These are connectionist models that capture the dynamics of sequences via interconnected networks of simple units. In simple words, the architectures RNN can be considered as multiple copies of the same network, each passing a message to a successor. Unlike standard feed forward neural networks, this architecture enables RNNs information from an arbitrarily long context window. Although in past recurrent neural networks were difficult to train due to millions of parameters. However, recent advances in optimization techniques, network architectures, and parallel computation have enabled successful large scale learning with them.
The learning with RNNs is challenging due to difficulty in learning long-range dependencies. The problems of vanishing and exploding gradients occur when back propagating errors across many successive time steps.The long short term memory (LSTM) architecture of RNN described in next subsection uses precisely designed nodes with recurrent edges with fixed unit weight as a solution to the vanishing gradient problem.
5.1.2.1 Long short-term memory(LSTM): LSTM is an RNN architecture designed to handle with long time-dependencies in sequential data such as sentences, speech etc. It was motivated by an analysis of error flow in existing RNNs ,where long time lags were inaccessible to existing architectures because the backpropagated error either blows up or decays exponentially. Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps even in the case of noisy, incompressible input sequences, without the loss of short time lag capability. This is done by enforcing constant error flow through “constant error carrousels (CEC)” within special self-connected units i.e., multiplicative gate units. These units act as memory cells and learn to open and close access to the constant error flow. Hence, LSTM is designed to get rid of the vanishing error problem.
5.1.2.2 Bidirectional Recurrent Neural Network with multiple LSTM layers: The main idea of bidirectional LSTM (BLSTM) recurrent Neural Network is to capture context of both sides of the current word at s(t) i.e., s(t-n) to s(t) & s(t) to s(t+n), to encode the text and make decision. A BLSTM processes input sequences in both directions with two sub-layers. Due to context capturing behavior these models have many applications in the field of image captioning, speech recognition and language modeling, and text mining.
6. Experimental setup and Results: The performance of the above classification models on the review data compared. Although the Bernoulli Naïve Bayes had been widely used for text classification when data was less. However, in present scenario, the data is available in sufficient amount which is ideal for deep learning tasks. Our study also proves that deep learning techniques (BLSTM and CNN) do better sentiment classification compared to another conventional method due to the ability to capture more complex features and context on a large data set (Table 1).
7. Conclusion: This study showed bidirectional long short term memory RNN is the ideal choice of the classifier to find polarity of review sentiments. This study can prove useful for the organizations to quantify their reputation or their product quality in real time so that necessary steps can be taken. Other potential applications of this work can be social
media monitoring such as public opinion on certain topics, tracking sentiment towards products, movies, politicians, etc., improving customer relation models, detecting happiness and well-being, improving automatic dialogue systems, etc.