Sentiment Analysis using Deep Learning techniques with India Elections 2019 — A Case study

Written by sharmi1206 | Published 2020/08/30
Tech Story Tags: machine-learning | artificial-intelligence | politic-data | technology | latest-tech-stories | data-science | deep-learning | data-science-top-story

TLDR Using Deep Learning ML techniques to predict moods using Twitter data, I decided to use a few deep-learning ML techniques. This article assumes a basic knowledge of data science and NLP (Natural Language Processing) We will see how RNN based models (LSTM, GRU, Bi-directional LSTM) perform with an external embedding which has been trained and distilled on a very large corpus of data as well as with an internal embedding. An auto-encoder attempts to copy its input to its output through an encoder and decoder architecture.via the TL;DR App

Predict BJP Congress Sentiment using Deep Learning
The phenomenal growth in real-time data tracking and analyzing techniques has inspired data scientists to visualize and predict sentiments, build real-time models to predict the winners, etc.
Trust me, the most exciting part of it is capturing the information online from all sources and predict in real-time with the highest accuracy. The great challenge in this scenario is the accuracy and ever-increasing length of the date getting flooded from all sources every second. With the current challenges in view, I decided to use a few Deep Learning ML techniques to predict moods using Twitter data.
Note that this article assumes a basic knowledge of data science and NLP (Natural Language Processing). But if you are a newcomer to this world, I have provided links throughout the article to help you out. This blog is structured like this:
Describe deep learning algorithms, LSTM, Bi-directional LSTM, Bi-directional GRU, CNN. Train these algorithms using contextual election corpus as well as pre-trained word embeddings to predict sentiments of electing parties. Comparing the accuracy and log loss of different models.

Glove Pre-trained Word Embeddings

Source, License — Apache Verison 2.0
We started our sentiment classification technique with Google’s pre-trained Word2Vec model that represents words as vectors, built on the basis of aggregated global word-word co-occurrence statistics from a corpus. The Word2Vec model, trained by Google predicts words close to the target word with a neural network to represent linear substructures of the word vector space.
As we represent each word with a vector and a sentence (tweet) as an average of its words (vectors) to illustrate its sentiment, it becomes obvious to train the word vector with different moods to aid in the classification and prediction process. As such, Word2Vec is trained with different RNN models.

Recurrent Neural Networks

recurrent neural network (RNN) is a sequence of inter-linked artificial neural networks where connections between nodes form a directed graph along a sequence. They are particularly known for processing data related to sequence: text, time series, videos, etc where the output at any given instant t is affected by the output at previous instant t-1 along with the input at t.
Source: License-Obtained
We will see how RNN based models (LSTM, GRU, Bi-directional LSTM) perform with an external embedding which has been trained and distilled on a very large corpus of data as well as with an internal embedding, where a part of the contextual corpus has been considered for training.
Basic RNNs suffers from vanishing and exploding gradient problems for which LSTM based networks have evolved to handle this problem.

Auto-encoder

Auto-encoders are a special type of RNN known for compressing a relatively long sequence into a limited, fixed-size, dense vector. They are well known for classifying textual sentiments and hence used here for the same purpose for training and predicting mood categories for election tweets.
An auto-encoder attempts to copy its input to its output through an encoder and decoder architecture. The dimension of the middle-hidden layer is lower than that of the input data. Thus, the neural network is designed to represent the input in a smart and compact way in order to reconstruct it successfully.
The AutoEncoders used here follow simple Sequnce2Sequence architecture built from an input layer followed by encoding the LSTM layer, an embedding layer, decoding the LSTM layer, and a softmax layer. Both the input and the output of the entire architecture are vectorized representation of the tweets and their labeled sentiments. Finally, the output of the LSTM is passed through softmax activation to represent the sentiment category.
Auto-Encoder Training with Pre-trained Glove

LSTM

LSTMs, kind of Recurrent Neural Networks possess internal contextual state cells that act as long-term or short-term memory cells. LSTMs solve many problems of vanilla Recurrent Neural Networks by :
Helping to preserve a constant error, by continuous learning and backpropagation through time and layers.LSTMs contain gated cell that controls the flow of information. Gated cells remain responsible for information read, write, and storage. They remain primary decision-makers to retain cell state information (input gate), to determine the amount of cell state to pass on to next neural network layers (output gate) and amount of existing information from memory that can be forgotten (forget gate). Gates in LSTMs contain analog information ranging from 0 to 1 through sigmoid activation functions. The analog information flow in gates facilitates backpropagation to happen through multiple bounded nonlinearities.LSTM solves vanishing gradient problem by keeping the gradients steep enough, therefore training relatively short batches with high accuracy.
The below figure shows how word-embedding can feed an input sentence to LSTM. The LSTM layers take into consideration the previous hidden state to extract the key feature vectors that determine the sentiment of the sentence.
The source code below shows how to build a Word Embedding with single hidden layer LSTM of 128 neurons and classify tweets based on predefined classes using the “softmax” classifier and “Adam” optimizer.
#fileName classifyw2veclstm.pyNO_CLASSES = 8
embedded_sequences = embedding_layer(sequence_input)

l_lstm = LSTM(128)(embedded_sequences)
preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])model.summary()model.fit(x_train, y_train,
          nb_epoch=15, batch_size=64)
output_test = model.evaluate(x_test, y_test, verbose=0)
Model Summary with single Layer LSTM

GRU

GRU built on top of LSTM bears a close resemblance to LSTM except for minor modifications. It captures the dependencies between time instances adaptively.
The absence of a memory unit like LSTM makes it incapable to control the flow of information like the LSTM unit.
GRU functions with “reset” and “update” gate. The reset gate remains located between the previous activation and the next candidate activation to allow forget from the previous state. The update gate evaluates on the level of information propagation and accordingly infers how much of the candidate activation to use in updating the cell state. Possesses fewer parameters and thus may train a bit faster or need fewer data to generalize. Falls short to LSTM in processing larger datasets where LSTMs have shown to perform better.
The source code below shows how to build a GRU with a single hidden layer and classify tweets using the “softmax” classifier and “Adam” optimizer.
#fileName classifyw2veclstm.py at 
https://github.com/sharmi1206/elections-2019NO_CLASSES = 8

embedded_sequences = embedding_layer(sequence_input)
l_lstm = GRU(128)(embedded_sequences)
preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])model.summary()model.fit(x_train, y_train,
          nb_epoch=15, batch_size=64)
output_test = model.evaluate(x_test, y_test, verbose=0)
Model Summary with single Layer GRU

Bi-directional LSTM

Bidirectional Recurrent Neural Networks (BRNN) connects two hidden layers of opposite directions. The connections end at the same output. As the information flow of both directions is captured, it increases the amount of input information available to the network. This architecture facilitates the output layer to get information from the past (backward) and future (forward) states simultaneously.
BRNN has been used in analyzing public sentiments towards elections as the election context is fed as its input and BRNN has increased performance when the knowledge of words proceeding and following the most polarized word is taken into consideration from either direction. BRNN aims to :
Divide the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). This facilitates information inclusion from both the past and future of the current time frame. The output of two states remains disconnected with the inputs of the opposite direction states.
BRNNs can be trained using similar algorithms to RNNs, because the training process does not involve any interactions between both the directional neurons. The training involves three steps with forward pass, backward pass, and weight updates:
  • For forward pass, forward states and backward states are passed first to the next hidden layer. Next, the states from the output neurons are passed.
  • For the backward pass, states from output neurons are passed first. Afterward forward and backward states are passed.
  • After forward and backward passes are completed, the hidden layers’ weights are updated.Bi-directional LSTM model summary
Bi-directional LSTM model summary

Convolutional Neural Networks (CNN)

CNN used for sentiment prediction using pre-trained word embeddings is composed of 1D convolution layers and 1D Global Max Pooling layers with 128 filters.1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using a filter size of 5, sliding over 5 words at a time.
Single-layer CNN with 128 filters
CODE
#fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019

model = Sequential()
model.add(layers.Embedding(len(word_index) + 1,
                        EMBEDDING_DIM,
                        weights=[embedding_matrix],
                        input_length=MAX_SEQUENCE_LENGTH,
                        trainable=True))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(Dense(8, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,
                    nb_epoch=15, batch_size=64,
                    validation_data=(x_test, y_test))
loss, accuracy = model.evaluate(x_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(x_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

LSTM, Bi-directional LSTM, Bi-directional GRU with Attention Mechanism

Attention mechanisms allow neural networks to decide which vectors (or words) from the past are important for future decisions by considering them in context to the word in question. In this process, it filters important and relevant chunks of information, and force hops in parts of the sequence that is not relevant to the final goal or task. Such relationships among words and connection to neighboring words can be represented by directed arcs of a semantic dependency graph.
Further, an attention mechanism takes into account the input from several time steps, distributes attention over the hidden states by assigning different weights, or degrees of importance, to those inputs. For a fixed target word, the first task is to loop over all encoders’ states to compare target and source states to generate scores for each state in encoders. A softmax is then introduced to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make the context vector easy to train, so that it gives a predicted output.
The principle advantage of attention mechanism lies in the context vector’s ability to take all cells’ outputs as input to compute the probability distribution of the source, providing the decoder an ability to represent global information, instead of a single hidden state.
Source-Bi-directional GRU and LSTM networks with Attention mechanism, wiki
Model Summary Bi-directional LSTM/GRU with Attention layer, Source -
Bi-directional GRU and LSTM networks with Attention mechanism, Source: wiki Model Summary Bi-directional LSTM/GRU with Attention layer, Source -Own
The source code below shows how to build a single Bi-directional GRU layer, with the Attention layer of 64 neurons, and classify tweets based on predefined classes using the “softmax” classifier and “Adam” optimizer. Source code available at https://github.com/sharmi1206/elections-2019
CODE
#fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019

from keras.layers import Dense
from keras.layers import GRU, Bidirectional, Embedding
from keras.models import Modelfrom sklearn.metrics import log_loss, accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrixNO_CLASSES = 8embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_gru = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)#Refr:https://github.com/richliao/textClassifier/issues/28
l_att = AttLayer(64)(l_gru)preds = Dense(NO_CLASSES, activation='softmax')(l_att)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])model.summary()
model.fit(x_train, y_train,
          nb_epoch=15, batch_size=64)#Evaluate model Accuracyoutput_test = model.predict(x_test)
final_pred = np.argmax(output_test, axis=1)
org_y_label = [np.where(r==1)[0][0] for r in y_test]
results = confusion_matrix(org_y_label, final_pred)
precisions, recall, f1_score, true_sum = metrics.precision_recall_fscore_support(org_y_label, final_pred)
pred_indices = np.argmax(output_test, axis=1)
classes = np.array(range(0, NO_CLASSES))
preds = classes[pred_indices]
print('Log loss: {}'.format(log_loss(classes[np.argmax(y_test, axis=1)], output_test)))
print('Accuracy: {}'.format(accuracy_score(classes[np.argmax(y_test, axis=1)], preds)))

Accuracy with Pre-trained Word Embeddings

Accuracy and Log Loss for sentiment prediction BJP vs Congress

Word Embeddings with Convolutional Neural Networks (CNN) on Election Tweets

Convolution Neural Networks with Word2Vec Models with Gensim by building the election corpus, Source -Wiki
The word2vec tool takes a text corpus (list of tweets) as input and produces the word vectors as output. It first constructs a unique vocabulary set from the training text data (list of tokenized tweets) and then learns vector representation of words, representing n-gram features that aid in the sentiment classification process. The process is known as word embedding as used in pre-trained word embeddings, the only difference being the training process takes place using election tweets instead of pre-trained data. We used Keras to convert positive integer representations of words into a word embedding by an Embedding layer.
CODE
#fileNameclassifyw2veccnn.py at https://github.com/sharmi1206/elections-2019

num_words = 20000
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(combined_df['tweet'].values)
word_index = tokenizer.word_index
# Pad the tweet data
X = tokenizer.texts_to_sequences(combined_df['tweet'].values)
X = pad_sequences(X, maxlen=2000)
Y = pd.get_dummies(combined_df['mood']).valuesword2vec = Word2Vec(sentences=tokenized_corpus,
                    size=vector_size,
                    window=window_size,
                    iter=500,
                    seed=300,
                    workers=multiprocessing.cpu_count())# Copy word vectors 
X_vecs = word2vec.wv
CNN used for sentiment prediction is composed of 1D convolution layers and 1D pooling layers over a series of 4 layers, with 32, 64, 128, and 256 filters respectively in each layer.
1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using a filter-size of 3, sliding over 3 words at a time. This allows considering at 3-grams to understand how words contribute to sentiment in the context of those around them.
After each convolution, we add a max-pool layer to extract the most significant elements and turn them into a feature vector. Further, we also add a regularization of 20% to ensure the model does not overfit. The resultant tensor of varying shape is concatenated into one big, single columned vector through flattening. The long feature vector is then used by a dense layer with software activation to yield a resultant classified output.
CODE
#fileNameclassifyw2veccnn.py at https://github.com/sharmi1206/elections-2019

from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.optimizers import Adam
from keras.models import Sequentialbatch_size = 64
nb_epochs = 20
vector_size = 512
max_tweet_length = 100model = Sequential()
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(64, kernel_size=3, activation='elu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(128, kernel_size=3, activation='elu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(256, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))
model.add(Dropout(0.2))
model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())
model.add(Dense(8, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(lr=0.001, decay=1e-6),
              metrics=['accuracy'])

# Fit the model
model.fit(X_train, Y_train,
          batch_size=batch_size,
          shuffle=True,
          epochs=nb_epochs)model.add(Flatten())
model.add(Dense(8, activation='softmax'))# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(lr=0.001, decay=1e-6),
              metrics=['accuracy'])# Fit the model
model.fit(X_train, Y_train,
          batch_size=batch_size,
          shuffle=True,
          epochs=nb_epochs)
Model Summary Convolution Neural Networks

Word Embeddings with Recurrent Neural Networks (LSTM/GRU/Bi-directional LSTMs) on Election Tweets

The neural network architecture (each of LSTM, GRU, Bi-directional LSTM/GRU) is modeled to 20000 most frequent words, where each tweet is padded to a maximum length of 2000. The first layer is the Embedded layer that uses 128 length vectors (each word is tokenized with Keras’s Tokenizer) to represent each word. The next layer is the LSTM layer with 256 memory neurons. Finally, the results are fed to a single output Dense layer with 8 neurons and a softmax activation function to predict the associated mood.
CODE
#fileName classifyw2veclstm.py at https://github.com/sharmi1206/elections-2019

NO_CLASSES = 8
embed_dim = 128
lstm_out = 256model = Sequential()
model.add(Embedding(num_words, embed_dim, input_length = X.shape[1]))
model.add(LSTM(lstm_out, recurrent_dropout=0.2, dropout=0.2))
model.add(Dense(NO_CLASSES, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['categorical_crossentropy'])
print(model.summary())X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42, stratify=Y)# Fit the model
model.fit(X_train, Y_train,
          batch_size=batch_size,
          shuffle=True,
          epochs=nb_epochs)output_test = model.predict(X_test)
The model yields 99.58% accuracy over 5 epochs with a batch-size of 128 .
RESULTS
<code style="box-sizing: border-box; margin: 0px; padding: 0px; border: none; outline: none; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, monospace; font-size: inherit; color: inherit; overflow-wrap: break-word; word-break: normal;">Epoch 5/5
...........64/7344 [..............................] - ETA: 58:45 - loss: 0.0218 - acc: 1.0000
 128/7344 [..............................] - ETA: 54:28 - loss: 0.0259 - acc: 1.0000
 192/7344 [..............................] - ETA: 57:35 - loss: 
........
........
7232/7344 [============================&gt;.] - ETA: 58s - loss: 0.0328 - acc: 0.9960
7296/7344 [============================&gt;.] - ETA: 24s - loss: 0.0330 - acc: 0.9959
7344/7344 [==============================] - 3811s 519ms/step - loss: 0.0331 - acc: 0.9958</code>

Conclusion

In this post, we reviewed deep learning methods for creating vector representations of sentences with RNNs, CNNs, and presented their effectiveness in solving a supervised sentiment prediction.
With glove pre-trained word embeddings, Bi-directional LSTM and Bidirectional GRU with Attention Layer perform the best, while Auto-encoder model, performs the worst both in case of BJP and Congress. With the Word Embedding matrix solely trained with election context tweets increase the accuracy of models (LSTM, GRU, Bi-directional LSTM/GRU) to almost 99.5%. But the CNN model performs the worst, with 50% accuracy.
However, each of these models can be further improved using extensive tuning of hyper-parameters, different epochs, learning rates, and the addition of more labeled data for minority classes. Further altering the neural network architecture by increasing or decreasing the number of neurons and hidden layers might give added improvements.
References
  1. https://www.researchgate.net/figure/The-architecture-of-sentence-representation-learning-network_fig2_325642880
  2. https://blog.myyellowroad.com/unsupervised-sentence-representation-with-deep-learning-104b90079a93
  3. https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/
  4. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  5. https://code.google.com/archive/p/word2vec/
Please let me know if there were any mistakes, suggestions feedbacks are welcome. The election repository is available at https://github.com/sharmi1206/elections-2019. Please feel free to follow me at Linkedin.

Written by sharmi1206 | https://www.linkedin.com/in/sharmistha-chatterjee-7a186310/
Published by HackerNoon on 2020/08/30