Introduction

The General Elections in India was scheduled to be held between April and May 2019 to constitute the 17th Lok Sabha. The nation was eagerly following news channels and newspapers to understand who will win the majority.

The numbers and data being generated around this is through the roof. What an intriguing project to take up as a data scientist!

You can analyze and predict public sentiments towards prominent parties using a variety of sources of public data, like Twitter, YouTube, etc.

We can look at various sources of information to identify sentiment towards:

Party, party leaders, their actions since the last election was held
Eagerness/willingness to cast votes while waiting in the queue
Sentiment of one party towards another as well as sentiment between contending leaders, coalition parties
Election campaigns, its popularity and how it effects the general public
Drift or sudden change in moods due to announcement of poll results for a single state or multiple states
Elected leaders, their promises and actionsIntroduction of completely new constitutional amendments
Economic, social, political changes that a national undergoes in between successive terms and how it affects the country’s GDP and foreign relations

With this context in mind, let’s try to understand how can we predict the sentiment of the tweets with supervised text classification techniques.

In the previous post (Twitter Sentiment Analysis for the 2019 Election), I gave an introduction to the process of labelling and analyzing tweets from Elections 2019. In this post, I will illustrate the different Text Based Classifiers used to train and predict the mood of the tweets of India’s two most prominent parties.

Why Text Based Classifiers?

Tweets are textual data that have different variations of n-grams, re-tweet counts, polarity and sentiment context variations. The textual information needs to be broken down at word and character-level for better feature engineering and feature-extraction.

Hence, comes the need of text based classifiers that helps to analyse character/word at granular level and even provides capability to users to define their own feature-sets from a combination of characters or words. Such common Text Based Classifiers are:

FastTextClassifiers from Natural Language ToolKit
Naive Bayes, MaxEnt (Maximum Entropy Classifier), Decision Tree.

Decision Tree Classifier , has been covered in the next blog.

FastText Classification

FastText classification discovered from Facebook AI research, works on the principle of representing each word as a bag of character n-grams. Each vector representation is created for each character n-gram, words being represented as the sum of these n-gram representations. This facilitates

Preserving local order of words by generating all possible n-grams.
Classification of words absent in training data set which happens in imbalanced data-sets.
Enable suitable representations for languages that are morphologically rich.

FastText efficiently solves problems of classifying huge sets of data with a large number of categories by organising it in a hierarchical structure in form of a tree, e.g. binary tree. This reduces the time complexity of training and testing text classifiers from linear to logarithmic O(h log2(k)) with respect to the number of classes.

The hierarchical tree structure uses Hierarchical softmax function based on Huffman algorithm to create categories and often solves the problem of imbalanced classes, classes which appear less frequently than others.

FastText takes into account words appearing in the text, sums them up and represents them through low dimensional vectors. The low dimensional hidden representation enables information sharing across classifiers as each classifier uses words learned from another category.

We used FastText to predict classified mood of the tweets. The target mood in FastText is prefixed with ‘__label__’. The trained dataset is further cross-validated into 10 cross-validation splits , where the cross-validated data is trained with different learning rates, number of epochs and n-gram level (1,2 or 3) .

The computed mean accuracy is stored in a Hash-map. The maximum mean accuracy and its corresponding learning rate, number of epochs, n-gram level are retrieved and that model is finalised to be the best model.

FastText for Election Mood Classification

import fastText as ft
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn.metrics import accuracy_score
#appending __label__ before the target mood
labelled_mood = '__label__' + df_train.mood
df_train['labels_text'] = labelled_mood
df_test['labels_text'] = '__label__' + df_test.mood
df_train.labels_text = df_train.labels_text.str.cat(df_train.tweet, sep=' ')
kf = KFold(n_splits=k, shuffle=True)
for train_index, test_index in kf.split(YX):
    YX[train_index].to_csv('train.csv, index = False)
    # Fit model for this set of parameter values
    model = ft.FastText.train_supervised('train.csv',
                                         lr=lr_val,
                                         wordNgrams=wordNgrams_val,
                                         epoch=epoch_val)
    df_valid=  pd.DataFrame(data=YX[test_index]).dropna()
    pred = model.predict(df_valid['labels_text'].tolist())
    pred = pd.Series(pred[0]).apply(lambda x: re.sub('__label__', '', x[0]))
    org =df_valid['labels_text'].apply(lambda x: re.sub('__label__', '', x[9:x.find(' ')])) #substituting label with ''
    # Accuracy for each cross-validation fold
    kf_accuracy = accuracy_score(org.values, pred.values)
    accuracy.append(kf_accuracy)
mean_acc = np.mean(accuracy)
cross_valid_params[mean_acc] = [lr_val, wordNgrams_val, epoch_val]
#Retrieving the max accuracy from dict
max_acc = max(cross_valid_params.keys())
lr_val, wordNgrams_val, epoch_val = cross_valid_params[max_acc]
precisions, recall, f1_score, true_sum = metrics.precision_recall_fscore_support(org.values, pred_filter.values)
print("Fast Text Precision =", precisions)
print("Fast Text Recall=", recall)
print("Fast Text F1 Score =", f1_score)
accuracy_score = metrics.accuracy_score(org, pred_fil

FastText for Election Mood ClassificationPrecision , Recall, F1, and Accuracy for BJP and Congress

Precision , Recall, F1, and Accuracy for BJP and Congress

Results of sentiment prediction for BJP
BJP Fast Text Precision = [0.45833333 0.56976744 0.73802395 0.78723404 0.70491803 0.7260274 0.59166667 0.2 ]
BJP Fast Text Recall= [0.36065574 0.37692308 0.84129693 0.71153846 0.61870504 0.79613734 0.73958333 0.01298701]
BJP Fast Text F1 Score = [0.40366972 0.4537037 0.78628389 0.74747475 0.65900383 0.75946776 0.65740741 0.02439024]
BJP Fast Tex Accuracy = 70.3%

Results of sentiment prediction for Congress
Classify Fast Text Precision = [0.42857143 0.5483871  0.74174174 0.7826087  0.71311475 0.73465347
 0.59504132 0.2       ]
Classify Fast Text Recall= [0.3442623  0.39230769 0.84300341 0.69230769 0.62589928 0.79613734
 0.75       0.01298701]
Classify Fast Text F1 Score = [0.38181818 0.4573991  0.78913738 0.73469388 0.66666667 0.76416066
 0.66359447 0.02439024]
Accuracy= 70.5%

Since the dataset involves imbalanced classes of predicted moods, the Precision-Recall curve here will be more appropriate here. The Precision between both the parties though look very close, but for BJP both precision and recall are slightly higher (1–3%) than Congress. The possible reason can be attributed to more textual data available and tweet posts containing BJP names and BJP party leader names.

Naive Bayes Classifier

In Naive Bayes classifiers, the label assigned to any given input is controlled by the decision of each and every independent feature. The independent features are generated based on the initial assigned label for any given input. This is done by calculating the prior probability of each label for an input value by considering the frequency of each label in the training set.

The classifier may start by choosing a label for the input, based on relative appearance and frequency, but as on it receives contribution from each feature towards the likelihood of the input label, it recomputes the decision of the most likely label. The re-computation reduces the effect of certain features that occurs less frequently than others.

Thus the estimated likelihood estimate for each label is obtained from each feature through joint probability distribution. The joint probability score includes the product of probability score of each input value with a specific label containing the feature.

Mathematically, it can be expressed as

Naive Bayes describes the likelihood computation of a feature, for a specific label by calculating P(f|label). The computation restricts to count the percentage of training instances with the given label that also has the given feature. Mathematically :

P(f|label) = count(f, label) / count(label)

This issue has certain limitations when any of the features does not occur in the training set, causes P(f|label) to be zero. As a result, such feature/label combination vanishes from predicted models. This estimation gives more accurate results when count(f, label) is relatively high.

Naive Bayes also provides a smoothing technique, known as Expected LikeliHood Estimation. It's a standardisation applied to handle small values of count(f, label) by adding 0.5 to each count(f, label) value.

Training

The different word frequencies of n-grams (1/2/3 gram words), positive, negative, compound scores from SentimentAnalyzer along with the tweet polarity and retweet count are taken into account while training the models with different nltk based classifiers. The model identifies the most important features and measures them in terms of the labelled moods.

All the available moods ranging from ‘fear’, ‘neutral’, ‘dominance’, ‘joy’, ‘faith’, ‘anger’, ‘sadness’, and ‘arousal’ are classified through n-grams frequency distribution and labelled as ‘JUM’, ‘FUM’, ‘SUM’, ‘AUM’, ‘DUM’, ‘EUM’, ‘NUM’, ‘TUM’, ‘JBM’, ‘FBM’, ‘SBM’, ‘ABM’, ‘DBM’, ‘EBM’, ‘NBM’, ‘TBM’, ‘JTM’, ‘FTM’, ‘STM’, ‘ATM’, ‘DTM’, ‘ETM’, ‘NTM’, ‘TTM’, where the initial word states the first word of the mood while the last 2 words UM , BM or TM stands for unigram, bigram and trigram respectively.

Thus, ‘SUM’ stands for “Sadness” from Unigram frequency distribution and ‘JTM’ stands for “Joy” for Trigram frequency distribution.

 for index, row in data.iterrows():
            JBM = 0
            FBM = 0
            SBM = 0
            ABM = 0
            DBM = 0
            EBM = 0
            NBM = 0
            TBM = 0
            add = 1
            word_mood = []
            bigram = ngrams(row[0], 2)
            for pair in bigram:

                word_mood.append(pair[0][0])
                word_mood.append(pair[1][0])
                processed_mood = md.get_n_gram_mood(word_mood)
                # print (processed_mood)
                if (processed_mood == 'sadness'):
                    SBM = SBM + 1
                elif (processed_mood == 'joy'):
                    JBM = JBM + 1
                elif (processed_mood == 'faith'):
                    FBM = FBM + 1
                elif (processed_mood == 'neutral'):
                    NBM = NBM + 1
                elif (processed_mood == 'dominance'):
                    DBM = DBM + 1
                elif (processed_mood == 'arousal'):
                    EBM = EBM + 1
                elif (processed_mood == 'fear'):
                    TBM = TBM + 1
                elif (processed_mood == 'anger'):
                    ABM = ABM + 1

                df.set_value(index, 'SBM', SBM)
                df.set_value(index, 'JBM', JBM)
                df.set_value(index, 'FBM', FBM)
                df.set_value(index, 'NBM', NBM)
                df.set_value(index, 'DBM', DBM)
                df.set_value(index, 'EBM', EBM)
                df.set_value(index, 'TBM', TBM)
                df.set_value(index, 'ABM', ABM)

The above code snippet considers each word from the bi-gram and creates a feature based on all available moods in the bi-gram. The process by which an n-gram mood is predicted is covered in the previous post, Twitter Sentiment Analysis for the 2019 Election.

Text Based Classifiers (Naive Bayes and MaxEnt) lists the most informative features as a ratio between different labels/moods. For example, the feature “TBM” for bi-gram frequency distribution demonstrates a frequency count of 2 for mood “Fear”, and a ratio of probability distributions (13.4:1.0) for moods, “Fear” : “Joy”. The ratio of probability distributions is nothing but a log probability of each label, given the features.

Label Encoding of Moods: {'fear': 4, 'neutral': 6, 'dominance': 2, 'joy': 5, 'faith': 3, 'anger': 0, 'sadness': 7, 'arousal': 1}
BJP 
-------
Most Informative Features
                     EUM = 1                   1 : 2      =     21.7 : 1.0
                     SBM = 3                   7 : 5      =     17.6 : 1.0
                     TBM = 2                   4 : 5      =     13.6 : 1.0
                     TUM = 1                   4 : 5      =     13.4 : 1.0
                     pos = 0.0                 6 : 3      =     12.8 : 1.0
                     SUM = 3                   7 : 5      =     12.8 : 1.0
                     STM = 3                   7 : 5      =     12.8 : 1.0
                     SBM = 2                   7 : 5      =     12.6 : 1.0
                compound = 0.0                 6 : 3      =     12.4 : 1.0
                     ATM = 3                   0 : 2      =     11.1 : 1.0
                     SBM = 1                   7 : 5      =     10.5 : 1.0
           retweet_count = 5                   3 : 5      =     10.5 : 1.0
                     SUM = 1                   7 : 5      =      9.9 : 1.0
                     STM = 1                   7 : 2      =      9.5 : 1.0
                     DBM = 1                   3 : 5      =      8.5 : 1.0
                     DUM = 6                   3 : 5      =      8.3 : 1.0
                     EBM = 1                   1 : 2      =      8.1 : 1.0
                     ETM = 2                   1 : 5      =      7.7 : 1.0
                     ETM = 3                   1 : 5      =      7.7 : 1.0
                     SUM = 2                   7 : 5      =      7.7 : 1.0
accuracy by using Naive Bayes: 0.59

Confusion Matrix from Naive Bayes Classifier:
  |  0  1  2  3  4  5  6  7 |
--+-------------------------+
0 | <.> .  1  .  .  1  1  . |
1 |  . <2> 2  .  .  1  5  . |
2 |  .  2<15> 2  .  9  .  2 |
3 |  .  .  . <1> .  4  .  . |
4 |  .  .  .  . <1> 2  2  . |
5 |  .  2  8  1  1<22> 2  . |
6 |  .  .  1  .  .  . <7> . |
7 |  .  .  .  .  2  .  1 <.>|
--+-------------------------+
(row = reference; col = test)
Naive Bayes Precision 0.69
Naive Bayes Recall 0.59
Naive Bayes F_Score 0.57

Congress : 
-----------
Most Informative Features
                     STM = 2                   7 : 2      =     35.9 : 1.0
                     SBM = 3                   7 : 2      =     31.4 : 1.0
                     EUM = 1                   1 : 2      =     29.0 : 1.0
                     SUM = 3                   7 : 2      =     23.2 : 1.0
                     STM = 3                   7 : 2      =     23.2 : 1.0
                     ATM = 3                   0 : 2      =     16.5 : 1.0
                     EUM = 5                   1 : 2      =     15.3 : 1.0
                     SBM = 2                   7 : 5      =     15.1 : 1.0
                     pos = 0.0                 6 : 3      =     15.0 : 1.0
                compound = 0.0                 6 : 3      =     14.3 : 1.0
                     NBM = 16                  1 : 5      =     13.4 : 1.0
                     SUM = 4                   0 : 2      =     12.9 : 1.0
                polarity = 0.5                 5 : 2      =     12.6 : 1.0
                     ETM = 3                   1 : 5      =     12.3 : 1.0
                     DBM = 10                  7 : 5      =     12.3 : 1.0
                     SBM = 1                   7 : 5      =     12.1 : 1.0
                     DBM = 1                   2 : 5      =     11.9 : 1.0
                     SUM = 2                   7 : 5      =     11.8 : 1.0
           retweet_count = 5                   3 : 5      =     11.8 : 1.0
                     TBM = 2                   4 : 5      =     11.4 : 1.0
accuracy by using Naive Bayes: 0.79
(row = reference; col = test)

Naive Bayes Classifier Precision 0.80
Naive Bayes Classifier Recall 0.79
Naive Bayes Classifier F_Score 0.78

Naive Bayes and MaxEnt classifier can be trained in a similar manner.

import nltk
from nltk.tokenize import *
from nltk.util import ngrams
from nltk.classify import *
import preprocessor as p
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

Naive Bayes Training and Testing

classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(20)
print('accuracy by using Naive Bayes:', nltk.classify.util.accuracy(classifier, test_set))

print('Naive Bayes Confusion Matrix' , nltk.ConfusionMatrix(labels, tests))
print('Naive Bayes Classification Report', classification_report(labels, tests))

MaxEnt Training and Testing

classifier = MaxentClassifier.train(train_set)
classifier.show_most_informative_features(20)
print('accuracy by using Max Entropy:', nltk.classify.util.accuracy(classifier, test_set))

print('Naive Bayes Confusion Matrix' , nltk.ConfusionMatrix(labels, tests))
print('Naive Bayes Classification Report', classification_report(labels, tests))

Comparative study of Precision-Recall-F-Score

Maximum Entropy Model

The Maximum Entropy classifier is a conditional classifier built to predict P(label|input) — the probability of a label given the input value. The classifier works on search techniques to find a set of parameters that will maximize the performance of the classifier. It works on iterative optimization techniques to bring the model parameters close to optimal value.

MaxEnt uses some of the algorithms like Generalized Iterative Scaling (GIS) or Improved Iterative Scaling (IIS), the Conjugate Gradient (CG) and the BFGS optimization. The optimization technique takes time which increases with the training data set, number of features and number of labels. While GIS or IIS are slow, CG and BFGS optimization are faster. In-depth analysis of each of the algorithms is out of scope for this post.

The objective of the search parameters is to maximize the total likelihood of the training corpus and can be defined as :

where P(label|features), the probability that an input whose features are features will have class label label, is defined as:

The MaxEnt model works on the principle of joint-feature where each combination of labels and features is assigned a parameter which further allows an user to define the combinations of labels and features while building the model. A single parameter can be used to associate a feature with more than one label. Likewise two or more features can be used to associate with a given label.

The association of feature and label to create joint-feature can be summarized as:

a. Define features (join-feature) for each label, corresponding to w[label], and for each combination of feature and label, corresponding to w[f, label]

b. Compute frequency of each joint-feature — i.e., the frequency with which it occurs in the training set.

c. Search and find the distribution with Maximum Entropy.

d. Associate scores to each label of input by computing product of the parameters associated with joint-features. Mathematically,

Since MaxEnt classifier works on condition probability distribution, there may be many distributions corresponding to each input and label for joint-features .However MaxEnt always selects distribution with probability labels that are more evenly distributed, so that its the distribution with Maximum Entropy.

Results using MaximumEntropy Classifier with input features : Word frequencies of n-grams (1/2/3 gram words), positive, negative, compound scores obtained from SentimentAnalyzer, tweet polarity and retweet count.

BJP: 
-----

Label Encoding of Moods
{'fear': 4, 'neutral': 6, 'dominance': 2, 'joy': 5, 'faith': 3, 'anger': 0, 'sadness': 7, 'arousal': 1}  

Training and Testing by using Max Entropy.....
  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.07944        0.043
             2          -1.30307        0.483
             3          -1.20139        0.563
             4          -1.11100        0.601
             5          -1.03157        0.631
             6          -0.96236        0.674
             7          -0.90215        0.702
             8          -0.84962        0.736
             9          -0.80355        0.756
         Final          -0.76289        0.763

   1.134 pos==0.315 and label is 0
   1.046 pos==0.277 and label is 7
   1.035 neg==0.446 and label is 0
   0.995 pos==0.18 and label is 4
   0.980 pos==0.302 and label is 1
   0.975 pos==0.6509999999999999 and label is 7
   0.964 neg==0.231 and label is 7
   0.960 pos==0.394 and label is 0
   0.955 retweet_count==39 and label is 4
   0.954 compound==0.6774 and label is 4

Accuracy from Max Entropy Classifier: 0.51
Confusion Matrix from MaxEnt Classifier:
|  0  1  2  3  4  5  6  7 |
--+-------------------------+
0 | <.> .  1  .  .  2  .  . |
1 |  . <1> 5  .  .  4  .  . |
2 |  .  1<14> .  . 15  .  . |
3 |  .  .  1 <.> .  4  .  . |
4 |  .  .  .  . <.> 5  .  . |
5 |  .  .  7  .  .<29> .  . |
6 |  .  .  8  .  .  . <.> . |
7 |  .  .  1  .  .  2  . <.>|
--+-------------------------+
(row = reference; col = test)
Max Entropy Precision 0.47
Max Entropy Recall 0.51
Max Entropy F_Score 0.43

Congress:
---------

Training and Testing by using Max Entropy.....
  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.07944        0.042
             2          -1.30335        0.482
             3          -1.20265        0.565
             4          -1.11320        0.598
             5          -1.03467        0.628
             6          -0.96627        0.671
             7          -0.90679        0.694
             8          -0.85489        0.724
             9          -0.80937        0.741
         Final          -0.76919        0.754

   1.161 pos==0.217 and label is 1
   1.136 pos==0.315 and label is 0
   1.102 pos==0.17600000000000002 and label is 1
   1.081 pos==0.277 and label is 7
   1.058 pos==0.6970000000000001 and label is 1
   1.052 pos==0.35200000000000004 and label is 4
   1.052 neg==0.319 and label is 4
   1.037 pos==0.38 and label is 4
   1.037 neg==0.157 and label is 4
   1.032 neg==0.446 and label is 0

Accuracy from MaxEnt Classifier: 0.72
Confusion Matrix from MaxEnt Classifier:


  |  0  1  2  3  4  5  6  7 |
--+-------------------------+
0 | <1> .  1  .  .  1  .  . |
1 |  . <4> 2  .  .  4  .  . |
2 |  .  .<25> .  .  5  .  . |
3 |  .  .  . <1> .  4  .  . |
4 |  .  .  1  . <3> 1  .  . |
5 |  .  .  1  .  .<35> .  . |
6 |  .  .  3  .  .  4 <1> . |
7 |  .  .  1  .  .  .  . <2>|
--+-------------------------+
(row = reference; col = test)

Max Entropy Precision 0.79
Max Entropy Recall 0.72
Max Entropy F_Score 0.68

Comparative study of Precision-Recall-F-Score

Generative (Naive bayes) vs Conditional Classifiers (MaxEnt)

Conclusion:

The analytics of the previous post and the predicted results from text based classifiers observe some of the moods like “Anger”, “Sadness”, “Fear” occur much less frequently than others “Dominance”, “Joy” and “Arousal”. Imbalanced data sampling (under-sampling/over-sampling) can help to increase the accuracy of minority mood classes.

The different classification and prediction techniques employed for generic tweets show that FastText is the most efficient way of classifying and predicting tweet moods for both the parties, followed by Naive Bayes and MaxEnt and classifier.

Further, the results show predicted mood for Congress has greater prediction accuracy than BJP. The models can be further enhanced by choosing the right features from n-grams distribution frequency, retweet count and different polarity metrics.

In the next post will I go into the details of classifying highly polarized tweets , bring forward other standard ML techniques that can be used for Text Classification and understand how to interpret the accuracies for various models.

Referencess:

Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2016
https://pdfs.semanticscholar.org/4667/88e0ba1f608981ca5422ddfb5bfedeef75d0.pdf
https://www.nltk.org/book/ch06.html https://medium.com/analytics-vidhya/twitter-sentiment-analysis-for-the-2019-election-8f7d52af1887

Disclaimer Statement

The work analyses tweet of 2 prominent parties for the upcoming election. The author has no intention to create controversy in people’s mind or hurt anybody’s feelings or incite feelings of anger or hatred. Its purely done for academic, research and information purposes and somebody else might get different results on application of other techniques of analysis. Its an unbiased and impartial summary and does not discriminate/differentiate any individual or group

Sentiment Classification for 2019 Lok Sabha Elections Using Text Based Classifiers