Automated Identification of Inclusiveness User Feedback: Testing the Effectiveness of 5 LLMs

Authors:

(1) Nowshin Nawar Arony;

(2) Ze Shi Li;

(3) Bowen Xu;

(4) Daniela Damian.

Table of Links

Abstract & Introduction

Motivation

Related Work

Methodology

A Taxonomy of Inclusiveness

Inclusiveness Concerns in Different Types of Apps

Inclusiveness Across Different Sources of User Feedback

Automated Identification of Inclusiveness User Feedback

Discussion

Conclusion & References

8 AUTOMATED IDENTIFICATION OF INCLUSIVENESS USER FEEDBACK

In answering our fourth research question, How effective are pre-trained large language models in automatically identifying inclusiveness-related user feedback?, we assessed the effectiveness of the five LLMs, as detailed in Section 4.3 through evaluating the same dataset. We fine-tuned the pre-trained models on each of the sources and measured the performance in terms of precision, recall, F1-score, and accuracy. The performance results of the five models for each source are outlined in Table 3.

We found that the overall evaluation results are best for Twitter user feedback. All the evaluation metrics are above 0.85 for Twitter for all five different classifiers. Among the five classifiers, we find BART to have the best results in terms of F1-score, achieving a value of 0.930. One possible reason for the better performance on Twitter may be that Twitter data has a lot of unrelated discussion (e.g. ads), making it relatively easier for the classifier to differentiate between inclusiveness and non-inclusiveness. Since there are many Twitter posts that consist of completely random dialogue or topics obviously unrelated to inclusiveness, this may help the models to classify the user feedback.

We observe that the classification results for app reviews are not as good as Twitter’s (i.e., roughly 8-12% lower). For Play Store, the best performing classifier is BERT, with an F1-score of 0.849. There are many user feedback reviews that report bugs concerning the apps, but not all of these bug reports are about inclusiveness. This may be a reason for the increased difficulty for the classifiers to identify inclusiveness in app reviews.

Finally, we observed the classifiers exhibiting a comparatively lower performance on Reddit in comparison to the other feedback sources. The best performing classifier for Reddit data is GPT-2, whereas the overall performance for classifiers for Reddit is roughly 8-14% lower than Twitter. The most likely reason why performance on Reddit is lower is that user feedback from Reddit is often complex and detailed in nature, which may contain an assortment of topic discussions. Identifying the inclusiveness aspect in a Reddit post is less clear cut than in a Twitter post or app review.

This paper is available on arxiv under CC 4.0 license.