The Unreasonable Ineffectiveness of Deep Learning in NLU

On real world data, Deep Learning performance can be shallow

I often get pitched with a superior deep learning solution for Natural Language Understanding (NLU). The plan appears prudent. After all, deep learning is the disruptive new force in AI. A better NLU AI entices many useful advancements, ranging from smarter chat bots and virtual assistants to news categorization, with an ultimate promise of better language comprehension.

State of the Practice

Lets assume this superior deep learning (DL) "product” is called "(dot)AI”. Their pitch deck will invariably have a bar chart that looks something like this — the claim being that the new DL topic classifier/tagger of (Dot)AI is better than state of the art methods.

In many industries, it is expected that production grade ML classifiers have more than 90% accuracy for quality assurance and a decent user experience. This is expected tolerance level for news categorization or conversational bots

The chart presents an interesting proposition, even though performance is only slightly superior than state of the art. In any product, what constitutes “good enough” depends on tolerance for error specific to that industry. For example, a model’s best accuracy score might be reasonable for video recommenders or image transcription, but is outside the tolerance limits for news categorization.

You don’t have to be a sceptic to ask the question: In the realm of natural language text classification, do DL techniques significantly outperform shallow methods, e.g TF-IDF or bag of words (BoW) based approaches?

The assumption often is a confident Yes — that DL obliterates shallow methods in NLU. But does it? Three recent trends underpin this illusion:

In industry AI conferences, deep learning talks overwhelming relate to image/audio/video data, with almost zero talks on production level natural language tasks. Why?
The media and others continuously hype deep learning as a silver bullet, without perusing the actual results in papers. This can lead to confusion for practitioners trying to evaluate DL’s utility in their domain.
A lot of results just squeeze a few percent of performance on some artificial benchmark, whereas robustness and applicability matter more.

While DL has taken the computing world by storm, its impact on certain fundamental NLU tasks remains uncertain and performance is not always superior. To understand why, let me first describe the NLU task and then the state of the art models trying to solve it and how DL underperforms.

A Fundamental NLU Task

A critical task in natural language understanding is to comprehend the topic of a sentence. The topic could be a tag (such as politics ,music,gaming , immigration or adventure-sports), but usually it isn’t merely a named entity based task such as a person’s name or extracting location.

A topic tagger will attempt to tag the first WFTV article to “sports”, while the second WFTV article to “animals” although both mentions `Tiger`. This can get complicated quickly due to things like word sense disambiguation, as is shown in the example on the right.

This type of software is called topic taggers. Their utility cannot be overstated. Topics are key in extracting intent and formulating automated responses. Consider chat bots — the most common problem bot companies face is the lack of any automated way to capture what their users are messaging about. The only way to estimate user intent from bot messages is either via human eye-balling or whatever matches pre-built regex scripts. Both methods are suboptimal and cannot cover larger topics space.

In fact, topic tagging at various semantic resolutions is a gateway solution approach to NLU for two reasons: (1) Text classification into topics is a precursor to most higher level NLU tasks such as sentiment detection, discourse analysis, episodic memory, and even question answering (the quintessential NLU task) (2) Also, NLP pipelines are considerably prone to error propagation, i.e. an error in topic classification can jeopardize future analysis, such as episodic memory modeling or discourse or even sentiment analysis. Thus, finding the right topic is crucial for NLU.

What good is sentiment of a piece of news unless we know what exactly this sentiment is about? Incorrect topic tagging can adversely affect sentiment utility.

Comprehending the topic is a first step in taking meaningful action. In reality, topic classification is a hard problem , which has been at times underestimated and overlooked by the AI community.

State of the Art

Over the years, several technologies have tried to tackle the topic classification problem. There was LSA, LDA and others like PLSI, Explicit Semantic Analysis etc. Half of these are either not production grade or don’t scale well with messy real-world data. The other half has poor interpretability or needs considerable post-processing of whatever it outputs.

New world models: Today, two main solutions appear overwhelmingly in topic classification performance comparisons. (1) First is the very deep Convolutional Neural Net [DCNN] model from 2016, which proposes the use of very deep neural network architecture - a “state of the art in computer vision”. (2) Second is the [FastText](https://arxiv.org/abs/1607.01759) approach (also 2016). Its performance is almost as good as DCNN but is orders of magnitude faster in training and evaluation than DCNN. Some call FastText the Tesla of NLP — whatever that means.

Both methods are elegant in their own way. The big difference is that FastText is a shallow network whereas DCNN is 29 layers deep neural net. FastText does not fall in the “stereotype” of fancy deep neural nets. Instead it uses word embeddings to solve the tag prediction task.

FastText extends the basic word embedding idea to predict a topic label, instead of predicting the middle/missing word (which recall is the original Word2Vec task). This visualizes Word2Vec word embeddings [link]

Old world models: Holding the ground for older/naive models are n-gram/bag of word based models and TFIDF, which still find value in large-scale implementations.

Benchmark Data: A final component in examination of state of the art is datasets on which these models are tested. Benchmark datasets are key for reproducibility and comparative analysis. In topic classification tasks, three popular datasets are: [AG news](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html), Sogou news and Yahoo! answers. They differ in corpus size and number of topics (classes) present in the data.

The three datasets marked in red rectangles are specifically used for topic classification. Shown with arrow is one instance in the dataset. Task is to predict the label by analyzing the sample.

Deep (Learning) Impact?

First, lets look at the results from the DCNN DL paper and compare it with naive models. The numbers below indicate error rates when running a particular configuration of a model on the topic classification datasets.

This is Table 4 from [DCNN] paper. Topic classification datasets (mentioned above) are marked with rectangles. The corresponding comparable error values are marked with red ellipses.

Four main observations here:

In 2 /3 topic classification datasets (AG+Sogou), the naive/shallow methods perform better than deep learning.
In the 3rd dataset (Yah. Ans), DL reduces the error by just **~1.63**.
The accuracy of the best model on (**Yah. Ans**) dataset is still at **~73**, which is non-trivial and significantly lower than the tolerance level for most quality production systems.

An important thing to note: all 3 datasets have a topic space less than 11, which is still somewhat synthetic. In real world natural language data (news streams or conversational messaging), topic spaces could easily exceed 20 or 25 different topics (or intents). This is key, because the next point hints topic space cardinality can have huge impact on accuracy.

4. Accuracy Degradation: Notice when the topic space grows from 4 to 10 (AG vs. Yah.Ans), the error skyrockets to _28.26_ from _7.64_ with the same model. While its possible this is caused by spurious factors, such as imbalanced datasets, there is a good chance that a four-fold increase in error is due to complexities involved in generalizing larger topic spaces.

Finally, lets look at FastText performance on these datasets and compare it to DCNN DL and the naive approaches:

This is Table 1 from FastText paper showing accuracy values on the three topic classification datasets, comparing FastText with naive methods.

Three further observations with FastText’s comparative results :

5. Once again, in 2/3 datasets FastText performs better than deep learning model. In Yah. Ans dataset, FastText is only inferior by ~1.1.

6. The DCNN deep learning method actually performs worse than naive models in the first two datasets (AG and Sogou).

7. And again, in 2/3 datasets, the naive model’s performance is comparable or better than FastText

In addition to these (stunning) results, recall that non-DL models are usually orders of magnitude faster to train and much much more interpretable.

Why is this U_nreasonable_?

Well, looks like when it comes to topic classifiers — the old world models (naive / shallower) aren’t ready to give up their throne just yet! This ineffectiveness of deep learning is somewhat unexpected. It is counterintuitive, given the new world DL models were produced at a company with tons of data, performance should be significantly better.

However, we observe little difference in accuracy. Naive/older models are better or comparable to DL models when classifying text into topics.

From “What Data Scientists should know about Deep Learning”, performance beats older algorithms with sufficient data.

Deep learning might have deep problems in classifying language, but the objective here isn’t to disparage it or have anything to do with a deep learning conspiracy. I think its impact is clear and promising. In computer vision and speech recognition and playing games, DNN’s have taken us where we have never been before.

But the reality is your mileage may vary when using deep learning for a basic Natural Language task like text classification. Why this gap in performance between image/video/audio vs. language data? Perhaps it has to do with the patterns of biological signal processing required to “perceive” the former while the patterns of cultural context required to “comprehend” the latter? In any case, there is still much we have to learn about the intricacies of learning itself, especially with different forms of multimedia.