Transformer-based models are a game-changer when it comes to using unstructured text data. As of September 2020, the top-performing models in the General Language Understanding Evaluation (GLUE) benchmark are all BERT transformer-based models. At Georgian, we often encounter scenarios where we have supporting tabular feature information and unstructured text data. We found that by using the tabular data in these models, we could further improve performance, so we set out to build a toolkit that makes it easier for others to do the same.

Building on Top of Transformers

The main benefits of using transformers are that they can learn long-range dependencies between text and can be trained in parallel (as opposed to sequence to sequence models), meaning they can be pre-trained on large amounts of data.

Given these advantages, BERT is now a staple model in many real-world applications. Likewise, with libraries such as HuggingFace Transformers, it’s easy to build high-performance transformer models on common NLP problems.

Currently, transformer models using unstructured text data are well understood. However, in the real-world, text data is often supported by rich structured data or other unstructured data like audio or visual information. Each one of these might provide signals that one alone would not. We call these different ways of experiencing data—audio, visual or text—modalities.

Think about E-commerce reviews as an example. Beyond the review text itself, information about the seller, buyer and product is available as numerical and categorical features.

We set out to explore how text and tabular data could be used together to provide stronger signals in our projects. We started by exploring the field known as multimodal learning, which focuses on how to process different modalities in machine learning.

Multimodal Literature Review

The current models for multimodal learning mainly focus on learning from the sensory modalities such as audio, visual, and text.

Within multimodal learning, there are several branches of research. The MultiComp Lab at Carnegie Mellon University provides an excellent taxonomy. Our problem falls under Multimodal Fusion—joining information from two or more modalities to make a prediction.

As text data is our primary modality, we focus on the literature that treats text as the main modality and introduces models that leverage the transformer architecture.

Trivial Solution to Structured Data

Before we dive into the literature, there is a simple solution that can be used where the structured data is treated as regular text and is appended to the standard text inputs. Taking the E-commerce reviews example, the input can be structured as follows: Review. Buyer Info. Seller Info. Numbers/Labels. Etc. One caveat with this approach, however, is that it is limited by the maximum token length that a transformer can handle.

Transformer on Images and Text

In the last couple of years, transformer extensions for image and text have advanced significantly. Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al. (2019) uses pre-trained ResNet and pre-trained BERT features on unimodal images and text respectively and feeds this into a Bidirectional transformer. The key innovation is adapting the image features as additional tokens to the transformer model.

Additionally, there are models, ViLBERT (Lu et al. 2019) and VLBert (Su et al. 2020), which define pre-training tasks for images and text. Both models pre-train on the Conceptual Captions dataset, which contains roughly 3.3 million image-caption pairs (web images with captions from alt text). In both cases, for a given image, a pre-trained object detection model like Faster R-CNN obtains vector representations for regions of the image, which count as input token embeddings to the transformer model.

As an example, ViLBert pre-trains on the following training objectives:

Masked multimodal modelling: Mask input image and word tokens. For the image, the model tries to predict a vector capturing image features for the corresponding image region, while for text, it predicts the masked text based on the textual and visual clues.
Multimodal alignment: Whether the image and text pair are actually from the same image and caption pair.

An image of the pretraining tasks and an example of masked multimodal learning are shown below. Given the image and text, if we mask out dog, the model should be able to use the unmasked visual information to correctly predict the masked word to be dog.

All these models use the bidirectional transformer model that is the backbone of BERT. The differences lie in the pretraining tasks the models are trained on and slight additions to the transformer. In the case of ViLBERT, the authors also introduce a co-attention transformer layer (shown below) to define the attention mechanism between the modalities explicitly.

Finally, there’s also LXMERT (Tan and Mohit 2019), another pre-trained transformer model that as of Transformers version 3.1.0, is implemented as part of the library. The input to LXMERT is the same as ViLBERT and VLBERT. However, LXMERT pre-trains on aggregated datasets, which also include visual question answering datasets. In total LXMERT pre-trains on 9.18 million image text pairs.

Transformers on Aligning Audio, Visual, and Text

Beyond transformers for combining image and text, there are multimodal models for audio, video, and text modalities in which there is a natural ground truth temporal alignment. Papers along this direction include MulT, Multimodal Transformer for Unaligned Multimodal Language Sequences (Tsai et al. 2019), and the Multimodal Adaptation Gate (MAG) from Integrating Multimodal Information in Large Pretrained Transformers (Rahman et al. 2020).

MuIT is similar to ViLBert in which co-attention is used between pairs of modalities. Meanwhile, MAG hopes to inject other modality information at certain transformer layers via a gating mechanism.

Transformers with Text and Knowledge Graph Embeddings

Some works have also identified knowledge graphs as a vital piece of information in addition to text data. Enriching BERT with Knowledge Graph Embeddings for Document Classification (Ostendorff et al. 2019) uses features from the author entities in the Wikidata knowledge graph in addition to metadata features for book category classification. In this case, the model is a simple concatenation of these features and BERT output text features of the book title and description before some final classification layers.

Key Takeaway

The main takeaway for adapting transformers for multimodal data is to ensure that there's an attention or weighting mechanism between the multiple modalities. The attention mechanisms can occur at different points of the transformer architecture, as encoded input embeddings, injected in the middle, or combined after the transformer encodes the text data.

Multimodal Transformers Toolkit

Using what we’ve learned from the literature review and the comprehensive HuggingFace library of state of the art transformers, we’ve developed a toolkit. The multimodal-transformers package extends any HuggingFace transformer for tabular data. To see the code, documentation, and working examples, check out the project repo.

At a high level, the outputs of a transformer model on text data and tabular features containing categorical and numerical data are combined in a combining module. Since there is no alignment in our data, we choose to combine the text features after the transformer’s output. The combining module implements several methods for integrating the modalities, including, attention and gating methods inspired by the literature survey. More details of these methods are available here.

Walkthrough

Let's work through an example where we classify clothing review recommendations. We'll use a simplified version of the example included in the Colab notebook. We will use the Women's E-Commerce Clothing Reviews from Kaggle, which contains 23,000 customer reviews.

In this dataset, we have text data in the Title and Review Text columns. We also have categorical features from the Clothing ID, Division Name, Department Name and Class Name columns and numerical features from the Rating and Positive Feedback Count.

Loading The Dataset

We first load our data into a

TorchTabularTextDataset

, which works with PyTorch's data loaders that include the text inputs for HuggingFace Transformers and our specified categorical feature columns and numerical feature columns. For this, we also need to load our HuggingFace tokenizer.

Loading Transformer with Tabular Model

Now we load our transformer with a tabular model. We first specify our tabular configurations in a

TabularConfig

object. This config is then set as the

tabular_config

member variable of a HuggingFace transformer config object. Here, we also specify how we want to combine the tabular features with the text features. In this example, we will use a weighted sum method.

Once we have the

tabular_config

set, we can load the model using the same API as HuggingFace. See the documentation for the list of currently supported transformer models that include the tabular combination module.

Training

For training, we can use HuggingFace’s trainer class. We also need to specify the training arguments, and in this case, we will use the default.

Let's take a look at our models in training!

Results

Using this toolkit, we also ran our experiments on the Women’s E-Commerce Clothing Reviews dataset for recommendation prediction and the Melbourne Airbnb Open Data dataset for price prediction. The former is a classification task, while the latter is a regression task. Our results are in the table below. The text_only combine method is a baseline that uses only the transformer and is essentially the same as a HuggingFace for SequenceClassification model.

We can see that incorporating tabular features improves performance over the text_only method. The performance gains also depend on how strong the training signals from the tabular data are. For example, in the review recommendation case, the text_only model is already a strong baseline.

Next Steps

We've already used the toolkit successfully in our projects. Feel free to try it out on your next machine learning project!

Check out the documentation and the included main script for how to do evaluation and inference. Also, if you want support for your favourite transformer, feel free to add transformer support here.

Appendix

Readers should check out The Illustrated Transformer and The Illustrated BERT, for a well summarized overview of transformers and BERT.

Here’s a quick taxonomy of papers we reviewed.

Transformer on Image and Text

Supervised Multimodal Bitransformers for Classifying Images and Text (Kiela et al. 2019)
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al. 2019)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations (Su et al. ICLR 2020)
LXMERT: Learning Cross-Modality Encoder Representations from Transformers (Tan et al. EMNLP 2019)