NLP Tutorial: Creating Question Answering System using BERT + SQuAD on Colab TPU

Open sourced by Google Research team, pre-trained models of BERT achieved wide popularity amongst NLP enthusiasts for all the right reasons! It is one of the best Natural Language Processing pre-trained models with superior NLP capabilities. It can be used for language classification, question & answering, next word prediction, tokenization, etc.

Our case study Question Answering System in Python using BERT NLP [1] and BERT based Question and Answering system demo [2], developed in Python + Flask, got hugely popular garnering hundreds of visitors per day. We got a lot of appreciative and lauding emails praising our QnA demo. Along with that, we also got number of people asking about how we created this QnA demo. And till the day, we keep getting requests on how to develop such a QnA system using BERT pre-trained model open-sourced by Google.

To start with, the readme file on the official GitHub repository of BERT provides a good amount of information about how to fine-tune the model on SQuAD 2.0 but we could see that developers are still facing issues. So, we decided to publish a step-by-step tutorial to fine-tune the BERT pre-trained model and generate inference of answers from the given paragraph and questions on Colab using TPU.

In this tutorial, we are not going to cover how to create web-based interface using Python + Flask. We’ll just cover the fine-tuning and inference on Colab using TPU. You can create your own interface using Flask or Django.

Overview

In this tutorial we will see how to perform a fine-tuning task on SQuAD using Google Colab, for that we will use BERT GitHub Repository, BERT Repository includes:

TensorFlow code for the BERT model architecture.
Pre-trained models for both the lowercase and cased version of BERT-Base and BERT-Large.

You can also refer or copy our colab file to follow the steps.

Steps to perform BERT Fine-tuning on Google Colab

1) Change Runtime to TPU

On the main menu, click on Runtime and select Change runtime type. Set “ TPU “ as the hardware accelerator. Below screeenshot will help you understand how you can change the runtime to TPU.

After Clicking on “Change runtime type”, Select TPU from the dropdown option as given in the below figure.

2) Clone the BERT github repository

BERT, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. You can find the academic paper here: https://arxiv.org/abs/1810.04805.

BERT has two stages: Pre-training and fine-tuning.

Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure. BERT has released a number of pre-trained models. Most NLP researchers will never need to pre-train their own model from scratch.

Fine-tuning is inexpensive. One can replicate all the results given in the paper, in at most 1 hour on a single Cloud TPU, or a few hours on a GPU. For example, SQuAD can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%.

So our first step is to Clone the BERT github repository, below is the way by which you can clone the repo from github. Now get inside the Bert repo using “ cd “ command

!git clone https://github.com/google-research/bert.git

cd bert

3) Download the BERT PRETRAINED MODEL

BERT Pretrained Model List :

BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has released BERT-Base and BERT-Large models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved.

When using a cased model, make sure to pass -do_lower=False at the time of training.

You can download any model of your choice. We have used the BERT-Large-Uncased Model.

!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip

# Unzip the pretrained model
!unzip uncased_L-24_H-1024_A-16.zip

4) Download the SQUAD2.0 Dataset

For the Question Answering task, we will be using SQuAD2.0 Dataset.

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD2.0 combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. You can download the dataset from SQUAD site https://rajpurkar.github.io/SQuAD-explorer/

#Download the SQUAD train and dev dataset
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

5) Set up your TPU environment

Verify that you are connected to a TPU device
You will get know your TPU Address which is used at the time of fine-tuning
Perform Google Authentication to access your bucket
Upload your credentials to TPU to access your GCS bucket

Using code below you can do the above mentioned 4 points:

import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

6) Create an output directory

Prerequisite: You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket to run the colab file. Please follow the Google Cloud for how to create a GCP account and GCS bucket. You have $300 free credit to start with any GCP product, learn more about it at https://cloud.google.com/tpu/docs/setup-gcp-account.

You can create your GCS bucket from here http://console.cloud.google.com/storage.

As we are using the Cloud TPU, we need to store the pre-trained model and the output directory in the Google Cloud Storage. If you are not storing it on Bucket you may face the following error :

ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on uncased_L-24_H-1024_A-16/bert_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'uncased_L-24_H-1024_A-16/bert_model.ckpt')
[[node checkpoint_initializer_14 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

You will get your fine_tuned model in the Google cloud storage bucket after completion of training. For that, you need to provide your BUCKET name and OUPUT DIRECTORY name.

BUCKET = 'bertnlpdemo' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_output' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

7) Move Pretrained Model to GCS Bucket

Need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don’t move your pre-trained model to TPU you may face the error.

The gsutil mv command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers.

!gsutil mv /content/bert/uncased_L-24_H-1024_A-16 $BUCKET_NAME

8) Training

Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have found above.

-use_tpu=True
-tpu_name=YOUR_TPU_ADDRESS

!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --do_train=True \
  --train_file=train-v2.0.json \
  --do_predict=True \
  --predict_file=dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --use_tpu=True \
  --tpu_name=grpc://10.1.118.82:8470 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --version_2_with_negative=True \
  --output_dir=$OUTPUT_DIR

Create Testing File

We are creating input_file.json as a blank JSON file and then add the data in the file in the SQuAD dataset format.

- touch is used to create a file
- %%writefile is used to write a file in the colab

You can pass your own questions and context in the below file.

!touch input_file.json

%%writefile input_file.json
{
    "version": "v2.0",
    "data": [
        {
            "title": "your_title",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "Who is current CEO?",
                            "id": "56ddde6b9a695914005b9628",
                            "is_impossible": ""
                        },
                        {
                            "question": "Who founded google?",
                            "id": "56ddde6b9a695914005b9629",
                            "is_impossible": ""
                        },
                        {
                            "question": "when did IPO take place?",
                            "id": "56ddde6b9a695914005b962a",
                            "is_impossible": ""
                        }
                    ],
                    "context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet."                
                 }
            ]
        }
    ]
}

Prediction

Below is the command to perform your own custom prediction, that is you can change the input_file.json by providing your paragraph and questions after then execute the below command.

!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$OUTPUT_DIR/model.ckpt-10859 \
  --do_train=False \
  --max_query_length=30  \
  --do_predict=True \
  --predict_file=input_file.json \
  --predict_batch_size=8 \
  --n_best_size=3 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=output/

To make it easier for you, we have already created a Colab file which you can copy in your Google Drive and execute the commands. You can access the colab file at: Question Answering System using BERT + SQuAD on Colab TPU.

If you have any further questions or doubt then please feel free to post them in comments. We’ll get back to you.

Previously published at https://www.pragnakalp.com/nlp-tutorial-setup-question-answering-system-bert-squad-colab-tpu/

References

Our case study Question Answering System in Python using BERT NLP
BERT based Question and Answering system demo