LSTM Based Word Detectors

This article aims to provide the basics of LSTMs (Long Short Term Memory) and implements a word detector using the architecture.

The detector implemented in this article is a cuss word detector that detects a custom set of cuss words.

What are LSTMs ???

LSTMs or Long Short term memory cells are a long term memory units that were designed to solve the vanishing gradient problem with the RNNs. Normally the memory in the RNNs is short lived. We cannot store data 8 - 9 time steps behind using an RNN. To store data for longer periods like 1000 time steps we use a LSTM.

LSTM History

1997: LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber.[1] By introducing Constant Error Carousel (CEC) units, LSTM deals with the vanishing gradient problem. The initial version of LSTM block included cells, input and output gates.[5]

1999: Felix Gers and his advisor Jürgen Schmidhuber and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture,[6] enabling the LSTM to reset its own state.[5]

2000: Gers & Schmidhuber & Cummins added peephole connections (connections from the cell to the gates) into the architecture.[7] Additionally, the output activation function was omitted.[5]

2009: An LSTM based model won the ICDAR connected handwriting recognition competition. Three such models were submitted by a team lead by Alex Graves.[8] One was the most accurate model in the competition and another was the fastest.[9]

2013: LSTM networks were a major component of a network that achieved a record 17.7% phoneme error rate on the classic TIMIT natural speech dataset.[10]

2014: Kyunghyun Cho et al. put forward a simplified variant called Gated recurrent unit (GRU).[11]

2015: Google started using an LSTM for speech recognition on Google Voice.[12][13] According to the official blog post, the new model cut transcription errors by 49%. [14]

2016: Google started using an LSTM to suggest messages in the Allo conversation app.[15] In the same year, Google released the Google Neural Machine Translation system for Google Translate which used LSTMs to reduce translation errors by 60%.[16][17][18]

Apple announced in its Worldwide Developers Conference that it would start using the LSTM for quicktype[19][20][21] in the iPhone and for Siri.[22][23]

Amazon released Polly, which generates the voices behind Alexa, using a bidirectional LSTM for the text-to-speech technology.[24]

2017: Facebook performed some 4.5 billion automatic translations every day using long short-term memory networks.[25]

Researchers from Michigan State University, IBM Research, and Cornell University published a study in the Knowledge Discovery and Data Mining (KDD) conference.[26][27][28] Their study describes a novel neural network that performs better on certain data sets than the widely used long short-term memory neural network.

Microsoft reported reaching 94.9% recognition accuracy on the Switchboard corpus, incorporating a vocabulary of 165,000 words. The approach used "dialog session-based long-short-term memory".[29]

2019: Researchers from the University of Waterloo proposed a related RNN architecture which represents continuous windows of time. It was derived using the Legendre polynomials and outperforms the LSTM on some memory-related benchmarks.[30]

An LSTM model climbed to third place on the in Large Text Compression Benchmark.[31][32]

LSTM Architecture

But all of the above diagram is complex math. To simplify all of it we can view their functions i.e. what all that math represents. So, simplifying it we can represent it as

In the article we are now going to use some abbreviations.

LTM : Long term memory

STM : Short term memory

NLTM : New long term memory

NSTM : New short term memory

Working

1. The data from the LTM is pushed into the forget gate which remembers only certain features.

2. Then this data is pushed into the use and remember gate.

3. Now data from the STM and the event is pushed into the learn gate

4. This data is again pushed into remember and use gates.

5. The combined data in the remember gate from the learn gate and forget gate is the NLTM

6. The data in the use gate which is a combination of data from forget and learn gate is the NSTM.

In case you wish to get into the core mathematics behind the LSTM make sure you check out this beautiful article.

Link : https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Our Model Architecture

LSTM Requirements

In the case of an LSTM, for each piece of data in a sequence (say, for a word in a given sentence), there is a corresponding hidden state ℎ𝑡ht. This hidden state is a function of the pieces of data that an LSTM has seen over time; it contains some weights and, represents both the short term and long term memory components for the data that the LSTM has already seen.

So, for an LSTM that is looking at words in a sentence, the hidden state of the LSTM will change based on each new word it sees. And, we can use the hidden state to predict the next word in a sequence or help identify the type of word in a language model, and lots of other things!

To create an LSTM in PyTorch we use

nn.LSTM(input_size=input_dim, hidden_size=hidden_dim, num_layers=n_layers)

input_dim = the number of inputs (a dimension of 20 could represent 20 inputs)

hidden_dim = the size of the hidden state; this will be the number of outputs that each LSTM cell produces at each time step.

n_layers = the number of hidden LSTM layers to use; this is typically a value between 1 and 3; a value of 1 means that each LSTM cell has one hidden state. This has a default value of 1.

Hidden State

Once an LSTM has been defined with input and hidden dimensions, we can call it and retrieve the output and hidden state at every time step. out, hidden = lstm(input.view(1, 1, -1), (h0, c0))

The inputs to an LSTM are (input, (h0, c0)).

input = a Tensor containing the values in an input sequence; this has values: (seq_len, batch, input_size)

h0 = a Tensor containing the initial hidden state for each element in a batch

c0 = a Tensor containing the initial cell memory for each element in the batch

h0 nd c0 will default to 0, if they are not specified. Their dimensions are: (n_layers, batch, hidden_dim).

We know that an LSTM takes in an expected input size and hidden_dim, but sentences are rarely of a consistent size, so how can we define the input of our LSTM?

Well, at the very start of this net, we'll create an Embedding layer that takes in the size of our vocabulary and returns a vector of a specified size, embedding_dim, for each word in an input sequence of words. It's important that this be the first layer in this net. You can read more about this embedding layer in the PyTorch documentation.

Pictured below is the expected architecture for this tagger model.

Code

import torch
import torch.nn as nn
import torch.nn.functional as F

import torch.optim as optim
import matplotlib.pyplot as ply
import numpy as np

data = [("What the fuck".lower().split() , ["O","O","CS"]),
        ("The boy asked him to fuckoff".lower().split() ,["O","O","O","O","O","CS"]),
        ("I hate that bastard".lower().split() , ["O","O","O","CS"]),
        ("He is a dicked".lower().split(),["O","O","O","CS"]),
        ("Hey prick".lower().split(),["O","CS"]),
        ("What a pussy you are".lower().split() , ["O","O","CS","O","O"]),
        ("Dont be a cock".lower().split(),["O","O","O","CS"])]

word2idx = {}

for sent , tag in data:
  for word in sent:
    if word not in word2idx:
      word2idx[word] = len(word2idx)

tag2idx = {"O" : 0 , "CS" : 1}
tag2rev = {0 : "O" , 1 : "CS"}

def prepare_sequence(seq , to_idx):
  idxs = [to_idx[word] for word in seq]
  idxs = np.array(idxs)
  return torch.tensor(idxs)

testsent = "fuckoff boy".lower().split()
inp = prepare_sequence(testsent , word2idx)
print("The test sentence {} is tranlated to {}\r\n".format(testsent , inp))

class LSTMTagger(nn.Module):

  def __init__(self,embedding_dim,hidden_dim,vocab_size,tagset_size):

    super(LSTMTagger , self).__init__()

    self.hidden_dim = hidden_dim

    self.word_embedding = nn.Embedding(vocab_size , embedding_dim= embedding_dim)

    self.lstm = nn.LSTM(input_size= embedding_dim , hidden_size = hidden_dim)

    self.hidden2tag = nn.Linear(hidden_dim , tagset_size)

    self.hidden = self.init_hidden()

  def init_hidden(self):

    return (torch.randn(1 , 1 , self.hidden_dim),
           torch.randn(1 , 1 , self.hidden_dim))

  def forward(self , sentence):

    embeds = self.word_embedding(sentence)

    lstm_out , hidden_out = self.lstm(embeds.view(len(sentence) , 1 , -1) , self.hidden) 

    tag_outputs = self.hidden2tag(lstm_out.view(len(sentence) , -1))
    tag_scores = F.log_softmax(tag_outputs , dim = 1)

    return tag_scores   

EMBEDDING_DIM = 6
HIDDEN_DIM = 6
model = LSTMTagger(EMBEDDING_DIM , HIDDEN_DIM , len(word2idx) , len(tag2idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters() , lr = 0.1)

n_epochs = 300

for epoch in range(n_epochs):

  epoch_loss = 0.0

  for sent , tags in data:

    model.zero_grad()

    input_sent = prepare_sequence(sent , word2idx)
    tag = prepare_sequence(tags , tag2idx)

    model.hidden = model.init_hidden()

    output = model(input_sent)

    loss = loss_function(output , tag)

    epoch_loss += loss.item()

    loss.backward()

    optimizer.step()

  if epoch % 20 == 19:
    print("Epoch : {} , loss : {}".format(epoch , epoch_loss / len(data)))

testsent = "You ".lower().split()
inp = prepare_sequence(testsent , word2idx)

print("Input sent : {}".format(testsent))
tags = model(inp)
_,pred_tags = torch.max(tags , 1)
print("Pred tag : {}".format(pred_tags))
pred = np.array(pred_tags)

for i in range(len(testsent)):
  print("Word : {} , Predicted tag : {}".format(testsent[i] , tag2rev[pred[i]]))

For more well documented code kindly check this GitHub repository which contains detailed instructions.

Link : https://github.com/srimanthtenneti/Cuss-Word-Detector---LSTM

Conclusion

This is how we use LSTMs to make a word detector.

Contact

Feel free to connect.

Link : https://www.linkedin.com/in/srimanth-tenneti-662b7117b/