Text Summarization Using Keras Models

Learn how to summarize text in this article by Rajdeep Dua who currently leads the developer relations team at Salesforce India, and Manpreet Singh Ghotra who is currently working at Salesforce developing a machine learning platform/APIs.

Text summarization is a method in natural language processing (NLP) for generating a short and precise summary of a reference document. Producing a summary of a large document manually is a very difficult task. Summarization of a text using machine learning techniques is still an active research topic. Before proceeding to discuss text summarization and how we do it, here is a definition of summary.

A summary is a text output that is generated from one or more texts that conveys relevant information from the original text in a shorter form. The goal of automatic text summarization is to transform the source text into a shorter version using semantics.

Lately, various approaches have been developed for automated text summarization using NLP techniques, and they have been implemented widely in various domains. Some examples include search engines creating summaries for use in previews of documents and news websites producing consolidated descriptions of news topics, usually as headlines, to help users browse.

To summarize text effectively, deep learning models need to be able to understand documents and discern and distill the important information. These methods are highly challenging and complex, particularly as the length of a document increases.

Text summarization for reviews

This article will show you how to work on the problem of text summarization to create relevant summaries for product reviews about fine food sold on the world’s largest e-commerce platform, Amazon. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Develop a basic character-level sequence-to-sequence (seq2seq) model by defining the encoder-decoder recurrent neural network (RNN) architecture.

The dataset used in this article can be found at https://www.kaggle.com/snap/amazon-fine-food-reviews/. Your dataset will include the following:

· 568,454 reviews

· 256,059 users

· 74,258 products

How to do it…

You’ll develop a modeling pipeline and encoder-decoder architecture that tries to create relevant summaries for a given set of reviews. The modeling pipelines use RNN models written using the Keras functional API. The pipelines also use various data manipulation libraries.

The encoder-decoder architecture is used as a way of building RNNs for sequence predictions. It involves two major components: an encoder and a decoder. The encoder reads the complete input sequence and encodes it into an internal representation, usually a fixed-length vector, described as the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and generates the output sequence. Various types of encoders can be used — more commonly, bidirectional RNNs, such as LSTMs, are used.

Data processing

It is crucial that you serve the right data as input to the neural architecture for training and validation. Make sure that data is on a useful scale and format, and that meaningful features are included. This will lead to better and more consistent results.

Employ the following workflow for data preprocessing:

1. Load the dataset using pandas

2. Split the dataset into input and output variables for machine learning

3. Apply a preprocessing transform to the input variables

4. Summarize the data to show the change

Now get started step by step:

1. Get started by importing important packages and your dataset. Use the pandas library to load data and review the shape of your dataset—it includes 10 features and 5 million data points:

import pandas as pd

import re

from nltk.corpus import stopwords

from pickle import dump, load

reviews = pd.read_csv("/deeplearning-keras/ch09/summarization/Reviews.csv")

print(reviews.shape)

print(reviews.head())

print(reviews.isnull().sum())

The output will be as follows:

(568454, 10)Id 0

ProductId 0

UserId 0

ProfileName 16

HelpfulnessNumerator 0

HelpfulnessDenominator 0

Score 0

Time 0

Summary 27

Text 0

2. Remove null values and unneeded features, as shown in the following snippet:

reviews = reviews.dropna()

reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator', 'Score','Time'], 1)

reviews = reviews.reset_index(drop=True) print(reviews.head())

for i in range(5):

print("Review #",i+1)

print(reviews.Summary[i])

print(reviews.Text[i])

print()

The output will be as follows:

Summary Text

0 Good Quality Dog Food I have bought several of the Vitality canned d...

1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...

2 "Delight," says it all This is a confection that has been around a fe...

3 Cough Medicine If you are looking for the secret ingredient i...

Review # 1

Not as Advertised - Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

Review # 2

"Delight" says it all - This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case, Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.

Review # 3

Cough Medicine - If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.

By definition, a contraction is the combination of two words into a reduced form, with the omission of some internal letters and the use of an apostrophe. You can get the list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python.

3. Replace contractions with their longer forms, as shown here:

contractions = {

"ain't": "am not",

"aren't": "are not",

"can't": "cannot",

"can't've": "cannot have",

"'cause": "because",

"could've": "could have",

"couldn't": "could not",

"couldn't've": "could not have",

"didn't": "did not",

"doesn't": "does not",

"don't": "do not",

"hadn't": "had not",

"hadn't've": "had not have",

"hasn't": "has not",

"haven't": "have not",

"he'd": "he would",

"he'd've": "he would have",

4. Clean the text documents by replacing contractions and removing stop words:

def 

# Convert words to lower case

text = text.lower()

if True:

text = text.split()

new_text = []

for word in text:

if word in contractions:

new_text.append(contractions[word])

else:

new_text.append(word)

text = " ".join(new_text)

text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

text = re.sub(r'\<a href', ' ', text)

text = re.sub(r'

text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)

text = re.sub(r'

text = re.sub(r'\'', ' ', text)

if remove_stopwords:

text = text.split()

stops = set(stopwords.words("english"))

text = [w for w in text if not w in stops]

text = " ".join(text)

return text

5. Remove unwanted characters and, optionally, stop words. Also, make sure to replace the contractions, as shown previously. You can get the list of stop words from Natural Language Toolkit (NLTK), which helps with splitting sentences from paragraphs, splitting up words, and recognizing parts of speech. Import the toolkit using the following commands:

import nltk

nltk.download('stopwords')

6. Clean the summaries as shown in the following snippet:

# Clean the summaries and texts

clean_summaries = []

for summary in reviews.Summary:

clean_summaries.append(clean_text(summary, remove_stopwords=False))

print("Summaries are complete.")

clean_texts = []

for text in reviews.Text:

clean_texts.append(clean_text(text))

print("Texts are complete.")

7. Finally, save all the reviews into a pickle file. pickle serializes objects so they can be saved to a file and loaded in a program again later on:

stories = list()

for i, text in enumerate(clean_texts):

stories.append({'story': text, 'highlights': clean_summaries[i]})

# save to file

dump(stories, open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'wb'))

Encoder-decoder architecture

Develop a basic character-level seq2seq model for text summarization. Use a word-level model, which is quite common in the domain of text processing. For this article, use character level models. As mentioned earlier, encoder and decoder architecture is a way of creating RNNs for sequence prediction. Encoders read the entire input sequence and encode it into an internal representation, usually, a fixed-length vector named the context vector. The decoder, on the other hand, reads the encoded input sequence from the encoder and produces the output sequence.

The encoder-decoder architecture consists of two primary models: one reads the input sequence and encodes it to a fixed-length vector, and the second decodes the fixed-length vector and outputs the predicted sequence. This architecture is designed for seq2seq problems.

1. Firstly, define the hyperparameters such as batch size, number of epochs for training, and number of samples to train:

batch_size = 64

epochs = 110

latent_dim = 256

num_samples = 10000

2. Next, load the review dataset from the pickle file:

stories = load(open('/deeplearning-keras/ch09/summarization/review_dataset.pkl', 'rb'))

print('Loaded Stories %d' % len(stories))

print(type(stories))

The output will be as follows:

Loaded Stories

3. Then, vectorize the data:

input_texts = []

target_texts = []

input_characters = set()

target_characters = set()

for story in stories:

input_text = story['story']

for highlight in story['highlights']:

target_text = highlight

# We use "tab" as the "start sequence" character

# for the targets, and "\n" as "end sequence" character.

target_text = '\t' + target_text + '\n'

input_texts.append(input_text)

target_texts.append(target_text)

for char in input_text:

if char not in input_characters:

input_characters.add(char)

for char in target_text:

if char not in target_characters:

target_characters.add(char)

input_characters = sorted(list(input_characters))

target_characters = sorted(list(target_characters))

num_encoder_tokens = len(input_characters)

num_decoder_tokens = len(target_characters)

max_encoder_seq_length = max([len(txt) for txt in input_texts])

max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))

print('Number of unique input tokens:', num_encoder_tokens)

print('Number of unique output tokens:', num_decoder_tokens)

print('Max sequence length for inputs:', max_encoder_seq_length)

print('Max sequence length for outputs:', max_decoder_seq_length)

The output will be as follows:

Number of samples: 568411

Number of unique input tokens: 84

Number of unique output tokens: 48

Max sequence length for inputs: 15074

Max sequence length for outputs: 5

4. Now, create a generic function to define an encoder-decoder RNN:

def 

# define training encoder

encoder_inputs = Input(shape=(None, n_input))

encoder = LSTM(n_units, return_state=True)

encoder_outputs, state_h, state_c = encoder(encoder_inputs)

encoder_states = [state_h, state_c]

# define training decoder

decoder_inputs = Input(shape=(None, n_output))

decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

decoder_dense = Dense(n_output, activation='softmax')

decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# define inference encoder

encoder_model = Model(encoder_inputs, encoder_states)

# define inference decoder

decoder_state_input_h = Input(shape=(n_units,))

decoder_state_input_c = Input(shape=(n_units,))

decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs,  initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# return all models

return model, encoder_model, decoder_model

Training

1. For running the training, use rmsprop optimizer and categorical_crossentropy as the loss function:

# Run training

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data], decoder_target_data,

batch_size=batch_size,

epochs=epochs,

validation_split=0.2)

# Save model

model.save('/deeplearning-keras/ch09/summarization/model2.h5')

The output will be as follows:

64/800 [=>............................] - ETA: 22:05 - loss: 2.1460

128/800 [===>..........................] - ETA: 18:51 - loss: 2.1234

192/800 [======>.......................] - ETA: 16:36 - loss: 2.0878

256/800 [========>.....................] - ETA: 14:38 - loss: 2.1215

320/800 [===========>..................] - ETA: 12:47 - loss: 1.9832

384/800 [=============>................] - ETA: 11:01 - loss: 1.8665

448/800 [===============>..............] - ETA: 9:17 - loss: 1.7547

512/800 [==================>...........] - ETA: 7:35 - loss: 1.6619

576/800 [====================>.........] - ETA: 5:53 - loss: 1.5820

512/800 [==================>...........] - ETA: 7:19 - loss: 0.7519

576/800 [====================>.........] - ETA: 5:42 - loss: 0.7493

640/800 [=======================>......] - ETA: 4:06 - loss: 0.7528

704/800 [=========================>....] - ETA: 2:28 - loss: 0.7553

768/800 [===========================>..] - ETA: 50s - loss: 0.7554

2. For inference, use the following method:

# generate target given source sequence

def predict_sequence(infenc, infdec, source, n_steps, cardinality):

# encode

state = infenc.predict(source)

# start of sequence input

target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)

# collect predictions

output = list()

for t in range(n_steps):

# predict next char

yhat, h, c = infdec.predict([target_seq] + state)

# store prediction

output.append(yhat[0,0,:])

# update state

state = [h, c]

# update target sequence

target_seq = yhat

return array(output)

The output will be as follows:

Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!

Summary(1): great coffee

Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!

Summary(2): omg gross gross

Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon to know quaker flavor packets

Summary(3): love it

If you found this article interesting, you can explore Keras Deep Learning Cookbook to leverage the power of deep learning and Keras to develop smarter and more efficient data models. Keras Deep Learning Cookbook shows you how to tackle different problems encountered while training efficient deep learning models, with the help of the popular Keras library.