Since its reveal in 2017 in the popular paper Attention Is All You Need (https://arxiv.org/abs/1706.03762), the Transformer quickly became the most popular model in NLP. The ability to process text in a non-sequential way (as opposed to RNNs) allowed for training of big models. The attention mechanism it introduced proved extremely useful in generalizing text.
Following the paper, several popular transformers surfaced, the most popular of which is GPT. GPT models are developed and trained by OpenAI, one of the leaders in AI research. The latest release of GPT is GPT-3, which has 175 billion parameters. The model was very advanced to the point where OpenAI chose not to open-source it. People can access it through an API after a signup process and a long queue.
However, GPT-2, their previous release is open-source and available on many deep learning frameworks.
In this excercise, we use Huggingface and PyTorch to fine-tune a GPT-2 model for review summarization.
Overview:
- Imports and Data Loading
- Data Preprocessing
- Setup and Training
- Summary Writing
!pip install transformers
import re
import random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch.optim as optim
We set the device to enable GPU processing.
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device
from google.colab import drive
drive.mount("/content/drive")
The data we will use for training summarization is the Amazon review dataset, which can be found at https://www.kaggle.com/currie32/summarizing-text-with-amazon-reviews.
When writing a review on Amazon, customers write a review and a title for the review. The dataset treats the title as the summary of the review.
reviews_path = "/content/drive/My Drive/Colab Notebooks/reviews.txt"
We use the standard python method of opening txt files:
with open(reviews_path, "r") as reviews_raw:
reviews = reviews_raw.readlines()
Showing 5 instances:
reviews[:5]
As shown, each sample consists of the review followed by its summary, separated by the equals (=) sign.
len(reviews)
There are ~71K instances in the dataset, which is sufficient to train a GPT-2 model.
The beauty of GPT-2 is its ability to multi-task. The same model can be trained on more than 1 task at a time. However, we should adhere to the correct task designators, as specified by the oriningal paper.
For summarization, the appropriate task designator is the TL;DR symbol, which stands for "too long; didn't read".
The "TL;DR" token should be between the input text and the summary.
Thus, we will replace the equals symbol in the data with the correct task designator:
reviews = [review.replace(" = ", " TL;DR ") for review in reviews]
reviews[10]
So far, so good.
Finally for preprocessing, we should acquire a fixed length input. We use the average review length (in words) as an estimator:
avg_length = sum([len(review.split()) for review in reviews])/len(reviews)
avg_length
Since the average instance length in words is 53.3, we can assume that a max length of 100 will cover most of the instances.
max_length = 100
Before creating the Dataset object, we download the model and the tokenizer. We need the tokenizer in order to tokenize the data.
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
model_pth = "/content/drive/My Drive/Colab Notebooks/gpt2_weights_reviews"
model.load_state_dict(torch.load(model_pth))
We send the model to the device and initialize the optimizer
model = model.to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4)
To correctly pad and truncate the instances, we find the number of tokens used by the designator " TL;DR ":
tokenizer.encode(" TL;DR ")
extra_length = len(tokenizer.encode(" TL;DR "))
We create a simple dataset that extends the PyTorch Dataset class:
class ReviewDataset(Dataset):
def __init__(self, tokenizer, reviews, max_len):
self.max_len = max_len
self.tokenizer = tokenizer
self.eos = self.tokenizer.eos_token
self.eos_id = self.tokenizer.eos_token_id
self.reviews = reviews
self.result = []
for review in self.reviews:
# Encode the text using tokenizer.encode(). We add EOS at the end
tokenized = self.tokenizer.encode(review + self.eos)
# Padding/truncating the encoded sequence to max_len
padded = self.pad_truncate(tokenized)
# Creating a tensor and adding to the result
self.result.append(torch.tensor(padded))
def __len__(self):
return len(self.result)
def __getitem__(self, item):
return self.result[item]
def pad_truncate(self, name):
name_length = len(name) - extra_length
if name_length < self.max_len:
difference = self.max_len - name_length
result = name + [self.eos_id] * difference
elif name_length > self.max_len:
result = name[:self.max_len + 3]+[self.eos_id]
else:
result = name
return result
Then, we create the dataset:
dataset = ReviewDataset(tokenizer, reviews, max_length)
Using a batch_size of 32, we create the dataloader (Since the reviews are long, increasing the batch size can result in out of memory errors):
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)
GPT-2 is capable of several tasks, including summarization, generation, and translation. To train for summarization, use the same as input as labels:
def train(model, optimizer, dl, epochs):
for epoch in range(epochs):
for idx, batch in enumerate(dl):
with torch.set_grad_enabled(True):
optimizer.zero_grad()
batch = batch.to(device)
output = model(batch, labels=batch)
loss = output[0]
loss.backward()
optimizer.step()
if idx % 50 == 0:
print("loss: %f, %d"%(loss, idx))
train(model=model, optimizer=optimizer, dl=dataloader, epochs=1)
The online server I used was going to go offline, therefore I had to stop training a few batches early. The KeyboardInterrupt error should not be an issue, since the model's weights are saved.
The loss decreased consistently, which means that the model was learning.
Review Summarization
The summarization methodology is as follows:
- A review is initially fed to the model.
- A choice from the top-k choices is selected.
- The choice is added to the summary and the current sequence is fed to the model.
- Repeat steps 2 and 3 until either max_len is achieved or the EOS token is generated.
def topk(probs, n=9):
# The scores are initially softmaxed to convert to probabilities
probs = torch.softmax(probs, dim= -1)
# PyTorch has its own topk method, which we use here
tokensProb, topIx = torch.topk(probs, k=n)
# The new selection pool (9 choices) is normalized
tokensProb = tokensProb / torch.sum(tokensProb)
# Send to CPU for numpy handling
tokensProb = tokensProb.cpu().detach().numpy()
# Make a random choice from the pool based on the new prob distribution
choice = np.random.choice(n, 1, p = tokensProb)
tokenId = topIx[choice][0]
return int(tokenId)
def model_infer(model, tokenizer, review, max_length=15):
# Preprocess the init token (task designator)
review_encoded = tokenizer.encode(review)
result = review_encoded
initial_input = torch.tensor(review_encoded).unsqueeze(0).to(device)
with torch.set_grad_enabled(False):
# Feed the init token to the model
output = model(initial_input)
# Flatten the logits at the final time step
logits = output.logits[0,-1]
# Make a top-k choice and append to the result
result.append(topk(logits))
# For max_length times:
for _ in range(max_length):
# Feed the current sequence to the model and make a choice
input = torch.tensor(result).unsqueeze(0).to(device)
output = model(input)
logits = output.logits[0,-1]
res_id = topk(logits)
# If the chosen token is EOS, return the result
if res_id == tokenizer.eos_token_id:
return tokenizer.decode(result)
else: # Append to the sequence
result.append(res_id)
# IF no EOS is generated, return after the max_len
return tokenizer.decode(result)
Generating unique summaries for a 5 sample reviews:
sample_reviews = [review.split(" TL;DR ")[0] for review in random.sample(reviews, 5)]
sample_reviews
for review in sample_reviews:
summaries = set()
print(review)
while len(summaries) < 3:
summary = model_infer(model, tokenizer, review + " TL;DR ").split(" TL;DR ")[1].strip()
if summary not in summaries:
summaries.add(summary)
print("Summaries: "+ str(summaries) +"\n")
The summaries reflect the content of the review. Feel free to try other reviews to test the capbailities of the model.
In this tutorial, we learned how to fine-tune the Huggingface GPT model to perform Amazon review summarization. The same methodology can be applied to any language model available on https://huggingface.co/models.
Related Notebooks
- Movie Name Generation Using GPT-2
- Select Pandas Dataframe Rows And Columns Using iloc loc and ix
- Activation Functions In Artificial Neural Networks Part 2 Binary Classification
- Polynomial Interpolation Using Python Pandas Numpy And Sklearn
- Covid 19 Curve Fit Using Python Pandas And Numpy
- Stock Tweets Text Analysis Using Pandas NLTK and WordCloud
- How To Calculate Stocks Support And Resistance Using Clustering
- How to do SQL Select and Where Using Python Pandas
- Tweet Sentiment Analysis Using LSTM With PyTorch