Tweet Sentiment Analysis Using LSTM With PyTorch

We will go through a common case study (sentiment analysis) to explore many techniques and patterns in Natural Language Processing.


  • Imports and Data Loading
  • Data Preprocessing
    • Null Value Removal
    • Class Balance
  • Tokenization
  • Embeddings
  • LSTM Model Building
  • Setup and Training
  • Evaluation

Imports and Data Loading

In [81]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from import DataLoader, TensorDataset

import numpy as np
import pandas as pd

import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import nltk
from nltk.tokenize import word_tokenize

import matplotlib.pyplot as plt
In [4]:'punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/

This dataset can be found on Github in this repo:

It is a sentiment analysis dataset comprised of 2 files:

  • train.csv, 5971 tweets
  • test.csv, 4000 tweets

The tweets are labeled as:

  • Positive
  • Neutral
  • Negative

Other datasets have different or more labels, but the same concepts apply to preprocessing and training. Download the files and store them locally.

In [7]:
train_path = "train.csv"
test_path = "test.csv"

Before working with PyTorch, make sure to set the device. This line of code selects a GPU if available.

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Since the data is stored in csv files, we can use the pandas function .read_csv() to parse both train and test files:

In [9]:
train_df = pd.read_csv(train_path)
In [10]:
test_df = pd.read_csv(test_path)

Data Preprocessing

Null Value Removal

After parsing the files, it is important to analyze the text to understand the preprocessing steps you will take.

In [11]:
Id Category Tweet
0 635769805279248384 negative Not Available
1 635930169241374720 neutral IOS 9 App Transport Security. Mm need to check...
2 635950258682523648 neutral Mar if you have an iOS device, you should down...
3 636030803433009153 negative @jimmie_vanagon my phone does not run on lates...
4 636100906224848896 positive Not sure how to start your publication on iOS?...
... ... ... ...
5965 639016598477651968 neutral @YouAreMyArsenal Wouldn't surprise me if we en...
5966 640276909633486849 neutral Rib injury for Zlatan against Russia is a big ...
5967 640296841725235200 neutral Noooooo! I was hoping to see Zlatan being Zlat...
5968 641017384908779520 neutral Not Available
5969 641395811474128896 neutral Not Available

5970 rows × 3 columns

Preprocessing is about cleaning the files from inconsistent, useless, or noisy information. So, we first look for things to remove.

  • We can see a few tweets that are "Not Available", and they will not help train our model.
  • Also, the column "Id" is not useful in machine learning, since the ID of a tweet does not affect its sentiment.
  • We may not see any in the sample displayed, but there may be null values (NaN) in the columns. Pandas has a function .dropna() that drops null values.
In [12]:
train_df = train_df.drop(columns=["Id"])
train_df = train_df.dropna()
train_df = train_df[train_df['Tweet'] != "Not Available"]
Category Tweet
1 neutral IOS 9 App Transport Security. Mm need to check...
2 neutral Mar if you have an iOS device, you should down...
3 negative @jimmie_vanagon my phone does not run on lates...
4 positive Not sure how to start your publication on iOS?...
5 neutral Two Dollar Tuesday is here with Forklift 2, Qu...
... ... ...
5963 positive Ok ed let's do this, Zlatan, greizmann and Lap...
5964 neutral Goal level: Zlatan 90k by Friday? = Posting e...
5965 neutral @YouAreMyArsenal Wouldn't surprise me if we en...
5966 neutral Rib injury for Zlatan against Russia is a big ...
5967 neutral Noooooo! I was hoping to see Zlatan being Zlat...

5422 rows × 2 columns

So far so good, let us take a look at the test set:

In [13]:
Id Category
0 6.289494e+17 dear @Microsoft the newOoffice for Mac is grea...
1 6.289766e+17 @Microsoft how about you make a system that do...
2 6.290232e+17 Not Available
3 6.291792e+17 Not Available
4 6.291863e+17 If I make a game as a #windows10 Universal App...
... ... ...
9963 NaN NaN
9964 NaN NaN
9965 NaN NaN
9966 NaN NaN
9967 NaN NaN

9968 rows × 2 columns

It turns out that the test set unfortunately has no Category column. Thus, it will not be very useful for us. However, we can do some preprocessing for pratice:

  • The tweets column is wrongly named "Category", we can rename it:
In [14]:
test_df = test_df.rename(columns={"Category": "Tweet"})

Then, we apply the same steps as we did on the train set.

In [15]:
test_df = test_df.drop(columns=["Id"])
test_df = test_df.dropna()
test_df = test_df[test_df['Tweet'] != "Not Available"]
0 dear @Microsoft the newOoffice for Mac is grea...
1 @Microsoft how about you make a system that do...
4 If I make a game as a #windows10 Universal App...
5 Microsoft, I may not prefer your gaming branch...
6 @MikeWolf1980 @Microsoft I will be downgrading...
... ...
3994 Anybody with a Steak & Shake or IHOP move ...
3995 I am assembling an epic Pancake Posse for an I...
3996 do you work at Ihop tomorrow @carlysunshine_
3997 23 Aug 00;30 #771NAS Rescue193 returned from T...
3999 IOS 9 App Transport Security. Mm need to check...

3640 rows × 1 columns

Class Imbalance

Next, since this is a classification task, we must make sure that the classes are balanced in terms of number of instances. Otherwise, any model we train will be skewed and less accurate.

First, we find the counts of each class:

In [16]:
positive    2599
neutral     1953
negative     869
Tweet          1
Name: Category, dtype: int64

Supervised datasets typically have balanced classes. However, as seen in this dataset, the number of positive and neutral tweets are a lot more than the negative tweets. There are several solutions to fix imbalance problem:

  • Oversampling
  • Undersampling
  • Hybrid approaches
  • Augmentation


To re-adjust the class imbalance, in oversampling, you duplicate some tweets in the minority classes until you have similar number of tweets for each class. So for example, we would duplicate the negative set ~3 times to acquire 2600 negative tweets. We can also do the same for neutral tweets. By doing so, you end up with all classes having 2600 tweets.


In undersampling, instead of increasing the number of tweets in the minority classes, you decrease the number of tweets in the majority classes. You do so simply by deleting tweets in the majority classes randomly until you have 869 tweets in all classes.

Hybrid Approaches

Both oversampling and undersampling can be a bit extreme. One can do a mixture of both by determining a final number of tweets that is between the minimum and the maximum. For instance, we can select 2000 as the final tweet count. Then, we delete ~600 positive tweets, keep neutral tweets the same, and duplicate the negative tweets by a factor of ~2.3. This way we end up with ~2000 tweets in each class.


Augmentation is more complex than the other approaches. In augmentation, you use the existing negative tweets to create new negative tweets. By doing so, you can increase the number of negative and neutral tweets until they are all 2600.

It is a relatively new concept, but you can find more about it in the papers listed here:

For our purpose, we undersample positive and neutral classes till we have 869 tweets in each class. We are doing undersampling manually in this excercise, but there is a python library called imblearn that can perform under/oversampling.

In [17]:
remove_pos = 2599 - 869
remove_neut = 1953 - 869

neg_df = train_df[train_df["Category"] == "negative"] 

pos_df = train_df[train_df["Category"] == "positive"]
neut_df = train_df[train_df["Category"] == "neutral"]

pos_drop_indices = np.random.choice(pos_df.index, remove_pos, replace=False)
neut_drop_indices = np.random.choice(neut_df.index, remove_neut, replace=False)

pos_undersampled = pos_df.drop(pos_drop_indices)
neut_undersampled = neut_df.drop(neut_drop_indices)
In [18]:
Category Tweet
10 positive Parkrun app for iOS downloaded Where have you ...
16 positive Five Great Free Apps and Games for iOS - Augus...
18 positive See news through the eyes of real people &...
19 positive Siri knows all about #Apple's iOS event on the...
22 positive @Yurt try beat mp3 it may be on android i have...
... ... ...
5924 positive Zlatan Ibrahimovich. @zlatan_ibra9 Gracious Le...
5932 positive Scenes when Benzema walks out of tunnel tomorr...
5939 positive 7 more days till we start the campaign that wi...
5940 positive The VP of France's refereeing union Laurent Ug...
5947 positive @DaveEllis11 @klavierstuk but if Zlatan is ava...

869 rows × 2 columns

After undersampling both neutral and positive classes, we join them all together again:

In [19]:
balanced_train_df = pd.concat([neg_df, pos_undersampled, neut_undersampled])
In [20]:
neutral     869
negative    869
positive    869
Name: Category, dtype: int64

As shown, the value counts have been adjusted.

Moving forward, since we do not have a labeled test set, we split the train set into train and test sets with ratios of 85:15

In [21]:
train_clean_df, test_clean_df = train_test_split(balanced_train_df, test_size=0.15)
In [22]:
Category Tweet
2818 positive Early release bc Obama will be at the College ...
1505 neutral April 17, 1986 Madonna at the At Close Range p...
620 negative "Joe Biden may join Bernie Sanders in the Demo...
3367 positive @LaurenceWHolmes What do you mean, Laurence? T...
19 positive Siri knows all about #Apple's iOS event on the...
... ... ...
1738 positive Warm up those vocals, Castro! @KAMELLE is lead...
990 positive Best Jerseys this season (not in order, can't ...
4391 neutral "I've never been shy or secretive about the fa...
4753 neutral Not for nothing is their motto TGIF - 'Thank G...
1838 positive MAGICAL MARCH - With 48 goals in 42 official m...

2215 rows × 2 columns

In [23]:
Category Tweet
705 positive Khakis and Jurassic Park shirt for tomorrow. ...
1482 neutral May our old mini van and Lexus rest in peace. ...
5307 negative There's a simple solution, just deport all the...
3377 negative Rick Perry was going to go on Are You Smarter ...
3932 positive Snoop Dogg was one of the stars to support Ma...
... ... ...
4972 neutral Tristram 'more Tory than the Tories' Hunt seem...
2859 negative Mark Levin Market Crash: It's Not China-It's B...
3536 negative Someone may want to let Sarah Palin know that ...
2367 negative The LAST thing we need is more corn and more M...
5099 neutral Hahaha dead. Trump talks about the real issues...

392 rows × 2 columns

Since the data is small, we can transfer them into python lists for further manipulation. If the data is large, it's preferred to keep using pandas until you create the batch iterator (DataLoader in PyTorch).

In [24]:
train_set = list(train_clean_df.to_records(index=False))
test_set = list(test_clean_df.to_records(index=False))
In [25]:
[('positive', 'Early release bc Obama will be at the College across the street from my high school tomorrow. Nice.'),
 ('neutral', 'April 17, 1986 Madonna at the At Close Range premiere'),
 ('negative', '"Joe Biden may join Bernie Sanders in the Democrat primary... I thought the Democrats were opposed to fossil fools!" ~ Emily Zanotti,'),
 ('positive', '@LaurenceWHolmes What do you mean, Laurence? The Dudleys, Ric Flair, and Sting were on Raw Monday. Taker wrestled Sunday. It IS the 90s.'),
 ('positive', "Siri knows all about #Apple's iOS event on the 9th. #GiveUsAHint"),
 ('negative', ".@SenTedCruz @realDonaldTrump @SenTomCotton   We don't want Obama dumping them in the USA!"),
 ('neutral', 'YouTube Gaming Launches Tomorrow with iOS and Android Apps to Go Head-to-Head with Twitch #ios #game'),
 ('neutral', "@Omsondafivenine @Footy_Jokes this is the truth my friend while messi might win the 5th ballon d or people would say Ronaldo didn't win it"),
 ('neutral', "Michelle Obama's waiting in the Master Bedroom Chelsea Clinton's waiting in the Lincoln Bedroom WHICH ROOM 1st @Sadieisonfire @REALFITFINLAY"),
 ('positive', 'The very best thing about Monday Night Raw was the Nintendo #MarioMaker commericial. We still want the games @WWE @2K @WWENetwork. #WiiU')]

We can observe that some tweets end with links. Moreover, we can see that many tweets have twitter mentions (@someone). These are not useful in determining the sentiment of the tweet, and it is better to remove them before proceeding:

In [26]:
def remove_links_mentions(tweet):
    link_re_pattern = "https?:\/\/[\w]+"
    mention_re_pattern = "@\w+"
    tweet = re.sub(link_re_pattern, "", tweet)
    tweet = re.sub(mention_re_pattern, "", tweet)
    return tweet.lower()
In [27]:
remove_links_mentions('...and Jeb Bush is third in the polls and losing donors. Be fair and balance...@karlrove @FoxNews.')
'...and jeb bush is third in the polls and losing donors. be fair and balance... . '

As showm, regex can remove such strings easily. Finally, notice that we lowercased all tweets in the function. The simple reason is that for a computer, case differences are important. For example, the word "word" and "Word" are as different as any other 2 pairs of words, although for us they are the same. To improve training, it is better to lowercase all words.


Finally, using word_tokenize() from the NLTK library, we can split the sentence into tokens, or words, puncatation points, and other language blocks that are "divisbile".

In [28]:
train_set = [(label, word_tokenize(remove_links_mentions(tweet))) for label, tweet in train_set]
In [29]:
test_set = [(label, word_tokenize(remove_links_mentions(tweet))) for label, tweet in test_set]

Next, we create the "vocabulary" of the corpus. In NLP projects, the vocabulary is just a mapping of each word to a unique ID. Since models cannot process text as we do, we must convert them into numerical form.

By creating this mapping, one can write a sentence with numbers. For instance, if the vocab is as follows:

{"i": 0,
 "the: 1,
 "ate": 2,
 "pizza": 3

We can say "I ate the pizza" by saynig [0, 2, 1, 3].

This is an oversimplified explanation of encoding, but the general idea is the same.

In this exercise, we create a list of unique words (set-like) and use that list and its indices to create a dictionary of mapping.

As shown, the list starts with the 3 tokens "<PAD>", "<SOS>", "<EOS>".

Since we will input fixed-size text to the model, we will have to pad some tweets to increase their length. The token for padding is <PAD>.

<SOS> and <EOS> are short for "start of sentence" and "end of sentence" respectively. They are tokens used to identify the beginning and ending of each sentence in order to train the model. As will be showm, they will be inserted at the beginning and end of every tweet

In [30]:
index2word = ["<PAD>", "<SOS>", "<EOS>"]

for ds in [train_set, test_set]:
    for label, tweet in ds:
        for token in tweet:
            if token not in index2word:
In [31]:
In [32]:
word2index = {token: idx for idx, token in enumerate(index2word)}
In [33]:

As shown, index2word and word2index act as our vocabulary which can be used to encode all tweets.

In [34]:
def label_map(label):
    if label == "negative":
        return 0
    elif label == "neutral":
        return 1
    else: #positive
        return 2

ALso, we cannot leave the labels in text form. So, we encode them using 0, 1, and 2 for negative, neutral, and positive respectively.

To pad, we must select a sequence length. This length should cover the majority of tweets. Typically, length measurements are performed to find the ideal sequence length, but since our data is tweet data im 2012, we know that they cannot be too long and therefore we can set the length to 32 tokens.

In [35]:
seq_length = 32

Then, we perform padding and truncating. Padding is performed when a tweet is shorter than 32 tokens, and truncating is used when a tweet is longer than 32 tokens. In the same encoding method, we also insert the PAD, SOS, and EOS tokens.

In [36]:
def encode_and_pad(tweet, length):
    sos = [word2index["<SOS>"]]
    eos = [word2index["<EOS>"]]
    pad = [word2index["<PAD>"]]

    if len(tweet) < length - 2: # -2 for SOS and EOS
        n_pads = length - 2 - len(tweet)
        encoded = [word2index[w] for w in tweet]
        return sos + encoded + eos + pad * n_pads 
    else: # tweet is longer than possible; truncating
        encoded = [word2index[w] for w in tweet]
        truncated = encoded[:length - 2]
        return sos + truncated + eos

Encoding both train and test sets:

In [37]:
train_encoded = [(encode_and_pad(tweet, seq_length), label_map(label)) for label, tweet in train_set]
In [38]:
test_encoded = [(encode_and_pad(tweet, seq_length), label_map(label)) for label, tweet in test_set]

This is what 3 tweets look like after encoding:

In [39]:
for i in train_encoded[:3]:
([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 10, 13, 14, 15, 16, 17, 18, 19, 20, 19, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 2)
([1, 21, 22, 23, 24, 25, 9, 10, 9, 26, 27, 28, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1)
([1, 29, 30, 31, 32, 33, 34, 35, 36, 10, 37, 38, 39, 40, 41, 10, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 23, 2, 0, 0, 0], 0)

Notice that they always begin with 1, which stands for SOS, and end with 2, which is EOS. If the tweet is shorter than 32 tokens, it is then padded with 0's, which is the padding. Also, notice that the labels are numerical as well.

Now, the data is preprocessed and encoded. It is time to create our PyTorch Datasets and DataLoaders:

In [40]:
batch_size = 50

train_x = np.array([tweet for tweet, label in train_encoded])
train_y = np.array([label for tweet, label in train_encoded])
test_x = np.array([tweet for tweet, label in test_encoded])
test_y = np.array([label for tweet, label in test_encoded])

train_ds = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_ds = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

train_dl = DataLoader(train_ds, shuffle=True, batch_size=batch_size, drop_last=True)
test_dl = DataLoader(test_ds, shuffle=True, batch_size=batch_size, drop_last=True)

Notice the parameter drop_last=True. This is used for when the final batch does not have 50 elements. The batch is then incomplete and will cause dimension errors if we feed it into the model. By setting this parameter to True, we avoid this final batch.

PyTorch LSTM Model Buidling

Building LSTMs is very simple in PyTorch. Similar to how you create simple feed-forward neural networks, we extend nn.Module, create the layers in the initialization, and create a forward() method.

In the initialization, we create an embeddings layer first.

Embeddings are used for improving the representation of the text. This Wikipedia article explains embeddings well:

In short, instead of feeding sentences as simple encoded sequences (for example [0, 1, 2], etc. as seen in the pizza example), we can improve the representation of every token.

Word embeddings are vectors that represent each word, instead of a single number in the pizza example.

Why does a vector help? Vectors allow you to highlight the similarities between words. For instance, we can give the words "food" and "pizza" similar vectors since the 2 words are related. This makes it easier for the model to "understand" the text.

As seen, in PyTorch it is a simple layer, and we only need to feed the data into it. Vectors are initially initialized randomly for every word, and then adjusted during training. That means that the embeddings are trainable parameters in this network.

Another alternative to using random initialization is to use pre-trained vectors. Big AI labs at Google, Facebook, and Stanford have created pre-trained embeddings that you can just download and use. They are called word2vec, fastText, and GloVe respectively.

This is a good example of how to use pre-trained embeddings such as word2vec in the Embedding layer of PyTorch:

In [41]:
class BiLSTM_SentimentAnalysis(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim, dropout) :

        # The embedding layer takes the vocab size and the embeddings size as input
        # The embeddings size is up to you to decide, but common sizes are between 50 and 100.
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        # The LSTM layer takes in the the embedding size and the hidden vector size.
        # The hidden dimension is up to you to decide, but common values are 32, 64, 128
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # We use dropout before the final layer to improve with regularization
        self.dropout = nn.Dropout(dropout)

        # The fully-connected layer takes in the hidden dim of the LSTM and
        #  outputs a a 3x1 vector of the class scores.
        self.fc = nn.Linear(hidden_dim, 3)

    def forward(self, x, hidden):
        The forward method takes in the input and the previous hidden state 

        # The input is transformed to embeddings by passing it to the embedding layer
        embs = self.embedding(x)

        # The embedded inputs are fed to the LSTM alongside the previous hidden state
        out, hidden = self.lstm(embs, hidden)

        # Dropout is applied to the output and fed to the FC layer
        out = self.dropout(out)
        out = self.fc(out)

        # We extract the scores for the final hidden state since it is the one that matters.
        out = out[:, -1]
        return out, hidden
    def init_hidden(self):
        return (torch.zeros(1, batch_size, 32), torch.zeros(1, batch_size, 32))

Finally, as seen, we have an init_hidden() method. The reason we need this method is that at the beginning of the sequence, there are no hidden states.

The LSTM takes in initial hidden states of zeros at the first time-step. So, we initalize them using this method.

Now, we initialize the model and move it to device as follows:

Setup and Training

In [113]:
model = BiLSTM_SentimentAnalysis(len(word2index), 64, 32, 0.2)
model =

Next, we create the criterion and optimizer used for training:

In [114]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = 3e-4)

Then we train the model for 50 epochs:

In [115]:
epochs = 50
losses = []
for e in range(epochs):

    h0, c0 =  model.init_hidden()

    h0 =
    c0 =

    for batch_idx, batch in enumerate(train_dl):

        input = batch[0].to(device)
        target = batch[1].to(device)

        with torch.set_grad_enabled(True):
            out, hidden = model(input, (h0, c0))
            loss = criterion(out, target)

We plot the loss at each batch to make sure that the mode is learning:

In [116]:
[<matplotlib.lines.Line2D at 0x7f03a2c1bbd0>]

As shown, the losses are decreasing steadily and then they level off, which means that the model has successfully learnt what can be learned from the data.

To test the model, we run the same loop for the the test set and extract the accuracy:


In [117]:
batch_acc = []
for batch_idx, batch in enumerate(test_dl):

    input = batch[0].to(device)
    target = batch[1].to(device)

    with torch.set_grad_enabled(False):
        out, hidden = model(input, (h0, c0))
        _, preds = torch.max(out, 1)
        preds ="cpu").tolist()
        batch_acc.append(accuracy_score(preds, target.tolist()))


While this is generally a low accuracy, it is not insignificant. If the model did not learn, we would expect an accuracy of ~33%, which is random selection.

However, since the dataset is noisy and not robust, this is the best performance a simple LSTM could achieve on the dataset.

According to the Github repo, the author was able to achieve an accuracy of ~50% using XGBoost.


In this tutorial, we created a simple LSTM classifier for sentiment analysis. Along the way, we learned many NLP techniques used in real NLP projects. While the accuracy was not as high as accuracies for other datasets, we can conclude that the model learned what it could from the data, as shown by the loss.