Stock Tweets Text Analysis Using Pandas NLTK and WordCloud

In this notebook, we will go over the text analysis of Stock tweets. This data has been scraped from stocktwits. I will use Python Pandas, Python library WordCloud and NLTK for this analysis. If you want to know more about Pandas, check my other notebooks on Pandas

Let us import the necesary packages.

In [1]:
import re
import random
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import as px
import plotly.figure_factory as ff
import json
from collections import Counter

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import nltk
from nltk.corpus import stopwords

import os
import nltk

import warnings

Checking the Data

Let us check the data using Unix cat command.

In [2]:
!head -2 stocktwits.csv
atvi,$ATVI brutal selloff here today... really dumb given the sectors performance. still bulish midterm.,Bullish,14,2020-10-02T22:19:36.000Z

Reading the data

Let us take a peak in to our data.

In [3]:
df = pd.read_csv('stocktwits.csv')
In [4]:
ticker message sentiment followers created_at
0 atvi $ATVI brutal selloff here today... really dumb... Bullish 14 2020-10-02T22:19:36.000Z
1 atvi $ATVI $80 around next week! Bullish 31 2020-10-02T21:50:19.000Z
2 atvi $ATVI Jefferies says that the delay is a &quot... Bullish 83 2020-10-02T21:19:06.000Z
3 atvi $ATVI I’ve seen this twice before, and both ti... Bullish 5 2020-10-02T20:48:42.000Z
4 atvi $ATVI acting like a game has never been pushed... Bullish 1 2020-10-02T19:14:56.000Z

As we see above, for each stock we have a tweet , sentiment, number of followers and date of stock tweet.

In [5]:
(31372, 5)

Check if there are any 'na' values in data with df.isna(). We see below, there is no 'na' in data.

In [6]:
ticker        False
message       False
sentiment     False
followers     False
created_at    False
dtype: bool

Check if there are any 'null' in data with df.isnull() command. As we see below, there are no null values in data.

In [7]:
ticker        False
message       False
sentiment     False
followers     False
created_at    False
dtype: bool

There are no null Values in the test set

Stock Tweet Analysis

Let us look at the distribution of tweets by stocks.

In [8]:
stock_gp = df.groupby('ticker').count()['message'].reset_index().sort_values(by='message',ascending=False)
ticker message
607 spce 353
629 zm 294
614 tsla 283
591 ostk 275
171 F 267
In [9]:
g = sns.distplot(stock_gp['message'],kde=False)

X-axis in the above plot shows the number of messages. Every bar represents a ticker.

There is another way to plot which is bar plot (shown below) that will give us some more information about the stocks and their tweets. Note in the below plot, only few labels have been plotted, otherwise the y-axis will be cluttered with the labels if plot all of them.

In [10]:
import matplotlib.ticker as ticker
ax = sns.barplot(y='ticker', x='message', data=stock_gp)

Lets look at the distribution of tweets by sentiment in the data set.

In [11]:
temp = df.groupby('sentiment').count()['message'].reset_index().sort_values(by='message',ascending=False)'Greens')
sentiment message
1 Bullish 26485
0 Bearish 4887

As we can see the data is skewed towards Bullish sentiments which is not surprising given the fact that since mid of 2020 market has been in uptrend.

Most Common 20 words in Text/Tweets

In [12]:
df['words'] = df['message'].apply(lambda x:str(x.lower()).split())
top = Counter([item for sublist in df['words'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']'Blues')
Common_words count
0 the 16867
1 to 12515
2 and 9252
3 a 9179
4 is 7643
5 this 7354
6 of 6321
7 in 6105
8 for 6062
9 on 5390
10 i 4598
11 will 3755
12 it 3695
13 be 3589
14 at 3545
15 with 3389
16 you 3203
17 are 3134
18 up 2539
19 that 2472

Most of these words shown above are stop words. Let us remove these stop words first.

In [13]:
def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]
df['words'] = df['words'].apply(lambda x:remove_stopword(x))
In [14]:
top = Counter([item for sublist in df['words'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']'Blues')
Common_words count
0 buy 1868
1 - 1606
2 stock 1588
3 like 1542
4 going 1483
5 good 1461
6 go 1445
7 get 1410
8 see 1409
9 next 1377
10 short 1317
11 trade 1253
12 back 1233
13 $spy 1197
14 market 1159
15 long 1116
16 calls 1075
17 price 1038
18 $aapl 1013
19 day 984

Stock Tweets WordClouds

Let us now plot the word clouds using Python WordCloud library.

In [15]:
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=50, figure_size=(16.0,9.0), color = 'white',
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'u', "im"}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color=color,
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    mask = mask)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})

Let us first plot the word clouds of Bullish tweets only.

In [16]:
plot_wordcloud(df[df['sentiment']=="Bullish"]['message'],mask=None,color='white',max_font_size=50,title_size=30,title="WordCloud of Bullish Tweets")

Ok let us plot WordCloud now for Bearish tweets.

In [17]:
plot_wordcloud(df[df['sentiment']=="Bearish"]['message'],mask=None,color='white',max_font_size=50,title_size=30,title="WordCloud of Bearish Tweets")