Stock Tweets Text Analysis Using Pandas NLTK and WordCloud

In this notebook, we will go over the text analysis of Stock tweets. This data has been scraped from stocktwits. I will use Python Pandas, Python library WordCloud and NLTK for this analysis. If you want to know more about Pandas, check my other notebooks on Pandas https://www.nbshare.io/notebooks/pandas/

Let us import the necesary packages.

import re
import random
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import json
from collections import Counter

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import nltk
from nltk.corpus import stopwords

import os
import nltk

import warnings
warnings.filterwarnings("ignore")

Checking the Data

Let us check the data using Unix cat command.

!head -2 stocktwits.csv

ticker,message,sentiment,followers,created_at
atvi,$ATVI brutal selloff here today... really dumb given the sectors performance. still bulish midterm.,Bullish,14,2020-10-02T22:19:36.000Z

Reading the data

Let us take a peak in to our data.

df = pd.read_csv('stocktwits.csv')

df.head()

As we see above, for each stock we have a tweet , sentiment, number of followers and date of stock tweet.

df.shape

(31372, 5)

Check if there are any 'na' values in data with df.isna(). We see below, there is no 'na' in data.

df.isna().any()

ticker        False
message       False
sentiment     False
followers     False
created_at    False
dtype: bool

Check if there are any 'null' in data with df.isnull() command. As we see below, there are no null values in data.

df.isnull().any()

ticker        False
message       False
sentiment     False
followers     False
created_at    False
dtype: bool

There are no null Values in the test set

Stock Tweet Analysis

Let us look at the distribution of tweets by stocks.

stock_gp = df.groupby('ticker').count()['message'].reset_index().sort_values(by='message',ascending=False)
stock_gp.head(5)

plt.figure(figsize=(12,6))
g = sns.distplot(stock_gp['message'],kde=False)

X-axis in the above plot shows the number of messages. Every bar represents a ticker.

There is another way to plot which is bar plot (shown below) that will give us some more information about the stocks and their tweets. Note in the below plot, only few labels have been plotted, otherwise the y-axis will be cluttered with the labels if plot all of them.

import matplotlib.ticker as ticker
plt.figure(figsize=(12,6))
ax = sns.barplot(y='ticker', x='message', data=stock_gp)
ax.yaxis.set_major_locator(ticker.MultipleLocator(base=20))

Lets look at the distribution of tweets by sentiment in the data set.

temp = df.groupby('sentiment').count()['message'].reset_index().sort_values(by='message',ascending=False)
temp.style.background_gradient(cmap='Greens')

As we can see the data is skewed towards Bullish sentiments which is not surprising given the fact that since mid of 2020 market has been in uptrend.

Most Common 20 words in Text/Tweets

df['words'] = df['message'].apply(lambda x:str(x.lower()).split())
top = Counter([item for sublist in df['words'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Most of these words shown above are stop words. Let us remove these stop words first.

def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]
df['words'] = df['words'].apply(lambda x:remove_stopword(x))

top = Counter([item for sublist in df['words'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Stock Tweets WordClouds

Let us now plot the word clouds using Python WordCloud library.

def plot_wordcloud(text, mask=None, max_words=200, max_font_size=50, figure_size=(16.0,9.0), color = 'white',
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'u', "im"}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color=color,
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=400, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()

Let us first plot the word clouds of Bullish tweets only.

plot_wordcloud(df[df['sentiment']=="Bullish"]['message'],mask=None,color='white',max_font_size=50,title_size=30,title="WordCloud of Bullish Tweets")

Ok let us plot WordCloud now for Bearish tweets.

plot_wordcloud(df[df['sentiment']=="Bearish"]['message'],mask=None,color='white',max_font_size=50,title_size=30,title="WordCloud of Bearish Tweets")

	ticker	message
607	spce	353
629	zm	294
614	tsla	283
591	ostk	275
171	F	267

	Common_words	count
0	buy	1868
1	-	1606
2	stock	1588
3	like	1542
4	going	1483
5	good	1461
6	go	1445
7	get	1410
8	see	1409
9	next	1377
10	short	1317
11	trade	1253
12	back	1233
13	$spy	1197
14	market	1159
15	long	1116
16	calls	1075
17	price	1038
18	$aapl	1013
19	day	984

	ticker	message	sentiment	followers	created_at
0	atvi	$ATVI brutal selloff here today... really dumb...	Bullish	14	2020-10-02T22:19:36.000Z
1	atvi	$ATVI $80 around next week!	Bullish	31	2020-10-02T21:50:19.000Z
2	atvi	$ATVI Jefferies says that the delay is a &quot...	Bullish	83	2020-10-02T21:19:06.000Z
3	atvi	$ATVI I’ve seen this twice before, and both ti...	Bullish	5	2020-10-02T20:48:42.000Z
4	atvi	$ATVI acting like a game has never been pushed...	Bullish	1	2020-10-02T19:14:56.000Z