Stock Sentiment Analysis Using Autoencoders

In this notebook, we will use autoencoders to do stock sentiment analysis. Autoencoder consists of encoder and decoder models. Encoders compress the data and decoders decompress it. Once you train an autoencoder neural network, the encoder can be used to train a different machine learning model.

For stock sentiment analysis, we will first use encoder for the feature extraction and then use these features to train a machine learning model to classify the stock tweets. To learn more about Autoencoders check out the following link...

https://www.nbshare.io/notebook/86916405/Understanding-Autoencoders-With-Examples/

Stock Tweets Data

Let us import the necessary packages.

In [1]:
# importing necessary lib 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# reading tweets data
df=pd.read_csv('/content/stocktwits (2).csv')
In [3]:
df.head()
Out[3]:
ticker message sentiment followers created_at
0 atvi $ATVI brutal selloff here today... really dumb... Bullish 14 2020-10-02T22:19:36.000Z
1 atvi $ATVI $80 around next week! Bullish 31 2020-10-02T21:50:19.000Z
2 atvi $ATVI Jefferies says that the delay is a &quot... Bullish 83 2020-10-02T21:19:06.000Z
3 atvi $ATVI I’ve seen this twice before, and both ti... Bullish 5 2020-10-02T20:48:42.000Z
4 atvi $ATVI acting like a game has never been pushed... Bullish 1 2020-10-02T19:14:56.000Z

Let us remove the unnecessary features - ticker, followers and created_at from our dataset.

In [4]:
df=df.drop(['ticker','followers','created_at'],axis=1)
In [5]:
df.head()
Out[5]:
message sentiment
0 $ATVI brutal selloff here today... really dumb... Bullish
1 $ATVI $80 around next week! Bullish
2 $ATVI Jefferies says that the delay is a &quot... Bullish
3 $ATVI I’ve seen this twice before, and both ti... Bullish
4 $ATVI acting like a game has never been pushed... Bullish
In [6]:
# class counts
df['sentiment'].value_counts()
Out[6]:
Bullish    26485
Bearish     4887
Name: sentiment, dtype: int64

If you observe the above results.Our data set is imabalanced. The number of Bullish tweets are way more than the Bearish tweets. We need to balance the data.

In [7]:
# Sentiment encoding 
# Encoding Bullish with 0 and Bearish with 1 
dict={'Bullish':0,'Bearish':1}

# Mapping dictionary to Is_Response feature
df['Class']=df['sentiment'].map(dict)
df.head()
Out[7]:
message sentiment Class
0 $ATVI brutal selloff here today... really dumb... Bullish 0
1 $ATVI $80 around next week! Bullish 0
2 $ATVI Jefferies says that the delay is a &quot... Bullish 0
3 $ATVI I’ve seen this twice before, and both ti... Bullish 0
4 $ATVI acting like a game has never been pushed... Bullish 0

Let us remove the 'sentiment' feature since we have already encoded it in the 'class' column.

In [8]:
df=df.drop(['sentiment'],axis=1)

To make our dataset balanced, in the next few lines of code, I am taking same number of samples from Bullish class as we have in Bearish class.

In [9]:
Bearish = df[df['Class']== 1]
Bullish = df[df['Class']== 0].sample(4887)
In [10]:
# appending sample records of majority class to minority class
df = Bullish.append(Bearish).reset_index(drop = True)

Let us check how our dataframe looks now.

In [11]:
df.head()
Out[11]:
message Class
0 Options Live Trading with a small Ass account... 0
1 $UPS your crazy if you sold at open 0
2 If $EQIX is at $680, this stock with the bigge... 0
3 $WMT just getting hit on the no stimulus deal.... 0
4 $AMZN I'm playing the catalyst stocks with... 0

Let us do count of both the classes to make sure count of each class is same.

In [12]:
# balanced class 
df['Class'].value_counts()
Out[12]:
1    4887
0    4887
Name: Class, dtype: int64
In [13]:
df.message
Out[13]:
0       Options  Live Trading with a small Ass account...
1                     $UPS your crazy if you sold at open
2       If $EQIX is at $680, this stock with the bigge...
3       $WMT just getting hit on the no stimulus deal....
4       $AMZN I'm playing the catalyst stocks with...
                              ...                        
9769    SmartOptions® Unusual Activity Alert\n(Delayed...
9770                                            $VNO ouch
9771                                             $VNO dog
9772    $ZION I wanted to buy into this but I had an u...
9773    $ZOM Point of Care, rapid tests from $IDXX and...
Name: message, Length: 9774, dtype: object

Stock Tweets Text to Vector Form

Now we need to convert the tweets(text) into vector form.

To convert text into vector form, first we need to clean the text, Cleaning means removing special characters, lowercasing , remvoing numericals,stemming etc

For text preprocessing I am using NLTK lib.

In [14]:
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[14]:
True
In [15]:
import re
In [16]:
# I am using porterstemmer for stemming 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):

  review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)

To convert words into vector I am using TF-IDF.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
In [19]:
# I am using 1 to 3 ngram combinations
tfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))
tfidf_word=tfidf.fit_transform(corpus).toarray()
tfidf_class=df['Class']
In [20]:
tfidf_word
Out[20]:
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.20443663,
        0.        ]])
In [21]:
# importing necessary lib 
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler 
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
In [22]:
tfidf_class
Out[22]:
0       0
1       0
2       0
3       0
4       0
       ..
9769    1
9770    1
9771    1
9772    1
9773    1
Name: Class, Length: 9774, dtype: int64

Scaling the data

To make the data suitable for the auto-encoder, I am using MinMaxScaler.

In [23]:
X_scaled = MinMaxScaler().fit_transform(tfidf_word)
X_bulli_scaled = X_scaled[tfidf_class == 0]
X_bearish_scaled = X_scaled[tfidf_class == 1]
In [25]:
tfidf_word.shape
Out[25]:
(9774, 10000)

Building the Autoencoder neural network

I am using standard auto-encoder network.

For encoder and decoder I am using 'tanh' activation function.

For bottle neck and output layers I am using 'relu' activation.

I am using L1 regularizer in Encoder. To learn more about regularlization check here.

In [26]:
# Building the Input Layer
input_layer = Input(shape =(tfidf_word.shape[1], ))
  
# Building the Encoder network
encoded = Dense(100, activation ='tanh',
                activity_regularizer = regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation ='tanh',
                activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(25, activation ='tanh',
                activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(12, activation ='tanh',
                activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(6, activation ='relu')(encoded)

# Building the Decoder network
decoded = Dense(12, activation ='tanh')(encoded)
decoded = Dense(25, activation ='tanh')(decoded)
decoded = Dense(50, activation ='tanh')(decoded)
decoded = Dense(100, activation ='tanh')(decoded)
  
# Building the Output Layer
output_layer = Dense(tfidf_word.shape[1], activation ='relu')(decoded)

Training Autoencoder

In [27]:
import tensorflow as tf

For training I am using 'Adam' Optimizer and 'BinaryCrossentropy' Loss.

In [ ]:
# Defining the parameters of the Auto-encoder network
autoencoder = Model(input_layer, output_layer)
autoencoder.compile(optimizer ="Adam", loss =tf.keras.losses.BinaryCrossentropy())
  
# Training the Auto-encoder network
autoencoder.fit(X_bulli_scaled, X_bearish_scaled, 
                batch_size = 16, epochs = 100
                , 
                shuffle = True, validation_split = 0.20)

After training the neural network, we discard the decoder since we are only interested in Encoder and bottle neck layers.

In the below code, autoencoder.layers[0] means first layer which is encoder layer. Similarly autoencoder.layers[4] means bottle neck layer. Now we will create our model with encoder and bottle neck layers.

In [29]:
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])
hidden_representation.add(autoencoder.layers[3])
hidden_representation.add(autoencoder.layers[4])

Encoding Data

In [30]:
# Separating the points encoded by the Auto-encoder as bulli_hidden_scaled and bearish_hidden_scaled

bulli_hidden_scaled = hidden_representation.predict(X_bulli_scaled)
bearish_hidden_scaled = hidden_representation.predict(X_bearish_scaled)

Let us combine the encoded data in to a single table.

In [31]:
encoded_X = np.append(bulli_hidden_scaled, bearish_hidden_scaled, axis = 0)
y_bulli = np.zeros(bulli_hidden_scaled.shape[0]) # class 0
y_bearish= np.ones(bearish_hidden_scaled.shape[0])# class 1
encoded_y = np.append(y_bulli, y_bearish)

Now we have encoded data from auto encoder. This is nothing but feature extraction from input data using auto encoder.

Train Machine Learning Model

We can use these extracted features to train machine learning models.

In [32]:
# splitting the encoded data into train and test 

X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split(encoded_X, encoded_y, test_size = 0.2)

Logistic Regreession

In [33]:
lrclf = LogisticRegression()
lrclf.fit(X_train_encoded, y_train_encoded)
  
# Storing the predictions of the linear model
y_pred_lrclf = lrclf.predict(X_test_encoded)
  
# Evaluating the performance of the linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_lrclf)))
Accuracy : 0.620460358056266

SVM

In [34]:
# Building the SVM model
svmclf = SVC()
svmclf.fit(X_train_encoded, y_train_encoded)
  
# Storing the predictions of the non-linear model
y_pred_svmclf = svmclf.predict(X_test_encoded)
  
# Evaluating the performance of the non-linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_svmclf)))
Accuracy : 0.6649616368286445

RandomForest

In [35]:
from sklearn.ensemble import RandomForestClassifier
In [36]:
# Building the rf model
rfclf = RandomForestClassifier()
rfclf.fit(X_train_encoded, y_train_encoded)
  
# Storing the predictions of the non-linear model
y_pred_rfclf = rfclf.predict(X_test_encoded)
  
# Evaluating the performance of the non-linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_rfclf)))
Accuracy : 0.7631713554987213

Xgbosst Classifier

In [37]:
import xgboost as xgb
In [38]:
#xgbosst classifier 
xgb_clf=xgb.XGBClassifier()
xgb_clf.fit(X_train_encoded, y_train_encoded)

y_pred_xgclf = xgb_clf.predict(X_test_encoded)

print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_xgclf)))


  
Accuracy : 0.7089514066496164

If you observe the above accuracy's by model. Randomforest is giving good accuracy on test data. So we can tune the RFclassifier to get better accuracy.

Hyperparamter Optimization

In [39]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
In [ ]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 25, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random.fit(X_train_encoded, y_train_encoded)
In [46]:
rf_random.best_params_
Out[46]:
{'bootstrap': True,
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 1000}

But these are probably not the best hyperparameters, I used only 25 iterations. We can increase the iterations further to find the best hyperparameters.