In this notebook, we will use autoencoders to do stock sentiment analysis. Autoencoder consists of encoder and decoder models. Encoders compress the data and decoders decompress it. Once you train an autoencoder neural network, the encoder can be used to train a different machine learning model.
For stock sentiment analysis, we will first use encoder for the feature extraction and then use these features to train a machine learning model to classify the stock tweets. To learn more about Autoencoders check out the following link...
https://www.nbshare.io/notebook/86916405/Understanding-Autoencoders-With-Examples/
Let us import the necessary packages.
# importing necessary lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# reading tweets data
df=pd.read_csv('/content/stocktwits (2).csv')
df.head()
Let us remove the unnecessary features - ticker, followers and created_at from our dataset.
df=df.drop(['ticker','followers','created_at'],axis=1)
df.head()
# class counts
df['sentiment'].value_counts()
If you observe the above results.Our data set is imabalanced. The number of Bullish tweets are way more than the Bearish tweets. We need to balance the data.
# Sentiment encoding
# Encoding Bullish with 0 and Bearish with 1
dict={'Bullish':0,'Bearish':1}
# Mapping dictionary to Is_Response feature
df['Class']=df['sentiment'].map(dict)
df.head()
Let us remove the 'sentiment' feature since we have already encoded it in the 'class' column.
df=df.drop(['sentiment'],axis=1)
To make our dataset balanced, in the next few lines of code, I am taking same number of samples from Bullish class as we have in Bearish class.
Bearish = df[df['Class']== 1]
Bullish = df[df['Class']== 0].sample(4887)
# appending sample records of majority class to minority class
df = Bullish.append(Bearish).reset_index(drop = True)
Let us check how our dataframe looks now.
df.head()
Let us do count of both the classes to make sure count of each class is same.
# balanced class
df['Class'].value_counts()
df.message
Now we need to convert the tweets(text) into vector form.
To convert text into vector form, first we need to clean the text, Cleaning means removing special characters, lowercasing , remvoing numericals,stemming etc
For text preprocessing I am using NLTK lib.
import nltk
nltk.download('stopwords')
import re
# I am using porterstemmer for stemming
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
review = re.sub('[^a-zA-Z]', ' ', df['message'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)
To convert words into vector I am using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# I am using 1 to 3 ngram combinations
tfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))
tfidf_word=tfidf.fit_transform(corpus).toarray()
tfidf_class=df['Class']
tfidf_word
# importing necessary lib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
tfidf_class
To make the data suitable for the auto-encoder, I am using MinMaxScaler.
X_scaled = MinMaxScaler().fit_transform(tfidf_word)
X_bulli_scaled = X_scaled[tfidf_class == 0]
X_bearish_scaled = X_scaled[tfidf_class == 1]
tfidf_word.shape
I am using standard auto-encoder network.
For encoder and decoder I am using 'tanh' activation function.
For bottle neck and output layers I am using 'relu' activation.
I am using L1 regularizer in Encoder. To learn more about regularlization check here.
# Building the Input Layer
input_layer = Input(shape =(tfidf_word.shape[1], ))
# Building the Encoder network
encoded = Dense(100, activation ='tanh',
activity_regularizer = regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation ='tanh',
activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(25, activation ='tanh',
activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(12, activation ='tanh',
activity_regularizer = regularizers.l1(10e-5))(encoded)
encoded = Dense(6, activation ='relu')(encoded)
# Building the Decoder network
decoded = Dense(12, activation ='tanh')(encoded)
decoded = Dense(25, activation ='tanh')(decoded)
decoded = Dense(50, activation ='tanh')(decoded)
decoded = Dense(100, activation ='tanh')(decoded)
# Building the Output Layer
output_layer = Dense(tfidf_word.shape[1], activation ='relu')(decoded)
import tensorflow as tf
For training I am using 'Adam' Optimizer and 'BinaryCrossentropy' Loss.
# Defining the parameters of the Auto-encoder network
autoencoder = Model(input_layer, output_layer)
autoencoder.compile(optimizer ="Adam", loss =tf.keras.losses.BinaryCrossentropy())
# Training the Auto-encoder network
autoencoder.fit(X_bulli_scaled, X_bearish_scaled,
batch_size = 16, epochs = 100
,
shuffle = True, validation_split = 0.20)
After training the neural network, we discard the decoder since we are only interested in Encoder and bottle neck layers.
In the below code, autoencoder.layers[0] means first layer which is encoder layer. Similarly autoencoder.layers[4] means bottle neck layer. Now we will create our model with encoder and bottle neck layers.
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])
hidden_representation.add(autoencoder.layers[3])
hidden_representation.add(autoencoder.layers[4])
# Separating the points encoded by the Auto-encoder as bulli_hidden_scaled and bearish_hidden_scaled
bulli_hidden_scaled = hidden_representation.predict(X_bulli_scaled)
bearish_hidden_scaled = hidden_representation.predict(X_bearish_scaled)
Let us combine the encoded data in to a single table.
encoded_X = np.append(bulli_hidden_scaled, bearish_hidden_scaled, axis = 0)
y_bulli = np.zeros(bulli_hidden_scaled.shape[0]) # class 0
y_bearish= np.ones(bearish_hidden_scaled.shape[0])# class 1
encoded_y = np.append(y_bulli, y_bearish)
Now we have encoded data from auto encoder. This is nothing but feature extraction from input data using auto encoder.
We can use these extracted features to train machine learning models.
# splitting the encoded data into train and test
X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split(encoded_X, encoded_y, test_size = 0.2)
lrclf = LogisticRegression()
lrclf.fit(X_train_encoded, y_train_encoded)
# Storing the predictions of the linear model
y_pred_lrclf = lrclf.predict(X_test_encoded)
# Evaluating the performance of the linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_lrclf)))
# Building the SVM model
svmclf = SVC()
svmclf.fit(X_train_encoded, y_train_encoded)
# Storing the predictions of the non-linear model
y_pred_svmclf = svmclf.predict(X_test_encoded)
# Evaluating the performance of the non-linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_svmclf)))
from sklearn.ensemble import RandomForestClassifier
# Building the rf model
rfclf = RandomForestClassifier()
rfclf.fit(X_train_encoded, y_train_encoded)
# Storing the predictions of the non-linear model
y_pred_rfclf = rfclf.predict(X_test_encoded)
# Evaluating the performance of the non-linear model
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_rfclf)))
import xgboost as xgb
#xgbosst classifier
xgb_clf=xgb.XGBClassifier()
xgb_clf.fit(X_train_encoded, y_train_encoded)
y_pred_xgclf = xgb_clf.predict(X_test_encoded)
print('Accuracy : '+str(accuracy_score(y_test_encoded, y_pred_xgclf)))
If you observe the above accuracy's by model. Randomforest is giving good accuracy on test data. So we can tune the RFclassifier to get better accuracy.
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 25, cv = 3, verbose=2, random_state=42)
# Fit the random search model
rf_random.fit(X_train_encoded, y_train_encoded)
rf_random.best_params_
But these are probably not the best hyperparameters, I used only 25 iterations. We can increase the iterations further to find the best hyperparameters.
Related Notebooks
- Tweet Sentiment Analysis Using LSTM With PyTorch
- Opinion Mining Aspect Level Sentiment Analysis
- Stock Tweets Text Analysis Using Pandas NLTK and WordCloud
- Time Series Analysis Using ARIMA From StatsModels
- Data Analysis With Pyspark Dataframe
- Demystifying Stock Options Vega Using Python
- Stock Charts Detection Using Image Classification Model ResNet
- Calculate Implied Volatility of Stock Option Using Python
- Plot Stock Options Vega Implied Volatility Using Python Matplotlib