Predict Stock Market with Daily Top News

A classification approach with natural language processing in R

Audience

This short blog post is intended for someone who has basic understanding of the natural language processing algorithm word2vec, in addition to familiarity with machine learning basics, particularly in classification techniques, and is keen to learn the actual integration of the two in R. We will go over detailed data preparation as well as model optimization. It is also recommended to review this post for a brief introduction on word2vec.

Objective

The objective is to learn how the Dow Jones Industrial Average would be affected by daily popular news on Reddit.

Dataset

The datasets can be found here. The simplified dataset is consisted of daily Dow Jones data and top daily headlines from Reddit WorldNews Channel for the past 8 years. The target variable is Label (0 is down and 1 is up).

Screenshot of the stock news data set

Approach

We will use the word2vec algorithm to first create word vectors of 100 dimensions on all the headlines. We will then transform the headlines into sentence vectors by averaging the word vectors that every sentence is consisted of. Finally we will use a variety of classification algorithms to predict whether Dow Jones Industrial Average would go up or down.

Data Preparation

1…Remove undesired characters , including punctuations and upper cases

2…Remove English stop words such as “i,” “my,” “myself”

Screenshot of the data set with unwanted features removed

3…Input cleaned headlines into word2vec model to obtain word vectors

Screenshot of the word vectors generated from cleaned headlines

4…Apply and average word vectors to all headlines

5…Return 2-dimension vector representations on headlines using t-SNE

6…Create dataset with target variable and 2-dimension vectors

Screenshot of the finished data set

Model Evaluation

Here we will use the caret package in R to evaluate the prediction accuracy on a variety of algorithms, including CART (rpart), logistic regression (glm), linear discriminate analysis (LDA), k-Nearest Neighbors (kNN), Support Vector Machine.with Radial Basis Function (svmRadial), and eXtreme Gradient Boosting (xgbLinear).

It is observed that all algorithms above have unsatisfying performance, with close to 50% accuracy and poor Kappa values. However, SVM with the Radial Basis Function kernel tends to perform better than other algorithms, with a 95% confidence interval on accuracy between 51% and 55%.

Performance on various classification algorithms

As a result, we will explore a few optimization options.

Performance Optimization

1…Stacking Multiple Models

We will first try to stack all the models to see if performance will improve.

Unfortunately, not much improvement is observed.

2…Averaging All Headlines

Instead of treating top headlines as individual variables, we will combine all headlines to create one set of 2-dimension vector.

Performance on the algorithms with averaged headlines

A slight increase of 0.1% in mean accuracy is observed in SVM algorithm.

3…Shifting Target Variable

We will try shifting the target variable back 1–7 days, as daily news may have lagged effects on stock performance. Alternatively, we can also shift the target variable forward a couple of days, however, this approach will contradict with the objective of predicting stock marketing performance using daily top news.

Performance on the algorithms with 3 day lag on target variable

We can observe that CART (rpart) now has the best performance in terms of mean accuracy at 53.66%. There is also another slight increase of 0.1% in mean accuracy is observed in the SVM algorithm.

4…Tuning Model Parameters

Here we will demonstrate how to tune parameters of the CART and SVM algorithms to see if we can increase the accuracy on our predictions.

CART

In rpart, the tuning parameter is cp, or complexity parameter. We will examine the accuracy on different values of cp. The optimal cp value is found to be 0.15, as shown below.

SVM

In SVM with Radial Basis Function, the tuning parameters are sigma and C (cost). We will examine the accuracy on different values of sigma and C. The optimal values are shown below, with sigma = 0.05 and C = 3.

R Code

library(devtools)library(tm)library(ggplot2)library(text2vec)library(rword2vec)library(wordVectors)library(tidyverse)library(Rtsne)library(caret)library(mlbench)library(caretEnsemble)

setwd("~/Desktop/stock")#Simple Visualization on DJIA datastock <- read.csv("DJIA_table.csv")stock$Date <- as.Date(stock$Date)

ggplot(data=stock,aes(x=Date,y=Open))+geom_line()+labs(title="Dow Jones Industrial Average")

#Read the simplified data setdata <- read.csv("stock.csv")

#Remove unwanted columns for natural language processingtext <- data

text$Date <- NULLtext$Label <- NULL

#Remove unwanted characters/punctuations/upper-casesfor (i in 1:25){text[,i] <- gsub('b"',"",text[,i])text[,i] <- gsub("b'","",text[,i])text[,i] <- gsub("[[:punct:]]", "", text[,i])text[,i] <- tolower(text[,i])}

write.table(text,"text.txt")

#Rename column namesnew_text <- subset(text,select=paste("Top",1,sep=""))new_text <- setNames(new_text,"x")

for (i in 2:25){new_text2 <- subset(text,select=paste("Top",i,sep=""))new_text2 <- setNames(new_text2,"x")new_text <- rbind(new_text,new_text2)}

write.table(new_text,"text.txt")

final_text <- new_text

#Remove stopwordsstopWords <- stopwords("en")'%nin%' <- Negate('%in%')

train <- lapply(final_text$x, function(x) {t <- unlist(strsplit(x, " "))t[t %nin% stopWords]})

train.df = as.data.frame(do.call(rbind, train))train.df$x <- do.call(paste, c(train.df[,1:45], sep=" "))

finaltext <- subset(train.df,select="x")

write.table(finaltext,"text_data.txt")

#Input words to word2vecmodel = train_word2vec("text_data.txt", output="vec.bin",threads = 4, vectors = 100, window = 12, min_count = 5, iter=10, force=TRUE)

#Plot the word vectorsplot(model)#Convert binary to text formatbin_to_txt("vec.bin","model1text.txt")

#Remove first rows manually in the .txt file

#Convert .txt to .cvsm1 <- read.table("model1text.txt",header = F,quote = "", row.names = NULL, stringsAsFactors = FALSE)

colnames(m1)[1] <- "word"

#Apply word vectors to headlines and return 2-dimension representations with t-SNElabel <- subset(data,select="Label")labeltext <- cbind(label,text)labeltext$id <- 1:nrow(labeltext)labeltext$id <- as.factor(labeltext$id)

for (i in 2:24){s <- strsplit(labeltext[,i], split= " ")new <- data.frame(V1 = rep(labeltext$id, sapply(s, length)), V2 = unlist(s))colnames(new) <- c("id","word")new <- merge(new,m1,by="word")new2 <- aggregate(V2 ~ id, new, mean)for (j in 4:102){new3 <- aggregate(new[,j] ~ id, new, mean)new2 <- merge (new2,new3,by="id")}colnames(new2)[2:101] <- paste("V", 1:100, sep="")tsne <- Rtsne(new2, dims = 2, perplexity=100, verbose=TRUE, max_iter = 500)t = as.data.frame(tsne$Y)colnames(t) <- c(paste("top",i,"x",sep=""),paste("top",i,"y",sep=""))labeltext <- cbind(labeltext,t)}

write.csv(labeltext,"labeltext.csv")

#Add back Label informationmydata <- labeltextmydata[,2:27] <- NULL

mydata$Label[mydata$Label1]<- "Yes"mydata$Label[mydata$Label0]<- "No"

#Classfication Modelscontrol <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(Label~., data=mydata, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results)

#-----Performance Optimization #1: Stacking Multiple Models-----#Model CorrelationsmodelCor(results)splom(results)#Model Stacking with GLMstackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)set.seed(2)stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)print(stack.glm)

#-----Performance Optimization #2: Averaging All Headlines-----mydata2 <- mydatamydata2$headline <- rowMeans(mydata2[,2:45], na.rm = TRUE)mydata2[,2:45] <- NULL

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(Label~., data=mydata2, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results)

#-----Performance Optimization #3: Shifting Target Variable-----shift <-subset(mydata,select="Label")shift1 <- as.data.frame(shift[-1,])shift2 <- as.data.frame(shift[-c(1,2),])shift3 <- as.data.frame(shift[-c(1,2,3),])shift4 <- as.data.frame(shift[-c(1,2,3,4),])shift5 <- as.data.frame(shift[-c(1,2,3,4,5),])shift6 <- as.data.frame(shift[-c(1,2,3,4,5,6),])shift7 <- as.data.frame(shift[-c(1,2,3,4,5,6,7),])

mydata1 <- mydata[-1989,]mydata2 <- mydata[-c(1988,1989),]mydata3 <- mydata[-c(1987,1988,1989),]mydata4 <- mydata[-c(1986,1987,1988,1989),]mydata5 <- mydata[-c(1985,1986,1987,1988,1989),]mydata6 <- mydata[-c(1984,1985,1986,1987,1988,1989),]mydata7 <- mydata[-c(1983,1984,1985,1986,1987,1988,1989),]

mydata1$Label <- NULLmydata2$Label <- NULLmydata3$Label <- NULLmydata4$Label <- NULLmydata5$Label <- NULLmydata6$Label <- NULLmydata7$Label <- NULL

mydata1 <- cbind(shift1,mydata1)colnames(mydata1)[1] <- "label1day"mydata2 <- cbind(shift2,mydata2)colnames(mydata2)[1] <- "label2day"mydata3 <- cbind(shift3,mydata3)colnames(mydata3)[1] <- "label3day"mydata4 <- cbind(shift4,mydata4)colnames(mydata4)[1] <- "label4day"mydata5 <- cbind(shift5,mydata5)colnames(mydata5)[1] <- "label5day"mydata6 <- cbind(shift6,mydata6)colnames(mydata6)[1] <- "label6day"mydata7 <- cbind(shift7,mydata7)colnames(mydata7)[1] <- "label7day"

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(label3day~., data=mydata3, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results)

#-----Performance Optimization #4: Tuning Model Parameters-----set.seed(3)control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

#Tuning cp in CARTrpartGrid <- expand.grid(cp = c(0.01,0.05,0.1,0.15))

rpartFit <- train(label3day ~ .,data = mydata3,method = "rpart",tuneLength = 10,tuneGrid=rpartGrid)

rpartFitggplot(rpartFit)

#Tuning sigma and C in SVMsvmGrid <- expand.grid(sigma = c(0.01,0.05,0.1,0.15),C = c(1,3,5,7,9))

svmFit <- train(label3day ~ ., data = mydata3,method = "svmRadial",trControl = control,verbose = FALSE,tuneGrid = svmGrid)

svmFitggplot(svmFit)

Questions, comments, or concerns?jchen6912@gmail.com