Skip to content
Snippets Groups Projects

High Frequency trading strategy using LSTM model

Teammates:

Suwen Wang: suwenw2@illinois.edu

LinkedIn

Suwen is pursuing a Bachelor's at the University of Illinois at Urbana-Champaign, majoring in Statistics & Computer Science. He will graduate in May 2023 and start his Master's in Computer Science in Fall 2023.

He has experience in programming, version control, and data analysis

Pallavi Prakash: pallavi8@illinois.edu

LinkedIn

Pallavi is pursuing a Master's at the University of Illinois at Urbana-Champaign in the Industrial & Enterprise System Engineering Department majoring in Financial Engineering. She will graduate in December 2022.

She has experience in quantitative research, risk management, and data analysis.

Project Description: Presentation Video

This report is for the semester-long project for the course "FIN556 - Algorithmic Market Microstructure " Under Prof. David Lariviere. here

The scope of the project was to develop a trading strategy to be used for high-frequency trading. A Long short-term memory (LSTM) model was developed to predict stock prices and created a trading strategy of buy and sell based on the stock price prediction. We have divided the project into 5 sections: strategy initialization, data gathering, model development and implementation, strategy implementation in Strategy Studio, a proprietary software from RCM-X, and performance analysis.

1. Strategy Initialization :

As the initial step, time series modeling methods were explored. We reviewed SMA, EWMA, and ARIMA models for stock price prediction. Based on the further strategy, we also looked into deep learning models and used LSTM models for predicting stock prices.

2. Data Gathering :

We used data from Yahoo Finance when developing the model in Python, because of its accessibility. We also retrieved over five years of data from IEX Exchange for later backtesting in the Strategy Studio.

3. Model Development and Implementation :

We first developed an LSTM model using PyTorch and TensorFlow in Google Colab. Our model was backtested with 1, 3 and 5 days of data and adjusted according to the result.

4. Strategy Implementation in Strategy Studio :

We implemented the SMA trading strategy and loaded the LSTM model in C++. We further incorporated the model and strategy in Strategy Studio together for later backtesting.

5. Performance Analysis :

We backtested our strategy in both Python and Strategy Studio. Backtesting in Python is easier, but cannot handle a large amount of data. Process in the Strategy Studio is more complicated, but years of data can be backtested at once.

alt text

Git Repo Layout :

├── Model
│   ├── LSTM_Python
│   │   ├── LSTM_PyTorch.ipynb
│   │   └── LSTM_Keras.py
│   └── TorchScript_Conversion
│        ├── conversion.py
│        ├── 20221128_trades.csv
│        └── my_module_model.pt
├── Strategy_Studio_Implementation
│   ├── Group5Strategy
│   │   ├── SMA_Crossover
│   │   │   ├── Group5Strategy.h
│   │   │   ├── Group5Strategy.cpp
│   │   │   └── Makefile
│   │   └── LSTM_SMA_Crossover
│   │       ├── Group5Strategy.h
│   │       ├── Group5Strategy.cpp
│   │       ├── libtorch
│   │       │   └── ... 
│   │       ├── Makefile
│   │       └── lstm_model.pt
│   ├── performance_analysis
│   │   ├── LSTM.ipynb
│   │   └── SMA.ipynb
│   └── ss_backtest_output
│       ├── SMA_Crossover
│       │   └── SPY 
│       │       ├── 20220103_20220103
│       │       ├── 20220101_20221202
│       │       └── 20210101_20211231
│       └── LSTM_SMA_Crossover
│           ├── SPY
│           │   ├── 20220101_20221202
│           │   └── 20210101_20211231
│           └── DIA
│               ├── 20220101_20221202
│               ├── 20210101_20211231
│               └── 20170515_20201231
├── Python_Backtesting
│   ├── LSTM_Backtesting.ipynb
│   └── results
│       ├── PNL.csv
│       ├── Output_with_Strategy.csv.zip
│       └── output_graphs
│           └── ...
├── Images
│   └── ...
└── Documentation
    └── Final_Report.md

Background :

Time-series modeling:

Time Series models are used to predict future values based on past observed values. These models are used extensively in the industry to predict future values like macroeconomic variables, stock prices over time, house prices, etc. There are many methods to compute time series forecasting:

1) Moving Average : This model simply predicts the next observation based on the mean of past values.

2) Exponential Smoothing : It follows similarly to the Moving Average, but a different decreasing weight is assigned to pass observations. The alpha values range from 0 to 1 and it determines how the weightage is decreased as we go through the past.

y = ax_{t}+(1-\alpha )y_{t-1},t>0 

3) Autoregressive Integrated Moving Average Model(ARIMA): It is a combination of simple models which creates a complex model representing time series exhibiting nonstationary and seasonality. It comprises 3 parts:

Autoregressive model (AR(p): It is a regression of the time series to itself. Its current value depends on its previous values with lag. The parameter(p) represents the maximum lag.

Moving average model(AR(q)): It is the weighted sum of lagged forecasted errors of the time series.

The Integrated(I) part is a difference in time series.

Neural networks and Deep learning models:

Neural networks also, knowns as artificial neural networks(ANN) are subsets of machine learning and the basis of deep learning models. Usually, ANNs are compromised into a node layer, one input layer, one or more hidden layers, and one output layer. Each node, an artificial neuron, is connected to each other and has weights and thresholds assigned to each node. Recurrent Neural Networks usually save the output of a particular layer and feed it back to the system as inputs to predict the output. These neural networks use datasets to train the model. The more iterative a process is, the more accurate of the model is.

Long Short-Term Memory models are special kinds of RNNs that can remember information for a long period. It works as follows:

Step 1 - Decide how much past information to remember: LSTM has Forget gate (Ft) which decides what information should be deleted in the previous time step. The sigmoid function determines it.

f_{t\ }=\sigma(W_{f.}.\left\lfloor h_{t-1},x_t\right\rfloor\ +\ b_f)

_Step 2 - Decide how much this unit adds to the current state: The second layer has an input gate that comprises 2 parts, one sigmoid function, and a tanh function. The sigmoid function decides which values to let through by assigning 0 or 1 and the tanh function assigns weightage to values that have passed and decides importance to values ranging from -1 to 1

i$_{t\ }=\sigma(W_{i.}.\left\lfloor h_{t-1},x_t\right\rfloor\ +\ b_i
C_{t\ }=tanh(W_{C.}.\left\lfloor h_{t-1},x_t\right\rfloor\ +\ b_C)

Step 3 - Decide which part of current cells makes it to output: First, we ran the sigmoid layer that decides which part of the cell state goes to output. Then cell state passed through tanh to push values between -1 and 1 and multiply it with the output of the sigmoid gate.

o_{t\ }=\sigma(W_{0.}.\left\lfloor h_{t-1},x_t\right\rfloor\ +\ b_O)
h_t=o_t\ast tanh\left(C_t\right)

alt text

Technologies used :

Programming Languages:
  1. C++ : We used C++ to load the saved Python model and run it for a large amount of data. Since the strategy studio provides an interface with c++, so we implemented various strategies on the platform using C++.

  2. Python 3.7.11 : We used the Python platform to code and train the LSTM model for analysis and visualization. We have used the following programming libraries to write the code:

    2.1) Pytorch : https://pytorch.org/

    2.2) TensorFlow : https://www.tensorflow.org/

    2.3) Matplotlib : https://matplotlib.org/

    2.4) Pandas : https://pandas.pydata.org/

    2.5) Numpy : https://numpy.org/

    2.6) Keras : https://keras.io/

    2.7) Sklearn : https://scikit-learn.org/stable/

Softwares:
  1. Strategy Studio from RCM-X for implementing and backtesting our strategies with the market data.

  2. LibTorch: The C++ SDK for PyTorch. We trained models in python and exported them as a model file. We then read the model file in the C++ strategy code and run predictions.

  3. Jupyter Notebook/Google Colab are dev tools for Python and they are commonly used in data analytics. We used those tools to train the LSTM model and for writing codes in python.

Pipeline Frameworks :
  1. GitLab is used for version control and managing and tracking the project progress.

  2. VirtualBox/Vagrant : VirtualBox is used for creating virtual machines. Vagrant uses the VirtualBox to launch VMs to keep consistency in each run for the project. Both of them are used to setting up a virtual environment. The vagrant contains the Strategy Studio and necessary environment files to run the project.

Data Sources
  1. IEX Exchange : We imported data directly from IEX using Professor's IEX parser module. Please find the link to access the code here
  • The deep and trade data downloaded from the IEX Exchange was fed to the Strategy Studio for later backtesting.

  • Use command cd IexDownloaderParser to change to the correct directory. Then run ./vm_go.sh to download and parse market data from IEX Exchange automatically.

  • Update the range of deep data to download by running the command vim download.sh

    • Modify the start-date and end-date argument in python3 src/download_iex_pcaps.py --start-date 2022-08-01 --end-date 2022-08-01 --download-dir data/iex_downloads.

    • The downloaded raw IEX DEEP data should be stored at iexdownloaderparsers/data/iex_downlaods/DEEP in tick_SYMBOL_YYYYMMDD.txt.gz format.

    • Run command ./download.sh can download source data only without parsing.

  • Update the range of deep data to parse by running the command vim parse_all.sh

    • Update the company symbols by editing --symbols argument in line gunzip -d -c $pcap | tcpdump -r - -w - -s 0 | $PYTHON_INTERP src/parse_iex_pcap.py /dev/stdin --symbols SPY --trade-date $pcap_date --output-deep-books-too.

    • Update the deep data to parse by specifying the data in line for pcap in $(ls data/iex_downloads/DEEP/*gz). For example, for pcap in $(ls data/iex_downloads/DEEP/*202105gz) parses all source data for the chosen company in May 2021.

    • The parsed trade and order update data should be stored at iexdownloaderparsers/data/text_tick_data in tick_SYMBOL_YYYYMMDD.txt.gz format.

    • Run command ./parse_all.sh to parse selected source data without downloading new.

  • Command ./vm_go.sh will run ./download.sh and ./parse_all.sh automatically.

  1. Yahoo Finance : We used Stock SPY for initial coding in python

Model Developement and Implementation

LSTM model development in python :

Data Preparation and cleaning:

Fetching price data: As an initial data step, we imported data from the trade book and order book update files. It was difficult to run the python code of the order book update because of the large amount of data. Since most of the orders from book updates were not executed as trade orders. We decided to run the price from trade book data for our analysis.

Stock_prices

Normalizing raw data: LSTM algorithm uses gradient descent as the optimization technique, which requires price data to be normalized. It is because the feature value in the model can affect the step size of gradient descent, which could skew the results of LSTM. So, data normalization increases the accuracy of the model and helps the gradient descent algorithm converge more quickly to the minimum value. The normalization of the dataset on the same scale helps in reducing the variance and would improve the efficiency of LSTM.

Splitting data into training and test dataset: LSTM learns the mapping function from the input variables(X) to the output variable(Y). The training dataset is used for the learning process by the model and the test dataset is used for validation. We have tested the many look-back periods to review, which will be best for predicting prices using the LSTM model. Based on our analysis, look back period 60 are most optimal.

Predicted price based on lookback periods

After transforming the dataset into input features and output labels, the shape of X(input variable) is (x,60,1) rows, each row represents a sequence of past 59 prices. The corresponding Y data shape is (x, 1), which has the same number of rows. The dataset was split into two parts for training and validation. We split the data using an 80:20 split, where 80% of the data is used for training while the rest 20% of data to verify model performance in predicting future prices.

LSTM model :

The LSTM model is a specialized recurring neural network that could memorize patterns from historical sequences of data points and predict such patterns for future events. LSTM can learn the long sequence of data by enforcing constant error flow through self-connected hidden layers containing memory cells and corresponding gate units. We created models using PyTorch and TensorFlow/Keras libraries to create models.

In our model, we have used 2 main layers :

  1. LSTM : learn the data sequence
  2. Linear layer to produce the predicted value based on LSTM's output

Model Training:

We make the LSTM model learn by making predictions iteratively on the training dataset X. We have used MSE(mean square function) as a loss function, which measures the difference between predicted and actual values. If the model is making bad predictions, the MSE value will be high. The model is fine-tuned by changing its weights through backpropagation, improving the quality of predictions. We have also used Adam optimizer which updates model parameters based on the learning rate through the step function. Mean squared error is a loss function that we have used to improve our model. We ran the model for a large number of epochs until MSE converges to a negligible value.

MSE losses

Model evaluation :

To evaluate the model performance, we would use the trained model to predict prices using test datasets. We further plotted the trained and tested predicted values with actual prices and found them very close to the price. We increased the number of iterations(epochs) to 300 times to improve the accuracy of the predicted stock prices. The model training and evaluation is an iterative process. We had to fine-tune the model and re-evaluate the model training to improve the performance.

Model Evaluation

Predicting stock prices :

Since we have trained the model and can fairly predict the next values using historical data points. We noticed that the average time gap is 2 seconds in trade files. Since 1 min is 30 timesteps, we used historical 2400 prices, i.e. 1.5 hours, to forecast stock prices for the next 10 minutes.

Forecasted Stock Price

Loading the model into C++ for strategy studio integration :

We implemented our model into C++ by loading a TorchScript model in C++ using following steps here

  1. We converted our model to Torch Script via Annotation. Since the forward method uses control flow, which depends on the input. It is not suitable for tracing. We converted the module to the ScriptModule and compiled the module via torch.jit.script.
  2. We serialized the script module to a file and save it .pt extension. It is then can be further loaded into a C++ file and could be executed without dependency on Python.
  3. We loaded the script to C++ and also incorporated LibTorch Library in C++. The LibTorch distribution CONSISTS collection of shared libraries, header files and CMake build configuration files.
  4. We executed the script model in C++ by passing the input prices via tensors into model.forward

Strategy Studio Implementation:

Simple Moving Average Crossover Strategy

The first trading strategy we implemented was the traditional Simple Moving Average Crossover Strategy. We created a small window of size 40 and a big window of size 100. For each book update, our strategy will update the average for the small window and big window. Since "books updates are dramatically more common than actual trades" (Prof. David Lariviere), we decided to use TradeDataEventMsg::instrument::topquote::ask and TradeDataEventMsg::instrument::topquote::bid here to get the input price by taking the average of those two. If the average for the small window is greater than the big window, our strategy will send an order to buy 100 shares of the selected security. While if the average for the small window is smaller than the big window, we will want to sell 100 shares of the selected security. However, the trade action will only be triggered if we have at least 1 share at hand. In all other situations, our strategy will hold and not make any trade.

Simple Moving average Crossover Strategy with LSTM

After implementing the SMA Crossover strategy mentioned above, we improved our strategy by incorporating the LSTM model. The main update for this new strategy will be the input price for the small and big windows. Previously, the input was the average of the highest bid price and lowest ask price. While for this strategy, we would feed this average price to our LSTM model and get the predicted price as the output. Then, we would update the small and big windows-based on the predicted price instead of the average price for each tick. However, since our LSTM model normalizes its input data, the predicted price we got out of our model was the un-scaled version of the predicted price. As a result, this predicted price would only decide on the trade action (buy, sell, hold).

Backtesting and Analysis :

For the analysis, we used the following metrics to review the performances of backtesting results.

Performance metrics

Backtesting in python for Trade updates

Because of some technical difficulties in implementing the LSTM in strategy studio, we decided to backtest the LSTM python code with 1 year of trade updates dataset for AAPL. We manually backtested the model for 2-3 days at a time for whole year and generated the prices for the whole year. The generated price was pretty accurate and MSE of actual and predicted prices were around 1.4 dollar.

Price Comparison

Based on prices, we implemented a moving average crossover strategy for buy and sell of 100 stocks each time. The small_moving_average price is calculated as 40 days moving average of predicted price and long_moving_average price is calculated as 100 days moving average of predicted price.The trade indicator for buy signal becomes +1 when SMA_40 > SMA_100 for bullish and sell signal becomes -1 when SMA_40 < SMA_100 for bearish position.

Strategy sample

We calculated the Profit/loss for year based on this strategy and our total net loss was $23084.19. The Average Trade Net Profit is -0.75. The max_profit for the day is $44200.4 USD and loss is $ -29818.51 USD. The sharpe ratio is -0.073. The max drawdown is -1.67.

Backtesting analysis of Simple Moving Average cross over strategy :

We developed a simple moving average crossover strategy for buy and sell of 100 stocks. We calculated SMA_small window for 40 days and SMA_large window for 100 days. We buy 100 stocks when SMA_40 > SMA_100 to show bullish position and sell 100 stocks when SMA_40 < SMA_100 for bearish position. We backtested this strategy for 2 years for 2021 and 2022.

For 2021 year, our strategy was profitable. Our cumulative profit was USD 192,411,296.1059 USD with initial investment of USD 1,000,000. The sharpe ratio is 3.2802, Sortino ratio: 35.2323 and max drawdown: -1.93% .

SMA 2021 Results

For 2022 year, our strategy was not profitable. The cumulative loss was USD -131,377,007.3707 with an initial investment of USD 1,000,000.

SMA 2022 Results

Backtesting analysis of LSTM and Simple Moving Average crossover strategy :

We added our LSTM model with the above Simple moving average crossover strategy. We backtested our model with SPY data for year 2021 and 2022. We found this model performed better than a simple moving average crossover strategy. Our profit and losses were limited as compared to the above model. For 2021, our strategy was profitable. The profit was USD 38,380,543.520, which is less than compared to USD 192 million profit generated from the above strategy. Following are the results:-

lstm 2021 Results

We also backtested our model with 2022 data and found it to be much limited loss as compared to the simple moving average strategy performance for 2022. The cumulative loss was USD -4,494,300.84 with initial investiment of USD 1,000,000 as compared to USD 131 million loss from above strategy.

lstm 2022 Results

All the other ratios are comparable for the both the strategies for SPY data.

Additonally we also backtested our LSTM strategy for DJIA stock for five years. We first backested for 2022 and found out the strategy lost 137 milllion USD as expected but our strategy was profitable for year 2021. The cumulative profit was around 1.4 million USD. The total net loss was -173.66 million USD for five years 2017 to 2022.

If we compare the LSTM model performance for SPY and DJIA stocks, the results for SPY were more better than DJIA for both years 2021 and 2022.

Conclusion:

We can conclude that strategies are behaving with the price trend in the market. Both strategies can earn a profit by 2021, but losses are pretty high for both of strategies for SPY stock for year 2022. However, the market has not performed well in 2022. But, we could have to change the strategy to overcome loss. We could change the time-windows for SMA crossover strategy because it is not giving profit during downtrend and add stop loss condition to prevent losses. We should also explore other strategies to improve model performance.

Reflections

Pallavi Prakash :

  • What did you specifically do individually for this project?

I researched the trading strategy for FIN556. We implemented the time series models for strategy. Initially, I analyzed these models and implemented LSTM model for predicting stock prices and then SMA crossover strategy for buy and sell. I coded the LSTM model initially with keras and tensorflow library. After reviewing and further discussion, with its problem with integration in C++. I coded the LSTM model again in python using Pytorch. I also backtested data manually for LSTM model for a year 2022 in python and did the analysis on python. For documentation, I wrote the whole final report, including outlining the part to be written by other teammate.

  • What did you learn as a result of doing your project?

This project was a great learning experience for me. I learned about time series modeling and machine learning. I had never worked in GitLab and strategy studio before. Initially, working on GitLab was confusing, but it helped us as a team to do version control on our project. I also learned a lot about working with Keras, Tensorflow and pytorch libraries in Python. It was difficult for me to code using Pytorch library in C++. But in the end, I was able to code it successfully. Last, but not the least, I also learned to work on strategy studio.

  • If you had a time machine and could go back to the beginning, what would you have done differently?

I would have liked to work on bigger group. Since we were just 2 people in the group, it was difficult for us to find solutions to various roadblocks that we faced in project. I felt that we should have chosen some easy trading strategy rather than implementing a machine learning algorithm. We had to speed up the project in the last month and faced issues related to converting the python file to C++ for strategy studio implementation. This part was one of the most difficult problem we faced during the project.

  • If you were to continue working on this project, what would you continue to do to improve it, how, and why?

I would like to work on implementing LSTM strategy in Strategy studio. We have already implemented python code in C++ but our integration in Strategy studio faced a lot of issues because of this mechanism. So, I would like to work on implementation and then, further backtesting the model for 2-3 years of data. I also would like to work on improving and further tuning the LSTM model. Better strategies and meaningful analysis are things I would like to continue working on. I believe my strategy could be further improved using machine learning algorithms like GRU.I would like to explore other machine learning algorithms for improving model performance. Besides, I would like to improve my coding skills related to analysis and visualization to improve the performance analysis of results.

  • What advice do you offer to future students taking this course and working on their semester long project. Providing detailed thoughtful advice to future students will be weighed heavily in evaluating your responses.

The first advice I would like to give is to pre-plan the project well in advance and develop a strong DevOps and project pipeline to work on for a semester's extensive project. It will help to progress in a smoother way and improve project efficiency. Another advice I would like to give is to look for strategies where integrating code in strategy studio shouldn't be much more problematic. Look out for logistic and execution problems that we could face near the end of the project. Lastly, I would recommend to never hesitate to ask questions to both your team members and classmates. They would be happy to give you insights into your problems and you could learn something from their projects.

Suwen Wang;

  • What did you specifically do individually for this project?

I retrieved all 5 years of market data using the IEX parser. I also worked on everything related to C++ and Strategy Studio. Originally, our LSTM model was coded in Python using PyTorch, which was not compatible with Strategy Studio. As a result, I converted our PyTorch model to Torch Script, which could then be compiled and executed in C++ by a series of serializing, loading, and deserializing. I also implemented the SMA crossover strategy in Strategy Studio and incorporated the LSTM model into the strategy. Additionally, I backtested our strategy in the Strategy Studio with 1 year of SPY data and converted the resulting .cra file to .csv files which could be read by human beings.

  • What did you learn as a result of doing your project?

This project teaches me numerous things, especially about machine learning. I had no prior experience in ML before doing this project, so I struggled badly at first. However, my partner helped me greatly and in the end, I become confident in understanding and using machine learning and its libraries such as PyTorch. Being the team leader, I also get myself more familiar with project management and version control. By loading the PyTorch model and implementing trading strategies in C++, I am also better at C++ programming than before. Additionally, after working with Strategy Studio a lot, I now have a good understanding of how it works.

  • If you had a time machine and could go back to the beginning, what would you have done differently?

I would start with a simpler strategy. Even though LSTM is a good option for the project, it is hard, especially for a group of 2 whose members aren't that familiar with machine learning. We spent a lot of time creating the LSTM model in python and even more time on how to use the model in C++. Even though we finally come up with the solution and backtested our strategy in the Strategy Studio, we didn't have time to backtest all 5 years of data among different companies. If we can start with a simpler strategy in python and then make it work in Strategy Studio, it might help us to save time and backtest our strategy with more data.

  • If you were to continue working on this project, what would you continue to do to improve it, how, and why?

I will definitely backtest our strategy with more data, specifically more years and more companies. I have already downloaded all data from May 2017 to December 2022 and parsed both trade and book update data for 5 different securities. However, due to the lack of time, I didn't backtest the strategy with all data I collected, which may lead to some bias in the backtest result. In addition to that, I would like to improve our trading strategy. SMA crossover is a good strategy to use, but I would like to search for other strategies and combine them all altogether, which should yield even better performance. Last but not least, I want to keep working on the machine learning part, to see is there any improvement can be done to our model.

  • What advice do you offer to future students taking this course and working on their semester long project. Providing detailed thoughtful advice to future students will be weighed heavily in evaluating your responses.

A detailed plan will be a good start. In the proposal at the beginning of the semester, they should come up with a firm deadline for each part of the project and a backup plan for any possible failure. If they cannot finish a certain task in a given period, they should find ways to move forward instead of being stuck there for a long time. If working on Strategy Studio or something similar, I will highly suggest start working on it as early as possible so that there will be plenty of time to look for solutions and even reboot the machine. It will also give them enough time to do backtesting since that process is time-consuming.