Exploring stock data with Jupyter

kyle profile picKyle O Shea


As a former credit analyst, I have had to deal with numerous large data sets - historical stock prices, transactional data, etc. - and have typically analyzed such data with some general-purpose statistical software package, or even something as routine as excel.

I’ve recently started learning data-science using Python, and I now wanted to try using Jupyter for some of my typical tasks.

I decided to stick with something familiar, applying a version of a model I’m familiar with. An ARIMA(p,d,q) model, which can be used to test the predictability of stock prices based on past price movements - essentially a quick-fire way to determine if the market is weak-form efficient.

I could run Jupyter on my local machine, or on a public cloud like AWS, but for this post I used the Kyso quickstart.

Once I opened the notebook I ran the following code to import the libraries I need.


import numpy as np
from matplotlib import pyplot
import pandas as pd
    

Then I import the dataset, a simple monthly time series of the SPY ETF, which tracks the S&P 500.


data = pd.read_csv(‘./spy_prices.csv’), parse_dates=True, index_col=0)
    

Next we want to check if the data is stationary (it won’t be - stock price data will usually have some trend), but for illustration purposes:


pylab.rcParams[‘figure.figsize’] = (15, 9)
data.plot()
    

Then we look at the data’s autocorrelation (correlation between the data points of our series and others from the same series at different monthly intervals).


from pandas.plotting import autocorrelation_plot
autocorrelation_plot(data[‘SPY’])
    

  • We can see that there may be some form of positive correlation between any given data point & those at a number of preceding data points.
  • Next, we define a model by calling ARIMA(p,d,q) and passing in the p, d, and q parameters. The model is prepared on the training data by calling the fit() function. Predictions can be made by calling the predict() function and specifying the index of the time or times to be predicted.
  • We run some models to determine what model is the best fit, which turns out to be an ARIMA(1,1,1).


data = pd.read_csv(‘./spy_prices.csv’, index_col = ‘date’)
model - ARIMA(data, order=(1,1,1))
model_fit = model_fit(disp=0)
print(model_fit.summary())
    

We also look at the residuals of the model to see how well the model fits the data.


residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind=’kde’)
pyplot.show()
print(residuals.describe())
    

There does appear to be trend information not captured by the model. Next, we perform a rolling forecast, separating the data into train and test sets. We are effectively re-creating the model after each new observation is received. We manually keep track of all observations in a list called history that is seeded with the training data and to which new observations are appended at each iteration. We also import the ‘mean-squared-error’ function to measure the average of the squares of the errors. The model’s performance is demonstrated below, showing the expected values (blue) compared to the rolling forecast predictions (red).

This is obviously a very modest example, and there is so much more that can be done with the data - I could run more complex versions of the model (such as a seasonal ARIMA) or validate the existing model by producing and visualizing forecasts of future values of the time series outside of the above timeline.

But the study accomplished my goal - to see how easy it can be to run a (simple) predictive model in the Jupyter environment, in this case on Kyso’s cloud platform.

I was able to open a Jupyter notebook with a single click, and to formulate and share my model with others, who can now clone and run their own versions of the model with other time-series data sets. I can also tweak the code a bit to allow for an interactive graph & deploy the study as a Jupyter application. Others can then pass different ARIMA(p,d,q) parameters through the model to determine if other combinations better fit & predict the data.