Time Series (State Space Models)

Time series from scratch

Posted by SV on Tuesday, November 10, 2020

Table of content

Time Series & Forecasting (state space models)

There has always been an interest in looking at things over time, to see how crops grow, how the marketing campaign went, or how the stock market developed. In hindsight we can look back at events to try to understand why it happened and then make an estimate of what will happen in the future. Or maybe there are certain events that might unfold and therefore we use probabilities to find the most likely outcome.

For animals the same, they also need to look at what happened previously and make assumptions on what is going to happen next. In many cases it is key for survival.

In this post I will go through methods we can use to build models over a time span, and we can use this knowledge to make assumptions about the future. It is in a way different from what our focus was in the previous blogpost Regression, where we had a cross-sectional approach, looking at different groups or things at a certain moment in time.

If we combine the two, time series and cross-sectional, we get what is called panel data methods. Panel data is very hard to model, as we need to take in to account the tests of both time series and cross-sectional to check if our model is valid. But this will be for another post.

Let us have a quick look at the math as this usually makes things clearer.

A time-series model is one which postulates a relationship amongst a number of temporal sequences. Which means that when we make predictions, we try to estimate the dependent variable, based on previous values of the dependent variable. And it kind of makes sense that there is some information to gather from the previous value. Take the stock market: The price of the stock the next day is quite dependent on the value today, since on average it tends to land in a given range of the value today. It is not very often that a stock drops to 0 or goes to infinity the next day.

In many cases one estimate the price today based on previous prices: $$y{t} = y{t-1}\beta + … +y{t-n}\beta + \epsilon{t-n}$$

#Lets import the different libraries needed
import pandas as pd
import urllib.request
import numpy as np
from scipy import stats

import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from jupyter_dash import JupyterDash
import dash_bootstrap_components as dbc
import wbdata as wbd

from statsmodels.tsa.tsatools import lagmat, add_trend
from statsmodels.tsa.adfvalues import mackinnonp, mackinnoncrit
from statsmodels.tsa.stattools import pacf, acf
import statsmodels.api as sm

Decomposition

To understand time series, it is a good idea to try to decompose the sequence into different parts.

There are several components to a time series, but we usually rend to decompose time series into (At this point we do not make any strong assumptions): * Trend * Seasonal * Cyclical * Irregular

We could make a model based on these components.

A popular approach is to use the multiplicative method:

$$y{t} = TR{t} * SN{t} * CL{t} * IR_{t}$$

We could also take the log of the method to get a addititve method

$$\log(y{t}) = \log(TR{t}) * \log(SN{t}) * \log(CL{t}) * \epsilon_{t}$$

Why would we for instance use the multiplicative model instead of an additive method, like what we did in the regression post. The best answer for this is that empirically, if something grows, its seasonal swings will also most likely get larger, and multiplying seasonal with trend accomplishes this. And as we took the logarithm it has in practice the same effect.

Lets download some data and look at some charts.

#Download yearly population in the arab world from the world bank
popArab = wbd.get_data("SP.POP.TOTL",country="1A")
popuA = dict([(int(d['date']), float(d['value'])) for d in popArab if d['value'] is not None]) 
popArab = pd.DataFrame(list(popuA.items()), columns=['Year', 'Population'])

#Non farm payrolls US employees from Bank of St. Louis
payrollUS = pd.read_csv("PAYNSA.csv").rename({'DATE': 'Month', 'PAYNSA': 'Payrolls'}, axis=1)
payrollUS["Month"] = pd.to_datetime(payrollUS["Month"], format='%Y-%m-%d', errors='ignore')

#Houses sold on a yearly basis by United States Census Bureau
house_sold = pd.read_excel("sold_cust.xls")
house_sold = house_sold.iloc[8:65,0:2]
house_sold.columns = ['Year', 'Houses_sold']
house_sold["Year"] = pd.to_datetime(house_sold["Year"], format='%Y', errors='ignore').dt.year

#Prime rate from Federal Bank of St. Louis
prime_rate = pd.read_csv("MPRIME.csv")
prime_rate.columns = ['Date', 'Prime_rate']
prime_rate["Date"] = pd.to_datetime(prime_rate["Date"], format='%Y-%m-%d', errors='ignore')
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=("Non farm total employees", "Houses sold in US on a yearly basis (in millions)",
                    "Prime Rate", "Population Development"))

fig.add_trace(go.Scatter(x=payrollUS["Month"], y=payrollUS["Payrolls"],name="Employees non-farm US")
              ,row=1, col=1)

fig.add_trace(go.Scatter(x=house_sold["Year"], y=house_sold["Houses_sold"],name="House Sold")
              ,row=1, col=2)

fig.add_trace(go.Scatter(x=prime_rate["Date"], y=prime_rate["Prime_rate"],name="Prime Rate")
              ,row=2, col=1)

fig.add_trace(go.Scatter(x=popArab["Year"], y=popArab["Population"],name="Population")
              ,row=2, col=2)

fig.update_layout(height=600, width=900,
                  title_text="Time series examples")

fig.write_json("first")
fig.show()

To interpret the plots; * In the population chart, as well as in the non farm total employees chart, we can clearly see a trend in the data, as it is constantly increasing. * In both the prime rate, as well as the houses sold charts, there seems to be some sort of cycle going on. * As the fluctuations in these charts are rather spread out and not over a fixed period, we can exclude that those are seasonal, and that would also not make sense as the frequency is yearly. * If you zoom in on the payroll chart, you can see there are some seasonalities in there, about every 12 month. * In the three first charts there is clearly some irregularities.

Lets try to decompose the payroll chart:

We can see from the plot that the seasons tend to be every 6 month, one lower spike in June and one larger around December. So, lets try a centered moving average to decompose the seasons.

Mathematically: $$CMA= \frac{y{1} + … + y{n}}{n}$$

where: $n$ is the window you want to calculate, and in the sense of 6 months centered moving average we would start from $y{-6} … y{6}$

#I decided for a season of 12 months, since 6 months did not look to appealing.
payrollUS["CMA"] = payrollUS["Payrolls"].rolling(window=12, center=True).mean()
decomfig = px.line(payrollUS, x="Month", y="CMA", title='Trend of payrolls')
decomfig.show()

Great, it looks good, and it seems like we have captured the trend of the data. We can see on the first plot of payrolls, that the seasons seemed to increase over time, and thus we can try a multiplicative method. But feel free to go for the additive method.

For simplicity, let us remove the cyclical component:

$$y{t} = TR{t} * SN{t} * IR{t}$$

So far we have the original timeseries and the trend, so lets extract the seasonality and externalities from the series:

$$\frac{y{t}}{TR{t}} = SN{t} * IR{t}$$

payrollUS['SNIR'] = payrollUS['Payrolls']/payrollUS['CMA']
snir = px.line(payrollUS, x="Month", y="SNIR", title='Seasonality and irregular component')
snir.show()

It seems like we have done a decent job.

Lets go on and extract the seasonality from the plot:

#Get the dates of payrollUS
m = payrollUS['Month'].dt.month
#Group by month and get the mean
regseason = payrollUS["SNIR"].groupby(m).mean()
#Create a dataframe
regseason = pd.DataFrame({'Month':regseason.index, 'SN':regseason.values})

#Concatenate the regseason df and times it by the new length
new_len = round(len(payrollUS)/len(regseason))
repeated = (pd.concat([regseason["SN"]] * new_len)
              .reset_index()
              .drop(["index"], 1)
              .iloc[:len(payrollUS)])
#Merge the two datframes based on index
payrollUS = payrollUS.merge(repeated, how="outer", left_index=True, right_index=True)
sn = px.line(payrollUS, x="Month", y="SN", title='Seasonality')
sn.show()

Finally we divide the seasonal and irregular components on the seasonal component to get the regular component.

$$ IR{t} = \frac{SN{t}IR{t}}{SN{t}}$$

payrollUS["IR"] = payrollUS.iloc[6:976,3] / payrollUS.iloc[6:976,4]
ir = px.line(payrollUS, x="Month", y="IR", title='Irregularity')
ir.show()

We have finally managed to decompose the timeseries into different components. However, it is quite hard to forecast with this method, so let us move forward and try to include some coefficients, to see where that leads us.

Exponential Smoothing

Level or Simple Exponential Smoothing

What we can consider to start with, is if there is no trend or seasonal component, where the time series fluctuate around a mean, which would be the level.

If the defined level would not change over time, we could just take the mean and add some error term:

$$ y{t} = \mu + \epsilon{t}$$

We would then have: $$ \hat{\mu} = \frac{1}{n}\sum\limits{t=1}^{n}y{t}$$

If however the level changed over time, this would probably not be such a good idea, as the most recent lags of $y_{t}$ would probably contain more information than the values many lags ago.

Therefore a good approach would be to include some weights based on the lags:

$$\hat{\mu} = \frac{1}{n}\sum\limits{s=1}^{t}w{s}y_{s}$$

The most common way to approach this, is to let the weights decrease exponentially over time, where $w_{s}$ is proporptional to: $(1-\alpha)^{t-s}$

And from a probabilistic standpoint we need all the weights to sum to 1:

$$\sum\limits_{s=1}^{t}\alpha(1-\alpha)^{t-s} = 1$$

Therefore we get:

$$\hat{\mu}{t} = \sum\limits{s=1}^{t}\alpha(1-\alpha)^{t-s}y_{s}$$

We can express the level as: $$\ell{t} = \sum\limits{s=1}^{t}\alpha(1-\alpha)^{t-s}y{s} = \sum\limits{s=1}^{t-1}\alpha(1-\alpha)^{t-s}y{s}+\alpha y{t} = (1-\alpha)\sum\limits{s=1}^{t-1}\alpha(1-\alpha)^{t-s-1}y{s}+\alpha y{t} $$ where: $$\sum\limits{s=1}^{t-1}\alpha(1-\alpha)^{t-s-1}y{s} = \ell{t-1}$$

so we get: $$\ell{t} = (1-\alpha)\ell{t-1} + \alpha y_{t}$$

We can see from the above equation that the new level at time t is dependent on the previous level + the current value of $y$ at time t.

  • If the value of our weight $\alpha = 0$, then we ignore the value of $y_{t}$
  • if $\alpha = 1$, we disregard the previous level when we have $y_{t}$

Common practice for setting alpha is $ \alpha = \frac{2}{N + 1} $ where $N$ is the size of the rolling window.

If you are interested in finance, the level equation look a lot like this: $$EMA{t} = (1-\alpha)EMA{t-1} + \alpha p_{t}$$

And indeed, this is a exponential weighted moving average (EMA or EWMA), which is often used for prediction in finance.

#I have set up two functions to solve the equation. One without a rolling window, and one with.

def emwa(data, alpha):
    #We set the first value to the same as the series.
    li_res = [data[0]]
    #Loop over the length of the array
    for n in range(1, len(data)):
        #Append the values of a lag of 1.
        li_res.append(alpha * data[n] + (1 - alpha) * li_res[n-1])
    return li_res


#Function to calculate EMA
def ema(a, alpha, windowSize):
    #Get weights (1-alpha)^windowsize
    wghts = (1-alpha)**np.arange(windowSize)
    #Get weights mean
    wghts /= wghts.sum()
    #We slide over the data given and the weights to compute the linear approximation.
    out = np.convolve(a,wghts)
    #For the output, set everything -1 below window size to be nan
    out[:windowSize-1] = np.nan
    #return up to the size of the data given
    return out[:a.size] 


#Define window
windowSize = 2

#Define Alpha
alpha = 2/(windowSize + 1)
alpha5 = 0.8
payrollUS["Payrolls"] = payrollUS["Payrolls"].astype(float)
payrollUS["EMA"] = ema(payrollUS["Payrolls"].values,alpha, windowSize)
payrollUS["EMA5"] = ema(payrollUS["Payrolls"].values,alpha5, windowSize)
payrollUS["EMWA"] = emwa(payrollUS["Payrolls"].values,alpha5)
level = px.line(payrollUS, x="Month", y=["Payrolls","EMA","EMA5","EMWA"], title='Level')
level.show()

We can play around with the different windows and alphas, and can see that when the window is longer, we smoothen more, while when alpha is at extremes, we shift the weights from not paying attention to the lags, to put a lot of emphasis on them.

To see how well the prediction works, we can look at $y{t} - \ell{t-1}$ as the error.

If we rearrange: $$\ell{t} = (1-\alpha)\ell{t-1} + \alpha y_{t}$$

We get: $$\ell{t}-\ell{t-1} = \alpha(y{t}-\ell{t-1})$$

If we want to perform point forecasts, we could use: $\hat{y}{n+h} = \ell{n}$, as we do not have information on how the level changes out of sample.

So let us assume that the errors are i.i.d and normal:

$$y{t} = L{t-1} + \epsilon_{t}$$

$$L{t} = L{t-1} + \alpha \epsilon_{t}$$

$$\epsilon_{t} \sim N(0,\sigma^2)$$

So for a prediction 1 day ahead, we can write:

$$y{n+1} = L{n} + \epsilon{n+1}$$ $$ \sim N(L{n},\sigma^2)$$

Two days ahead: $$y{n+2} = L{n+1} + \epsilon{n+2} = L{n} + \alpha \epsilon{n+1} + \epsilon{n+2}$$ $$ \sim N(L_{n},(\alpha^2 + 1)\sigma^2)$$

In general we have: $$y{n+h} \sim N(L{n},((h-1)\alpha^2 + 1)\sigma^2)$$where $h$ is the horizon ahead.

To get the prediction intervals for 99% probability, we use: $$L{n} - 2,576 + \sigma \sqrt{(h-1)\alpha^2 + 1} \leq y{n+h} \leq L{n} + 2,576 + \sigma \sqrt{(h-1)\alpha^2 + 1}$$ where we replace $L{n}$ with $\ell_{n}$ and $\sigma^2$ with $s^2$ where $s^2 = \frac{SSE}{(n-1)}$, since we lose one degree of freedom by estimating alpha

def simpleEMA_forecast(data,forecast_p=1,alpha=0.8):  
    
    #Make array so easier to work with numerically.
    d = np.array(data)
    
    # Length of data
    l_rows = len(data)  
    
    # Append nan to the payrollUS array
    d = np.append(d,[np.nan]*forecast_p)    
    
    # Fill the forecast array with nan 
    f = np.full(l_rows+forecast_p,np.nan)  
    
    # initialization of first forecast
    f[1] = d[0]      
    
    #Loop over the range of data and get the EMW forecasts
    for t in range(2,l_rows+1):  
        f[t] = alpha*d[t-1]+(1-alpha)*f[t-1]  
    
    # Forecast for all periods added
    f[l_rows+1:] = f[t]    
    
    #append first value to be same as payrolls
    f[0] = data[0]
    
    #Create df with all the values
    forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f})
    forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")
    return forecast_df 
sEMA = simpleEMA_forecast(payrollUS["Payrolls"], forecast_p = 12)
s_mape = abs(sEMA["Error"]).sum()/(sEMA["Payrolls"].sum())
s_mape
0.0076371762707314154
exp_sm = px.line(sEMA, x="Month", y=["Payrolls","Forecast"], title='Level')
exp_sm.show()

If you zoom in on the end of the graph, you will see the forecast, which is a straight line, or a level, as we talked about earlier. This is maybe not too informative, so let us move on and try to predict trend.

Holt’s Trend Corrected Exponential Smoothing

Let us now assume there is a trend that can change over time. So the one day ahead prediciton is no longer current level, but rather a current level plus a current slope.

We then change the level equation: $$\ell{t} = (1-\alpha)\ell{t-1} + \alpha y_{t}$$

To: $$\ell{t} = (1-\alpha)(\ell{t-1} + b{t-1}) + \alpha y{t}$$

Where $b$ is the slope estimate.

A change in the level should be a good indication for the slope, and hence we have: $$ b{t} = \gamma(\ell{t} - \ell{t-1}) + (1-\gamma)b{t-1} $$

We now have to decide the new coefficients (gamma), which is the smoothing factor for the trend component.

A sound check for forecast error is now: $$y{t}-\ell{t-1}-b_{t-1}$$

And the error correction form: $$\ell{t} = \ell{t-1} + b{t-1} + \alpha( y{t} -\ell{t-1}-b{t-1})$$ $$ b{t} = \alpha \gamma(y{t} - \ell{t-1} - b{t-1}) + b_{t-1}$$

We still assume i.i.d, normal errors and thus: $$y{t} = L{t-1} + B{t-1}+ \epsilon{t}$$

$$L{t} = L{t-1} + B{t-1} + \alpha \epsilon{t}$$

$$\epsilon_{t} \sim N(0,\sigma^2)$$

Forecasting one step ahead: $$y{n+1} = L{n} + B{n} \epsilon{n+1}$$ $$ \sim N(L{n}+B{n},\sigma^2)$$

Two days ahead: $$y{n+2} = L{n+1} +B{n+1}+ \epsilon{n+2}$$ $$= L{n} +B{n}+ \alpha \epsilon{n+1} +B{n} +\gamma \alpha \epsilon{n+1}+ \epsilon{n+2}$$ $$= L{n} + 2B{n} + \alpha(1+\gamma) \epsilon{n+1} + \epsilon{n+2}$$ $$ \sim N(L{n} + 2B{n},(\alpha^2(1 + \gamma) + 1)\sigma^2)$$

In general we have: $$y{n+h} \sim N(L{n} + hB{n},(\alpha^2 \sum\limits{j=1}^{h-1}(1+j\gamma) + 1)\sigma^2)$$where $h$ is the horizon ahead.

prediction standard error is: $$s\sqrt{\alpha^2 \sum\limits_{j=1}^{h-1}(1+j\gamma) + 1}$$ and: $$s^2 = \frac{SSE}{n-2}$$ where we deduct 2 degrees of freedom for $\gamma$ and $\alpha$.

def holts_exponential_forecast(data, alpha, gamma, forecast_p=2):

    d = np.array(data)
    l_rows = len(d)
    d = np.append(d,[np.nan]*forecast_p)
    
    # Fill the forecast array with nan 
    f = np.full(l_rows+forecast_p,np.nan)  
    
    # initialization of first forecast
    f[1] = d[0]

    level = d[0]
    trend = d[1] - d[0]
    for t in range(2, l_rows + 1):
        if t >= l_rows:
            # forecasting new points
            value = f[t - 1]
        else:
            value = d[t]

        previous_level = level
        level = alpha * value + (1 - alpha) * (level + trend)
        trend = gamma * (level - previous_level) + (1 - gamma) * trend 
        f[t] = level + trend

    # for forecasting beyond the first new point,
    # the level and trend is all fixed
    if forecast_p > 1:
        f[l_rows + 1:] = level + np.arange(2, forecast_p + 1) * trend
        
    #Set first value of forecast to first value of payrolls
    f[0] = data[0]
    
    #Create df with all the values
    h_forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f})
    h_forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")

    return h_forecast_df
#For simplicity we have just set alpha and gamma to 0.8, but feel free to change them to more appropriate values.
holt = holts_exponential_forecast(payrollUS["Payrolls"],alpha=0.8,gamma=0.8, forecast_p = 12)
h_mape = abs(holt["Error"]).sum()/(holt["Payrolls"].sum())
h_mape
0.005412664907921347
holt_trend = px.line(holt, x="Month", y=["Payrolls","Forecast"], title='Holt')
holt_trend.show()

As we can see from above, the forecast is not a level as before, but rather follow the latest trend.

Great, we have forecasted level, and then level and trend. Let us try to include seasons as well.

Holt-Winters Exponential Smoothing

One of the major point of these decompositions, is that we want to separate trend and seasonal effect, so therefore we need to separate them in the error correction form as well:

$$\ell{t} =\alpha(y{t} - sn{t-L}) + (1 - \alpha)(\ell{t-1}-b_{t-1})$$

Here we can see that the trend component $(1 - \alpha)(\ell{t-1}-b{t-1})$ is trying to predict the movement in the deseasonalized series $\alpha (y{t} - sn{t-L})$.

We also have the other components: $$ b{t} = \gamma(\ell{t} - \ell{t-1}) + (1-\gamma)b{t-1}$$

$$ sn{t} = \delta(y{t}-\ell{t}) + (1-\delta)sn{t-L}$$

In addition to $\alpha$ and $\gamma$ we also have to decide for the seasonal component $\delta$.

Our main obejctive would be to minimize: $$ \sum^n{t=1}(y{t}-\ell{t-1}-b{t-1}-sn_{t-L})^2$$

As in the Holt trend section we still assume i.i.d, normal errors and thus: $$y{t} = L{t-1} + B{t-1}+ + SN{t-L} + \epsilon_{t}$$

$$L{t} = L{t-1} + B{t-1} + \alpha \epsilon{t}$$

$$SN{t} = SN{t-L} + (1-\alpha)\delta +\epsilon_{t}$$

$$\epsilon_{t} \sim N(0,\sigma^2)$$

The point forecast will then be: $$\hat{y}{n+h} = \ell{n} + hb{n} + sn{n+h-L}$$

We can estimate $\sigma^2$ using the sample $s^2 = \frac{SSE}{(n-3)}$.

Prediciton standard error: $$s+\sqrt{\sum\limits{j=1}^{h-1}[\alpha(1+j\gamma)+ d{j,L}(1-\alpha)\delta]^2 + 1}$$

where the dummy variable:

$$d_{j,L} = \begin{cases} 1 & \text{if} \frac{j}{L} \text{is an integer};
0 & \text{otherwise}.
\end{cases} $$

To calculate the Holt-Winter is a bit more tricky than to calculate the previous trend and level.

First we need to compute the average of trends across seasons:

$$b{0} = \frac{1}{L}\bigg(\frac{y{L+1}-y{1}}{L} + \frac{y{L+2}-y{1}}{L} + \cdots + \frac{y{L+L}-y_{1}}{L}\bigg)$$

def initial_trend(data, n_seasons):
    sum = 0.0
    for i in range(n_seasons):
        sum += float(data[i+n_seasons] - data[i]) / n_seasons
    return sum / n_seasons

The steps we need to take to compute the initial seasonal components are:

  • Compute the average of each of the years:

$$A{p} = \frac{\sum{i=1}^{12}y_{i}}{12}, \text{p} = 1,2,…,n$$

  • Divide the values in the series by the appropriate yearly mean.

For instance: $\frac{y{1}}{A{1}}$ and $\frac{y{12}}{A{12}}$ and $\frac{y{13}}{A{2}}$

Have a look at this link for a better explanation.

def initial_seasonal_components(data, n_seasons):
    seasonals = {}
    season_averages = []
    k_seasons = int(len(data)/n_seasons)
    # compute season averages
    for j in range(k_seasons):
        season_averages.append(sum(data[n_seasons*j:n_seasons*j+n_seasons])/float(n_seasons))
    # compute initial values
    for i in range(n_seasons):
        sum_of_vals_over_avg = 0.0
        for j in range(k_seasons):
            sum_of_vals_over_avg += data[n_seasons*j+i]-season_averages[j]
        seasonals[i] = sum_of_vals_over_avg/k_seasons
    return seasonals
def holtwinters_exponential_forecast(data, n_seasons, alpha, gamma, delta, forecast_p):
    #Get values
    d = np.array(data)
    l_rows = len(d)
    
    f = []
    seasonals = initial_seasonal_components(d, n_seasons)
    for i in range(l_rows+forecast_p):
        if i == 0: # initial values
            level = d[0]
            trend = initial_trend(d, n_seasons)
            f.append(d[0])
            continue
        if i >= l_rows: # we are forecasting
            m = i - l_rows + 1
            f.append((level + m*trend) + seasonals[i%n_seasons])
        else:
            value = d[i]
            last_level = level
            level = alpha*(value-seasonals[i%n_seasons]) + (1-alpha)*(level+trend)
            trend = gamma * (level-last_level) + (1-gamma)*trend
            seasonals[i%n_seasons] = delta*(value-level) + (1-delta)*seasonals[i%n_seasons]
            f.append(level+trend+seasonals[i%n_seasons])
    
    
    d = np.append(d,[np.nan]*forecast_p)
    
        #Create df with all the values
    hw_forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f})
    hw_forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")
    
    return hw_forecast_df
#For simplicity we also only set the alpha, gamma to 0.8 and delta to 0.2
holtwinters = holtwinters_exponential_forecast(payrollUS["Payrolls"],n_seasons=12,alpha=0.8,gamma=0.8, delta=0.2, forecast_p = 12)
hw_mape = abs(holtwinters["Error"]).sum()/(holtwinters["Payrolls"].sum())
hw_mape
0.006959972779861919
holtw = px.line(holtwinters, y=["Payrolls","Forecast"], title='Holt-Winters')
holtw.show()

Finally we have gone through the level, trend and seasonal components of the different smoothing processes.

As you saw, we have purely focused on the additive approach. The reason for this, is that mathematically the multiplicative technique tends to be harder to handle. Furthermore as we are going to the next step to look at ARIMA (Box-Jenkins) models, we can see that it is hard to generalize those techniques. And empirically the multiplicative method tends not to be worth it.

Gardner-McKenzie damped smoothing

Over longer periods of time, a linear approach may not be suitable. A last an idea that is worth looking into, is the: Gardner-McKenzie damped trend exponential smoothing. * This techniques key idea is that growth between time periods tends not to be a constant $b{n}$, but it can be a fraction of that: $\phi^h b{n} \mid (0<\phi<1)$

This means that we expect $b_{n}$ to shrink over time.

Applying the dampened parameter $\phi$ to the Holt-Winters we then have:

$$\ell{t} =\alpha(y{t} - sn{t-L}) + (1 - \alpha)(\ell{t-1}-\phi b_{t-1})$$

$$ b{t} = \gamma(\ell{t} - \ell{t-1}) + (1-\gamma)\phi b{t-1}$$

$$ sn{t} = \delta(y{t}-\ell{t}) + (1-\delta)sn{t-L}$$

As you can see, there are no major changes that we have to perform in order to add the $\phi$ parameter.

Below, I have added it to the Holt-Winters exponential forecasting function we used above.

def gardner_mcKenzie_exponential_forecast(data, n_seasons, alpha, gamma, delta, phi, forecast_p):
    #Get values
    d = np.array(data)
    l_rows = len(d)
    
    f = []
    seasonals = initial_seasonal_components(d, n_seasons)
    for i in range(l_rows+forecast_p):
        if i == 0: # initial values
            level = d[0]
            trend = initial_trend(d, n_seasons)
            f.append(d[0])
            continue
        if i >= l_rows: # we are forecasting
            m = i - l_rows + 1
            f.append((level + m*trend) + seasonals[i%n_seasons])
        else:
            value = d[i]
            last_level = level
            #Phi added as we showed above.
            level = alpha*(value-seasonals[i%n_seasons]) + (1-alpha)*(level+(phi*trend))
            trend = gamma * (level-last_level) + (1-gamma)*(phi*trend)
            seasonals[i%n_seasons] = delta*(value-level) + (1-delta)*seasonals[i%n_seasons]
            f.append(level+trend+seasonals[i%n_seasons])
    
    
    d = np.append(d,[np.nan]*forecast_p)
    
        #Create df with all the values
    hw_forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f})
    hw_forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")
    
    return hw_forecast_df
holtwinters_damp = gardner_mcKenzie_exponential_forecast(payrollUS["Payrolls"],n_seasons=12,alpha=0.8,gamma=0.8, delta=0.2,phi=0.8, forecast_p = 12)
hw_d_mape = abs(holtwinters_damp["Error"]).sum()/(holtwinters_damp["Payrolls"].sum())
hw_d_mape
0.005052857255299971
holtwinters_dampened = px.line(holt, x="Month", y=["Payrolls","Forecast"], title='Holt')
holtwinters_dampened.show()

As we did the Holts-Winters above, I have also added it to the pure holts exponential forecasting function as well.

Applying the dampened parameter $\phi$ to the Holts smoothing we then have:

$$\ell{t} =\alpha y{t} + (1 - \alpha)(\ell{t-1}-\phi b{t-1})$$

$$ b{t} = \gamma(\ell{t} - \ell{t-1}) + (1-\gamma)\phi b{t-1}$$

def gardner_mcKenzie_trend_exponential_forecast(data, alpha, gamma, phi,forecast_p=2):

    d = np.array(data)
    l_rows = len(d)
    d = np.append(d,[np.nan]*forecast_p)
    
    # Fill the forecast array with nan 
    f = np.full(l_rows+forecast_p,np.nan)  
    
    # initialization of first forecast
    f[1] = d[0]

    level = d[0]
    trend = d[1] - d[0]
    for t in range(2, l_rows + 1):
        if t >= l_rows:
            # forecasting new points
            value = f[t - 1]
        else:
            value = d[t]

        previous_level = level
        level = alpha * value + (1 - alpha) * (level + phi*trend)
        trend = gamma * (level - previous_level) + (1 - gamma) * (phi*trend) 
        f[t] = level + trend

    # for forecasting beyond the first new point,
    # the level and trend are fixed
    if forecast_p > 1:
        f[l_rows + 1:] = level + np.arange(2, forecast_p + 1) * trend
        
    #Set first value of forecast to first value of payrolls
    f[0] = data[0]
    
    #Create df with all the values
    hs_forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f})
    hs_forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")

    return hs_forecast_df
holt_damp = gardner_mcKenzie_trend_exponential_forecast(payrollUS["Payrolls"],alpha=0.8,gamma=0.8, phi=0.8,forecast_p = 12)
hd_mape = abs(holt_damp["Error"]).sum()/(holt_damp["Payrolls"].sum())
hd_mape
0.00493208310268728
holt_dampened = px.line(holt, x="Month", y=["Payrolls","Forecast"], title='Holt')
holt_dampened.show()

As a final graph, I have added some confidence intervals using Brutlags algorithm to the Holts-Winters-Gardner-McKenzie approach. Where prob is the z score for the different significance levels. e.g 2.33 for 99%, 1.96 for 95%.

$$\hat{y}{upper{t}} = \ell{t-1} + b{t-1} + sn{t-L} + h d{t-L}$$

$$\hat{y}{lower{t}} = \ell{t-1} + b{t-1} + sn{t-L} - h d{t-L}$$

$$d{t} = \delta|y{t}-\hat{y}{t}| +(1-\delta)d{t-L}$$

def gardner_mcKenzie_confidence_interval(data, n_seasons, alpha, gamma, delta, phi, forecast_p, prob):
    #Get values
    d = np.array(data)
    l_rows = len(d)
    
    p_dev = []
    upper_c = []
    lower_c = []
    
    f = []
    seasonals = initial_seasonal_components(d, n_seasons)
    for i in range(l_rows+forecast_p):
        #Get starting values
        if i == 0: 
            level = d[0]
            trend = initial_trend(d, n_seasons)
            f.append(d[0])
            #append the predicted deviation
            p_dev.append(0)
            #Confidence intervals
            upper_c.append(d[0]+prob*p_dev[0])
            lower_c.append(d[0]-prob*p_dev[0])
            continue
        if i >= l_rows: # we are forecasting
            m = i - l_rows + 1
            f.append((level + m*trend) + seasonals[i%n_seasons])
            #append the predicted deviation
            p_dev.append(p_dev[-1]*1.01)
        else:
            value = d[i]
            last_level = level
            #Phi added.
            level = alpha*(value-seasonals[i%n_seasons]) + (1-alpha)*(level+(phi*trend))
            trend = gamma * (level-last_level) + (1-gamma)*(phi*trend)
            seasonals[i%n_seasons] = delta*(value-level) + (1-delta)*seasonals[i%n_seasons]
            f.append(level+trend+seasonals[i%n_seasons])
            #append the predicted deviation
            p_dev.append(delta * np.abs(d[i] - f[i]) + (1-delta)*p_dev[-1])
    
        #Get the confidence intervals
        upper_c.append(f[-1] + prob * p_dev[-1])
        lower_c.append(f[-1] - prob * p_dev[-1])

            
    
    d = np.append(d,[np.nan]*forecast_p)
    
        #Create df with all the values
    hwd_forecast_df = pd.DataFrame.from_dict({"Payrolls":d,"Forecast":f,"Error":d-f, "upper_c":upper_c,"lower_c":lower_c,"p_dev":p_dev})
    hwd_forecast_df["Month"] = pd.date_range(start='01-01-1939', periods=len(f), freq="M")
    
    return hwd_forecast_df
holtwinters_conf = gardner_mcKenzie_confidence_interval(payrollUS["Payrolls"],n_seasons=12,alpha=0.8,gamma=0.8, delta=0.2,phi=0.8, forecast_p = 12, prob = 2.33)
holt_dampenedfr = px.line(holtwinters_conf, x="Month", y=["Payrolls","Forecast","upper_c","lower_c"], title='Holt')
holt_dampenedfr.show()

A final word for the different techniques we have utilized above, is that there are no real right or wrong choices for the hyper parameters; alpha, gamma, delta and phi. We could off course run through a grid search to try and find the optimal value for the parameters, but then maybe we would be overfitting. So I would rather think carefully through what you are modeling, what are the characteristics, and what are your expectations.

A general trend of the mean avarage percentage error, as I pick for the loss function, seems that adding more components is not such a bad idea. But I would encourage to try out different values and look up best practices.

ARIMA (Box-Jenkins)

One of the issues with the techniques used above, is that they assume the time series analyzed is non-stationary.

A stationary process is a stochastic process whose unconditional joint probability distribution $(y{t},y{t-1},…,y_{t-h})$does not change when shifted over time, which means that mean and variance also do not change over time.

A common cause of the violation of the stationary assumption is a trend in the mean. This can be a deterministic trend, or it can be a more serious problem like unit root. If the trend is deterministic, the variable will tend to move towards the evolving mean, even if it can diverge for longer periods. Unit root on the other hand is not mean reverting, and stochastic shocks have permanent effect.

If the time series has a deterministic trend, we can easily remove this through removing the underlying trend. If we have a unit root, it might help to take the first difference of the time series and hope.

However, in many cases strict stationarity as we described above is not feasible, as it puts to many restrictions on us. Therefore we mostly go with covariance stationarity (weak-sense stationarity), where the first moment and autocovariance do not vary over time, while we only ask that the second monent is finite for any time t.

An intuitive example of a non-stationary process is the random walk: $y{t} = y{t-1} + \epsilon{t}$, where $\epsilon{t} \sim N(0,\sigma^2)$

Expected value: $E(y{t}) = \sum^t{s=1} E(\epsilon_{s}) = t * 0=0$

while variance: $Var(y{t}) = \sum^t{s=1} Var(\epsilon_{s}) = t\sigma^2\neq0$

So we can see that this is not a stationary process.

What is a very nice trait with a stationary process, is that over time, nothing explodes (shocks die out).

Taking the difference of a time series gives you many times a stationary time series: $y{t} - y{t-1} = \epsilon_{t}$ We call this time series a I(1), which stands for “integrated of oder one”. If you need to difference the series more, you get I(2), or I(3) and so on.

We can also take the difference with respect to the seasons we discovered eariler in the decomposition and plot it.

#Create a new dataframe and get the columns needed
df_payrollUS = pd.DataFrame.from_dict({"Month":payrollUS["Month"],"Payrolls":payrollUS["Payrolls"]})
#Take the difference over 1 and 12 month, and then combining them
df_payrollUS["diff1"] = df_payrollUS["Payrolls"].diff()
df_payrollUS["diff12"] = df_payrollUS["Payrolls"].diff(12)
#We take the difference over 12 month first, and after the one month, to make sure we get rid of 
#any stationarity after we take the 12 month difference.
df_payrollUS["diff1_12"] = df_payrollUS["Payrolls"].diff(12).diff(1)

diff_payroll = px.line(df_payrollUS.dropna(), x="Month", y="diff1_12", title='First Difference')
diff_payroll.show()

Looking at the plot it seems to be kind of stationary, but the last spike can be of some concern. Let us have a try to check for unit root.

Augmented Dickey-Fuller

A common test, to see if it is stationary, is the Dickey-Fuller test.

We estimate a regression model: $$y{t} = \delta + \varphi y{t-1} + \epsilon_{t}$$

if $\varphi = 1$ and $\delta = 0$, then we get a random walk model.

if $\varphi = 1$ and $\delta \neq 0$, we get a random walk with drift.

So what we need to test if $\varphi = 1$, which tells us if the time series is stationary.

$H{0}:\varphi = 0 $ unit root is present. $H{1}:\varphi < 0$ no unit root present.

One of the issues are, that we cannot use a t-test or F-test, since the Dickey-Fuller distribution is not normal nor symmetrical. So we use critical values by Mackinnon

def adfull(data, maxlags):

    # make sure we are working with an array, convert if necessary
    data = np.asarray(data)
    
    # Calculate the discrete difference
    d_diff = np.diff(data)
    
    # Create lags, and trim off any lag values.
    d_lag = lagmat(d_diff[:, None], maxlags, trim='both', original='in')
    # Get length of the array
    l_obs = len(d_lag) 

    d_lag[:, 0] = data[-l_obs - 1:-1]  
    tsdshort = d_diff[-l_obs:]
    
    trend = add_trend(d_lag[:, :maxlags + 1])
    
    # Use Ordinary least square from my article on "Regressions"
    #Find coefficients
    b_ts = np.linalg.inv(trend.T.dot(trend)).dot(trend.T).dot(tsdshort)
    #Fit model
    yhat = trend.dot(b_ts)
    #SSR is the sum of squared errors the model does not explain
    SSR = np.dot((tsdshort-yhat).T,tsdshort-yhat)
    #Get Variance
    vs = (SSR)/(len(tsdshort)-1)
    #Standard error
    s = np.sqrt(vs)
    #Get variance beta
    var_b = np.diag(np.dot(s**2,np.linalg.inv(np.dot(trend.T,trend))))
    #Standard error beta
    s_b = np.sqrt(var_b)
    #And finally t_values
    t_values = b_ts/ s_b
    
    # p-values
    pvalue = mackinnonp(t_values[0], N=1)
    # c-values
    c_values = mackinnoncrit(1, nobs=np.inf)
    c_values = {"1%": c_values[0],
                "5%": c_values[1],
                "10%": c_values[2]}
    return  t_values[0], pvalue, c_values
unit_r = adfull(df_payrollUS["Payrolls"].dropna(), maxlags=1)
unit_r
(-0.8226552066131838,
 0.812376529451031,
 {'1%': -3.43035, '5%': -2.86154, '10%': -2.56677})

From the results above we can clearly see, that we cannot reject the null hypothesis after differencing once, so we cannot be certain that we do not have unit root.

As we have seen from previous plots, there seems to be a trend. Let us try to difference it once, as we did in the plot above.

unit_ro = adfull(df_payrollUS["diff1"].dropna(), maxlags=1)
unit_ro
(-26.46729310536294, 0.0, {'1%': -3.43035, '5%': -2.86154, '10%': -2.56677})

If we look both at the value of the coefficient and the p-value, we can reject H0, at a significance level of more than 99%. We now have de-trended the series, but we still remain cautious for seasonality as well.

Autocorrelation

Further, we also have to look for autocorrelation, expressed as: $\rho{k} = Corr(y{t},y_{t-k})$. As we talked about eariler, for a non-stationary time series, to check for autocorrelation does not really make much sense, as the correlation may depend on t.

To find the autocorrelation, we have to use the autocorrelation function (ACF), or SAC ($r_{k}$) which is the sample autocorrelation function.

We can define it as: $$r{k} = \frac{\sum^{n-k}{t=1} (y{t}-\bar{y})(y{t+k}-\bar{y})}{\sum^{n}{t=1} (y{t}-\bar{y})^2}$$

From this we can easily see that $r_{0} = 1$, as the correlation of a variable with itself is in fact perfectly correlated and therefore 1.

We can also see that $r{k}$ will be very close to the OLS coefficient $\hat{\rho}{k}$ in the model: $$y{t} = \alpha + \rho{k} y{t-k} + \epsilon{t}$$

The coefficient for OLS are estimated as follows: $\hat{\beta} = \frac{Cov[x,y]}{Var[x]}$, which is what we have in the equation above. Be aware that the normalization by dividing by the variance is commonly used in statistics/econometrics, while in engineering it is usally dropped, and autocorrelation, autocovariance is used interchangeably.

As $r{k}$ is an estimator, it has a distribution and we can test the hypothesis to see if $\rho{k} = 0$, with a t-test.

The usual pattern for autocorrelation is that it has an exponential decay, like a dampened sin wave. In plotting the autocorrelation, if the function dies off immediately, it is a sign that we are not too dependent on the last obeservations, but if it dies down extremely slow, it might be that we have problems with non-stationarity.

Lets also look at the partial autocorrelation (PACF) or SPAC. The partial autocorrelation measures how $y{t}$ and $y{t-1}$ are related without taking the period in between into account.

In other words, what is the correlation between $y{t-k}$ and the part of $y{t}$ that is not explained by any previous lag:

$$y{t} = \alpha + \rho{k1}y{t-1} + \rho{k2}y{t-2} + … + \rho{kk}y{t-k} + \epsilon{t}$$

As with the SAC, we can also estimate SPAC with the OLS regression, using a t-test to find the critical values.

Below, I have created some functions to plot the SAC and SPAC, where we continue to use the Houses sold in the US, dataset. However, this time we are trying to make it stationary through differencing it, as you can remember from above.

def plot_sac(x,lags,alpha):

    sac = acf(x, unbiased=True, fft=True, nlags=lags, alpha=alpha)
    
    df_auto = pd.DataFrame()
    
    df_auto["sac_l"] = np.array([t for t in sac[0]])
    df_auto["upper_conf_s"] = np.array([t[1] for t in sac[1]]) - df_auto["sac_l"].values
    df_auto["lower_conf_s"] = np.array([t[0] for t in sac[1]]) - df_auto["sac_l"].values
    df_auto["lags"] = np.arange(len(df_auto))
    
    auto_fig = go.Figure([
    go.Scatter(
        name='Upper Bound',x=df_auto['lags'],y=df_auto["upper_conf_s"],
        mode='lines',marker=dict(color="#444"),line=dict(width=0),
        showlegend=False),
        
    go.Scatter(
        name='Lower Bound',x=df_auto['lags'],y=df_auto["lower_conf_s"],
        marker=dict(color="#444"),line=dict(width=0),mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)',fill='tonexty',
        showlegend=False)])
    
    auto_fig.add_trace(go.Bar(
        name='Autocorrelation',x=df_auto['lags'],y=df_auto["sac_l"]))

    auto_fig.update_layout(
    yaxis_title='Correlation',
    title='Auto Correlation Function',
    hovermode="x")
    return auto_fig
def plot_psac(x,lags,alpha):

    psac = pacf(x, nlags=lags,alpha=alpha)
    
    df_auto = pd.DataFrame()
    df_auto["psac_l"] = np.array([t for t in psac[0]])
    df_auto["upper_conf"] = np.array([t[1] for t in psac[1]]) - df_auto["psac_l"].values
    df_auto["lower_conf"] = np.array([t[0] for t in psac[1]]) - df_auto["psac_l"].values
    df_auto["lags"] = np.arange(len(df_auto))
    
    auto_fig = go.Figure([
    go.Scatter(
        name='Upper Bound',x=df_auto['lags'],y=df_auto["upper_conf"],
        mode='lines',marker=dict(color="#444"),line=dict(width=0),
        showlegend=False),
        
    go.Scatter(
        name='Lower Bound',x=df_auto['lags'],y=df_auto["lower_conf"],
        marker=dict(color="#444"),line=dict(width=0),mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)',fill='tonexty',
        showlegend=False)])
    
    auto_fig.add_trace(go.Bar(
        name='Autocorrelation',x=df_auto['lags'],y=df_auto["psac_l"]))

    auto_fig.update_layout(
    yaxis_title='Correlation',
    title='Partial Auto Correlation Function',
    hovermode="x")
    return auto_fig
plot_sac(df_payrollUS["diff1"].dropna(),lags=40,alpha=.01)
plot_psac(df_payrollUS["diff1"].dropna(),lags=40,alpha=.01)

From both the plots above, we can see that there is definitely some seasonality going on. We should probably consider difference the series by 12 to get rid of the seasons.

plot_sac(df_payrollUS["diff1_12"].dropna(),lags=40,alpha=.01)
plot_psac(df_payrollUS["diff1_12"].dropna(),lags=40,alpha=.01)

By first differencing by 1 and then by 12, we seem to have a stationary series, both in terms of trend and seasonality.

ARMA

The ARMA model is made out of the AR (auto-regressive) and MA (moving average) processes ARMA(p,q), where p is for the number of lags of the auto-regressive process and q is the lags of the moving average process.

Mathematically: $$y{t} = \varphi{1} y{t-1} + … + \varphi{p} y{t-p} + \epsilon{t} - \theta{1}\epsilon{t-1}-…-\theta{q}\epsilon{t-q}$$ $$\text{where } \epsilon_{t} \text{ is i.i.d: } N(0,\sigma^2)$$

In spite of the indications of the lags we should choose for the ARMA model, let us start with some naive approaches and then come back to why we did choose the model we did.

We can start with a ARMA(0,0) model, where we have no auto regressive or moving average processes, which means that: $y{t} = \epsilon{t}$ and in other words white noise.

From this model it is easy to see that both SAC and PSAC are zero, as there is no correlation going on.

AR (Auto-Regressive)

Let us further try a AR(1) model; ARMA(1,0): $y{t} = \varphi{1}y{t-1} + \epsilon{t} \text{ where } |\varphi| < 1$

We have $= Corr(y{t},y{t-k}) = \frac{Cov[y{t},y{t-k}]}{Var[y{t}]} = \frac{\gamma{k}}{\gamma_{0}}$

Deriving variance: $$ \gamma{0} = Var(y{t}) = Var(\varphi{1} y{t-1} + \epsilon{t}) $$ $$=\varphi^2{1}Var(y{t-1}) + Var(\epsilon{t}) + 2\varphi{1}Cov(y{t-1}, \epsilon{t})$$ $$=\varphi^2{1}\gamma{0} + \sigma^2 + 2\varphi{1} * 0$$

$$\text{We get: } \gamma{0} = \frac{\sigma^2}{1 - \varphi^2{1}}$$

In the case as we discussed above, if $\varphi_{1} = 1$, ak non stationary, it will not work.

To get Covariance: $$\gamma{1} = Cov(y{t},y{t-k})$$ $$= Cov(\varphi{1} y{t-1} + \epsilon{t},y{t-k})$$ $$= \varphi{1}Cov(y{t-1},y{t-1}) + Cov(\epsilon{t},y{t-k})$$ $$= \varphi{1}\gamma{0}$$

We then get: $$\rho{1} = \frac{\gamma{1}}{\gamma{0}} = \varphi{1}$$

If we continue we can find that: $\rho{k}=\varphi^k{1}$, which tells us that the autocorrelation dies down exponentially.

This becomes quite cumbersome when the auto-regressive process increases, and as such we use the Yule-walker equation. For a thorough explanation, please go to this link. But we will end up with: $$\rho{k} = \sum^p{j=1}\varphi{j}\rho{|j-k|}\text{ for } k = 1,2,..,p$$

The AR process is a very powerful tool and can be extend to many other techniques. We can model non-linar auto-regressive processes, such as variance through Arch processes or different auto regressive processes with thresholds TAR (Threshold AR), STAR(Smoothened Threshold AR). In this blog we have focused on Univariate time series, but we can also extend the AR process to include several time series modeled by AR processes, such as Vector auto regressive, where we have two or more auto-regressive processes that has some cointegration, in the sense that the error terms might be correlated. Several of these models I will get back to in a later post.

Lets move on to create the AR function. In the function I have also included the possibility of choosing a train and test set, so we can test how well the model does on “unseen” data.

def AR(data, p, trainSize):
    

    data = np.asarray(data)
    
    # Create lags, and trim off any lag values.
    d_lag = lagmat(data[:, None], p, trim='both', original='in')
    df_lag = pd.DataFrame(d_lag)

    train_size = (int)(trainSize * df_lag.shape[0])

    #Breaking data set into test and training
    df_train = pd.DataFrame(df_lag[0:train_size])
    df_test = pd.DataFrame(df_lag[train_size:data.shape[0]])

    #Get the lags, or independent variables
    X_train = df_train.iloc[:,1:].values.reshape(-1,p)
    #Get the dependent variable Y
    y_train = df_train.iloc[:,0].values.reshape(-1,1)

 
    #Add the ones to the training set to get the intercept
    X_train2 = np.append(np.ones([X_train.shape[0],  1], dtype=np.int32),X_train, axis=1)
    
    #Get the phi coefficents for the lags with standard OLS
    b_ts = np.linalg.inv(X_train2.T.dot(X_train2)).dot(X_train2.T).dot(y_train)
    
    #Fit the regression
    df_train['Predicted'] = X_train2.dot(b_ts)
    
    #Get the X test set
    X_test = df_test.iloc[:,1:].values.reshape(-1,p)
    
    #Insert ones to get the intercept.
    X_test2 = np.append(np.ones([X_test.shape[0],  1], dtype=np.int32),X_test, axis=1)
    #Fit on test set.
    df_test['Predicted'] = X_test2.dot(b_ts)
    
    RMSE = np.sqrt(np.square(np.subtract(df_test.iloc[:,0], df_test['Predicted'])).mean())

    return [df_train,df_test,b_ts,RMSE]

MA (Moving Average)

As we have created the AR function, we also have to create function for the last half of the ARIMA technique, the MA function.

Mathematically an MA(1) or ARMA(0,1) procsess is derived: $$Y{t} =\varphi{0} + \epsilon{t} + \theta \epsilon{t-1} \text{where}, \epsilon{t} \sim WN(0,\sigma^2)$$ $$E[Y{t}] = E[\varphi{0} + \epsilon{t} + \theta \epsilon{t-1}]$$ $$E[Y{t}] = \varphi_{0}$$

As from above: $$\gamma{0} = V[Y{t}] = V[\varphi{0} + \epsilon{t} + \theta \epsilon{t-1}]$$ $$=V[\epsilon{t}] + V[\theta \epsilon{t-1}] + 2Cov[\epsilon{t},\theta \epsilon_{t-1}]$$ $$\sigma^2 + \theta^2 \sigma^2 + 0$$ $$\sigma^2(1+\theta^2)$$

$$\gamma{1} = Cov[Y{t},Y{t-1}]$$ $$=E[(Y{t} - \varphi{0}),(Y{t-1} - \varphi{0})]$$ $$=E[(\epsilon{t} + \theta \epsilon{t-1}),(\epsilon{t-1} + \theta \epsilon{t-2})]$$ $$=E[\epsilon{t} \epsilon{t-1}+ \theta \epsilon^2{t-1} + \theta \epsilon{t} \epsilon{t-2} + \theta^2\epsilon{t-1} \epsilon{t-2}]$$ $$=\theta \sigma^2$$

A general moving average: $$Y{t} = \varphi{0} + \epsilon{t} + \theta{1} \epsilon{t-1}+…+\theta{q} \epsilon{t-q}, \epsilon{t} \sim WN(0,\sigma^2)$$

A great thing with MA processes are, that they are all stationary, since the definition of an MA process states that only the last q shocks matter, and as shocks are transitory, we get stationarity.

The main problem with estimating a moving average process is that the errors are unobserved, and therefore we cannot estimate them directly from OLS. To not go through cumbersome calculations to estimate the MA process, we can use a neat trick. It will not be as accurate, but it will give you an idea.

What we will do is to take the quasi white noise, which in this case are the residuals from the AR process, and use those as the input in our MA calculation. Then we can use OLS to train fit the coefficients in the MA process.

def MA(data_res, q, trainSize):
    

    data = np.asarray(data_res)
    
    # Create lags, and trim off any lag vaøies.
    res_lag = lagmat(data[:, None], q, trim='both', original='in')
    res_lag = pd.DataFrame(res_lag)

    train_size = (int)(trainSize * res_lag.shape[0])

    #Breaking data set into test and training
    r_train = pd.DataFrame(res_lag[0:train_size])
    r_test = pd.DataFrame(res_lag[train_size:data.shape[0]])

    #Get the lags, or independent variables
    X_train = r_train.iloc[:,1:].values.reshape(-1,q)
    #Get the dependent variable Y
    y_train = r_train.iloc[:,0].values.reshape(-1,1)

 
    #Add the ones to the training set to get the intercept
    X_train2 = np.append(np.ones([X_train.shape[0],  1], dtype=np.int32),X_train, axis=1)
    
    #Get the phi coefficents for the lags with standard OLS
    b_ts = np.linalg.inv(X_train2.T.dot(X_train2)).dot(X_train2.T).dot(y_train)
    
    #Fit the regression
    r_train['Predicted'] = X_train2.dot(b_ts)
    
    #Get the X test set
    X_test = r_test.iloc[:,1:].values.reshape(-1,q)
    
    #Insert ones to get the intercept.
    X_test2 = np.append(np.ones([X_test.shape[0],  1], dtype=np.int32),X_test, axis=1)
    #Fit on test set.
    r_test['Predicted'] = X_test2.dot(b_ts)
    
    RMSE = np.sqrt(np.square(np.subtract(r_test.iloc[:,0], r_test['Predicted'])).mean())
    return [r_train,r_test,b_ts,RMSE]

From the SAC and SPAC above, we could see that we needed to difference the series once, by regular differencing, while the output looked better differencing by 12, to model the seasonality.

For now we will try to fit the ARIMA model using both the seasonal and monthly difference, but later on we would want to go deeper into the seasonality, to have a model that takes the seasonality into account.

#Get the values from the AR process
[df_train,df_test,b_ts,RMSE] = AR(df_payrollUS["diff1_12"].dropna(),p = 2,trainSize = 0.8)
#Create a new df to concatenate the train and test set.
df_concat = pd.concat([df_train,df_test])
df_concat = df_concat.rename({0:"Real"}, axis=1)
df_concat.columns
Index(['Real', 1, 2, 'Predicted'], dtype='object')
tr_pl = px.line(df_concat,x=df_concat.index, y=["Real","Predicted"], title='AR fit')
tr_pl.show()

The fit does not look too bad, lets get the errors.

res = pd.DataFrame()
res['Residuals'] = df_concat["Real"] - df_concat["Predicted"]

group_labels = ['Payrolls']
hist_data = [res["Residuals"].dropna()]
histfig = ff.create_distplot(hist_data, group_labels)
histfig.show()
qq = stats.probplot(res["Residuals"].dropna(), dist='norm', sparams=(1))
x = np.array([qq[0][0][0], qq[0][0][-1]])

fign = go.Figure()
fign.add_scatter(x=qq[0][0], y=qq[0][1], mode='markers')
fign.add_scatter(x=x, y=qq[1][1] + qq[1][0]*x, mode='lines')
fign.layout.update(showlegend=False)
fign.show()

Looking at the errors, we can see that we have some huge outliers, which probably comes from the last observations 2020, when unemployment increased rapidly. Which tells us that the errors are not normally distributed.

[r_train,r_test,b_tsam,RMSE] = MA(res["Residuals"].dropna(),q = 2,trainSize = 0.8)
r_concat = pd.concat([r_train,r_test])
r_concat = r_concat.rename({0:"Real"}, axis=1)

ma_pl = px.line(r_concat,x=r_concat.index, y=["Real","Predicted"], title='MA fit')
ma_pl.show()
#Add the AR and MA process together and plot
df_concat["Predicted"] += r_concat["Predicted"]
df_concat = df_concat.dropna()
arima_pl = px.line(df_concat,x=df_concat.index, y=["Real","Predicted"], title='ARIMA fit')
arima_pl.show()
df_concat.Real += payrollUS["Payrolls"].shift(1)
df_concat.Real += payrollUS["Payrolls"].diff().shift(12)
df_concat.Predicted += payrollUS["Payrolls"].shift(1)
df_concat.Predicted += payrollUS["Payrolls"].diff().shift(12)
df_concat.shape
(964, 4)
arima_plot = px.line(df_concat,x=df_concat.index, y=["Real","Predicted"], title='ARIMA fit')
arima_plot.show()

As you can see the fit is not too great. Let us rather go further and look at seasonal ARIMA, maybe we would get a better fit.

SARIMA

Our ARIMA model looked like a Seasonal ARIMA (SARIMA), and all the evidence has pushed in that direction, so lets try to model one.

A SARIMA technique, is characterized by adding additional seasonal lags, so in addition to the ARIMA(p,I,q), we will have ARIMA(p,I,q)(P,I,Q), where P is the seasonal auto-regressive process, while the Q is the seasonal moving average process. The modeling of seasonal AR and MA processes are still the same, only we use lags of 12 in the case of monthly seasonality, instead of lags of 1. But the lags could as easily be 4 for quarters and so on.

An advantage with a SARIMA model is that it can handle longer dependencies, better than a standard ARIMA model.

We have thus far infered that the model possibly should look like this: SARIMA(0,1,0)(0,1,0).

As it is quite cumbersome to write out entire SARIMA from scratch, since we need to estimate many parameters and evaluate the MA coefficients with log likelihood, we will use the statsmodels library to get the final results.

Mathematically we multiply the values of the seasonal component with the values of the non seasonal components. So if we have a SARIMA(1,0,0)(1,0,0) process it would look like this:

Nonseasonal AR: $$y{t} = \varphi{1}y{t-1} + \epsilon_{t}$$

Seasonal AR: $$y{t} = \phi{1}y{t-12} + \epsilon{t}$$

We multiply the two AR components and get:

$$y{t} = \varphi{1}y{t-1} + \phi{1}y{t-12} + \varphi{1}\phi{1}y{t-13} + \epsilon_{t}$$

To find the right model, there are several different strategies you can use, but it has been most common to find out what you need to difference first, and then start with an AR model before moving on. The models are usually optimized based on a information criterion, sich as AIC (Aikake IC) or BIC (Baysian IC). Statisticians as well as Econometricians usually prefer parsimonious models, which in general mean simple models with great explanatory power. If you overfit the model, you might have a harder time predicting in the future, and your variance will increase, therefore it is better to have a model with a bit more bias, and make the fit good, but not overfit. So therefore, I tend to go with th BIC as it is the most restrictive information criterion of them. And try to keep the amount of parameters of the model low.

For the model SARIMA(2,1,0)(1,1,1,12) (12 month lag), I managed to get one of the lowest BIC scores, at the same time as the ljung box test statistics do not reject the H0 that we have mostly white noise left after fitting our model.

mod = sm.tsa.statespace.SARIMAX(df_payrollUS["Payrolls"], order=(2,1,0), seasonal_order=(1,1,1,12))
residu = mod.fit(disp=False)
print(residu.summary())
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                             Payrolls   No. Observations:                  981
Model:             SARIMAX(2, 1, 0)x(1, 1, [1], 12)   Log Likelihood               -7775.568
Date:                              Thu, 19 Nov 2020   AIC                          15561.137
Time:                                      14:31:51   BIC                          15585.513
Sample:                                           0   HQIC                         15570.416
                                              - 981                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.0681      0.007      9.285      0.000       0.054       0.082
ar.L2         -0.0942      0.011     -8.328      0.000      -0.116      -0.072
ar.S.L12       0.3958      0.148      2.667      0.008       0.105       0.687
ma.S.L12      -0.6800      0.164     -4.141      0.000      -1.002      -0.358
sigma2      5.672e+05   1675.718    338.488      0.000    5.64e+05     5.7e+05
===================================================================================
Ljung-Box (Q):                        6.73   Jarque-Bera (JB):          15236541.73
Prob(Q):                              1.00   Prob(JB):                         0.00
Heteroskedasticity (H):              10.73   Skew:                           -21.99
Prob(H) (two-sided):                  0.00   Kurtosis:                       616.05
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

From the results above we can see: * as we suspected, there is both a skew and kurtosis in the residuals. * Jarque-Bera rejects H0 of normality. * Ljung-Box does not reject H0 of no autocorrelation. * All coefficients are signficant > 99%.

res_iduals = residu.resid.values
lab = ['Residuals']
histo_data = [res_iduals]
res_fig = ff.create_distplot(histo_data, lab)
res_fig.show()

From the pure ARIMA(2,1,2)(0,1,0) model with quasi white noise, we can see that the residuals are pretty similar to the ones we have from the SARIMA(2,1,0)(1,1,1,12). It is still not normal, as we have large outliers, but this will mostly affect the confidence interval, so we could bootstrap it to the “population distribution” to get the right confidence intervals.

Let us try to make a forecast for the last 25 values

forecast = residu.get_prediction(start=-25)
prd_df = payrollUS[["Month","Payrolls"]].copy()
prd_df["mean_forecast"] = forecast.predicted_mean
confidence_intervals = forecast.conf_int()
prd_df = prd_df.merge(confidence_intervals, left_index=True, right_index=True, how='outer')
pred_fig = go.Figure([
    go.Scatter(
        name='Upper Bound',x=prd_df["Month"],y=prd_df["lower Payrolls"],
        mode='lines',marker=dict(color="#444"),line=dict(width=0),
        showlegend=False),
        
    go.Scatter(
        name='Lower Bound',x=prd_df["Month"],y=prd_df["upper Payrolls"],
        marker=dict(color="#444"),line=dict(width=0),mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)',fill='tonexty',
        showlegend=False)])

pred_fig.add_trace(go.Scatter(
    name='Real',x=prd_df["Month"],y=prd_df["Payrolls"]))
    
pred_fig.add_trace(go.Scatter(
    name='Forecast',x=prd_df["Month"],y=prd_df["mean_forecast"]))

pred_fig.update_layout(
    yaxis_title='Employed',
    title='SARIMA Forecast',
    hovermode="x")

Conclusion

Finally, we have gone through several methods of how to fit and predict forecasts, creating confidence intervals and looking at the respective distributions.

As we talked about before, there are several extensions to these models, like Vector AR for mutlivariate time series, or threshold AR, smoothed threshold AR and Arch (auto regressive conditional heteroskedasticity), for variance forecasting. The SARIMA model is incredibly flexible and can also be extended to handle exogenous variables as well, just like in a regression, but over time (called SARIMAX), or without seasonality ARIMAX. We can also make a mutlivariate SARIMA, which is called a VARIMAX, but then you have several assumptions that hs to hold.

Another thing, that I did not pay to much attention to, is that many timeseries has an exponential growth, in a sense that the trend increases over time or the seasonal fluctuations increase over time. This can be dealt with by taking the log. And if you look at the irregular component in the first decomposition, we can spot a bit of seasonality there, so it could have been an idea to take the log.

In future posts we probably will look in to some stochastic and bayesian forecasting.