Neural-Path/Notes
30 min

Time Series & Forecasting

Time series forecasting is one of the oldest quantitative disciplines — demand planning, weather prediction, and financial modeling all depend on it. Modern practice blends classical statistical models (ARIMA, exponential smoothing) with deep sequence models that capture complex seasonality and covariates.

Theory

Time Series Forecast (ARIMA-style)
100120140160180forecast →m1m7m13m19m24m25m34historyforecast
ARIMA captures trend and seasonality. Toggle "actual" to see forecast error. Shaded band = 95% prediction interval.

Time series forecasting makes one assumption before any model can be fit: that the statistical properties of the series are constant over time. A stock price trending upward is non-stationary — its mean changes. ARIMA and most classical methods require stationarity, so the first step is always testing for it and differencing until the test passes. The diagram above shows a non-stationary series (trending) vs. its stationary differenced form.

Stationarity and Differencing

A time series (yt)(y_t) is weakly stationary if its mean and autocovariance are time-invariant:

E[yt]=μt,Cov(yt,ytk)=γkt\mathbb{E}[y_t] = \mu \quad \forall t, \qquad \text{Cov}(y_t, y_{t-k}) = \gamma_k \quad \forall t

Stationarity is required because non-stationary series have time-varying parameters — a model fit on one time period may have completely different coefficients than one fit on another. The autocovariance γk\gamma_k being time-invariant is specifically the condition that makes the AR and MA coefficients in ARIMA constant and therefore estimable from a single observed series. Without stationarity, maximum likelihood estimation of ARIMA parameters is ill-defined: the model's parameters are themselves changing over time, and you cannot consistently estimate a moving target from historical data alone.

Most real series are non-stationary (trending, seasonal). Differencing achieves stationarity:

Δyt=ytyt1(first difference)\Delta y_t = y_t - y_{t-1} \qquad (\text{first difference}) Δdyt=Δ(Δd1yt)(order-d differencing)\Delta^d y_t = \Delta(\Delta^{d-1} y_t) \qquad (\text{order-}d \text{ differencing})

The Augmented Dickey-Fuller (ADF) test tests the null hypothesis of a unit root (non-stationarity). Reject H_0 → series is stationary.

ARIMA(p, d, q)

Box-Jenkins ARIMA combines three components:

  • AR(p): autoregression on p lags — y_t depends on its own past
  • I(d): differencing d times to achieve stationarity
  • MA(q): moving average on q past error terms

The full ARIMA(p, d, q) model on the differenced series w_t = Δ^d y_t:

wt=c+i=1pϕiwti+εtj=1qθjεtj,εtN(0,σ2)w_t = c + \sum_{i=1}^{p} \phi_i w_{t-i} + \varepsilon_t - \sum_{j=1}^{q} \theta_j \varepsilon_{t-j}, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma^2)

Seasonal ARIMA (SARIMA)(p,d,q)(P,D,Q)_m adds seasonal AR, differencing, and MA terms at lag m (e.g., m=12 for monthly data).

Fit via maximum likelihood estimation. Select (p, d, q) by minimizing AIC:

AIC=2k2lnL^\text{AIC} = 2k - 2\ln\hat{\mathcal{L}}

where k is the number of parameters and L̂ is the maximized likelihood.

Exponential Smoothing (ETS)

Holt-Winters triple exponential smoothing decomposes the series into level (ℓ), trend (b), and seasonal (s) components, updated with smoothing parameters α, β, γ ∈ (0, 1):

t=α(ytstm)+(1α)(t1+bt1)\ell_t = \alpha(y_t - s_{t-m}) + (1-\alpha)(\ell_{t-1} + b_{t-1}) bt=β(tt1)+(1β)bt1b_t = \beta(\ell_t - \ell_{t-1}) + (1-\beta)b_{t-1} st=γ(ytt1bt1)+(1γ)stms_t = \gamma(y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-m}

h-step forecast: y^t+h=t+hbt+st+hm\hat{y}_{t+h} = \ell_t + h \cdot b_t + s_{t+h-m}

Temporal Fusion Transformer (TFT)

Lim et al. (Google, 2021) introduce TFT, a transformer-based architecture designed specifically for multi-horizon forecasting with interpretability:

Key components:

  • Variable Selection Networks (VSN): learned sparse selection over static and temporal covariates using gated residual networks
  • LSTM encoder-decoder: captures local temporal patterns; encoder processes historical inputs, decoder processes known future inputs (e.g., day-of-week)
  • Interpretable multi-head attention: attention heads specialize to different temporal patterns (seasonality, trend)
  • Quantile outputs: predict distribution quantiles (10th, 50th, 90th percentile) for calibrated prediction intervals

The quantile loss (pinball loss) for quantile q ∈ (0, 1):

Lq(y,y^)={q(yy^)if yy^(1q)(y^y)if y<y^\mathcal{L}_q(y, \hat{y}) = \begin{cases} q(y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1-q)(\hat{y} - y) & \text{if } y < \hat{y} \end{cases}

Walkthrough

Task: Forecast monthly airline passengers for 6 months using 24 months of history.

Step 1 — EDA and Stationarity

Plot the series: clear upward trend + annual seasonality (peaks in summer). ACF/PACF plots show: ACF decays slowly (non-stationary trend), significant spike at lag 12 (seasonality).

Step 2 — Differencing

Apply d=1 ordinary differencing to remove trend, D=1 seasonal differencing at m=12 to remove seasonality. ADF test on the doubly-differenced series: p-value 0.002, reject non-stationarity.

Step 3 — Model Identification

ACF of differenced series shows significant spike at lag 1 (q=1, Q=1). PACF shows spike at lag 1 (p=1). Candidate model: SARIMA(1,1,1)(1,1,1)_12.

Step 4 — Estimation and Diagnostics

Fit via MLE. Check residuals: Ljung-Box test (residuals should be white noise), Q-Q plot (residuals should be approximately normal). AIC = 241.3.

Step 5 — Forecasting

Generate 6-step-ahead point forecast + 95% prediction intervals. Intervals widen over the horizon because uncertainty compounds.

MASE (Mean Absolute Scaled Error):

MASE=MAE1nmt=m+1nytytm\text{MASE} = \frac{\text{MAE}}{\frac{1}{n-m}\sum_{t=m+1}^{n}|y_t - y_{t-m}|}

MASE < 1 means the model beats the seasonal naive baseline. MASE = 0.73 here.

SARIMA fitting and forecasting with statsmodels:

python
import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
 
# Classic Box-Jenkins airline passengers dataset (monthly, 1949-1960)
# Log-transform stabilizes the multiplicative seasonality: var(y) ∝ level
log_y = np.log(passengers)     # passengers: pd.Series of monthly counts
 
# 1. Test for stationarity — null hypothesis: unit root (non-stationary)
adf_stat, p_val, *_ = adfuller(log_y)
print(f"ADF p-value: {p_val:.4f}")  # typically > 0.05 → need differencing
 
# 2. Fit SARIMA(1,1,1)(1,1,1)_12 on log-transformed series
model = SARIMAX(
    log_y,
    order=(1, 1, 1),            # (p, d, q): AR=1, one diff, MA=1
    seasonal_order=(1, 1, 1, 12), # (P, D, Q, m): seasonal AR/diff/MA at lag 12
    enforce_stationarity=False,
    enforce_invertibility=False,
)
result = model.fit(disp=False)
print(result.summary())         # AIC, BIC, Ljung-Box p-value
 
# 3. Forecast 12 months ahead with 95% prediction intervals
forecast = result.get_forecast(steps=12)
pred_mean = np.exp(forecast.predicted_mean)           # back-transform from log
conf_int  = np.exp(forecast.conf_int(alpha=0.05))
 
for month, (mean, lo, hi) in enumerate(
        zip(pred_mean, conf_int.iloc[:, 0], conf_int.iloc[:, 1]), 1):
    print(f"Month +{month:02d}: {mean:.0f}  [95% CI: {lo:.0f}{hi:.0f}]")

Walk-forward (expanding window) cross-validation:

Random train/test splits destroy temporal ordering — a test point may be "predicted" using future data from training. Walk-forward CV respects causality: the model always trains on past, predicts future.

python
from statsmodels.tsa.statespace.sarimax import SARIMAX
import numpy as np
 
def walk_forward_cv(log_series: np.ndarray, n_test: int = 12,
                    order=(1,1,1), seasonal=(1,1,1,12)) -> dict:
    """
    Expanding-window CV: train on all data up to time t, predict t+1.
    Returns MAE and MASE vs. seasonal naive baseline.
    """
    n = len(log_series)
    errors, naive_errors = [], []
 
    for t in range(n - n_test, n):
        train = log_series[:t]
        actual = np.exp(log_series[t])
        seasonal_naive = np.exp(log_series[t - seasonal[3]])  # same month last year
 
        try:
            fit  = SARIMAX(train, order=order,
                           seasonal_order=seasonal).fit(disp=False)
            pred = np.exp(fit.forecast(steps=1).iloc[0])
        except Exception:
            pred = seasonal_naive   # fallback on convergence failure
 
        errors.append(abs(pred - actual))
        naive_errors.append(abs(seasonal_naive - actual))
 
    mae  = np.mean(errors)
    mase = mae / np.mean(naive_errors)   # < 1.0 means we beat seasonal naive
    return {"mae": mae, "mase": mase}
 
metrics = walk_forward_cv(log_y)
print(f"MAE: {metrics['mae']:.1f} passengers  |  MASE: {metrics['mase']:.3f}")

Analysis & Evaluation

Where Your Intuition Breaks

Neural networks always outperform classical methods like ARIMA for time series forecasting. For short series, strong seasonal patterns, and limited training data, classical methods (ARIMA, ETS, SARIMA) routinely match or outperform neural approaches. The M4 and M5 forecasting competitions showed that statistical ensembles outperformed many deep learning submissions. Neural methods (N-BEATS, PatchTST, TimesFM) win on long-horizon forecasting with abundant data and complex cross-series patterns — but these conditions are not universal. The right model depends on series length, data volume, and whether the task benefits from cross-series learning (where neural methods have a clear advantage) or individual series modeling (where ARIMA-class models remain competitive).

Model Comparison

ModelMulti-stepCovariatesUncertaintyScalabilityInterpretability
ARIMA/SARIMARecursiveLimitedYes (analytic)Single seriesHigh
ETSDirectNoneYes (analytic)Single seriesHigh
ProphetDirectHolidaysYes (MCMC)Single seriesHigh
LSTM/Seq2SeqDirectManyVia MC dropoutMulti-seriesLow
TFTDirectManyQuantileMulti-seriesMedium
N-BEATSDirectNoneVia ensemblesMulti-seriesMedium

Evaluation Metrics

MASE: Scale-free, compares to seasonal naive. Preferred for intermittent or near-zero series.

SMAPE (Symmetric MAPE):

sMAPE=200nt=1nyty^tyt+y^t\text{sMAPE} = \frac{200}{n} \sum_{t=1}^{n} \frac{|y_t - \hat{y}_t|}{|y_t| + |\hat{y}_t|}

Bounded in [0, 200%], handles near-zero values better than standard MAPE.

Winkler score: evaluates the calibration of prediction intervals — penalizes both width (being too conservative) and coverage failures.

Common Pitfalls

Data leakage: Include future information in features during training. Strictly use only data available at prediction time t.

Wrong cross-validation: Random splitting destroys temporal structure. Always use walk-forward (expanding window) or rolling window CV.

Ignoring seasonality: Fitting ARIMA without seasonal differencing when seasonality is present leads to systematic bias at seasonal lags.

Global vs local models: A single global model (shared across all series) can outperform per-series ARIMA when series share patterns and training data is sufficient, but underperforms for highly idiosyncratic series.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.