Time Series & Forecasting
Time series forecasting is one of the oldest quantitative disciplines — demand planning, weather prediction, and financial modeling all depend on it. Modern practice blends classical statistical models (ARIMA, exponential smoothing) with deep sequence models that capture complex seasonality and covariates.
Theory
Time series forecasting makes one assumption before any model can be fit: that the statistical properties of the series are constant over time. A stock price trending upward is non-stationary — its mean changes. ARIMA and most classical methods require stationarity, so the first step is always testing for it and differencing until the test passes. The diagram above shows a non-stationary series (trending) vs. its stationary differenced form.
Stationarity and Differencing
A time series is weakly stationary if its mean and autocovariance are time-invariant:
Stationarity is required because non-stationary series have time-varying parameters — a model fit on one time period may have completely different coefficients than one fit on another. The autocovariance being time-invariant is specifically the condition that makes the AR and MA coefficients in ARIMA constant and therefore estimable from a single observed series. Without stationarity, maximum likelihood estimation of ARIMA parameters is ill-defined: the model's parameters are themselves changing over time, and you cannot consistently estimate a moving target from historical data alone.
Most real series are non-stationary (trending, seasonal). Differencing achieves stationarity:
The Augmented Dickey-Fuller (ADF) test tests the null hypothesis of a unit root (non-stationarity). Reject H_0 → series is stationary.
ARIMA(p, d, q)
Box-Jenkins ARIMA combines three components:
- AR(p): autoregression on p lags — y_t depends on its own past
- I(d): differencing d times to achieve stationarity
- MA(q): moving average on q past error terms
The full ARIMA(p, d, q) model on the differenced series w_t = Δ^d y_t:
Seasonal ARIMA (SARIMA)(p,d,q)(P,D,Q)_m adds seasonal AR, differencing, and MA terms at lag m (e.g., m=12 for monthly data).
Fit via maximum likelihood estimation. Select (p, d, q) by minimizing AIC:
where k is the number of parameters and L̂ is the maximized likelihood.
Exponential Smoothing (ETS)
Holt-Winters triple exponential smoothing decomposes the series into level (ℓ), trend (b), and seasonal (s) components, updated with smoothing parameters α, β, γ ∈ (0, 1):
h-step forecast:
Temporal Fusion Transformer (TFT)
Lim et al. (Google, 2021) introduce TFT, a transformer-based architecture designed specifically for multi-horizon forecasting with interpretability:
Key components:
- Variable Selection Networks (VSN): learned sparse selection over static and temporal covariates using gated residual networks
- LSTM encoder-decoder: captures local temporal patterns; encoder processes historical inputs, decoder processes known future inputs (e.g., day-of-week)
- Interpretable multi-head attention: attention heads specialize to different temporal patterns (seasonality, trend)
- Quantile outputs: predict distribution quantiles (10th, 50th, 90th percentile) for calibrated prediction intervals
The quantile loss (pinball loss) for quantile q ∈ (0, 1):
Walkthrough
Task: Forecast monthly airline passengers for 6 months using 24 months of history.
Step 1 — EDA and Stationarity
Plot the series: clear upward trend + annual seasonality (peaks in summer). ACF/PACF plots show: ACF decays slowly (non-stationary trend), significant spike at lag 12 (seasonality).
Step 2 — Differencing
Apply d=1 ordinary differencing to remove trend, D=1 seasonal differencing at m=12 to remove seasonality. ADF test on the doubly-differenced series: p-value 0.002, reject non-stationarity.
Step 3 — Model Identification
ACF of differenced series shows significant spike at lag 1 (q=1, Q=1). PACF shows spike at lag 1 (p=1). Candidate model: SARIMA(1,1,1)(1,1,1)_12.
Step 4 — Estimation and Diagnostics
Fit via MLE. Check residuals: Ljung-Box test (residuals should be white noise), Q-Q plot (residuals should be approximately normal). AIC = 241.3.
Step 5 — Forecasting
Generate 6-step-ahead point forecast + 95% prediction intervals. Intervals widen over the horizon because uncertainty compounds.
MASE (Mean Absolute Scaled Error):
MASE < 1 means the model beats the seasonal naive baseline. MASE = 0.73 here.
SARIMA fitting and forecasting with statsmodels:
import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
# Classic Box-Jenkins airline passengers dataset (monthly, 1949-1960)
# Log-transform stabilizes the multiplicative seasonality: var(y) ∝ level
log_y = np.log(passengers) # passengers: pd.Series of monthly counts
# 1. Test for stationarity — null hypothesis: unit root (non-stationary)
adf_stat, p_val, *_ = adfuller(log_y)
print(f"ADF p-value: {p_val:.4f}") # typically > 0.05 → need differencing
# 2. Fit SARIMA(1,1,1)(1,1,1)_12 on log-transformed series
model = SARIMAX(
log_y,
order=(1, 1, 1), # (p, d, q): AR=1, one diff, MA=1
seasonal_order=(1, 1, 1, 12), # (P, D, Q, m): seasonal AR/diff/MA at lag 12
enforce_stationarity=False,
enforce_invertibility=False,
)
result = model.fit(disp=False)
print(result.summary()) # AIC, BIC, Ljung-Box p-value
# 3. Forecast 12 months ahead with 95% prediction intervals
forecast = result.get_forecast(steps=12)
pred_mean = np.exp(forecast.predicted_mean) # back-transform from log
conf_int = np.exp(forecast.conf_int(alpha=0.05))
for month, (mean, lo, hi) in enumerate(
zip(pred_mean, conf_int.iloc[:, 0], conf_int.iloc[:, 1]), 1):
print(f"Month +{month:02d}: {mean:.0f} [95% CI: {lo:.0f}–{hi:.0f}]")Walk-forward (expanding window) cross-validation:
Random train/test splits destroy temporal ordering — a test point may be "predicted" using future data from training. Walk-forward CV respects causality: the model always trains on past, predicts future.
from statsmodels.tsa.statespace.sarimax import SARIMAX
import numpy as np
def walk_forward_cv(log_series: np.ndarray, n_test: int = 12,
order=(1,1,1), seasonal=(1,1,1,12)) -> dict:
"""
Expanding-window CV: train on all data up to time t, predict t+1.
Returns MAE and MASE vs. seasonal naive baseline.
"""
n = len(log_series)
errors, naive_errors = [], []
for t in range(n - n_test, n):
train = log_series[:t]
actual = np.exp(log_series[t])
seasonal_naive = np.exp(log_series[t - seasonal[3]]) # same month last year
try:
fit = SARIMAX(train, order=order,
seasonal_order=seasonal).fit(disp=False)
pred = np.exp(fit.forecast(steps=1).iloc[0])
except Exception:
pred = seasonal_naive # fallback on convergence failure
errors.append(abs(pred - actual))
naive_errors.append(abs(seasonal_naive - actual))
mae = np.mean(errors)
mase = mae / np.mean(naive_errors) # < 1.0 means we beat seasonal naive
return {"mae": mae, "mase": mase}
metrics = walk_forward_cv(log_y)
print(f"MAE: {metrics['mae']:.1f} passengers | MASE: {metrics['mase']:.3f}")Analysis & Evaluation
Where Your Intuition Breaks
Neural networks always outperform classical methods like ARIMA for time series forecasting. For short series, strong seasonal patterns, and limited training data, classical methods (ARIMA, ETS, SARIMA) routinely match or outperform neural approaches. The M4 and M5 forecasting competitions showed that statistical ensembles outperformed many deep learning submissions. Neural methods (N-BEATS, PatchTST, TimesFM) win on long-horizon forecasting with abundant data and complex cross-series patterns — but these conditions are not universal. The right model depends on series length, data volume, and whether the task benefits from cross-series learning (where neural methods have a clear advantage) or individual series modeling (where ARIMA-class models remain competitive).
Model Comparison
| Model | Multi-step | Covariates | Uncertainty | Scalability | Interpretability |
|---|---|---|---|---|---|
| ARIMA/SARIMA | Recursive | Limited | Yes (analytic) | Single series | High |
| ETS | Direct | None | Yes (analytic) | Single series | High |
| Prophet | Direct | Holidays | Yes (MCMC) | Single series | High |
| LSTM/Seq2Seq | Direct | Many | Via MC dropout | Multi-series | Low |
| TFT | Direct | Many | Quantile | Multi-series | Medium |
| N-BEATS | Direct | None | Via ensembles | Multi-series | Medium |
Evaluation Metrics
MASE: Scale-free, compares to seasonal naive. Preferred for intermittent or near-zero series.
SMAPE (Symmetric MAPE):
Bounded in [0, 200%], handles near-zero values better than standard MAPE.
Winkler score: evaluates the calibration of prediction intervals — penalizes both width (being too conservative) and coverage failures.
Common Pitfalls
Data leakage: Include future information in features during training. Strictly use only data available at prediction time t.
Wrong cross-validation: Random splitting destroys temporal structure. Always use walk-forward (expanding window) or rolling window CV.
Ignoring seasonality: Fitting ARIMA without seasonal differencing when seasonality is present leads to systematic bias at seasonal lags.
Global vs local models: A single global model (shared across all series) can outperform per-series ARIMA when series share patterns and training data is sufficient, but underperforms for highly idiosyncratic series.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.