Time-Series Foundation Models Require Explicit Domain-Level Benchmarks

Abstract

Why this matters

Time-series foundation models have shown strong benchmark performance, but common benchmark suites pool datasets from unevenly represented domains. This can hide whether a model is actually reliable in healthcare, finance, energy, retail, transport, or environmental forecasting. The paper evaluates seven TSFMs across 72 datasets from six domains and finds substantial cross-domain variability, showing that global scores are not enough for trustworthy model selection.

Core argument

Time series domains are not interchangeable

Different generative laws

Weather and energy often contain physical structure and seasonal regularities. Finance is noisy, stochastic, and regime-dependent. A single universal prior can average incompatible assumptions.

Different sampling regimes

Clinical data may be irregular, multi-rate, and informative when missing. High-frequency finance is event-driven. Fixed patching and uniform-grid assumptions can break under these settings.

Different drift behavior

Covariate shift, concept drift, volatility bursts, patient-specific trajectories, and seasonal recalibration create domain-specific failure modes that pooled metrics can miss.

7 TSFMs evaluated

72 datasets

6 application domains

2 accuracy metrics

Application domains

Six real-world forecasting settings

Health Finance Energy Nature Transport Retail

Foundation models

Seven TSFMs evaluated

Chronos Sundial TimeGPT TimesFM TimesMOE ToTo Moirai

Empirical validation

Rankings fragment across domains

The same model can be strong in one domain and weak in another. These figures expose rank reversals and domain-specific win patterns that a single leaderboard average would flatten.

MAE model rank distribution across six domains — MAE rank distribution by domain. Rank 1 is best; white circles denote mean rank.

MSE model rank distribution across six domains — MSE rank distribution reveals similar cross-domain instability.

MAE domain-wise win count radar chart — MAE domain-wise first-place counts show concentrated strengths rather than universal dominance.

MSE domain-wise win count radar chart — MSE win counts emphasize different model footprints across health, finance, nature, transport, retail, and energy.

Benchmark coverage

Benchmark imbalance

Existing benchmarks are domain-imbalanced, so pooled scores can hide weak performance in underrepresented areas.

GIFT-Eval

Finance

69.3%

Energy

1.4%

Nature

22.6%

Health

0.7%

Retail

2.6%

Transport

0.9%

Web

2.4%

TSFM-Bench

Finance

4.9%

Energy

10.3%

Nature

1.1%

Health

20.0%

Retail

0.0%

Transport

21.6%

Web

41.9%

Energy time series forecasting results

Detailed energy-domain results across datasets, horizons, models, MSE, MAE, average scores, and first-place counts.

Call to action

What the TSFM community should do next

Our results suggest a shift away from only reporting pooled global scores. Domain-stratified benchmarking can reveal where TSFMs succeed, where they fail, and when domain-aware modeling is necessary.

A1

Lowest-cost, highest-impact

Report domain-stratified evaluation

Benchmarks such as Monash, GIFT-Eval, and TSFM-Bench already contain multiple domains. The key change is to report results by domain instead of hiding failures behind pooled averages.

A2

Fair comparison

Evaluate domain-aware models

Specialized models for medical irregularity, missingness, financial non-stationarity, or other domain-specific challenges should be compared directly against universal TSFMs under standardized domain-level benchmarks.

A3

Long-term direction

Build cross-domain transfer frameworks

Practitioners need a way to predict which source domains help or harm target domains. Useful transfer frameworks should consider sampling frequency, stationarity, causal structure, and drift behavior.

Bottom line: a model that performs best globally may still fail in a target domain. Domain-stratified reporting makes TSFM selection more reliable for real-world deployment.