Position Paper · ICML 2026

Time-Series Foundation Models Require Explicit Domain-Level Benchmarks

Universal time-series foundation models can look strong on pooled leaderboards while failing inside specific deployment domains. This project argues for explicit domain-stratified evaluation to reveal those failures across health, finance, energy, nature, transport, and retail, where differences in sampling, noise, seasonality, and distribution shift can change which model performs best.

Overview diagram showing domain-specific challenges, limitations of universal foundation models, and explicit domain-level benchmarks

Abstract

Why this matters

Time-series foundation models have shown strong benchmark performance, but common benchmark suites pool datasets from unevenly represented domains. This can hide whether a model is actually reliable in healthcare, finance, energy, retail, transport, or environmental forecasting. The paper evaluates seven TSFMs across 72 datasets from six domains and finds substantial cross-domain variability, showing that global scores are not enough for trustworthy model selection.

Core argument

Time series domains are not interchangeable

Different generative laws

Weather and energy often contain physical structure and seasonal regularities. Finance is noisy, stochastic, and regime-dependent. A single universal prior can average incompatible assumptions.

Different sampling regimes

Clinical data may be irregular, multi-rate, and informative when missing. High-frequency finance is event-driven. Fixed patching and uniform-grid assumptions can break under these settings.

Different drift behavior

Covariate shift, concept drift, volatility bursts, patient-specific trajectories, and seasonal recalibration create domain-specific failure modes that pooled metrics can miss.

7 TSFMs evaluated
72 datasets
6 application domains
2 accuracy metrics
Application domains

Six real-world forecasting settings

Health Finance Energy Nature Transport Retail
Foundation models

Seven TSFMs evaluated

Chronos Sundial TimeGPT TimesFM TimesMOE ToTo Moirai

Empirical validation

Rankings fragment across domains

The same model can be strong in one domain and weak in another. These figures expose rank reversals and domain-specific win patterns that a single leaderboard average would flatten.

Benchmark coverage

Benchmark imbalance

Existing benchmarks are domain-imbalanced, so pooled scores can hide weak performance in underrepresented areas.

GIFT-Eval

Finance
69.3%
Energy
1.4%
Nature
22.6%
Health
0.7%
Retail
2.6%
Transport
0.9%
Web
2.4%

TSFM-Bench

Finance
4.9%
Energy
10.3%
Nature
1.1%
Health
20.0%
Retail
0.0%
Transport
21.6%
Web
41.9%

Energy time series forecasting results

Detailed energy-domain results across datasets, horizons, models, MSE, MAE, average scores, and first-place counts.

Energy time series forecasting results table from the paper
Energy-domain forecasting results. Bold values indicate best performance for each row.

Call to action

What the TSFM community should do next

Our results suggest a shift away from only reporting pooled global scores. Domain-stratified benchmarking can reveal where TSFMs succeed, where they fail, and when domain-aware modeling is necessary.

A1
Lowest-cost, highest-impact

Report domain-stratified evaluation

Benchmarks such as Monash, GIFT-Eval, and TSFM-Bench already contain multiple domains. The key change is to report results by domain instead of hiding failures behind pooled averages.

A2
Fair comparison

Evaluate domain-aware models

Specialized models for medical irregularity, missingness, financial non-stationarity, or other domain-specific challenges should be compared directly against universal TSFMs under standardized domain-level benchmarks.

A3
Long-term direction

Build cross-domain transfer frameworks

Practitioners need a way to predict which source domains help or harm target domains. Useful transfer frameworks should consider sampling frequency, stationarity, causal structure, and drift behavior.

Bottom line: a model that performs best globally may still fail in a target domain. Domain-stratified reporting makes TSFM selection more reliable for real-world deployment.