extended_benchmarks

Extended Benchmarks

The benchmark setting has been borrowed from Nixtla's original benchmarking of time-series foundation models against a strong statistical ensemble. Later more datasets were added by the Chronos team in this pull request. We compare on all the datasets in this extended benchmarks.

Running TimesFM on the benchmark

We need to add the following packages for running these benchmarks. Follow the installation instructions till before poetry lock. Then,

poetry add git+https://github.com/awslabs/gluon-ts.git
poetry lock
poetry install --only <pax or pytorch>

To run the timesfm on the benchmark do:

poetry run python3 -m experiments.extended_benchmarks.run_timesfm --model_path=google/timesfm-1.0-200m(-pytorch) --backend="gpu"

Note: In the current version of TimesFM we focus on point forecasts and therefore the mase, smape have been calculated using the quantile head corresponding to the median i.e 0.5 quantile. We do offer 10 quantile heads but they have not been calibrated after pretraining. We recommend using them with caution or calibrate/conformalize them on a hold out for your applications. More to follow on later versions.

Benchmark Results

Update: We have added TimeGPT-1 to the benchmark results. We had to remove the Dominick dataset as we were not able to run TimeGPT-1 on this benchmark. Note that the previous results including Dominick remain available at ./tfm_results.png. In order to reproduce the results for TimeGPT-1, please run run_timegpt.py.

Remark: All baselines except the ones involving TimeGPT were run performed on a g2-standard-32. Since TimeGPT-1 can only be accessed by an API, the time column might not reflect the true speed of the model as it also includes the communication cost. Moreover, we are not sure about the exact backend hardware for TimeGPT. The TimesFM latency numbers are from the PAX version.

We can see that TimesFM performs the best in terms of both mase and smape. More importantly it is much faster than the other methods, in particular it is more than 600x faster than StatisticalEnsemble and 80x faster than Chronos (Large).

Note: This benchmark only compares on one small horizon window for long horizon datasets like ETT hourly and 15 minutes. More in depth comparison on longer horizon rolling validation tasks are presented in our long horizon benchmarks.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
run_timegpt.py		run_timegpt.py
run_timesfm.py		run_timesfm.py
tfm_extended_new.png		tfm_extended_new.png
tfm_results.png		tfm_results.png
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extended_benchmarks

extended_benchmarks

README.md

Extended Benchmarks

Running TimesFM on the benchmark

Benchmark Results

Files

extended_benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

extended_benchmarks

Folders and files

parent directory

README.md

Extended Benchmarks

Running TimesFM on the benchmark

Benchmark Results