Spatial Generalization Tests for Machine Learning-based Weather Models as a Requirement for Climate Predictions
Copernicus Publications (2026)
Abstract:
Machine learning-based weather prediction is revolutionizing weather forecasting by learning from present-day climate. However, generalization to other climates remains a major challenge. With melting sea ice, land-use change and increasing ocean temperatures, boundary conditions are changing. Therefore, generalization in time will likely only be possible if generalization in space is also given. The physics of the atmosphere is invariant in space, and as such, a model should demonstrate the same to accurately represent the real world.Here, we present three test cases to evaluate whether machine learning-based weather and climate models generalize spatially and apply them to multiple AI weather models. The tests consist of reversing the entirety of the input data and boundary conditions in latitude (Test 1), reversing them in longitude (Test 2), as well as rotating them by 180Ëš in longitude (Test 3), while keeping all aspects of the simulation physically consistent. For a deterministic model that generalizes in space, each of these test cases yields the same predictions as the baseline case, only subject to a rounding error. With these test cases, we investigate whether data-driven models hardcode representations of spatial relationships in the training data into their latent space. We show that currently, both fully data-driven and hybrid general circulation models do not pass these tests, instead performing poorly with unphysical results. This implies that they have likely not learned underlying atmospheric physics principles, but instead local spatial relationships statistically dependent on geographical location. This calls into question the ability of such models to simulate a changing regional climate. As such, we propose that machine learning-based climate models be evaluated using our spatial tests during model development to reduce overfitting on present-day regional climate.Forced Component Estimation Statistical Method Intercomparison Project (ForceSMIP)
Journal of Climate American Meteorological Society (2026)
Abstract:
Abstract Anthropogenic climate change is unfolding rapidly, yet its regional manifestation can be obscured by internal variability. A primary goal of climate science is to identify the externally forced climate response from amongst the noise of internal variability. Separating the forced response from internal variability can be addressed in climate models by using a large ensemble to average over different possible realizations of internal variability. However, with only one realization of the real world, it is a major challenge to isolate the forced response directly in observations. In the Forced Component Estimation Statistical Method Intercomparison Project (ForceSMIP), contributors used existing and newly developed statistical and machine learning methods to estimate the forced response over 1950–2022 within individual realizations of the climate system. Participants used neural networks, linear inverse models, fingerprinting methods, and low-frequency component analysis, among other approaches. These methods were trained using large ensembles from multiple climate models and then applied to observations. Here we evaluate method performance within large ensembles and investigate the estimates of the forced response in observations. Our results show that many different types of methods are skillful for estimating the forced response in climate models, though the relative skill of individual methods varies depending on the variable and evaluation metric. Methods with comparable skill in models can give a wide range of estimates of the forced response pattern in observations, illustrating the epistemic uncertainty in forced response estimates. ForceSMIP gives new insights into the forced response in observations, its uncertainty, and methods for its estimation.Statistical methods for estimating the forced component of historical SST and precipitation changes: A bias-variance tradeoff
(2025)