Maren Starlene Höver: Publications

Spatial Generalization Tests for Machine Learning-based Weather Models as a Requirement for Climate Predictions

Copernicus Publications (2026)

Authors:

Maren Höver, Milan Klöwer, Christian Schroeder de Witt, Hannah M Christensen

Abstract:

Machine learning-based weather prediction is revolutionizing weather forecasting by learning from present-day climate. However, generalization to other climates remains a major challenge. With melting sea ice, land-use change and increasing ocean temperatures, boundary conditions are changing. Therefore, generalization in time will likely only be possible if generalization in space is also given. The physics of the atmosphere is invariant in space, and as such, a model should demonstrate the same to accurately represent the real world.Here, we present three test cases to evaluate whether machine learning-based weather and climate models generalize spatially and apply them to multiple AI weather models. The tests consist of reversing the entirety of the input data and boundary conditions in latitude (Test 1), reversing them in longitude (Test 2), as well as rotating them by 180˚ in longitude (Test 3), while keeping all aspects of the simulation physically consistent. For a deterministic model that generalizes in space, each of these test cases yields the same predictions as the baseline case, only subject to a rounding error. With these test cases, we investigate whether data-driven models hardcode representations of spatial relationships in the training data into their latent space. We show that currently, both fully data-driven and hybrid general circulation models do not pass these tests, instead performing poorly with unphysical results. This implies that they have likely not learned underlying atmospheric physics principles, but instead local spatial relationships statistically dependent on geographical location. This calls into question the ability of such models to simulate a changing regional climate. As such, we propose that machine learning-based climate models be evaluated using our spatial tests during model development to reduce overfitting on present-day regional climate.

Forced Component Estimation Statistical Method Intercomparison Project (ForceSMIP)

Journal of Climate American Meteorological Society (2026)

Authors:

Robert CJ Wills, Clara Deser, Karen A McKinnon, Adam Phillips, Stephen Po-Chedley, Sebastian Sippel, Anna L Merrifield, Constantin Bône, Céline Bonfils, Gustau Camps-Valls, Stephen Cropper, Charlotte Connolly, Shiheng Duan, Homer Durand, Alexander Feigin, MA Fernandez, Guillaume Gastineau, Andrei Gavrilov, Emily Gordon, Moritz Günther, Maren Höver, Sergey Kravtsov, Yan-Ning Kuo, Justin Lien, Gavin D Madakumbura, Nathan Mankovich, Matthew Newman, Jamin Rader, Jia-Rui Shi, Sang-Ik Shin, Gherardo Varando

Abstract:

Abstract Anthropogenic climate change is unfolding rapidly, yet its regional manifestation can be obscured by internal variability. A primary goal of climate science is to identify the externally forced climate response from amongst the noise of internal variability. Separating the forced response from internal variability can be addressed in climate models by using a large ensemble to average over different possible realizations of internal variability. However, with only one realization of the real world, it is a major challenge to isolate the forced response directly in observations. In the Forced Component Estimation Statistical Method Intercomparison Project (ForceSMIP), contributors used existing and newly developed statistical and machine learning methods to estimate the forced response over 1950–2022 within individual realizations of the climate system. Participants used neural networks, linear inverse models, fingerprinting methods, and low-frequency component analysis, among other approaches. These methods were trained using large ensembles from multiple climate models and then applied to observations. Here we evaluate method performance within large ensembles and investigate the estimates of the forced response in observations. Our results show that many different types of methods are skillful for estimating the forced response in climate models, though the relative skill of individual methods varies depending on the variable and evaluation metric. Methods with comparable skill in models can give a wide range of estimates of the forced response pattern in observations, illustrating the epistemic uncertainty in forced response estimates. ForceSMIP gives new insights into the forced response in observations, its uncertainty, and methods for its estimation.

Statistical methods for estimating the forced component of historical SST and precipitation changes: A bias-variance tradeoff

(2025)

Authors:

Maren Höver, Robert Jnglin Wills, Nora Fahrenbach

Abstract:

Distinguishing the influences of externally forced responses and internal variability on the observed climate is critical for attributing historical climate change and for evaluating the forced responses simulated by climate models. Statistical methods such as optimal fingerprinting, low-frequency component analysis (LFCA), and dynamical adjustment have proven useful for this application. The skill of such statistical methods can be evaluated using climate model large ensembles, where the forced response is estimated by averaging over many realizations. Our study uses large ensemble simulations from five different climate models to evaluate the performance of three statistical methods for this application: (1) low-frequency component analysis, (2) signal-to-noise maximizing pattern optimal fingerprinting (SNMP-OF), which uses the patterns from an ensemble-based signal-to-noise maximizing pattern (SNMP) analysis for optimal fingerprinting, and (3) a novel method based on SNMP analysis called fingerprint maximizing patterns (FMP), which finds patterns within observed variability that have the maximum fingerprint of the model-based forced response. We investigate how the root mean square error (RMSE) of these three methods varies across the choices of hyperparameters and show that all methods have a similar maximum skill. However, the contribution to the RMSE from the mean bias in the forced response estimate varies across the methods, with SNMP-OF and FMP showing a larger mean bias than LFCA. This demonstrates that methods that largely rely on the model forced response to obtain the observed forced response may give biased estimates and underestimate the uncertainty in these estimates due to the bias-variance tradeoff. Additionally, we apply these methods to observed Sahel precipitation, which is extensively debated in terms of its forced component, and closely related North Atlantic sea surface temperatures (SSTs). We show that while the methods give a robust estimate of the forced response in North Atlantic SSTs from 1950 to 2022, their estimates of the forced response in Sahel precipitation over the same period differ in sign. The fact that these estimates of the Sahel precipitation response differ substantially, despite all methods performing similarly well for large ensembles, suggests substantial epistemic uncertainty in estimates of the forced precipitation response in this region.

91̽��