Datasets

Training data

Our training data for Phase 1 consists of

AiBEDO is trained on Earth System Model (ESM) output. In the case of the simultaneous model used for evaluating cloud-climate variability, we train on a subset of CMIP6 Earth System Model (ESM) outputs which had sufficient data availability on AWS to calculate the requisite input variables for our analysis (shown in Table 1. For each ESM, there are three categories of data hyper-cubes: (a) input, (b) output, and (c) data for enforcing physics constraints. Based on the initial results from our alpha hybrid model, we revised and increased the list of input variables to achieve better hybrid model performance. The updated list of input, output, and constraint variables is shown in Table 2. Furthermore, testing indicates that model performance improves when the model is trained on a large data pool from a single ESM, thus we have opted to train on long preindustrial control (piControl) simulations in addition to historical simulations.

Table 1. Earth System Model datasets for training simultaneous AiBEDO
Model	Historical (N)	piControl (years)	Lat spacing (deg)	Lon spacing (deg)
ACCESS-ESM1-5	0	900	1.25	1.875
AWI-ESM-1-1-LR	0	100	1.85	1.875
AWI-CM-1-1-MR	0	500	0.9272	0.9375
CESM2	1	1200	0.942	1.2500
CESM2-FV2	1	500	1.895	2.5000
CESM2-WACCM	1	499	0.942	1.2500
CESM2-WACCM-FV2	1	500	1.895	2.5000
CMCC-CM2-SR5	1	0	0.942	1.2500
CNRM-CM6-1-HR	0	100	0.495	0.5
CanESM5	5	1000	2.789	2.8125
E3SM-1-0	1	500	1.000	1.0000
E3SM-1-0	1	500	1.000	1.0000
EC-Earth3-Veg	0	500	0.695870	0.703125
GFDL-CM4	1	500	1.000	1.2500
GFDL-ESM4	1	500	1.000	1.2500
GISS-E2-1-H	1	401	2.000	2.5000
MIROC-ES2L	3	0	2.789	2.8125
MIROC6	1	0	1.400	1.4062
MPI-ESM-1-2-HAM	1	0	1.865	1.8750
MPI-ESM1-2-HR	1	0	0.935	0.9375
MPI-ESM1-2-LR	1	1000	1.865	1.8750
MRI-ESM2-0	1	0	1.121	1.1250
NorCPM1	5	0	1.9	2.5
SAM0-UNICON	1	0	0.942	1.2500

Table 2. Variable list and descriptions
Category	Variable	Description	Equation notation
Input	cres	TOA Cloud radiative effect in shortwave (positive down)	$R_{TOA}^{cl,SW}$
Input	cresSurf	Surface Cloud radiative effect in shortwave (positive down)	$R_{SRF}^{cl,SW}$
Input	crel	TOA Cloud radiative effect in longwave (positive down)	$R_{TOA}^{cl,LW}$
Input	crelSurf	Surface Cloud radiative effect in longwave (positive down)	$R_{SRF}^{cl,LW}$
Input	netTOAcs	TOA radiation without clouds (clear-sky) (positive down)	$R_{TOA}^{cs,ALL}$
Input	netSurfcs	Net Clearsky Surface radiation and heat flux (positive down)	$R_{SRF}^{cs,ALL} - SH - LH$
Output	tas	2-metre air temperature	$T_{SRF}$
Output	ps	Surface pressure	$p_{SRF}$
Output	pr	Precipitation	$P$
Constraint	evspsbl	Evaporation	$E$
Constraint	hfss	Surface upward sensible heat flux (positive up)	$SH$
Constraint	netTOARad	TOA All-sky radiation (positive down)	$R_{TOA}^{all,ALL}$
Constraint	netSurfRad	Surface All-sky radiation (positive down)	$R_{TOA}^{all,ALL}$
Constraint	qtot	Column integrated water vapour mass	$Q_{tot} = \frac{1}{g}\sum_{lev} Q dp$
Constraint	dqdt	Time tendency of column integrated water vapour mass	$\frac{dQ_{tot}}{dt} = \frac{1}{g}\frac{d}{dt} \sum_{lev} Q dp$

The ESM data are pooled together to form the training and testing datasets for our hybrid model. However, it is important to note there are substantial differences in the climatologies and variability of some of the chosen input variables across models. In particular, global average cloud liquid water content, cloud ice water content, and net top of atmosphere radiation vary more across ESMs than other variables. The former two are the result of differences in cloud parameterizations between ESMs, while the latter is likely due to uncertainties in the overall magnitude of anthropogenic forcing over the historical period. Comparing spatial correlation scores, shows net TOA radiation fields are very similar across models while the spatial pattern of cloud ice and water content varies substantially. Such variations represent the inter-ESM uncertainty in the representation of the climate. However, many of these ESM differences are largely removed during preprocessing described below.

When training a lagged model for MCB projections, we found we require a much larger, consistent sample of internal variability to train on. That is, we found that multi-model ensemble members of no-normalized data resulted errors in AiBEDO. Thus, we instead use data from the Community Earth System Model 2 (CESM2) Large Ensemble (LE) (Rodgers et al., 2021, Fig. 1), from which we use a 50-member ensemble of historical and SSP3-7.0 simulations with smoothed biomass burning. These are processed in the same manner as the CMIP6 data.

../_images/LENS_init_graphic.png — Figure 1. Initialization of CESM2 Large Ensemble simulations.

Reanalysis

In addition to ESM data, we also employ “reanalysis” datasets as validation datasets. Namely the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA2) and the ECMWF Reanalysis version 5 (ERA5). Reanalyses are models which ingest large quantities observational data to estimate the historical evolution of the atmosphere, thus providing an estimate of a wide range of atmospheric variables over the entire globe. While these data are not exactly the same as observational data, they are the best method of obtaining physically consistent and spatially complete climate data representing the recent past of the Earth’s atmosphere. MERRA2 includes data from 1980 to 2020 at 0.5 degree resolution and ERA5 includes data from 1979 to 2021 at 0.25 degree resolution.

Preprocessing

Each of the above data hyper-cubes are preprocessed before ingestion into the hybrid model. After testing with normalized and non-normalized training data, we find that the model produces better results when the variables are not normalized (i.e. not divided by the standard deviation at each grid points). Thus, our updated preprocessing pipeline is:

Remove seasonal cycle or “Deseasonalizing”: Subtract climatological seasonal cycle.
Remove trend or Detrend: Fit a third degree polynomial for each month of the year and subtract it from the data to remove trend in data over time. This removes secular trends (for example, rising temperatures as atmospheric CO$_2$ increases) and allows the model to be trained on fluctuations due to internal variability, rather than the forced response.
Remove rolling average: The anomaly at each grid point is calculated relative to a running mean that is computed over a centered 30-year window for that grid point and month.
Convert output variable units: Convert output variable units such that the magnitudes of the variables are similar
Remap data to Sphere-Icosahedral: Use Climate Data Operators to bilinearly remap different ESM grids to uniform level-5 or level-6 Sphere-Icosahedral grid.

../_images/preprocessing.png — Figure 2. Example preprocessing for a surface air temperature data point.