3 Spatial interpolation in 3D using Ensemble ML
You are reading the workinprogress Spatial and spatiotemporal interpolation using Ensemble Machine Learning. This chapter is currently draft version, a peerreview publication is pending. You can find the polished first edition at https://opengeohub.github.io/spatialpredictioneml/.
3.1 Mapping concentrations of geochemical elements
Ensemble ML can also be used for mapping soil variables in 3D. Consider for example the Geochemical and minerological data set for USA48 (Smith, Woodruff, Solano, Ellefsen, & Karl, 2014). This is a public data set produced and maintained by the USA Geological Survey and contains laboratory measurements of chemical elements and minerals in soil at 4,857 sites (for three depths 0 to 5 cm, A horizon and C horizon; Fig. 3.1).
We have previously imported and overlaid the sampling points vs large stack of covariate layers using e.g.:
## Training data 
ngs.m < read.csv(gzfile("./input/ds_801_all.csv.gz"))
ngs.m$olc_id = olctools::encode_olc(ngs.m$latitude, ngs.m$longitude, 11)
## Spatial overlay 
sel.pnts = !duplicated(ngs.m$olc_id)
summary(sel.pnts)
#> Mode FALSE TRUE
#> logical 9714 4857
ngs.m.pnts = ngs.m[which(sel.pnts), c("olc_id", "longitude", "latitude")]
coordinates(ngs.m.pnts) < ~ longitude + latitude
proj4string(ngs.m.pnts) < "+init=epsg:4326"
prj = "+proj=aea +lat_0=23 +lon_0=96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"
ngs.xy = spTransform(ngs.m.pnts, prj)
tif.lst = list.files("./1km", glob2rx("*.tif$"), full.names = TRUE)
## 262 layers
ov.tmp = parallel::mclapply(1:length(tif.lst), function(j){ terra::extract(terra::rast(tif.lst[j]), terra::vect(ngs.xy)) }, mc.cores = 80)
ov.tmp = dplyr::bind_cols(lapply(ov.tmp, function(i){i[,2]}))
names(ov.tmp) = tools::file_path_sans_ext(basename(tif.lst))
ov.tmp$olc_id = ngs.xy$olc_id
reg.matrix = plyr::join(ngs.m, ov.tmp)
saveRDS.gz(reg.matrix, "./input/ds801_geochem1km.rds")
We can load the regression matrix which contains all target variables and covariates:
which has a total of 4818 unique locations:
str(levels(as.factor(ds801$olc_id)))
#> chr [1:4818] "75WXFVQG+WH6" "75WXJHVG+62X" "75WXJVM9+995" "75WXRG2G+5PV" ...
The individual records can be browsed directly via https://mrdata.usgs.gov/ngdb/soil/, for example a single record includes:
Covariates prepared to help interpolation of the geochemicals include:
 Distance to cities (Nelson et al., 2019);

MODIS LST (monthly daytime and nighttime);

MODIS EVI (longterm monthly values);

Lights at night images;

Snow occurrence probability;
 Soil property and class maps for USA48 (Ramcharan et al., 2018);
 Terrain / hydrological indices;
We can focus on predict concentration of lead (Pb). Pb seems to change with depth and this is seems to be consistent for different land cover classes:
openair::scatterPlot(ds801[ds801$pb_ppm<140,], x = "hzn_depth", y = "pb_ppm", method = "hexbin", col = "increment", log.x=TRUE, log.y=TRUE, xlab="Depth", ylab="Pb [ppm]", z.lim=c(0,100), type="landcover1")
#> Warning: removing 11 missing rows due to landcover1
Because we are aiming at producing predictions of geochemical elements for different depths in soil, we can also use depth of the soil sample as one of the covariates. This makes the prediction system 3D and the model can thus be used to predict at any new 3D location (\(X, Y, d\)). To fit a RF model for this data we can use:
ds801$log.pb = log1p(ds801$pb_ppm)
pr.vars = c(readRDS("./input/pb.pr.vars.rds"), "hzn_depth")
sel.pb = complete.cases(ds801[,c("log.pb", pr.vars)])
mrf = ranger::ranger(y=ds801$log.pb[sel.pb], x=ds801[sel.pb, pr.vars],
num.trees = 85, importance = 'impurity')
mrf
#> Ranger result
#>
#> Call:
#> ranger::ranger(y = ds801$log.pb[sel.pb], x = ds801[sel.pb, pr.vars], num.trees = 85, importance = "impurity")
#>
#> Type: Regression
#> Number of trees: 85
#> Sample size: 14264
#> Number of independent variables: 188
#> Mtry: 13
#> Target node size: 5
#> Variable importance mode: impurity
#> Splitrule: variance
#> OOB prediction error (MSE): 0.09636483
#> R squared (OOB): 0.670695
which results in Rsquare of about 0.67. Because many training samples have exactly the same coordinates (same site, three depths), we assume that this model is overfitting i.e. that the outofbag accuracy is probably overoptimistic. Instead we can fit an Ensemble model where we block points within 30 by 30km blocks:
if(!exists("eml.pb")){
lrn.rf = mlr::makeLearner("regr.ranger", num.trees=85, importance="impurity",
num.threads = parallel::detectCores())
lrns.pb < list(lrn.rf, mlr::makeLearner("regr.xgboost"), mlr::makeLearner("regr.cvglmnet"))
tsk0.pb < mlr::makeRegrTask(data = ds801[sel.pb, c("log.pb", pr.vars)],
target = "log.pb", blocking = as.factor(ds801$ID[sel.pb]))
init.pb < mlr::makeStackedLearner(lrns.pb, method="stack.cv", super.learner="regr.lm",
resampling=mlr::makeResampleDesc(method="CV", blocking.cv=TRUE))
parallelMap::parallelStartSocket(parallel::detectCores())
eml.pb = train(init.pb, tsk0.pb)
parallelMap::parallelStop()
}
#> [14:42:43] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
summary(eml.pb$learner.model$super.model$learner.model)
#>
#> Call:
#> stats::lm(formula = f, data = d)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> 2.3979 0.1990 0.0117 0.1699 6.2948
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>t)
#> (Intercept) 0.64745 0.04296 15.069 < 2e16 ***
#> regr.ranger 0.85552 0.02152 39.755 < 2e16 ***
#> regr.xgboost 0.26550 0.05337 4.974 6.62e07 ***
#> regr.cvglmnet 0.25631 0.02024 12.665 < 2e16 ***
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4075 on 14260 degrees of freedom
#> Multiple Rsquared: 0.4328, Adjusted Rsquared: 0.4327
#> Fstatistic: 3627 on 3 and 14260 DF, pvalue: < 2.2e16
which shows somewhat lower Rsquare of 0.44, this time that whole sites have been taken out and hence this seems to be somewhat more realistic estimate of the mapping accuracy (Roberts et al., 2017). The accuracy plot shows that the model has some problems with predicting higher values, but overall matches the observed values:
t.pb = quantile(ds801$log.pb, c(0.001, 0.01, 0.999), na.rm=TRUE)
plot_hexbin(varn="log.pb", breaks=c(t.pb[1], seq(t.pb[2], t.pb[3], length=25)),
meas=eml.pb$learner.model$super.model$learner.model$model$log.pb,
pred=eml.pb$learner.model$super.model$learner.model$fitted.values,
main="Pb [EML]")
Variables most important for explaining distribution of the target variable (based on the variable importance)
seem to be soil depth (hnz_depth
), soil type maps, annual day time temperature and travel time to cities:
library(ggplot2)
xl.pb < as.data.frame(mlr::getFeatureImportance(eml.pb[["learner.model"]][["base.models"]][[1]])$res)
xl.pb$relative_importance = 100*xl.pb$importance/sum(xl.pb$importance)
xl.pb = xl.pb[order(xl.pb$relative_importance, decreasing = T),]
xl.pb$variable = paste0(c(1:nrow(xl.pb)), ". ", xl.pb$variable)
ggplot(data = xl.pb[1:20,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=15))
If we plot the travel time to cities vs Pb concentrations, we can clearly see that Pb is negatively correlated with travel time to cities (following a loglog linear relationship):
openair::scatterPlot(ds801[ds801$pb_ppm<140,], x = "travel_time_to_cities_1_usa48", y = "pb_ppm", method = "hexbin", col = "increment", log.x=TRUE, log.y=TRUE, xlab="Travel time to cities 1", ylab="Pb [ppm]", type="hzn_depth")
#> Warning: removing 11 missing rows due to hzn_depth
3.2 Predictions in 3D
To produce predictions we can focus on area around Chicago conglomeration. We can load the covariate layers by using:
g1km = readRDS("./input/chicago_grid1km.rds")
This contains all layers we used for training. We can generate predictions by adding a depth column, then write predictions to GeoTIFFs:
for(k in c(5, 30, 60)){
out.tif = paste0("./output/pb_ppm_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, eml.pb$features]
pred = predict(eml.pb, newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=pred$data, proj4string=CRS("EPSG:5070"))
g1km.sp$pred = expm1(g1km.sp$response)
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
This finally gives the following pattern:
Based on these results, it can be said that in general:
 Distribution of Pb across USA seem to be controlled by a mixture of factors
including climatic factors, soilterrain units and anthropogenic factors (travel distance to cities);
 Overall big urban areas show significantly higher concentrations for some heavy
metals and this relationship if loglog linear;
 Soil depth for some geochemical elements comes as the overall most important covariate hence mapping soil variables in 3D is fully justified;
3.3 Modeling multiple geochemicals using a single ML model
The package randomForestSRC (Ishwaran & Kogalur, 2022) provides functionality to run socalled “Multivariate Outcomes” which means that one can fit a model with multivariate outcomes (one model to predict multiple target variables). In the case of geochemical data, usually multitude of geochemicals (50+) are analyzed in lab, so that it seems more practical to fit a single model to predict p+ target variables (outcomes) all at once, than to fit p+ models instead.
In the case of ds801
dataset we can first specify number of target variables that we would like to map:
tvs = c("as_ppm", "al_pct", "cd_ppm", "cr_ppm", "cu_ppm", "fe_pct",
"mo_ppm", "ni_ppm", "se_ppm", "th_ppm", "zn_ppm", "pb_ppm")
## number of missing values per variable:
sapply(tvs, function(i){sum(is.na(ds801[,i]))})
#> as_ppm al_pct cd_ppm cr_ppm cu_ppm fe_pct mo_ppm ni_ppm se_ppm th_ppm zn_ppm
#> 198 4 4290 41 23 24 32 43 6906 19 31
#> pb_ppm
#> 11
These variables are obviously highly crosscorrelated:
Here especially zn
, cu
, fe
, al
and ni
seem to be highly crosscorrelated.
The multicolinearity of these variables could be reduced by converting them e.g.
to Principal Components (Venables & Ripley, 2002). Principal Component Analysis helps
reduce overlap in data and can help with modeling.
Before transferring the original values to Principal Components we can also filter out
all missing values by replacing them with a mean value (for scaled variable this is 0):
ds801.x[is.na(ds801.x)] < 0
ds801.xd < as.data.frame(ds801.x)
pcs.ds801 < stats::prcomp(x = ds801.xd)
This again shows that most of geochemicals are positively correlated with about 6 groups of most correlated variables.
Note that, in the previous step, we convert all lognormal or percent variables
using log1p
and/or logit transformation to: (1) produce closetonormal distribution
and (2) avoid producing predictions beyond the physical range of data (e.g. negative values or values >100%).
First 6 components explain majority of variance in data, but it appears that to be on a safeside, we can model up to PC11:
summary(pcs.ds801)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 2.4221 1.1343 0.98982 0.79734 0.74930 0.67263 0.63508
#> Proportion of Variance 0.5244 0.1150 0.08759 0.05684 0.05019 0.04045 0.03606
#> Cumulative Proportion 0.5244 0.6395 0.72706 0.78390 0.83409 0.87454 0.91060
#> PC8 PC9 PC10 PC11 PC12
#> Standard deviation 0.58353 0.50063 0.41659 0.37258 0.31070
#> Proportion of Variance 0.03044 0.02241 0.01551 0.01241 0.00863
#> Cumulative Proportion 0.94104 0.96344 0.97896 0.99137 1.00000
As a rule of thumb, we advise to focus on components that explain >95% of variance in the data; to be certain one can also use >99% as the threshold.
Note the steps we have taken so far include: (1) we first transform the target variables to achieve close to normaldistribution, (2) we scale all numbers to have 0 = mean and 1 = standard deviation, (3) we replace missing value with the mean value (0 in this case), (4) we derive Principle Components with all data so nothing is lost.
Next, we can fit a model using the randomForestSRC package and all PCs expect for the last PC (which is usually pure noise):
nComp = 11; Ntree = 85
xdta = ds801[,pr.vars]
ydta = pcs.ds801$x[,1:nComp]
tfm = get.mv.formula(colnames(ydta))
## spatial tiles 30x30km
sp.tiles = unique(ds801$ID)
## we define own custom bootstrapping principle by
## organizing training points using spatial blocks;
samp = randomForestSRC:::make.sample(ntree=Ntree, samp.size=length(sp.tiles),
boot.size=length(sp.tiles)/2)
samp.df = plyr::join(ds801["ID"], cbind(data.frame(ID=sp.tiles), data.frame(samp)), type="left")
#> Joining by: ID
samp.df = as.matrix(samp.df[,1])
## sampsize must be identical for each tree
n.col = min(colSums(samp.df))
samp.dfa = apply(X = samp.df, MARGIN = 2, FUN = function(i){fill.null(i, n.col)})
m.s1 = randomForestSRC::rfsrc(tfm, data.frame(ydta, xdta),
bootstrap = "by.user",
samp = samp.dfa,
importance=TRUE)
print(m.s1, outcome.target = "PC1")
#> Sample size: 14275
#> Number of trees: 85
#> Forest terminal node size: 5
#> Average no. of terminal nodes: 1365.482
#> No. of variables tried at each split: 63
#> Total no. of variables: 188
#> Total no. of responses: 11
#> User has requested response: PC1
#> Resampling used to grow trees: by.user
#> Resample size used to grow trees: 7109
#> Analysis: mRFR
#> Family: regr+
#> Splitting rule: mv.mse *random*
#> Number of random split points: 10
#> % variance explained: 53.49
#> Error rate: 2.73
This basically fits a multivariate RF of form rfsrc(Multivar(y1, y2, ..., yd) ~ . , my.data, ...)
.
The resulting model show that the models explain significant part of variance. As expected the significance of the model
is usually the highest for the higher level components (PC1, PC2), then it gradually drops.
We can access the variable importance of each individual model by using:
vmp < as.data.frame(get.mv.vimp(m.s1))["PC1"]
vmp$relative_importance = 100*vmp$PC1/sum(vmp$PC1)
vmp = vmp[order(vmp$relative_importance, decreasing = TRUE),]
vmp$variable = paste0(c(1:nrow(vmp)), ". ", row.names(vmp))
ggplot(data = vmp[1:20,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=15))
3.4 Predicting geochemicals using decompositioncomposition method
To generate predictions for different depths and for different PCs, we can loop
the predict.rfsrc
function across different soil depths:
for(j in 1:nComp){
for(k in c(5, 30, 60)){
out.tif = paste0("./output/pc", j, "_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, pr.vars]
pred = predict(m.s1, m.target = paste0("PC", j), newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=data.frame(pred$regrOutput[[paste0("PC", j)]][[1]]),
proj4string=CRS("EPSG:5070"))
g1km.sp$pred = g1km.sp@data[,1]*100
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
}
This gives predictions (in the scaled space) for each of the original target variables and for all depths. Note that, because the randomForestSRC package uses by default all cores to run model training and prediction, there is no need to specify any further parallelization i.e. it is fine to run processing in a loop.
After we generate predictions, we can (1) first backtransform values to original scale using the PCA model eigenvectors, (2) scaleback from logtransformation. For example the predictions for topsoil, we first load all predicted PC’s:
ls.5cm = list.files('./output', pattern=glob2rx("^pc*_5cm_1km.tif$"), full.names = TRUE)
pred.tops = raster::stack(ls.5cm)
pred.tops = as(pred.tops, "SpatialGridDataFrame")
next, we backtransform values to the original scale by:
Xhat = as.matrix(pred.tops@data[,paste0("pc", 1:nComp, "_5cm_1km")]/100) %*% t(pcs.ds801$rotation[,1:nComp])
## scaleback to the original mean / stdev:
Xhat.s = sweep(Xhat %*% diag(attr(ds801.x, 'scaled:scale')), 2, attr(ds801.x, 'scaled:center'), "+")
Xhat.s = as.data.frame(expm1( Xhat.s ))
names(Xhat.s) = tvs
We can plot two predictions of Pb
(using EML produced in the previous example; and using PCA analysis) next to each other:
pred.tops$Pb_5cm_1km.PCA = log1p(Xhat.s[,"pb_ppm"])
pred.tops$pb.pca_ppm_5cm_1km = pred.tops$Pb_5cm_1km.PCA * 100
#rgdal::writeGDAL(pred.tops["pb.pca_ppm_5cm_1km"], "./output/pb.pca_ppm_5cm_1km.tif",
# type="Int16", mvFlag=32768, options=c("COMPRESS=DEFLATE"))
pred.tops$Pb_5cm_1km.EML = log1p(readGDAL("./output/pb_ppm_5cm_1km.tif")$band1)
#> ./output/pb_ppm_5cm_1km.tif has GDAL driver GTiff
#> and has 342 rows and 378 columns
r.pal = rev(rainbow(65)[1:48])
pred.tops.df = as.data.frame(pred.tops[c("Pb_5cm_1km.EML", "Pb_5cm_1km.PCA")])
all_df_long < tidyr::gather(pred.tops.df, 'layer', 'value', s1, s2)
pred.pnts = as.data.frame(ngs.xy[!is.na(over(y=pred.tops[1], x=ngs.xy)[,1]),], xy=TRUE)
ggplot() +
geom_raster(data = all_df_long, aes(x=s1, y=s2, fill = value)) +
facet_wrap(. ~ layer, ncol = 2) +
geom_point(data = pred.pnts, aes(x=longitude, y=latitude), shape="+", size=2) +
scale_fill_gradientn(colours=r.pal, name="", limits=c(2.8,3.9)) +
coord_equal() +
theme_minimal() + xlab("Westing")+ ylab("Northing")
This shows that the predictions are quite similar, except the PCAmethod somewhat smooths out some very high values (Chicago city area). The difference also happens due to the fact that, for the map on the left, we have use an Ensemble model, while the map on the right is purely based on RF.
It seem to be also possible to derive prediction errors using the quantreg
function, although
this can get at the order of magnitude more computational. In addition, for multiple
predicted PCs one would need to derive a composite prediction error (usually a
geometric mean of variances).
3.5 Predicting geochemicals assuming Nohumaninfluence
Using this integrated model for geochemicals, we could also map hypothetical distribution of the target geochemical
assuming no or marginal human influence (NoHI). For this we can use the same model,
but during prediction replace values of the covariates for those that quantify
human impact. For example layers travel_time_to_cities
, population_density
,
lights at night images, density of industrial complexes and proximity to roads and
cities, are all variables that basically quantify human / industrial footprint (Mu et al., 2022).
To predict values of geochemicals assuming no human influence i.e. to estimate the
background or natural concentrations of geochemicals we can thus replace values
for all such layers using the following simple principles (Sanderman, Hengl, & Fiske, 2017):
 for lights at night images and human population density we replace all values with 0 value;
 for travel time to cities, buffer densities to industrial sources of pollution and similar we use some very high value e.g. 99% probability quantile value for the whole of USA;
We can thus remake all predictions using NoHI settings:
## only topsoil
k = 5
for(j in 1:nComp){
out.tif = paste0("./output/pc", j, ".nohi_", k, "cm_1km.tif")
if(!file.exists(out.tif)){
g1km$hzn_depth = k
sel.na = complete.cases(g1km)
newdata = g1km[sel.na, pr.vars]
## replace values / simulate noHI:
newdata[,grep("travel", pr.vars)] = 1000
newdata[,grep("pop", pr.vars)] = 0
#newdata[,grep("light", pr.vars)] = 0
pred = predict(m.s1, m.target = paste0("PC", j), newdata=newdata)
g1km.sp = SpatialPixelsDataFrame(as.matrix(g1km[sel.na,c("x","y")]),
data=data.frame(pred$regrOutput[[paste0("PC", j)]][[1]]),
proj4string=CRS("EPSG:5070"))
g1km.sp$pred = g1km.sp@data[,1]*100
rgdal::writeGDAL(g1km.sp["pred"], out.tif, type="Int16", mvFlag=32768, options=c("COMPRESS=DEFLATE"))
#gc()
}
}
Again we need to backtransform the values to original scale:
lsN.5cm = list.files('./output', pattern=glob2rx("^pc*.nohi_5cm_1km.tif$"), full.names = TRUE)
predN.tops = raster::stack(lsN.5cm)
predN.tops = as(predN.tops, "SpatialGridDataFrame")
XhatN = as.matrix(predN.tops@data[,paste0("pc", 1:nComp, ".nohi_5cm_1km")]/100) %*% t(pcs.ds801$rotation[,1:nComp])
XhatN.s = sweep(XhatN %*% diag(attr(ds801.x, 'scaled:scale')), 2, attr(ds801.x, 'scaled:center'), "+")
XhatN.s = as.data.frame(expm1( XhatN.s ))
names(XhatN.s) = tvs
We can plot the two predictions for e.g. Pb to see if there is any difference:
pred.tops$Pb.nohi_5cm_1km.PCA = log1p(XhatN.s[,"pb_ppm"])
pred.tops$pb.pca.nohi_ppm_5cm_1km = pred.tops$Pb.nohi_5cm_1km.PCA * 100
#rgdal::writeGDAL(pred.tops["pb.pca.nohi_ppm_5cm_1km"], "./output/pb.pca.nohi_ppm_5cm_1km.tif",
# type="Int16", mvFlag=32768, options=c("COMPRESS=DEFLATE"))
predN.tops.df = as.data.frame(pred.tops[c("Pb_5cm_1km.PCA", "Pb.nohi_5cm_1km.PCA")])
allN_df_long < tidyr::gather(predN.tops.df, 'layer', 'value', s1, s2)
ggplot() +
geom_raster(data = allN_df_long, aes(x=s1, y=s2, fill = value)) +
facet_wrap(. ~ layer, ncol = 2) +
geom_point(data = pred.pnts, aes(x=longitude, y=latitude), shape="+", size=2) +
scale_fill_gradientn(colours=r.pal, name="", limits=c(2.8,3.9)) +
coord_equal() +
theme_minimal() + xlab("Westing")+ ylab("Northing")
This has significantly reduced the total values of Pb. Basically the map on the right is most likely showing the background or natural concentration of Pb assuming no human influence (notice the city of Chicago and similar are not visible on the map any more). The difference between the two maps is an estimate of humaninduced pollution by Pb. If there are historic Pb soil measurements the map on the right could also be validated.
Note that we have only modified values for travel time to cities, population density etc, also the MODIS LST night time temperatures should be probably adjusted since they also show big urban areas and similar. We have not done it here because converting MODIS LST images to nohuman influence would probably require modeling in itself, but it is most likely doable.
3.6 Advantages and limitations of running 3D predictive mapping
In summary 3D soil mapping is relatively straight forward to implement especially for mapping soil variables from soil profiles (multiple samples per soil layer). Before modeling the target variable with depth, it is a good idea to plot relationship between target variable and depth under different settings (as in Fig. 3.2). 3D soil mapping based on Machine Learning is now increasingly common (T. Hengl & MacMillan, 2019; Sothe, Gonsamo, Arabian, & Snider, 2022).
A limitation for 3D predictive mapping is the size of data i.e. data volumes increasing proportionally to number of slices we need to predict (in this case three). Also, we show that points with exactly the same coordinates might result in e.g. Random Forest overfitting emphasizing some covariate layers that are possibly less important, hence it is again important to use the blocking parameter that separates training and validation points.
Soil depth is for many soil variables most important explanatory variables, but in the cases it does not correlate with the target variable, there is probably also no need for 3D soil mapping. In the case depth is not significantly correlated, one could simply first aggregate all values to fixed block depth and convert modeling from 3D to 2D.
By combining PCA analysis and predictive mapping (we refer to this method as the Decompositioncomposition PCA method) one can integrate modeling of multiple crosscorrelated variables and multicolinearity reduction (PCA analysis). This seems to be especially applicable for cases where there are ≫10 target variables which are highly crosscorrelated, possibly have inconsistent missing values and after transformation match closetonormal distribution. As a rule of thumb one can model only PCs that explain up to 99% of variance in data, this way modeling effort can be significantly reduced (e.g. for 50+ target variables one could model all target variables only a dozen of PCs). A disadvantage of the Decompositioncomposition PCA method is that interpretation of scaled values is somewhat more complex, certainly more abstract.