Proyecto Final “Curso Aprendizaje Automático aplicado a datos de Biodiversidad”#¶
Redbioma. Instituto Tecnológico de Costa Rica
Ana Fandiño Carro
Species Distrubution Modelling with Google Earth Engine¶
Note: Conservation biologist Dr. Ramiro D. Crego implemented SDM using the GEE JavaScript Code Editor and published his research findings (Crego et al, 2022). The methodology of SDM introduced here has been translated and modified from the JavaScript source code he shared into the Python language.
The methodology described in the following tutorial has been modified slightly to the target species of interest and to consider the relevant climate data. https://developers.google.com/earth-engine/tutorials/community/species-distribution-modeling/species-distribution-modeling
import ee
# Trigger the authentication flow.
ee.Authenticate()
ee.Initialize(project = 'ee-fandian24bio')
# Import libraries
import geemap
import geemap.colormaps as cm
import pandas as pd, geopandas as gpd
import numpy as np, matplotlib.pyplot as plt
import os, requests, math, random
from ipyleaflet import TileLayer
from statsmodels.stats.outliers_influence import variance_inflation_factor
Data Required for SDM
SDM typically utilizes the relationship between known species occurrence records and environmental variables to identify the conditions under which a population can sustain. In other words, two types of model input data are required:
- Occurrence records of known species
- Various environmental variables
These data are input into algorithms to identify environmental conditions associated with the presence of species.
Workflow of SDM using GEE
The workflow for SDM using GEE is as follows:
1.- Collection and preprocessing of species occurrence data
2.- Definition of the Area of Interest
3.- Addition of GEE environmental variables
4.- Generation of pseudo-absence data
5.- Model fitting and prediction
6.- Variable importance and accuracy assessment
1.- Collection and Preprocessing of Species Occurrence Data
GBIF.org (24 agosto 2024) GBIF Occurrence Download https://doi.org/10.15468/dl.pka5js
# Open data as pandas DataFrame
df = pd.read_csv("Mcabrerae.csv")
df.shape
(2400, 223)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2400 entries, 0 to 2399 Columns: 223 entries, gbifID to iucnRedListCategory dtypes: bool(4), float64(105), int64(10), object(104) memory usage: 4.0+ MB
#Selection of the columns
selection = ["species", "year", "month", "decimalLongitude", "decimalLatitude"]
df = df[selection]
df.describe()
year | month | decimalLongitude | decimalLatitude | |
---|---|---|---|---|
count | 1972.000000 | 1528.000000 | 2370.000000 | 2370.000000 |
mean | 2010.879817 | 6.746073 | -4.727127 | 39.775464 |
std | 7.447313 | 3.149357 | 2.954253 | 1.284616 |
min | 1905.000000 | 1.000000 | -8.959004 | 37.160000 |
25% | 2007.000000 | 4.000000 | -7.305551 | 38.620983 |
50% | 2012.000000 | 6.000000 | -5.300000 | 39.860000 |
75% | 2015.000000 | 10.000000 | -1.820000 | 40.659700 |
max | 2024.000000 | 12.000000 | -0.380000 | 42.590000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2400 entries, 0 to 2399 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 2400 non-null object 1 year 1972 non-null float64 2 month 1528 non-null float64 3 decimalLongitude 2370 non-null float64 4 decimalLatitude 2370 non-null float64 dtypes: float64(4), object(1) memory usage: 93.9+ KB
df.dropna(inplace=True)
df.shape
(1518, 5)
Next, we convert the DataFrame into a GeoDataFrame that includes a column for geographic information (geometry) and check the first row. A GeoDataFrame can be saved as a GeoPackage file (*.gpkg) and read back in.
# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df.decimalLongitude,
df.decimalLatitude),
crs="EPSG:4326"
)[["species", "year", "month", "geometry"]]
gdf.head(1) # Display the first row of the GeoDataFrame
species | year | month | geometry | |
---|---|---|---|---|
270 | Microtus cabrerae | 2024.0 | 3.0 | POINT (-5.65824 40.92633) |
gdf.to_file('file.shp', driver='ESRI Shapefile')
gpd.read_file("file.shp")
species | year | month | geometry | |
---|---|---|---|---|
0 | Microtus cabrerae | 2024.0 | 3.0 | POINT (-5.65824 40.92633) |
1 | Microtus cabrerae | 2023.0 | 11.0 | POINT (-1.49000 39.47000) |
2 | Microtus cabrerae | 2023.0 | 11.0 | POINT (-1.24000 39.57000) |
3 | Microtus cabrerae | 2023.0 | 11.0 | POINT (-1.23000 39.75000) |
4 | Microtus cabrerae | 2023.0 | 10.0 | POINT (-1.23000 39.51000) |
... | ... | ... | ... | ... |
1513 | Microtus cabrerae | 1996.0 | 9.0 | POINT (-6.68000 40.50000) |
1514 | Microtus cabrerae | 1996.0 | 9.0 | POINT (-6.68000 40.50000) |
1515 | Microtus cabrerae | 1996.0 | 9.0 | POINT (-6.68000 40.50000) |
1516 | Microtus cabrerae | 1996.0 | 9.0 | POINT (-6.68000 40.50000) |
1517 | Microtus cabrerae | 1992.0 | 10.0 | POINT (-1.17000 42.43000) |
1518 rows × 4 columns
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'> Index: 1518 entries, 270 to 2397 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 1518 non-null object 1 year 1518 non-null float64 2 month 1518 non-null float64 3 geometry 1518 non-null geometry dtypes: float64(2), geometry(1), object(1) memory usage: 59.3+ KB
gdf.describe()
year | month | |
---|---|---|
count | 1518.000000 | 1518.000000 |
mean | 2011.402503 | 6.760870 |
std | 7.742863 | 3.153531 |
min | 1905.000000 | 1.000000 |
25% | 2010.000000 | 4.000000 |
50% | 2013.000000 | 6.000000 |
75% | 2015.000000 | 10.000000 |
max | 2024.000000 | 12.000000 |
#gdf.to_file("microtus_cabrerae.GeoJSON", driver="GeoJSON")
#gdf = gpd.read_file("microtus_cabrerae.GeoJSON")
This time, we have created a function to visualize the distribution of data by year and month from the GeoDataFrame and display it as a graph, which can then be saved as an image file. The use of a heatmap allows us to quickly grasp the frequency of species occurrence by year and month, providing an intuitive visualization of the temporal changes and patterns within the data. This allows for the identification of temporal patterns and seasonal variations in species occurrence data, as well as the rapid detection of outliers or quality issues within the data.
# Yearly and monthly data distribution heatmap
def plot_heatmap(gdf, h_size= 12):
statistics = gdf.groupby(["month", "year"]).size().unstack(fill_value=0)
# Heatmap
plt.figure(figsize=(h_size, h_size - 6))
heatmap = plt.imshow(
statistics.values, cmap="YlOrBr", origin="upper", aspect="auto"
)
# Display values above each pixel
for i in range(len(statistics.index)):
for j in range(len(statistics.columns)):
plt.text(
j, i, statistics.values[i, j], ha="center", va="center", color="black"
)
plt.colorbar(heatmap, label="Count")
plt.title("Monthly Species Count by Year")
plt.xlabel("Year")
plt.ylabel("Month")
plt.xticks(range(len(statistics.columns)), statistics.columns)
plt.yticks(range(len(statistics.index)), statistics.index)
plt.tight_layout()
plt.savefig("heatmap_plot.png")
plt.show()
plot_heatmap(gdf)
# Filtering data by year (to be coherent with climatic data and lees few information)
filtered_gdf = gdf[
(~gdf['year'].between(1905, 1995))
]
# Yearly and monthly data distribution heatmap
def plot_heatmap(filtered_gdf, h_size= 12):
statistics = filtered_gdf.groupby(["month", "year"]).size().unstack(fill_value=0)
# Heatmap
plt.figure(figsize=(h_size, h_size - 6))
heatmap = plt.imshow(
statistics.values, cmap="YlOrBr", origin="upper", aspect="auto"
)
# Display values above each pixel
for i in range(len(statistics.index)):
for j in range(len(statistics.columns)):
plt.text(
j, i, statistics.values[i, j], ha="center", va="center", color="black"
)
plt.colorbar(heatmap, label="Count")
plt.title("Monthly Species Count by Year")
plt.xlabel("Year")
plt.ylabel("Month")
plt.xticks(range(len(statistics.columns)), statistics.columns)
plt.yticks(range(len(statistics.index)), statistics.index)
plt.tight_layout()
plt.savefig("heatmap_plot.png")
plt.show()
plot_heatmap(filtered_gdf)
gdf = filtered_gdf
#The duplicate points will be removed later
gdf.duplicated().sum()
911
Now, the filtered GeoDataFrame is converted into a Google Earth Engine object.
# Convert GeoDataFrame to Earth Engine object
data_raw = geemap.geopandas_to_ee(gdf)
Next, we will define the raster pixel size of the SDM results as 1km resolution.
# Spatial resolution setting (meters)
grain_size = 1000
When multiple occurrence points are present within the same 1km resolution raster pixel, there is a high likelihood that they share the same environmental conditions at the same geographic location. Using such data directly in the analysis can introduce bias into the results.
In other words, we need to limit the potential impact of geographic sampling bias. To achieve this, we will retain only one location within each 1km pixel and remove all others, allowing the model to more objectively reflect the environmental conditions.
def remove_duplicates(data, grain_size):
# Select one occurrence record per pixel at the chosen spatial resolution
random_raster = ee.Image.random(seed=42).reproject("EPSG:4326", None, grain_size)
rand_point_vals = random_raster.sampleRegions(
collection=ee.FeatureCollection(data), geometries=True
)
return rand_point_vals.distinct("random")
data = remove_duplicates(data_raw, grain_size)
# Before selection and after selection
print("Original data size:", data_raw.size().getInfo())
print("Final data size:", data.size().getInfo())
Original data size: 1485 Final data size: 300
The visualization comparing geographic sampling bias before preprocessing (in blue) and after preprocessing (in red) is shown below.
# Visualization of geographic sampling bias before (blue) and after (red) preprocessing
Map = geemap.Map(layout={"height": "400px", "width": "800px"})
# Add the random raster layer
random_raster = ee.Image.random(seed=42).reproject("EPSG:4326", None, grain_size)
Map.addLayer(
random_raster,
{"min": 0, "max": 1, "palette": ["black", "white"], "opacity": 0.5},
"Random Raster",
)
# Add the original data layer in blue
Map.addLayer(data_raw, {"color": "blue"}, "Original data")
# Add the final data layer in red
Map.addLayer(data, {"color": "red"}, "Final data")
# Set the center of the map to the coordinates
Map.set_center(-3.70256, 40.4165, 5)
Map
Map(center=[40.4165, -3.70256], controls=(WidgetControl(options=['position', 'transparent_bg'], widget=SearchD…
2.- Definition of the Area of Interest
Defining the Area of Interest (AOI below) refers to the term used by researchers to denote the geographical area they want to analyze. It has a similar meaning to the term Study Area.
In this context, we obtained the bounding box of the occurrence point layer geometry and created a 300-kilometer buffer around it (with a maximum tolerance of 1,000 meters) to define the AOI.
# Define the AOI
aoi = data.geometry().bounds().buffer(distance=300000, maxError=1000)
# Add the AOI to the map
outline = ee.Image().byte().paint(
featureCollection=aoi, color=1, width=3)
Map.remove_layer("Random Raster")
Map.addLayer(outline, {'palette': 'FF0000'}, "AOI")
Map.centerObject(aoi, 5)
Map
Map(bottom=3340.0, center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position'…
3.- Addition of GEE environmental variables
Now, let's add environmental variables to the analysis. GEE provides a wide range of datasets for environmental variables such as temperature, precipitation, elevation, land cover, and terrain. These datasets enable us to comprehensively analyze various factors that may influence the habitat preferences of the vole.
The selection of GEE environmental variables in SDM should reflect the habitat preference characteristics of the species. To do this, prior research and literature review on the vole's habitat preferences should be conducted. This tutorial primarily focuses on the workflow of SDM using GEE, so some in-depth details are omitted.
WorldClim V1 Bioclim:
This dataset provides 19 bioclimatic variables derived from monthly temperature and precipitation data. It covers the period from 1960 to 1991 and has a resolution of 927.67 meters.
CITATION: Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very High Resolution Interpolated Climate Surfaces for Global Land Areas. International Journal of Climatology 25: 1965-1978. doi:10.1002/joc.1276.
# WorldClim V1 Bioclim
bio = ee.Image("WORLDCLIM/V1/BIO")
NASA SRTM Digital Elevation 30m:
This dataset contains digital elevation data from the Shuttle Radar Topography Mission (SRTM). The data was primarily collected around the year 2000 and is provided at a resolution of approximately 30 meters (1 arc-second). The following code calculates elevation, slope, aspect, and hillshade layers from the SRTM data.
# NASA SRTM Digital Elevation 30m
terrain = ee.Algorithms.Terrain(ee.Image("USGS/SRTMGL1_003"))
OpenLandMap USDA Soil Taxonomy Great Groups
Predicted USDA soil great group probabilities at 250m.
Distribution of the USDA soil great groups based on machine learning predictions from global compilation of soil profiles
soil = ee.Image("OpenLandMap/SOL/SOL_GRTGROUP_USDA-SOILTAX_C/v01")
Global Forest Cover Change (GFCC) Tree Cover Multi-Year Global 30m:
The Landsat Vegetation Continuous Fields (VCF) tree cover layers contain estimates of the percentage of horizontal ground in each 30-m pixel covered by woody vegetation greater than 5 meters in height. The dataset is available for four epochs centered on the years 2000, 2005, 2010 and 2015. The dataset is derived from the GFCC Surface Reflectance product (GFCC30SR), which is based on enhanced Global Land Survey (GLS) datasets. The GLS datasets are composed of high-resolution Landsat 5 Thematic Mapper (TM) and Landsat 7 Enhanced Thematic Mapper Plus (ETM+) images at 30 meter resolution.
Tree cover, the proportional, vertically projected area of vegetation (including leaves, stems, branches, etc.) of woody plants above a given height, affects terrestrial energy and water exchanges, photosynthesis and transpiration, net primary production, and carbon and nutrient fluxes. Tree cover also affects habitat quality and movements of wildlife, residential property value for humans, and other ecosystem services. The continuous classification scheme of the VCF product enables better depiction of land cover gradients than traditional discrete classification schemes. Importantly for detection and monitoring of forest changes (e.g., deforestation and degradation), tree cover provides a measurable attribute upon which to define forest cover and its changes. Changes in tree cover over time can be used to monitor and retrieve site-specific histories of forest change.
# Global Forest Cover Change (GFCC) Tree Cover Multi-Year Global 30m
tcc = ee.ImageCollection("NASA/MEASURES/GFCC/TC/v3")
median_tcc = (
tcc.filterDate("2000-01-01", "2015-12-31")
.select(["tree_canopy_cover"], ["TCC"])
.median()
)
EUCROPMAP
European crop type map based on Sentinel-1 and LUCAS Copernicus 2018 in-situ observations for 2018; and one based on Sentinel-2 and LUCAS Copernicus 2022 for 2022.
Capitalizing on the unique LUCAS 2018 Copernicus in-situ survey, the 2018 dataset is the first continental crop type map with 10m pixel size for the EU based on S1A and S1B Synthetic Aperture Radar observations for the year 2018; the 2022 dataset is a continuation of the work and is based on optical S2 observations for the year 2022.
crop = ee.ImageCollection("JRC/D5/EUCROPMAP/V1").filterDate('2018-01-01', '2019-01-01').first()
OpenLandMap Soil pH in H2O
Soil pH in H2O at 6 standard depths (0, 10, 30, 60, 100 and 200 cm) at 250 m resolution
pH = ee.Image("OpenLandMap/SOL/SOL_PH-H2O_USDA-4C1A2A_M/v02")
bio (Bioclimatic variables), terrain (topography), median_tcc (tree canopy cover), soil(soil types), pH (soil pH) and crop (crop type) are combined into a single multiband image. The elevation band is selected from terrain, and a watermask is created for locations where elevation is greater than 0. This masks regions below sea level (e.g. the ocean) and prepares the researcher to analyze various environmental factors for the AOI comprehensively.
# Combine bands into a multi-band image
predictors = bio.addBands(terrain).addBands(soil).addBands(pH).addBands(crop).addBands(median_tcc)
# Create a water mask
watermask = terrain.select('elevation').gt(0)
# Mask out ocean pixels and clip to the area of interest
predictors = predictors.updateMask(watermask).clip(aoi)
When highly correlated predictor variables are included together in a model, multicollinearity issues can arise. Multicollinearity is a phenomenon that occurs when there are strong linear relationships among independent variables in a model, leading to instability in the estimation of the model's coefficients (weights). This instability can reduce the model's reliability and make predictions or interpretations for new data challenging. Therefore, we will consider multicollinearity and proceed with the process of selecting predictor variables.
First, we will generate 5,000 random points and then extract the predictor variable values of the single multiband image at those points.
# Generate 5,000 random points
data_cor = predictors.sample(scale=grain_size, numPixels=5000, geometries=True)
# Extract predictor variable values
pvals = predictors.sampleRegions(collection=data_cor, scale=grain_size)
We will convert the extracted predictor values for each point into a DataFrame and then check the first row.
# Converting predictor values from Earth Engine to a DataFrame
pvals_df = geemap.ee_to_df(pvals)
pvals_df.head(1)
TCC | aspect | b0 | b10 | b100 | b200 | b30 | b60 | bio01 | bio02 | ... | bio15 | bio16 | bio17 | bio18 | bio19 | classification | elevation | grtgroup | hillshade | slope | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 28 | 72 | 72 | 75 | 75 | 72 | 74 | 163 | 115 | ... | 53 | 213 | 33 | 34 | 202 | 300 | 301 | 17 | 179 | 1 |
1 rows × 32 columns
pvals_df.shape
(2122, 32)
# Displaying the columns
columns = pvals_df.columns
print(columns)
Index(['TCC', 'aspect', 'b0', 'b10', 'b100', 'b200', 'b30', 'b60', 'bio01', 'bio02', 'bio03', 'bio04', 'bio05', 'bio06', 'bio07', 'bio08', 'bio09', 'bio10', 'bio11', 'bio12', 'bio13', 'bio14', 'bio15', 'bio16', 'bio17', 'bio18', 'bio19', 'classification', 'elevation', 'grtgroup', 'hillshade', 'slope'], dtype='object')
Calculating Spearman correlation coefficients between the given predictor variables and visualizing them in a heatmap.
def plot_correlation_heatmap(dataframe, h_size=10, show_labels=False):
# Calculate Spearman correlation coefficients
correlation_matrix = dataframe.corr(method="spearman")
# Create a heatmap
plt.figure(figsize=(h_size, h_size-2))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')
# Optionally display values on the heatmap
if show_labels:
for i in range(correlation_matrix.shape[0]):
for j in range(correlation_matrix.shape[1]):
plt.text(j, i, f"{correlation_matrix.iloc[i, j]:.2f}",
ha='center', va='center', color='white', fontsize=8)
columns = dataframe.columns.tolist()
plt.xticks(range(len(columns)), columns, rotation=90)
plt.yticks(range(len(columns)), columns)
plt.title("Variables Correlation Matrix")
plt.colorbar(label="Spearman Correlation")
plt.savefig('correlation_heatmap_plot.png')
plt.show()
# Plot the correlation heatmap of variables
plot_correlation_heatmap(pvals_df)
Spearman correlation coefficient is useful for understanding the general associations among predictor variables but does not directly assess how multiple variables interact, specifically detecting multicollinearity.
The Variance Inflation Factor (VIF below) is a statistical metric used to evaluate multicollinearity and guide variable selection. It indicates the degree of linear relationship of each independent variable with the other independent variables, and high VIF values can be evidence of multicollinearity.
Typically, when VIF values exceed 5 or 10, it suggests that the variable has a strong correlation with other variables, potentially compromising the stability and interpretability of the model. In this tutorial, a criterion of VIF values less than 10 was used for variable selection. The following 8 variables were selected based on VIF.
# Filter variables based on Variance Inflation Factor (VIF)
def filter_variables_by_vif(dataframe, threshold=10):
original_columns = dataframe.columns.tolist()
remaining_columns = original_columns[:]
while True:
vif_data = dataframe[remaining_columns]
vif_values = [
variance_inflation_factor(vif_data.values, i)
for i in range(vif_data.shape[1])
]
max_vif_index = vif_values.index(max(vif_values))
max_vif = max(vif_values)
if max_vif < threshold:
break
print(f"Removing '{remaining_columns[max_vif_index]}' with VIF {max_vif:.2f}")
del remaining_columns[max_vif_index]
filtered_data = dataframe[remaining_columns]
bands = filtered_data.columns.tolist()
print("Bands:", bands)
return filtered_data, bands
filtered_pvals_df, bands = filter_variables_by_vif(pvals_df)
Removing 'bio05' with VIF inf Removing 'bio10' with VIF 156478.74 Removing 'b10' with VIF 50583.87 Removing 'b100' with VIF 29931.65 Removing 'b30' with VIF 24176.46 Removing 'bio01' with VIF 22847.52 Removing 'b60' with VIF 21108.69 Removing 'bio07' with VIF 12967.63 Removing 'b200' with VIF 4767.68 Removing 'bio12' with VIF 3190.19 Removing 'bio16' with VIF 2083.59 Removing 'bio02' with VIF 1520.00 Removing 'bio11' with VIF 682.07 Removing 'bio17' with VIF 536.63 Removing 'hillshade' with VIF 430.01 Removing 'bio13' with VIF 294.89 Removing 'bio03' with VIF 287.12 Removing 'b0' with VIF 180.75 Removing 'bio18' with VIF 135.88 Removing 'bio04' with VIF 112.46 Removing 'bio15' with VIF 65.95 Removing 'classification' with VIF 19.97 Removing 'bio19' with VIF 13.05 Removing 'bio09' with VIF 11.55 Bands: ['TCC', 'aspect', 'bio06', 'bio08', 'bio14', 'elevation', 'grtgroup', 'slope']
# Variable Selection Based on VIF
predictors = predictors.select(bands)
# Plot the correlation heatmap of variables
plot_correlation_heatmap(filtered_pvals_df, h_size=6, show_labels=True)
Next, let's visualize the 8 selected predictor variables on the map.
You can explore the available palettes for map visualization using the following code. For example, the terrain palette looks like this. cm.plot_colormaps(width=8.0, height=0.2)
cm.plot_colormap('terrain', width=8.0, height=0.2, orientation='horizontal')
# Soil types layer (grtgroup)
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
vis_params = {'bands':['grtgroup'], 'min': 0, 'max': 433, 'palette': [
'ffffff', 'adff2d', 'adff22', 'a5ff2f', '87ff37', 'baf019',
'87ff19', '96f03d', 'a3f52f', 'aff319', '91ff37', '9cf319',
'9bff37', '91ff19', '71ff37', '86ff19', 'a9d42d', 'aff519',
'9bff19', '9af024', 'a5fd2f', '88ff37', 'afed19', '71ff19',
'aff026', '8cf537', 'b7ff19', '7177c0', '9a85ec', 'f5f5e1',
'52cf5a', 'e42777', '4ef76d', 'ff00fb', 'eb05eb', 'fa04fa',
'fc04f5', 'f50df0', 'f118f1', 'fa0cfa', 'fc05e1', 'f100d5',
'eb09e6', 'fa22fa', 'ffdab9', 'f5d2bb', 'e8c9b8', 'ffddc4',
'e7cbc0', 'ffd2c3', 'f5d6bb', 'd5d3b9', 'e8d4b8', 'e7cdc0',
'f3eac8', 'a0c4ba', 'ffd2b9', 'f5dabb', 'f5d5b9', 'e8ebb8',
'ffddc2', 'e7ffc0', 'f3e6c8', 'ffdab9', 'f5cdb9', 'a91d30',
'796578', 'd8ff6e', '177548', '43efd6', '8496a9', '296819',
'73ffd4', '6fffc8', '75fbc9', '86f5d1', '82ffd2', '88eec8',
'80ffd4', '6bffc9', '88eec8', '7fffc8', '81ffd2', '86f0d4',
'67ffc8', '88eec8', '7ffbcb', '87ffd2', '8af5ce', '6bfad2',
'78f0d4', '88eec8', '7ffbd4', '73f5cd', '88c8d2', '91f0cd',
'73cdd2', '88eec8', 'fb849b', 'dd4479', '61388b', 'a52a30',
'722328', 'd81419', 'a42828', '82f5cd', 'a54c2e', 'c11919',
'b91419', '21b199', '702028', 'b41919', 'b22328', 'a2c7eb',
'36ba79', '806797', 'cb5b5f', 'cd5c5c', 'd94335', 'd35740',
'e05a5d', 'cf5b5c', 'ca5964', 'ca5d5f', 'cd5e5a', 'ca5969',
'd95a35', 'd36240', 'e05c43', 'd64755', 'cf595c', 'ff5f5f',
'cd6058', 'd95f35', 'd35140', 'd65a55', 'e05c59', 'cf525e',
'c65978', 'f5615f', '826f9a', 'cff41a', '4a6f31', 'a96989',
'e16438', '24f640', '88c1f9', 'f5d25c', 'd74322', '7f939e',
'41a545', '8f8340', '09fe03', '0aff00', '0ff30f', '02f00a',
'0fc903', '17f000', '0cff00', '0ac814', '0cfe00', '0aff0a',
'03ff05', '1cf31c', '24f000', '00ff0c', '14c814', '00fe4c',
'14ff96', '44d205', '05f305', '62f00a', '0fcd03', '00d20f',
'1add11', '09ff0c', '03ff05', '05e700', '02f00a', '0fea03',
'00f000', '0ccb0c', '14dd14', '6a685d', 'fae6b9', '769a34',
'6ff2df', 'ca7fc6', 'd8228f', 'c01bf0', 'd2bad3', 'd8c3cb',
'd4c6d4', 'd5bed5', 'ddb9dd', 'd8d2d8', 'd4c9d4', 'd2bad5',
'd5bad5', 'd5b2d5', 'd8c8d2', 'd4cbd4', '552638', '2571eb',
'ffa514', 'f3a502', 'fb7b00', 'f0b405', 'f7a80f', 'fb9113',
'ffa519', 'f3a702', 'fbba07', 'f7970f', 'f3a702', 'fb5a00',
'f0c005', 'f7810f', 'ff9c00', 'f3b002', 'f0b005', 'f7980f',
'4d7cfc', 'ffff00', 'fafa05', 'ebeb22', 'ffff14', 'f1f10a',
'fafa05', 'ebeb1e', 'f5eb0c', 'eef506', 'f1f129', 'fafa05',
'ebeb0c', 'f5d202', 'ffd700', 'f1f12b', 'a91fac', '2da468',
'9a8b71', '76b989', '713959',
]}
Map.addLayer(predictors, vis_params, 'grtgroup')
Map.add_colorbar(vis_params, label="grtgroup", orientation="vertical", layer_name="soil type")
Map.centerObject(aoi, 5)
Map
Map(center=[39.81054025628009, -4.626324141266859], controls=(WidgetControl(options=['position', 'transparent_…
# Elevation layer
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
vis_params = {'bands':['elevation'], 'min': 0, 'max': 1800, 'palette': cm.palettes.terrain}
Map.addLayer(predictors, vis_params, 'elevation')
Map.add_colorbar(vis_params, label="Elevation (m)", orientation="vertical", layer_name="elevation")
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Slope layer
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
vis_params = {'bands':['slope'], 'min': 0, 'max': 25, 'palette': cm.palettes.RdYlGn_r}
Map.addLayer(predictors, vis_params, 'slope')
Map.add_colorbar(vis_params, label="Slope", orientation="vertical", layer_name="slope")
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Aspect layer
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
vis_params = {'bands':['aspect'], 'min': 0, 'max': 360, 'palette': cm.palettes.rainbow}
Map.addLayer(predictors, vis_params, 'aspect')
Map.add_colorbar(vis_params, label="Aspect", orientation="vertical", layer_name="aspect")
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Calculate the minimum and maximum values for bio06
min_max_val = (
predictors.select("bio06")
.multiply(0.1)
.reduceRegion(reducer=ee.Reducer.minMax(), scale=1000)
.getInfo()
)
# bio09 (Mean temperature of driest quarter) layer
Map = geemap.Map(layout={"height": "400px", "width": "800px"})
vis_params = {
"min": math.floor(min_max_val["bio06_min"]),
"max": math.ceil(min_max_val["bio06_max"]),
"palette": cm.palettes.hot,
}
Map.addLayer(predictors.select("bio06").multiply(0.1), vis_params, "bio06")
Map.add_colorbar(
vis_params,
label="Min temperature of coldest month (℃)",
orientation="vertical",
layer_name="bio06",
)
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Calculate the minimum and maximum values for bio08
min_max_val = (
predictors.select("bio08")
.multiply(0.1)
.reduceRegion(reducer=ee.Reducer.minMax(), scale=1000)
.getInfo()
)
# bio09 (Mean temperature of driest quarter) layer
Map = geemap.Map(layout={"height": "400px", "width": "800px"})
vis_params = {
"min": math.floor(min_max_val["bio08_min"]),
"max": math.ceil(min_max_val["bio08_max"]),
"palette": cm.palettes.hot,
}
Map.addLayer(predictors.select("bio08").multiply(0.1), vis_params, "bio08")
Map.add_colorbar(
vis_params,
label="Mean temperature of wettest quarter (℃)",
orientation="vertical",
layer_name="bio08",
)
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Calculate the minimum and maximum values for bio14
min_max_val = (
predictors.select("bio14")
.reduceRegion(reducer=ee.Reducer.minMax(), scale=1000)
.getInfo()
)
# bio14 (Precipitation of driest month) layer
Map = geemap.Map(layout={"height": "400px", "width": "800px"})
vis_params = {
"bands": ["bio14"],
"min": math.floor(min_max_val["bio14_min"]),
"max": math.ceil(min_max_val["bio14_max"]),
"palette": cm.palettes.Blues,
}
Map.addLayer(predictors, vis_params, "bio14")
Map.add_colorbar(
vis_params,
label="Precipitation of driest month (mm)",
orientation="vertical",
layer_name="bio14",
)
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# TCC layer
Map = geemap.Map(layout={"height": "400px", "width": "800px"})
vis_params = {
"bands": ["TCC"],
"min": 0,
"max": 100,
"palette": ["ffffff", "afce56", "5f9c00", "0e6a00", "003800"],
}
Map.addLayer(predictors, vis_params, "TCC")
Map.add_colorbar(
vis_params, label="Tree Canopy Cover (%)", orientation="vertical", layer_name="TCC"
)
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
Generation of pseudo-absence data
In the process of SDM, the selection of input data for a species is mainly approached using two methods:
Presence-Background Method: This method compares the locations where a particular species has been observed (presence) with other locations where the species has not been observed (background). Here, the background data does not necessarily mean areas where the species does not exist but rather is set up to reflect the overall environmental conditions of the study area. It is used to distinguish suitable environments where the species could exist from less suitable ones.
Presence-Absence Method: This method compares locations where the species has been observed (presence) with locations where it has definitively not been observed (absence). Here, absence data represents specific locations where the species is known not to exist. It does not reflect the overall environmental conditions of the study area but rather points to locations where the species is estimated not to exist.
In practice, it is often difficult to collect true absence data, so pseudo-absence data generated artificially is frequently used. However, it's important to acknowledge the limitations and potential errors of this method, as artificially generated pseudo-absence points may not accurately reflect true absence areas.
The choice between these two methods depends on data availability, research objectives, model accuracy and reliability, as well as time and resources. Here, we will use occurrence data collected from GBIF and artificially generated pseudo-absence data to model using the "Presence-Absence" method.
The generation of pseudo-absence data will be done through the "environmental profiling approach", and the specific steps are as follows:
1.-Environmental Classification Using k-means Clustering: The k-means clustering algorithm, based on Euclidean distance, will be used to divide the pixels within the study area into two clusters. One cluster will represent areas with similar environmental characteristics to randomly selected 100 presence locations, while the other cluster will represent areas with different characteristics.
2.-Generation of Pseudo-Absence Data within Dissimilar Clusters: Within the second cluster identified in the first step (which has different environmental characteristics from the presence data), randomly generated pseudo-absence points will be created. These pseudo-absence points will represent locations where the species is not expected to exist.
# Randomly select 100 locations for occurrence
pvals = predictors.sampleRegions(
collection=data.randomColumn().sort('random').limit(100),
properties=[],
scale=grain_size
)
# Perform k-means clustering
clusterer = ee.Clusterer.wekaKMeans(
nClusters=2,
distanceFunction="Euclidean"
).train(pvals)
cl_result = predictors.cluster(clusterer)
# Get cluster ID for locations similar to occurrence
cl_id = cl_result.sampleRegions(
collection=data.randomColumn().sort('random').limit(200),
properties=[],
scale=grain_size
)
# Define non-occurrence areas in dissimilar clusters
cl_id = ee.FeatureCollection(cl_id).reduceColumns(ee.Reducer.mode(),['cluster'])
cl_id = ee.Number(cl_id.get('mode')).subtract(1).abs()
cl_mask = cl_result.select(['cluster']).eq(cl_id)
# Presence location mask
presence_mask = data.reduceToImage(properties=['random'],
reducer=ee.Reducer.first()
).reproject('EPSG:4326', None,
grain_size).mask().neq(1).selfMask()
# Masking presence locations in non-occurrence areas and clipping to AOI
area_for_pa = presence_mask.updateMask(cl_mask).clip(aoi)
# Area for Pseudo-absence
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
Map.addLayer(area_for_pa, {'palette': 'black'}, 'AreaForPA')
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
Model fitting and prediction
We will now divide the data into training data and test data. The training data will be used to find the optimal parameters by training the model, while the test data will be used to evaluate the model trained beforehand. An important concept to consider in this context is spatial autocorrelation.
Spatial autocorrelation is an essential element in SDM, associated with Tobler's law. It embodies the concept that "everything is related to everything else, but near things are more related than distant things". Spatial autocorrelation represents the significant relationship between the location of species and environmental variables. However, if spatial autocorrelation exists between the training and test data, the independence between the two data sets can be compromised. This significantly impacts the evaluation of the model's generalization ability.
One method to address this issue is the spatial block cross-validation technique, which involves dividing the data into training and testing datasets. This technique involves dividing the data into multiple blocks, using each block independently as training and test datasets to reduce the impact of spatial autocorrelation. This enhances the independence between datasets, allowing for a more accurate evaluation of the model's generalization ability.
The specific procedure is as follows: 1. Creation of spatial blocks: Divide the entire dataset into spatial blocks of equal size (e.g., 5x5 km). 2. Assignment of training and testing sets: Each spatial block is randomly assigned to either the training set (70%) or the test set (30%). This prevents the model from overfitting to data from specific areas and aims to achieve more generalized results. 3. Iterative cross-validation: The entire process is repeated n times (e.g., 10 times). In each iteration, the blocks are randomly divided into training and test sets again, which is intended to improve the model's stability and reliability. 4. Generation of pseudo-absence data: In each iteration, pseudo-absence data are randomly generated to evaluate the model's performance.
Scale = 50000
grid = watermask.reduceRegions(
collection=aoi.coveringGrid(scale=Scale, proj='EPSG:4326'),
reducer=ee.Reducer.mean()).filter(ee.Filter.neq('mean', None))
Map = geemap.Map(layout={'height':'400px', 'width':'800px'})
Map.addLayer(grid, {}, "Grid for spatial block cross validation")
Map.addLayer(outline, {'palette': 'FF0000'}, "Study Area")
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
Now we can fit the model. Fitting a model involves understanding the patterns in the data and adjusting the model's parameters (weights and biases) accordingly. This process enables the model to make more accurate predictions when presented with new data. For this purpose, we have defined a function called SDM() to fit the model.
We will use the Random Forest algorithm.
def sdm(x):
seed = ee.Number(x)
# Random block division for training and validation
rand_blk = ee.FeatureCollection(grid).randomColumn(seed=seed).sort("random")
training_grid = rand_blk.filter(ee.Filter.lt("random", split)) # Grid for training
testing_grid = rand_blk.filter(ee.Filter.gte("random", split)) # Grid for testing
# Presence points
presence_points = ee.FeatureCollection(data)
presence_points = presence_points.map(lambda feature: feature.set("PresAbs", 1))
tr_presence_points = presence_points.filter(
ee.Filter.bounds(training_grid)
) # Presence points for training
te_presence_points = presence_points.filter(
ee.Filter.bounds(testing_grid)
) # Presence points for testing
# Pseudo-absence points for training
tr_pseudo_abs_points = area_for_pa.sample(
region=training_grid,
scale=grain_size,
numPixels=tr_presence_points.size().add(300),
seed=seed,
geometries=True,
)
# Same number of pseudo-absence points as presence points for training
tr_pseudo_abs_points = (
tr_pseudo_abs_points.randomColumn()
.sort("random")
.limit(ee.Number(tr_presence_points.size()))
)
tr_pseudo_abs_points = tr_pseudo_abs_points.map(lambda feature: feature.set("PresAbs", 0))
te_pseudo_abs_points = area_for_pa.sample(
region=testing_grid,
scale=grain_size,
numPixels=te_presence_points.size().add(100),
seed=seed,
geometries=True,
)
# Same number of pseudo-absence points as presence points for testing
te_pseudo_abs_points = (
te_pseudo_abs_points.randomColumn()
.sort("random")
.limit(ee.Number(te_presence_points.size()))
)
te_pseudo_abs_points = te_pseudo_abs_points.map(lambda feature: feature.set("PresAbs", 0))
# Merge training and pseudo-absence points
training_partition = tr_presence_points.merge(tr_pseudo_abs_points)
testing_partition = te_presence_points.merge(te_pseudo_abs_points)
# Extract predictor variable values at training points
train_pvals = predictors.sampleRegions(
collection=training_partition,
properties=["PresAbs"],
scale=grain_size,
geometries=True,
)
# Random Forest classifier
classifier = ee.Classifier.smileRandomForest(
numberOfTrees=500,
variablesPerSplit=None,
minLeafPopulation=10,
bagFraction=0.5,
maxNodes=None,
seed=seed,
)
# Presence probability: Habitat suitability map
classifier_pr = classifier.setOutputMode("PROBABILITY").train(
train_pvals, "PresAbs", bands
)
classified_img_pr = predictors.select(bands).classify(classifier_pr)
# Binary presence/absence map: Potential distribution map
classifier_bin = classifier.setOutputMode("CLASSIFICATION").train(
train_pvals, "PresAbs", bands
)
classified_img_bin = predictors.select(bands).classify(classifier_bin)
return [
classified_img_pr,
classified_img_bin,
training_partition,
testing_partition,
], classifier_pr
Spatial blocks are divided into 70% for model training and 30% for model testing, respectively. Pseudo-absence data are randomly generated within each training and testing set in every iteration. As a result, each execution yields different sets of presence and pseudo-absence data for model training and testing.
split = 0.7
numiter = 10
# Random Seed
#runif = lambda length: [random.randint(1, 1000) for _ in range(length)]
#items = runif(numiter)
# Fixed seed
items = [287, 288, 553, 226, 151, 255, 902, 267, 419, 538]
results_list = [] # Initialize SDM results list
importances_list = [] # Initialize variable importance list
for item in items:
result, trained = sdm(item)
# Accumulate SDM results into the list
results_list.extend(result)
# Accumulate variable importance into the list
importance = ee.Dictionary(trained.explain()).get('importance')
importances_list.extend(importance.getInfo().items())
# Flatten the SDM results list
results = ee.List(results_list).flatten()
Now we can visualize the habitat suitability map and potential distribution map for the Microtus cabrerae. In this case, the habitat suitability map is created by using the mean() function to calculate the average for each pixel location across all images, and the potential distribution map is generated by using the mode() function to determine the most frequently occurring value at each pixel location across all images.
# Habitat suitability map
images = ee.List.sequence(
0, ee.Number(numiter).multiply(4).subtract(1), 4).map(
lambda x: results.get(x))
model_average = ee.ImageCollection.fromImages(images).mean()
Map = geemap.Map(layout={'height':'400px', 'width':'800px'}, basemap='Esri.WorldImagery')
vis_params = {
'min': 0,
'max': 1,
'palette': cm.palettes.viridis_r}
Map.addLayer(model_average, vis_params, 'Habitat suitability')
Map.add_colorbar(vis_params, label="Habitat suitability",
orientation="horizontal",
layer_name="Habitat suitability")
Map.addLayer(data, {'color':'red'}, 'Presence')
Map.centerObject(aoi, 5)
Map
Map(center=[39.8105402562801, -4.6263241412668625], controls=(WidgetControl(options=['position', 'transparent_…
# Potential distribution map
images2 = ee.List.sequence(1, ee.Number(numiter).multiply(4).subtract(1), 4).map(
lambda x: results.get(x)
)
distribution_map = ee.ImageCollection.fromImages(images2).mode()
Map = geemap.Map(
layout={"height": "400px", "width": "800px"}, basemap="Esri.WorldImagery"
)
vis_params = {"min": 0, "max": 1, "palette": ["white", "green"]}
Map.addLayer(distribution_map, vis_params, "Potential distribution")
Map.addLayer(data, {"color": "red"}, "Presence")
Map.add_colorbar(
vis_params,
label="Potential distribution",
discrete=True,
orientation="horizontal",
layer_name="Potential distribution",
)
Map.centerObject(data.geometry(), 5)
Map
Map(center=[39.86745738149589, -3.494492174171101], controls=(WidgetControl(options=['position', 'transparent_…
Variable importance and accuracy assessment
Random Forest (ee.Classifier.smileRandomForest) is one of the ensemble learning methods, which operates by constructing multiple decision trees to make predictions. Each decision tree independently learns from different subsets of the data, and their results are aggregated to enable more accurate and stable predictions.
Variable importance is a measure that evaluates the impact of each variable on the predictions within the Random Forest model. We will use the previously defined importances_list to calculate and print the average variable importance.
def plot_variable_importance(importances_list):
# Extract each variable importance value into a list
variables = [item[0] for item in importances_list]
importances = [item[1] for item in importances_list]
# Calculate the average importance for each variable
average_importances = {}
for variable in set(variables):
indices = [i for i, var in enumerate(variables) if var == variable]
average_importance = np.mean([importances[i] for i in indices])
average_importances[variable] = average_importance
# Sort the importances in descending order of importance
sorted_importances = sorted(average_importances.items(),
key=lambda x: x[1], reverse=False)
variables = [item[0] for item in sorted_importances]
avg_importances = [item[1] for item in sorted_importances]
# Adjust the graph size
plt.figure(figsize=(8, 4))
# Plot the average importance as a horizontal bar chart
plt.barh(variables, avg_importances)
plt.xlabel('Importance')
plt.ylabel('Variables')
plt.title('Average Variable Importance')
# Display values above the bars
for i, v in enumerate(avg_importances):
plt.text(v + 0.02, i, f"{v:.2f}", va='center')
# Adjust the x-axis range
plt.xlim(0, max(avg_importances) + 5) # Adjust to the desired range
plt.tight_layout()
plt.savefig('variable_importance.png')
plt.show()
plot_variable_importance(importances_list)
Using the Testing Datasets, we calculate AUC-ROC and AUC-PR for each run. Then, we compute the average AUC-ROC and AUC-PR over n iterations.
AUC-ROC represents the area under the curve of the 'Sensitivity (Recall) vs. 1-Specificity' graph, illustrating the relationship between sensitivity and specificity as the threshold changes. Specificity is based on all observed non-occurrences. Therefore, AUC-ROC encompasses all quadrants of the confusion matrix.
AUC-PR represents the area under the curve of the 'Precision vs. Recall (Sensitivity)' graph, showing the relationship between precision and recall as the threshold varies. Precision is based on all predicted occurrences. Hence, AUC-PR does not include the true negatives (TN).
Note: It's important to ensure that each run has a sufficient number of points for model validation. The final number of points may vary due to the random partitioning of spatial blocks, so it's crucial to verify if there are enough presence and pseudo-absence points for model validation. In the case of endangered or rare species, there might be a shortage of occurrence data, leading to an insufficient test dataset. In such cases, alternatives may include additional data collection based on expert knowledge and experience or utilizing relevant alternative data sources.
def print_pres_abs_sizes(TestingDatasets, numiter):
# Check and print the sizes of presence and pseudo-absence coordinates
def get_pres_abs_size(x):
fc = ee.FeatureCollection(TestingDatasets.get(x))
presence_size = fc.filter(ee.Filter.eq("PresAbs", 1)).size()
pseudo_absence_size = fc.filter(ee.Filter.eq("PresAbs", 0)).size()
return ee.List([presence_size, pseudo_absence_size])
sizes_info = (
ee.List.sequence(0, ee.Number(numiter).subtract(1), 1)
.map(get_pres_abs_size)
.getInfo()
)
for i, sizes in enumerate(sizes_info):
presence_size = sizes[0]
pseudo_absence_size = sizes[1]
print(
f"Iteration {i + 1}: Presence Size = {presence_size}, Pseudo-absence Size = {pseudo_absence_size}"
)
# Extracting the Testing Datasets
testing_datasets = ee.List.sequence(
3, ee.Number(numiter).multiply(4).subtract(1), 4
).map(lambda x: results.get(x))
print_pres_abs_sizes(testing_datasets, numiter)
Iteration 1: Presence Size = 39, Pseudo-absence Size = 29 Iteration 2: Presence Size = 52, Pseudo-absence Size = 40 Iteration 3: Presence Size = 124, Pseudo-absence Size = 69 Iteration 4: Presence Size = 39, Pseudo-absence Size = 39 Iteration 5: Presence Size = 115, Pseudo-absence Size = 62 Iteration 6: Presence Size = 49, Pseudo-absence Size = 35 Iteration 7: Presence Size = 74, Pseudo-absence Size = 42 Iteration 8: Presence Size = 75, Pseudo-absence Size = 36 Iteration 9: Presence Size = 165, Pseudo-absence Size = 86 Iteration 10: Presence Size = 147, Pseudo-absence Size = 71
def get_acc(hsm, t_data, grain_size):
pr_prob_vals = hsm.sampleRegions(
collection=t_data, properties=["PresAbs"], scale=grain_size
)
seq = ee.List.sequence(start=0, end=1, count=25) # Divide 0 to 1 into 25 intervals
def calculate_metrics(cutoff):
# Each element of the seq list is passed as cutoff(threshold value)
# Observed present = TP + FN
pres = pr_prob_vals.filterMetadata("PresAbs", "equals", 1)
# TP (True Positive)
tp = ee.Number(
pres.filterMetadata("classification", "greater_than", cutoff).size()
)
# TPR (True Positive Rate) = Recall = Sensitivity = TP / (TP + FN) = TP / Observed present
tpr = tp.divide(pres.size())
# Observed absent = FP + TN
abs = pr_prob_vals.filterMetadata("PresAbs", "equals", 0)
# FN (False Negative)
fn = ee.Number(
pres.filterMetadata("classification", "less_than", cutoff).size()
)
# TNR (True Negative Rate) = Specificity = TN / (FP + TN) = TN / Observed absent
tn = ee.Number(abs.filterMetadata("classification", "less_than", cutoff).size())
tnr = tn.divide(abs.size())
# FP (False Positive)
fp = ee.Number(
abs.filterMetadata("classification", "greater_than", cutoff).size()
)
# FPR (False Positive Rate) = FP / (FP + TN) = FP / Observed absent
fpr = fp.divide(abs.size())
# Precision = TP / (TP + FP) = TP / Predicted present
precision = tp.divide(tp.add(fp))
# SUMSS = SUM of Sensitivity and Specificity
sumss = tpr.add(tnr)
return ee.Feature(
None,
{
"cutoff": cutoff,
"TP": tp,
"TN": tn,
"FP": fp,
"FN": fn,
"TPR": tpr,
"TNR": tnr,
"FPR": fpr,
"Precision": precision,
"SUMSS": sumss,
},
)
return ee.FeatureCollection(seq.map(calculate_metrics))
def calculate_and_print_auc_metrics(images, testing_datasets, grain_size, numiter):
# Calculate AUC-ROC and AUC-PR
def calculate_auc_metrics(x):
hsm = ee.Image(images.get(x))
t_data = ee.FeatureCollection(testing_datasets.get(x))
acc = get_acc(hsm, t_data, grain_size)
# Calculate AUC-ROC
x = ee.Array(acc.aggregate_array("FPR"))
y = ee.Array(acc.aggregate_array("TPR"))
x1 = x.slice(0, 1).subtract(x.slice(0, 0, -1))
y1 = y.slice(0, 1).add(y.slice(0, 0, -1))
auc_roc = x1.multiply(y1).multiply(0.5).reduce("sum", [0]).abs().toList().get(0)
# Calculate AUC-PR
x = ee.Array(acc.aggregate_array("TPR"))
y = ee.Array(acc.aggregate_array("Precision"))
x1 = x.slice(0, 1).subtract(x.slice(0, 0, -1))
y1 = y.slice(0, 1).add(y.slice(0, 0, -1))
auc_pr = x1.multiply(y1).multiply(0.5).reduce("sum", [0]).abs().toList().get(0)
return (auc_roc, auc_pr)
auc_metrics = (
ee.List.sequence(0, ee.Number(numiter).subtract(1), 1)
.map(calculate_auc_metrics)
.getInfo()
)
# Print AUC-ROC and AUC-PR for each iteration
df = pd.DataFrame(auc_metrics, columns=["AUC-ROC", "AUC-PR"])
df.index = [f"Iteration {i + 1}" for i in range(len(df))]
df.to_csv("auc_metrics.csv", index_label="Iteration")
print(df)
# Calculate mean and standard deviation of AUC-ROC and AUC-PR
mean_auc_roc, std_auc_roc = df["AUC-ROC"].mean(), df["AUC-ROC"].std()
mean_auc_pr, std_auc_pr = df["AUC-PR"].mean(), df["AUC-PR"].std()
print(f"Mean AUC-ROC = {mean_auc_roc:.4f} ± {std_auc_roc:.4f}")
print(f"Mean AUC-PR = {mean_auc_pr:.4f} ± {std_auc_pr:.4f}")
%%time
# Calculate AUC-ROC
calculate_and_print_auc_metrics(images, testing_datasets, grain_size, numiter)
--------------------------------------------------------------------------- HttpError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/ee/data.py in _execute_cloud_call(call, num_retries) 405 try: --> 406 return call.execute(num_retries=num_retries) 407 except googleapiclient.errors.HttpError as e: /usr/local/lib/python3.10/dist-packages/googleapiclient/_helpers.py in positional_wrapper(*args, **kwargs) 129 logger.warning(message) --> 130 return wrapped(*args, **kwargs) 131 /usr/local/lib/python3.10/dist-packages/googleapiclient/http.py in execute(self, http, num_retries) 937 if resp.status >= 300: --> 938 raise HttpError(resp, content, uri=self.uri) 939 return self.postproc(resp, content) HttpError: <HttpError 400 when requesting https://earthengine.googleapis.com/v1/projects/ee-fandian24bio/value:compute?prettyPrint=false&alt=json returned "User memory limit exceeded.". Details: "User memory limit exceeded."> During handling of the above exception, another exception occurred: EEException Traceback (most recent call last) <timed eval> in <module> <ipython-input-86-9a93464f0829> in calculate_and_print_auc_metrics(images, testing_datasets, grain_size, numiter) 25 ee.List.sequence(0, ee.Number(numiter).subtract(1), 1) 26 .map(calculate_auc_metrics) ---> 27 .getInfo() 28 ) 29 /usr/local/lib/python3.10/dist-packages/ee/computedobject.py in getInfo(self) 105 The object can evaluate to anything. 106 """ --> 107 return data.computeValue(self) 108 109 def encode(self, encoder: Optional[Callable[..., Any]]) -> Dict[str, Any]: /usr/local/lib/python3.10/dist-packages/ee/data.py in computeValue(obj) 1124 _maybe_populate_workload_tag(body) 1125 -> 1126 return _execute_cloud_call( 1127 _get_cloud_projects() 1128 .value() /usr/local/lib/python3.10/dist-packages/ee/data.py in _execute_cloud_call(call, num_retries) 406 return call.execute(num_retries=num_retries) 407 except googleapiclient.errors.HttpError as e: --> 408 raise _translate_cloud_exception(e) # pylint: disable=raise-missing-from 409 410 EEException: User memory limit exceeded.