Leptocybe invasa case study#

This page contains the results of a case study into Leptocybe invasa, using the MaxEnt species distribution modelling technique. The plots show historic, projected future, and the projected change in relative occurrence probability (ROP). An ROP of 1 is the most favourable set of environmental conditions for a species, while an ROP of 0 is the least favourable. Details on the settings used in the modelling can be found in the description at the bottom of the page.

Location data#

Historic relative occurrence of probability#

Future relative occurrence of probability#

Change in relative occurrence of probability#

Model assumptions and references#

A. Assumptions

  1. Data Quality and Bias Assumptions
    1.1 Spatial Sampling Bias
    • Assumption: Occurrence records are spatially biased, with oversampling in easily accessible areas (roads, urban areas, research stations) and undersampling in remote regions.
    • Justification: This is a well-documented issue in biodiversity databases (GBIF, CABI) where sampling effort is not uniform across geographic space.
    • Mitigation: The weighted MaxEnt approach uses distance-based weighting to reduce the influence of clustered occurrence points, giving higher weight to isolated records that may represent true species presence in less-sampled areas.

    1.2 Temporal Sampling Bias
    • Assumption: Recent occurrence records are more reliable and representative of current species distribution than historical records, which may reflect outdated distribution patterns or misidentifications.
    • Justification: Species distributions change over time due to climate change, human activities, and range expansions. Older records may not reflect current suitable habitats.
    • Mitigation: Temporal weighting is implemented to give higher weight to more recent occurrence records, though the current implementation primarily uses distance-based weighting.

    1.3 Data Source Quality Differences
    • Assumption: Different data sources (GBIF, CABI, research publications, field collections) have varying levels of data quality, accuracy, and reliability.
    • Justification:
    – GBIF: Citizen science and museum records may have varying accuracy
    – CABI: Curated database with expert verification, generally higher quality
    – Research publications: Peer-reviewed data, typically high quality but may have limited geographic coverage
    – Field collections: Expert-collected data, high quality but potentially biased toward specific research objectives
    • Mitigation: Source-based weighting could be implemented, though the current code primarily tracks data sources for transparency rather than explicit weighting.

  2. Ecological and Environmental Assumptions
    2.1 Climate Stationarity
    • Assumption: For historical models, the relationship between species occurrence and bioclimatic variables remains constant over the time period of occurrence records (1981-2005 baseline).
    • Justification: This is a fundamental assumption in SDM, assuming that species-climate relationships are stable and can be extrapolated to predict future distributions.
    • Limitation: Climate change may alter species-climate relationships, and this assumption may not hold for rapidly changing environments.

    2.2 Environmental Variable Completeness
    • Assumption: The 19 bioclimatic variables (bio1-bio19), along with topography (SRTM) and NDVI, capture the key environmental factors limiting species distribution.
    • Justification: WorldClim bioclimatic variables are widely used in SDM and represent annual, seasonal, and extreme temperature and precipitation patterns that are ecologically relevant.
    • Limitation: Other factors such as biotic interactions, dispersal limitations, and human activities are not explicitly included in the model.

    2.3 Pseudo-Absence Representativeness
    • Assumption: Background points (pseudo-absences) represent areas where the species is truly absent or where environmental conditions are unsuitable.
    • Justification:
    – Random pseudo-absence: Assumes uniform sampling across the study area
    – Biased pseudo-absence: Assumes sampling bias mirrors accessibility bias in presence data
    – Biased-land-cover pseudo-absence: Assumes species is restricted to specific land cover types (e.g., forests for Eucalyptus pests)
    • Mitigation: Multiple pseudo-absence strategies are tested to assess sensitivity of model results.

  3. Model Assumptions
    3.1 MaxEnt Algorithm Assumptions
    • Assumption: Maximum Entropy principle is appropriate for modeling species distributions, assuming that the most uniform distribution consistent with environmental constraints best represents the true distribution.
    • Justification: MaxEnt is one of the most widely used and validated SDM algorithms, performing well across diverse taxa and geographic regions.
    • Configuration:
    – Beta multiplier: 1.5 (controls regularization strength, preventing overfitting)
    – Feature types: Linear, Hinge, Product (allows for complex response curves)
    – Transform: Logistic (outputs probability values 0-1)

    3.2 Sample Weighting Assumptions
    • Assumption: Distance-based weighting (using all neighbors, n_neighbors=-1 for presence, n_neighbors=1 for background) effectively corrects for spatial clustering bias.
    • Justification: Clustered points in high-density areas receive lower weights, while isolated points receive higher weights, reducing the influence of oversampled regions.
    • Limitation: This assumes that spatial clustering is primarily due to sampling bias rather than true ecological clustering (e.g., suitable habitat patches).

    3.3 Train/Test Split Assumptions
    • Assumption: Random 70:30 train/test split with 10 iterations provides robust model validation.
    • Justification:
    – 70:30 split is a standard practice in machine learning, providing sufficient training data while maintaining a reasonable test set size
    – Multiple iterations (10) account for variability in train/test splits and provide more stable performance estimates
    – Random splitting assumes spatial independence, which may not hold for spatially autocorrelated species data
    • Limitation: Spatial cross-validation might be more appropriate for spatially structured data.

  4. Climate Projection Assumptions
    4.1 Future Climate Scenarios
    • Assumption: RCP 8.5 (Representative Concentration Pathway) represents a plausible future climate scenario for projecting species distributions.
    • Justification: RCP 8.5 represents a high-emission scenario, useful for assessing worst-case climate change impacts on species distributions.
    • Limitation: This is a single scenario; multiple scenarios (RCP 4.5, RCP 2.6) would provide a more comprehensive assessment of uncertainty.

    4.2 Climate Model Ensemble
    • Assumption: The ensemble of 12 CORDEX regional climate models provides robust climate projections for Southeast Asia.
    • Justification: Model ensembles reduce uncertainty by averaging across multiple climate models with different assumptions and parameterizations.
    • Models used:
    – CNRM-CERFACS-CNRM-CM5_SMHI-RCA4
    – CNRM-CM5_ICTP-RegCM4-3
    – IPSL-IPSL-CM5A-LR_ICTP-RegCM4-3
    – MOHC-HadGEM2-ES_GERICS-REMO2015
    – MOHC-HadGEM2-ES_ICTP-RegCM4-7
    – MOHC-HadGEM2-ES_SMHI-RCA4
    – MPI-M-MPI-ESM-LR_GERICS-REMO2015
    – MPI-M-MPI-ESM-MR_ICTP-RegCM4-3
    – MPI-M-MPI-ESM-MR_ICTP-RegCM4-7
    – NCC-NorESM1-M_GERICS-REMO2015
    – NCC-NorESM1-M_ICTP-RegCM4-7
    – NOAA-GFDL-GFDL-ESM2M_ICTP-RegCM4-3

    4.3 Temporal Extrapolation
    • Assumption: Species-climate relationships derived from historical data (1970-2005) remain valid for future climate projections (typically 2031-2050).
    • Justification: This is a fundamental assumption in climate change impact assessments using SDM.
    • Limitation: Species may adapt, evolve, or shift their fundamental niches over time, violating this assumption.

  5. Geographic and Spatial Assumptions
    5.1 Regional Boundaries
    • Assumption: Country-level boundaries accurately represent biogeographic regions and are appropriate for defining training and testing regions.
    • Justification: Administrative boundaries are practical for data collection and management, though they may not align with ecological boundaries.
    • Regions defined:
    – East Asia, Southeast Asia, Australia, Australasia, India-Sri Lanka, and global coverage

    5.2 Spatial Resolution
    • Assumption: 0.01° resolution (~1 km) is appropriate for capturing environmental heterogeneity relevant to species distributions.
    • Justification: This resolution balances computational efficiency with ecological relevance, capturing local environmental variation while remaining manageable for large-scale analyses.
    • Limitation: Fine-scale habitat features (<1 km) may be missed, and species with very localized distributions may require higher resolution.

  6. Species-Specific Assumptions
    6.1 Host Plant Association
    • Assumption: For Eucalyptus pests (Leptocybe invasa, Thaumastocoris peregrinus), species distribution is constrained by the distribution of Eucalyptus host plants.
    • Justification: These are specialist pests that require Eucalyptus as host plants.
    • Implementation: Land-cover-based pseudo-absence prioritizes forested areas, and Eucalyptus distribution data (Abbasi et al. 2023) is used when available.

    6.2 Equilibrium Assumption
    • Assumption: Species are in equilibrium with their current environment, meaning current distributions reflect environmental suitability rather than dispersal limitations or recent introductions.
    • Justification: This is a standard SDM assumption, though it may not hold for invasive species that are actively expanding their ranges.
    • Limitation: Invasive species like Leptocybe invasa and Thaumastocoris peregrinus may not be in equilibrium, as they are actively spreading to new areas.

B. References and Data Sources
Scientific Publications
Occurrence Data Sources

  1. Otieno et al. 2019
    – Species: Leptocybe invasa
    – Usage: Occurrence records for Leptocybe invasa distribution
    – File: Otieno_2019_L-invasa.csv
    – Note: Research publication providing occurrence coordinates

  2. Peng et al. 2021
    – Species: Leptocybe invasa
    – Usage: Occurrence records with elevation data
    – File: Peng_2021_L-invasa.csv
    – Note: Includes elevation information which can be used for data quality assessment

  3. Montemayor et al. 2015
    – Species: Thaumastocoris peregrinus
    – Usage: Occurrence records for bronze bug distribution
    – File: Montemayor_2015_Thaumastocoris_peregrinus_bronze_bug_transcribed_from_SUPP2.csv
    – Note: Supplementary data from research publication

  4. Abbasi et al. 2023
    – Title: “Global planted forest data for East Asia”
    – Journal: Nature Scientific Data
    – DOI/URL: https://www.nature.com/articles/s41597-023-02383-w
    – Usage: Eucalyptus forest distribution data for East Asia region
    – Purpose: Provides host plant distribution to inform pseudo-absence generation and model interpretation
    – Note: Referenced in 01_specie-distribution.ipynb for loading Eucalyptus data

Data Repositories and Databases

  1. GBIF (Global Biodiversity Information Facility)
    • Website: https://www.gbif.org
    • Usage: Primary source of occurrence records from citizen science, museum collections, and research institutions
    • Species:
    – Leptocybe invasa: Gbif_L-invasa_0120814-230530130749713.csv
    – Thaumastocoris peregrinus: GBIF_T-peregrinus_0026124-240321170329656.csv
    • Data Quality: Variable, includes both verified and unverified records
    • Coverage: Global, with varying density across regions

  2. CABI (Centre for Agriculture and Bioscience International)
    • Website: https://www.cabi.org
    • Usage: Curated database of invasive species occurrences with expert verification
    • Species:
    – Leptocybe invasa: Cabi_2017_L-invasa-108923.csv
    – Thaumastocoris peregrinus: CABI_T-peregrinus.csv
    • Data Quality: Generally high, with expert curation and verification
    • Coverage: Focus on agriculturally and economically important species

  3. EPPO (European and Mediterranean Plant Protection Organization)
    • Usage: Presence data at state/country level (no coordinates)
    • File: Eppo_2010_L-invasa.csv
    • Note: Used for presence confirmation but not for precise spatial modeling due to lack of coordinates

Climate Data Sources

  1. CORDEX (Coordinated Regional Climate Downscaling Experiment)
    • Website: https://www.cordex.org
    • Usage: Regional climate model projections for Southeast Asia
    • Scenario: Historical and RCP 8.5 (high-emission scenario)
    • Period: Future projections (typically 2070-2100)
    • Models: 12 regional climate model combinations (GCM-RCM pairs)
    • Resolution: Downscaled to regional scale for Southeast Asia (0.1°)
    Topographic Data
    SRTM (Shuttle Radar Topography Mission)
    • Source: NASA/USGS
    • Usage: Digital Elevation Model (DEM) for topographic variables
    • Resolution: 0.01° (~1 km, resampled to match bioclimatic data)
    • Variables: Elevation (can be extended to slope, aspect, etc.)

Vegetation Data
NDVI (Normalized Difference Vegetation Index)
• Usage: Vegetation index as proxy for habitat suitability
• Source: Various satellite products (MODIS, Landsat, etc.)
• Note: Optional variable, can be included or excluded based on ndvi1 parameter
Land Cover Data
Forest Type/Land Cover Classification
• Usage: For biased-land-cover pseudo-absence generation
• Purpose: Prioritizes pseudo-absence points in forested areas (suitable habitat for Eucalyptus pests)
• Classification:
– Evergreen Needleleaf Forest: weight 0.8
– Evergreen Broadleaf Forest: weight 1.0 (highest suitability)
– Deciduous Needleleaf Forest: weight 0.2
– Deciduous Broadleaf Forest: weight 0.4
– Mixed Forest: weight 0.5
– Unknown/Other: weight 0 (unsuitable)

Software and Libraries

  1. elapid
    • Purpose: Species Distribution Modeling library implementing MaxEnt
    • Usage: Core modeling functionality for weighted MaxEnt
    • Features:
    – MaxEnt implementation with sample weighting
    – Distance-based weighting functions
    – Raster annotation and prediction
    • Note: Python implementation of MaxEnt algorithm

  2. MaxEnt Algorithm
    • Original Reference: Phillips, S.J., Anderson, R.P., Schapire, R.E., 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190, 231-259.
    • Usage: Core algorithm for species distribution modeling
    • Implementation: Weighted version with sample weights for bias correction
    C. Methodological notes
    Weighted MaxEnt implementation
    The weighted MaxEnt approach addresses several limitations of standard SDM: - Spatial bias correction: Distance-based weighting reduces influence of clustered points - Data quality integration: Allows incorporation of expert knowledge about data reliability - Multi-source data integration: Combines data from sources with different quality levels
    Validation approach
    • Multiple iterations: 10 train/test splits provide robust performance estimates
    • Weighted metrics: ROC-AUC and PR-AUC calculated with and without sample weights
    • Permutation importance: Assesses variable importance while accounting for correlations
    Climate projection methodology
    • Ensemble approach: Multiple climate models reduce projection uncertainty
    • RCP 8.5 scenario: High-emission scenario for assessing worst-case impacts
    • Temporal baseline: Historical period (1981-2005) used for model calibration
    Limitations and caveats

  3. Equilibrium Assumption: Species may not be in equilibrium with current climate, especially for invasive species

  4. Dispersal Limitations: Model assumes unlimited dispersal, which may not reflect biological reality

  5. Biotic Interactions: Model does not explicitly account for species interactions, competition, or predation

  6. Human Activities: Land use change, management practices, and human-mediated dispersal are not explicitly modeled

  7. Climate Model Uncertainty: Future projections depend on climate model accuracy and scenario assumptions

  8. Spatial Autocorrelation: Random train/test splits may not account for spatial structure in species data

  9. Temporal Extrapolation: Assumes species-climate relationships remain constant over time
    D. Recommended citations
    When using this workflow, please cite:

  10. References:
    – Phillips, S.J., Anderson, R.P., Schapire, R.E., 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190, 231-259.
    – Anderson, C.B., 2023. elapid: Species distribution modeling tools for Python. Journal of Open Source Software, 8(84), p.4930.

  11. Occurrence Data Sources (as applicable):
    – Otieno et al. 2019 (for Leptocybe invasa)
    – Peng et al. 2021 (for Leptocybe invasa)
    – Montemayor et al. 2015 (for Thaumastocoris peregrinus)
    – Abbasi et al. 2023 (for Eucalyptus distribution data)

  12. Climate Data:
    – WorldClim: Fick, S.E. and Hijmans, R.J. 2017. WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37(12): 4302-4315.
    – CORDEX: Giorgi, F. et al. 2009. Addressing climate information needs at the regional level: the CORDEX framework. WMO Bulletin 58(3): 175-183.

  13. Data Repositories:
    – GBIF: https://www.gbif.org/citation
    – CABI: As per CABI citation guidelines