Methodology
Methodology
Section titled “Methodology”Canopi’s predictions are built on decades of field observations from the US Forest Service, combined with soil, climate, and terrain data through gradient-boosted machine learning.
Training data
Section titled “Training data”The foundation is the USFS Forest Inventory and Analysis (FIA) program — a nationwide network of permanent forest plots measured by field crews on rotating cycles. Each plot records individual tree measurements: species, size, condition, and critically, whether the tree is alive or dead at each remeasurement.
FIA is the gold standard for US forest data. Plots are systematically distributed across all forest land, remeasured every 5-10 years, and maintained with consistent methodology spanning decades. For the Pacific Northwest, this provides thousands of observed survival and mortality events across the full range of species and site conditions.
Feature engineering
Section titled “Feature engineering”Raw FIA observations are enriched with environmental data from four additional sources:
SSURGO (Soil Survey) — Detailed soil properties at each site: organic matter content, available water capacity, clay percentage, and drainage class. These capture the below-ground conditions that determine root access to water and nutrients.
PRISM (Climate) — 30-year climate normals at 4km resolution: precipitation, maximum and mean temperature, dew point temperature, and vapor pressure deficit. These capture the atmospheric conditions that drive water stress.
SRTM (Terrain) — Elevation data from the Shuttle Radar Topography Mission. Elevation interacts with climate to determine temperature regimes, frost exposure, and growing season length.
Planting method — Modeled as a categorical feature distinguishing manual planting from drone seeding, allowing the model to learn method-specific survival patterns.
Model architecture
Section titled “Model architecture”The current model (Mycel v0.2) uses XGBoost (eXtreme Gradient Boosting), a gradient-boosted decision tree algorithm. XGBoost was chosen for three reasons: it handles mixed feature types (continuous soil values + categorical method variable) naturally, it performs well on tabular ecological data without requiring the data volumes of deep learning, and it supports SHAP-based explainability so every prediction can be decomposed into contributing factors.
The model was trained on observed survival/mortality events from FIA permanent plots, with 17 input features capturing soil, climate, terrain, tree condition, and planting method.
Prediction generation
Section titled “Prediction generation”Rather than running the model in real-time for each API call, Canopi pre-computes predictions across a dense grid of site-species-method-horizon combinations. This means 17,974 sites × 20 species × 2 methods × 3 horizons = 2,156,880 predictions, each with full SHAP decomposition.
Pre-computation enables instant API response times and ensures that SHAP risk factors are always available without additional computation.
Explainability
Section titled “Explainability”Every prediction includes SHAP values for all 17 input features. SHAP decomposes the prediction into the additive contribution of each feature — how much each soil property, climate variable, and site characteristic pushed the survival probability up or down relative to the baseline.
The API surfaces the top 3 actionable risk factors from this decomposition, filtered to exclude structural features (species code, state code, horizon) that aren’t decision-relevant for the API consumer.