WetherVane

Model Accuracy

Predictions are only valuable if they can be verified. This page shows how WetherVane performs on elections it has never seen — the only test that matters.


Overall Performance

The headline metric is leave-one-out (LOO) r = 0.731. This is the correlation between predicted and actual county-level Democratic vote share shifts, measured on held-out counties that were excluded from their own type average before prediction.

LOO is a stricter test than the standard holdout (r = 0.698) because it prevents any county from inflating the score by predicting itself. The ~0.013 gap between them is small, which means type generalizations are stable.

LOO r (ensemble)
0.731
Ridge + HGB, 43 pruned features
Standard holdout r
0.698
Inflated ~0.013 by self-prediction
RMSE
7.3 pp
Percentage-point error
Covariance val r
0.936
Observed Ledoit-Wolf
Type coherence
0.783
Within-type consistency
Counties covered
3,154
All 50 states + DC

Cross-Election Validation

The more demanding test is cross-election validation: train the type structure on elections up to year N, then predict shifts in year N+4. This measures whether the discovered community types are durable structural features of American politics — or just noise in one election cycle.

Across four presidential cycles, mean LOO r = 0.476 ± 0.10. The variance is real and interpretable: not all elections test the same thing.

2008→2012Obama→Obama
r = 0.45
2012→2016Obama→Trump
r = 0.64
2016→2020Trump→Biden
r = 0.42
2020→2024Biden→Trump
r = 0.40
Why 2024 was hardest (r = 0.40). The Biden-to-Trump cycle saw unusual cross-type movement — particularly among Hispanic communities, which broke sharply from their historical type patterns. When entire demographic groups shift in ways that cut across the type structure, the model's ability to predict from prior shifts degrades. This is a known limitation: the model captures structural patterns, not realignments in progress.
Why 2012→2016 was most predictable (r = 0.64). Trump's initial surge in 2016 followed existing type fault lines closely. Rural working-class counties that were already trending Republican continued their trajectory; college-educated suburban counties that were already competitive moved further toward Democrats. The type structure discovered from 2008–2012 data captured exactly these patterns.

How the Model Improved

The production model is not a single algorithm — it is the result of systematic improvement from a simple baseline. Each step added information while maintaining leave-one-out honesty.

Type-mean baseline
0.448
Ridge regression (type scores only)
0.533
Ridge regression (all features)
0.671
Ridge + HGB ensemble (production)
0.731

Type-mean baseline: Predict each county from its type's average shift, excluding the county itself (LOO). This is the structural model alone — no demographics, no external data.

Ridge (scores only): Use all 100 type membership scores as features in a Ridge regression. Captures nonlinear type interactions.

Ridge (all features): Add 59 features from 8 independent sources: ACS demographics, religious congregations, BLS industry composition, County Health Rankings, IRS migration flows, Facebook Social Connectedness Index, urbanicity, and broadband access.

Ridge + HGB ensemble: Combine Ridge predictions with a Histogram Gradient Boosted tree trained on the same 160 features. The ensemble captures patterns that neither model finds alone.


What This Means

An LOO r of 0.731 means the model explains roughly 53% of the variance in county-level partisan shifts (r² ≈ 0.534). That sounds modest, but consider what is being predicted: the direction and magnitude of how each of 3,154 counties shifts relative to the prior election, using only information available before the election.

The other 47% of variance is genuinely unpredictable from structural features — candidate effects, late-breaking news, local mobilization, and pure noise. No structural model can capture these, and one that claimed to would be overfitting. The model is designed to capture the part that is predictable: the structural landscape of which communities tend to move together.

The model performs best on counties whose type membership is concentrated — places that are clearly one type tend to behave predictably. It performs worst on counties that are at the boundary between types, and on cycles where entire demographic groups cross type boundaries (like Hispanic communities in 2024). These limitations are documented, not hidden.

The LOO ensemble (0.731) outperforms the standard holdout (0.698) because it uses 43 pruned features beyond type scores. The basic LOO type-mean baseline (0.448) is the honest structural-model-only metric — the ensemble improvement from 0.448 to 0.731 comes from demographic, economic, and social features. Both LOO and standard holdout are reported for full transparency.


← Full methodologyView 2026 race forecasts →