← Insights

May 2026

Rebuilding my Utah Hockey Club submission, 21 months later

A public post-mortem of my 2024 Utah Hockey Club Summer Analytics Challenge submission, rebuilt with two unseen seasons of ground truth.

“The model missed by 36 points. The methodology missed by more.”

The Prediction That Was Wrong

In August 2024 I shipped a model to the Utah Hockey Club Summer Analytics Challenge predicting how schedule density and travel would shape UHC's inaugural season. The headline: 61% overall, 80% at home, 43% on the road. The writeup leaned on a reported 70% accuracy and a 0.75 ROC. I felt pretty good about it.

I should not have.

UHC actually went 46.3% overall, 43.9% at home, 48.8% on the road. Read that again — they were BETTER on the road than at home. The exact opposite of the central claim. The home prediction missed by 36 points. Not 3.6. Thirty-six.

The model wasn't calibrated. It wasn't reproducible. It wasn't compared to a single honest baseline. And it was sitting in my portfolio, public.

Two choices. Quietly delete it. Or audit it in public and rebuild. This is the second one.

The Seven Failures

Seven things came out of the audit. None of them are hockey problems. They show up in any model that gets shipped without somebody checking it against data it hasn't seen — which, in my experience, is most of them.

  1. Trained on the wrong subset.The XGBoost was trained on 75 back-to-back games (1.9% of the corpus), then applied to the full 82-game schedule. The training set didn't look like the test set. Stop me when this sounds familiar.
  2. No baselines.The submission reported “70% accuracy.” Cool number. Doesn't mean anything until you put it next to “always pick the home team,” which gets ~54% all by itself.
  3. Confounded home advantage with travel direction.Eastward-travel losses were chalked up to time-zone fatigue. But most eastward travelers are also visitors. The model couldn't separate the two and didn't even know it was supposed to.
  4. No causal inference. Even though the challenge brief explicitly asked for it. Just skipped the part that was assigned.
  5. No calibration. No uncertainty.Point predictions, no confidence intervals, no reliability diagram. The 80% home win number was a single point estimate I couldn't have defended if anyone asked me to.
  6. Hardcoded Google Drive paths. Not reproducible by anyone but me, on the laptop I happened to be using.
  7. Generic recommendations.“Light therapy. Strategic napping.” Those aren't recommendations a model gets to make. Get the data wrong, the recs are wrong by definition.

The Rebuild

So I rebuilt it. Walk-forward CV instead of a random split. Multiple baselines on the same hold-out — home-always, train-set rate, strength differential, Elo. A reliability diagram for actual calibration. A panel-data feature pass that computes travel, rest, and time-zone debt per team-game, so the same code scores any season (not just UHC's first one).

Which model wins changes by season. On the 2024-25 hold-out — the season the original submission was about — a ridge logistic regression on the full feature set hit 59.5% accuracy, Brier 0.237, AUC 0.616. On the 2025-26 hold-out — Utah Mammoth's first season — an ensemble of ridge LR and margin-of-victory Elo took the top at 53.9%, Brier 0.249, AUC 0.554.

53.9% is not 59.5%. I know. But the model didn't get worse — the league got more random. The 2025-26 season had the highest OT/SO rate in six years (24.8%, and those games are roughly coin flips) and the lowest league home-win rate in six years (52.3%). The signal-to-noise floor moved against me.

The Real Ceiling

Here's the boring part. Hockey prediction has a low ceiling and high variance. I think a well-built model probably lands somewhere between 54% and 60% accuracy depending on how chalky the season was. I don't know that for certain — maybe somebody reading this has done better — but that's what fell out across three walk-forward hold-outs for me.

I think a lot of submissions skip this part. The part where you say the number out loud, instead of dressing it up. “Our model is 70% accurate” versus “our model adds 4–6 points over the home-team-always baseline, with these specific failure modes documented.” The second version is less impressive. It's also the only one that's true.

The Mammoth Bounce-Back

Quick aside before we close out. Utah rebranded from the Hockey Club to the Mammoth for 2025-26 and finished 43-33-6 — 92 points, a real bounce-back. I went to a playoff game against the Vegas Golden Knights in May. Hockey's new to me, still learning the rules, and I found myself yelling more than the friends next to me. Not really sure why.

Anyway — the Mammoth flipped the prior year's pattern: 53.7% at home, 51.2% on the road. The thing my model was most confidently wrong about — that UHC would be a road-trip nightmare and a home-ice juggernaut — was eventually true. Just one season late, and via the opposite mechanism (UHC was actually the better road team; the Mammoth flipped to the better home team).

Modeling is humbling. Maybe most things are.

Not Really About Hockey

This isn't really about hockey. The seven things I listed up there show up in pretty much every model I've worked on for a client — churn, defaults, lead scoring, conversion. Just usually nobody finds out until 21 months later, when the stakeholder comes back and says “hey, this number you gave me…”

Source & Reproducibility

Everything in this post is reproducible. The fetch script pulls seven seasons (8,510 games) from api-web.nhle.com. Three commands rebuild the corpus, recompute features, and re-score the walk-forward splits. The repo includes the full post-mortem report, the dashboard, and the JSON outputs of every model variant.

Next year's model will miss too. I just want to be wrong faster.

Work with Nektar

Got a model out there with no baselines, no calibration check, no walk-forward hold-out? That's the work we audit. Free 30-minute review — no slides.

Book a free model audit