Visualising a year of training

Can we derive meaning from the random numbers?

Author

Nick Plummer

Published

February 22, 2026

I recently came across the amazing “Git Sweaty” dashboard, which reimagines your Strava data in a similar way to Github repo contributions. Mine shows an interesting (to me, almost certainly to nobody else) increase in intensity over the years, a slow ramping up of strength work and swimming, and in particular how often I ride at either 5am (before work) or 7pm (after work), as well as highlighting Zwift racing on Tuesday evenings and TTT on Thursdays.

However, this visualisation got me wondering - does Garmin see it similarly? In particular, does Garmin’s “Training Status” match up with the amount of work I’m ostensibly doing?

We can actually pull data from Garmin fairly easily using the garmin-cli package, although training status and readiness is only available for the past year. So let’s pull that down, and compare it against my Intervals.icu log of training workload over the same period:

Intervals plots acute load as “fatigue” (a 7 day exponentially weighted moving average), and chronic load as “fitness” (a 42 day EWMA). From this it derives “Form” or training stress balance, which is chronic - acute load, normalised by chronic load.

For me, it looked like this. You can see a couple of big events causing huge spikes in acute fatigue, and also how illness, house redevelopment, etc had an impact on my overall consistency through the year:

Show code
data2 <- data %>% 
  select(Date, atl, ctl, form_pc) %>% 
  mutate(
    form_band = case_when(
      form_pc > 20  ~ ">20",
      form_pc > 5   ~ "20–5",
      form_pc > -10 ~ "5–-10",
      form_pc > -30 ~ "-10–-30",
      TRUE          ~ "<-30"
    )
  )

training_load <- data2 %>% 
  select(Date, atl, ctl) %>%
  pivot_longer(-Date, names_to = "Load") %>% 
  ggplot() +
    aes(x = Date, y = value, color = Load) +
    geom_line(linewidth = 0.7) +
    geom_point(size = 1.2, alpha = 0.8) +
    scale_color_manual(name = NULL,
                       labels = c("Fatigue", "Fitness"),
                       values = c(atl = "purple",
                                  ctl = "blue")) +
    theme_classic() +
    labs(y = "Training load")
  
training_form <- data2 %>% 
  mutate(form_band = factor(form_band,
    levels = c(">20", "20–5", "5–-10", "-10–-30", "<-30")
  )) %>% 
  ggplot() +
    aes(x = Date, y = form_pc) +
    geom_line(linewidth = 1.0, color = "grey") +
    geom_point(aes(color = form_band), size = 1.8) +
    scale_color_manual(name = NULL,
                       labels = c("Transition", 
                                  "Fresh", 
                                  "Grey zone", 
                                  "Optimal", 
                                  "High risk"),
                       values = c(">20"     = "yellow",
                                  "20–5"    = "blue",
                                  "5–-10"   = "grey60",
                                  "-10–-30" = "green",
                                  "<-30"    = "red")) +
  theme_classic() +
  labs(y = "Form %")

full_icu_plot <- training_load / training_form 

full_icu_plot[[1]] = full_icu_plot[[1]] +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank())

full_icu_plot + plot_layout(heights = c(2,1))

Garmin has a slightly different approach to measuring load, but it follows a similar pattern:

Show code
data %>% 
  select(Date, Acute, Chronic) %>% 
  pivot_longer(!Date, names_to = "Load") %>% 
  ggplot() +
  aes(x = Date, y = value, color = Load) +
  geom_point() +
  geom_line() +
  theme_classic() +
  scale_color_manual(values = c("Acute" = "limegreen", "Chronic" = "orange"))

Garmin calculates training readiness using this acute training load and your heart rate variability, as well as sleep and a few other metrics. Interestingly mine doesn’t correlate particularly well with HRV at all:

Show code
tr_plot <- data %>% 
  select(Date, Score, Level) %>% 
  mutate(Level = factor(Level, 
                        levels = c("PRIME", "HIGH", "MODERATE", "LOW", "POOR"))) %>% 
  ggplot() +
  aes(x = Date, y = Score) +
    geom_line(linewidth = 1.0, color = "grey") +
    geom_point(aes(color = Level), size = 1.8) +
    scale_color_manual(name = NULL,
                       values = c("PRIME"    = "magenta",
                                  "HIGH"     = "cyan",
                                  "MODERATE" = "green",
                                  "LOW"      = "orange",
                                  "POOR"     = "red")) +
  theme_classic() +
  labs(y = "Training readiness")

hrv_plot <- data %>% 
  select(Date, HRV, Level) %>% 
  mutate(Level = factor(Level, 
                        levels = c("PRIME", "HIGH", "MODERATE", "LOW", "POOR"))) %>% 
  ggplot() +
  aes(x = Date, y = HRV) +
    geom_line(linewidth = 1.0, color = "grey") +
    geom_point(aes(color = Level), size = 1.8) +
    scale_color_manual(name = NULL,
                       values = c("PRIME"    = "magenta",
                                  "HIGH"     = "cyan",
                                  "MODERATE" = "green",
                                  "LOW"      = "orange",
                                  "POOR"     = "red")) +
  theme_classic() +
  labs(y = "HRV (ms)")

full_tr_plot <- tr_plot / hrv_plot 

full_tr_plot[[1]] = full_tr_plot[[1]] +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank())

full_tr_plot + plot_layout(heights = c(2,1))

Finally, Garmin estimates training status based on acute load, HRV, VO2max estimations, and any heat or altitude acclimatisation. We can plot this against load (which is most clear from Intervals):

Show code
status_levels = c("peaking", "productive", "maintaining", "recovery", "strained", "unproductive")

data3 <- data %>%
  mutate(
    Status = factor(Status, levels = status_levels),
    status_y = 10
  )

ts_plot <- data3 %>% 
  ggplot() + 
    aes(x = Date, y = status_y, color = Status) +
    geom_point() +
    scale_y_continuous(breaks = seq_along(status_levels), labels = status_levels) +
    scale_color_manual(
    values = c("peaking"      = "purple",
               "productive"   = "darkgreen",
               "maintaining"  = "yellow",
               "recovery"     = "lightblue",
               "strained"     = "magenta",
               "unproductive" = "orange"),
    name = NULL
  ) +
  theme_classic() +
  theme(
    axis.title = element_blank(), axis.text = element_blank(), 
    axis.line = element_blank(), axis.ticks = element_blank(),
    legend.position = "bottom"
    )

training_load_2 <- training_load + theme(legend.position = c(0.9, 0.9))
  
training_load_2 / ts_plot + plot_layout(heights = c(3, 1))

And this kind of makes sense. Periods with a lot of sustained work (for example early in both years during FRR and TdZ, in the run up to Dragon Ride in the middle of the year, and Hyrox Birmingham in October) are associated with productive status, while periods of sudden spikes in acute load are associated with a “strained” status (“Your performance ability is currently limited with inadequate recovery”, according to Garmin). I’m also surprised at how infrequently Garmin had me down as unproductive!

But can we do better than that in predicting what our Garmin will think?

Show code
d <- data %>%
  select(Date, Status, ctl, atl) %>%
  filter(!is.na(Status), !is.na(ctl), !is.na(atl)) %>%
  mutate(
    Status = factor(Status),
    ctl_z  = as.numeric(scale(ctl)),
    atl_z  = as.numeric(scale(atl))
  )

m0 <- multinom(Status ~ 1, data = d, trace = FALSE)
m1 <- multinom(Status ~ ctl_z + atl_z, data = d, trace = FALSE)

anova(m0, m1, test = "Chisq")
          Model Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
1             1      1840  1037.3530           NA       NA      NA
2 ctl_z + atl_z      1830   923.3229 1 vs 2    10 114.0301       0

This simple model suggests that training status is strongly related to Interval’s calculated chronic and acute load (compared to m0, a model of no relationship). Sure, daily data is to some degree autocorrelated, but the likelihood-ratio for an association is high even for this.

But what does the model actually mean?

Show code
summary(m1)
Call:
multinom(formula = Status ~ ctl_z + atl_z, data = d, trace = FALSE)

Coefficients:
             (Intercept)      ctl_z      atl_z
peaking       -3.4440014  0.3332530  0.2358007
productive    -0.5025475 -0.4816212  1.2932699
recovery      -1.8615347  0.8212917 -2.0513923
strained      -1.8415342  0.0218270 -0.1190769
unproductive  -2.0179150  0.4154227  0.1193923

Std. Errors:
             (Intercept)     ctl_z     atl_z
peaking        0.4682206 0.5940979 0.7638025
productive     0.1381913 0.2073060 0.2547212
recovery       0.2552693 0.2281194 0.3599720
strained       0.2243976 0.3201270 0.4138134
unproductive   0.2411518 0.3058405 0.4007338

Residual Deviance: 923.3229 
AIC: 953.3229 

This is quite interesting, and mostly fits what Garmin says publicly:

However it’s not entirely clear:

We can explain these to some degree by looking at how often they come up:

Show code
table(d$Status)

 maintaining      peaking   productive     recovery     strained unproductive 
         155            5          111           52           25           21 

Peaking and strained are so rare that the model is being dominated by productive and maintaining days. However, the clearest signal from this is high ATL is “productive”, and low ATL leads to “recovery”, with CTL tweaking those odds. What happens if we add some non-linear terms?

Show code
m1q <- multinom(Status ~ ctl_z + atl_z + I(ctl_z^2) + I(atl_z^2) + ctl_z:atl_z,
                data = d, trace = FALSE)
anova(m1, m1q, test = "Chisq")
                                                  Model Resid. df Resid. Dev
1                                         ctl_z + atl_z      1830   923.3229
2 ctl_z + atl_z + I(ctl_z^2) + I(atl_z^2) + ctl_z:atl_z      1815   813.7269
    Test    Df LR stat.      Pr(Chi)
1           NA       NA           NA
2 1 vs 2    15  109.596 2.220446e-16

Adding the nonlinear terms dramatically improves how well CTL/ATL explain Garmin Status, with a ~80 point improvement in AIC. That matches reality - the labels are usually about combinations of fitness and fatigue and thresholds, rather than straight lines.

But can we predict our training status? As this is a time-series using train/test splits will overstate performance because days next to each other will look similar. Instead we can use a walk-forward evaluation, using glmnet due to the imbalanced classes:

Show code
# Helper functions
multiclass_logloss <- function(y_true, p_mat) {
  eps <- 1e-15
  p <- pmin(pmax(p_mat, eps), 1 - eps)
  idx <- cbind(seq_along(y_true), match(y_true, colnames(p)))
  -mean(log(p[idx]))
}

pad_probs <- function(p, lev) {
  # p: matrix (n x k) with some subset of columns
  # lev: full set of levels
  missing <- setdiff(lev, colnames(p))
  if (length(missing) > 0) {
    pad <- matrix(0, nrow = nrow(p), ncol = length(missing),
                  dimnames = list(NULL, missing))
    p <- cbind(p, pad)
  }
  p[, lev, drop = FALSE]
}

# Walk-forward evaluator
eval_glmnet_walkforward <- function(df, features = c("linear","nonlinear"),
                                    k0 = 120, step = 14,
                                    alpha = 0.5, nfolds_target = 10,
                                    lambda_fallback = 0.01) {
  features <- match.arg(features)
  df <- df %>% arrange(Date) %>% mutate(Status = factor(Status))
  lev <- levels(df$Status)

  if (features == "linear") {
    X <- model.matrix(~ ctl_z + atl_z, df)[, -1, drop = FALSE]
  } else {
    X <- model.matrix(~ ctl_z + atl_z + I(ctl_z^2) + I(atl_z^2) + ctl_z:atl_z, df)[, -1, drop = FALSE]
  }
  y_all <- df$Status

  acc <- c(); ll <- c()

  for (i in seq(k0, nrow(df) - 1, by = step)) {
    idx_train <- 1:i
    idx_test  <- (i + 1):min(i + step, nrow(df))

    y_train <- droplevels(y_all[idx_train])
    X_train <- X[idx_train, , drop = FALSE]
    y_test  <- y_all[idx_test]
    X_test  <- X[idx_test, , drop = FALSE]

    # choose a safe nfolds
    n_train <- length(y_train)
    nfolds <- min(nfolds_target, max(3, floor(n_train / 25)))  # >=3 always

    # try CV, else fallback lambda
    lambda_use <- lambda_fallback
    cv_ok <- TRUE

    set.seed(1)
    cv_fit <- try(
      cv.glmnet(
        x = X_train, y = y_train,
        family = "multinomial",
        type.multinomial = "grouped",
        alpha = alpha,
        nfolds = nfolds,
        standardize = TRUE
      ),
      silent = TRUE
    )

    if (!inherits(cv_fit, "try-error")) {
      lambda_use <- cv_fit$lambda.1se
    } else {
      cv_ok <- FALSE
    }

    fit <- glmnet(
      x = X_train, y = y_train,
      family = "multinomial",
      type.multinomial = "grouped",
      alpha = alpha,
      lambda = lambda_use,
      standardize = TRUE
    )

    p_arr <- predict(fit, newx = X_test, type = "response")
    p <- as.matrix(p_arr[, , 1, drop = TRUE])
    colnames(p) <- dimnames(p_arr)[[2]]

    p <- pad_probs(p, lev)
    pred <- lev[max.col(p, ties.method = "first")]

    acc <- c(acc, mean(pred == y_test))
    ll  <- c(ll, multiclass_logloss(y_test, p))
  }

  list(mean_acc = mean(acc), mean_logloss = mean(ll), acc = acc, logloss = ll, levels = lev)
}

# Linear
res_lin <- eval_glmnet_walkforward(d, features = "linear",
                                   k0 = 120, step = 14, alpha = 0.5)

# Nonlinear (quadratic + interaction)
res_nl  <- eval_glmnet_walkforward(d, features = "nonlinear", 
                                   k0 = 120, step = 14, alpha = 0.5)

Our simple linear model has an accuracy of 0.399. This is worse than if it just guessed “maintaining” (0.420). The non-linear model is slightly better, with an accuracy of 0.418, but log loss is markedly worse (2.564 vs 2.564), meaning that the nonlinear model is making more overconfident wrong predictions (or poorly calibrated probabilities), even if its top class is slightly more often correct.

What can we take away from this? CTL/ATL are associated with training status (see the LR tests), but alone are not sufficient to predict training status better than a naive baseline. This isn’t particularly surprising. Garmin uses other signals (VO2max trend, load focus, HRV/sleep/recovery, etc), and additionally labels like “unproductive” are often driven by performance trend rather than pure load.

We’ll have one last try. As plotted above, we have HRV data. We can also try and add some awareness of changes over time, and with only 5 peaking days, it will wreck probability calibration so we’ll just call it productive for now:

Show code
# Add time-derived features
d_feat <- data %>%
  arrange(Date) %>%
  
  # Normalise HRV/CTL/ATL
  mutate(
    hrv_z = as.numeric(scale(HRV)),
    ctl_z = as.numeric(scale(ctl)),
    atl_z = as.numeric(scale(atl)),

    # Ratio / form
    acwr = atl / pmax(ctl, 1e-6),

    # Rolling HRV features
    hrv_7d = slide_dbl(HRV, mean, .before = 6, .complete = TRUE),
    dhrv_7 = HRV - hrv_7d,

    # Load trend features (chronic 2w, acute 1w)
    dctl_14 = ctl - lag(ctl, 14),
    datl_7  = atl - lag(atl, 7)
  ) %>%
  filter(!is.na(hrv_7d), !is.na(dctl_14), !is.na(datl_7)) %>%
  mutate(Status = fct_collapse(Status,
                               productive = "peaking")) %>%
  mutate(Status = factor(Status))

eval_glmnet_walkforward2 <- function(df, k0 = 120, step = 14, alpha = 0.5, lambda_fallback = 0.01) {
  df <- df %>% arrange(Date) %>% mutate(Status = factor(Status))
  lev <- levels(df$Status)

  X <- model.matrix(
    ~ ctl_z + atl_z + acwr + hrv_z + dhrv_7 + dctl_14 + datl_7,
    df
  )[,-1, drop = FALSE]
  y_all <- df$Status

  multiclass_logloss <- function(y_true, p_mat) {
    eps <- 1e-15
    p <- pmin(pmax(p_mat, eps), 1 - eps)
    idx <- cbind(seq_along(y_true), match(y_true, colnames(p)))
    -mean(log(p[idx]))
  }

  pad_probs <- function(p, lev) {
    missing <- setdiff(lev, colnames(p))
    if (length(missing) > 0) {
      pad <- matrix(0, nrow = nrow(p), ncol = length(missing), dimnames = list(NULL, missing))
      p <- cbind(p, pad)
    }
    p[, lev, drop = FALSE]
  }

  acc <- c(); ll <- c()

  for (i in seq(k0, nrow(df) - 1, by = step)) {
    idx_train <- 1:i
    idx_test  <- (i + 1):min(i + step, nrow(df))

    y_train <- droplevels(y_all[idx_train])
    X_train <- X[idx_train, , drop = FALSE]
    y_test  <- y_all[idx_test]
    X_test  <- X[idx_test, , drop = FALSE]

    # CV can still be fragile with rare classes; keep it simple + safe
    nfolds <- 5
    set.seed(1)
    cv_fit <- try(
      cv.glmnet(X_train, y_train, family="multinomial", type.multinomial="grouped",
                alpha=alpha, nfolds=nfolds, standardize=TRUE),
      silent = TRUE
    )

    lambda_use <- if (!inherits(cv_fit, "try-error")) cv_fit$lambda.1se else lambda_fallback

    fit <- glmnet(X_train, y_train, family="multinomial", type.multinomial="grouped",
                  alpha=alpha, lambda=lambda_use, standardize=TRUE)

    p_arr <- predict(fit, newx = X_test, type = "response")
    p <- as.matrix(p_arr[,,1, drop = TRUE])
    colnames(p) <- dimnames(p_arr)[[2]]
    p <- pad_probs(p, lev)

    pred <- lev[max.col(p, ties.method = "first")]

    acc <- c(acc, mean(pred == y_test))
    ll  <- c(ll, multiclass_logloss(y_test, p))
  }

  list(mean_acc = mean(acc), mean_logloss = mean(ll))
}

res_hrv <- eval_glmnet_walkforward2(d_feat, k0=120, step=14, alpha=0.5)

This is much better - accuracy has improved to 0.54 and log loss fallen to 1.615. This means that training status is not simply just a load classifier, but is strongly influenced by recovery physiology and recent trends.

We can compare this to just guessing the most common status in the last 120 days:

Show code
baseline_walkforward <- function(df, k0=120, step=14) {
  df <- df %>% arrange(Date) %>% mutate(Status=factor(Status))
  acc <- c()
  for (i in seq(k0, nrow(df) - 1, by=step)) {
    train <- df$Status[1:i]
    test  <- df$Status[(i+1):min(i+step, nrow(df))]
    majority <- names(which.max(table(train)))
    acc <- c(acc, mean(as.character(test) == majority))
  }
  mean(acc)
}

baseline_walkforward(d_feat, 120, 14)
[1] 0.4228419

Our model is almost 12% better!

To see why that is, we can look at what we are and aren’t predicting well:

Show code
eval_glmnet_collect <- function(df, k0 = 120, step = 14, alpha = 0.5, lambda_fallback = 0.01) {
  df <- df %>% arrange(Date) %>% mutate(Status = factor(Status))
  lev <- levels(df$Status)

  X <- model.matrix(~ ctl_z + atl_z + acwr + hrv_z + dhrv_7 + dctl_14 + datl_7, df)[,-1, drop=FALSE]
  y_all <- df$Status

  pad_probs <- function(p, lev) {
    missing <- setdiff(lev, colnames(p))
    if (length(missing) > 0) {
      pad <- matrix(0, nrow=nrow(p), ncol=length(missing), dimnames=list(NULL, missing))
      p <- cbind(p, pad)
    }
    p[, lev, drop=FALSE]
  }

  preds <- character(0)
  truths <- character(0)

  for (i in seq(k0, nrow(df) - 1, by = step)) {
    idx_train <- 1:i
    idx_test  <- (i + 1):min(i + step, nrow(df))

    y_train <- droplevels(y_all[idx_train])
    X_train <- X[idx_train, , drop = FALSE]
    y_test  <- y_all[idx_test]
    X_test  <- X[idx_test, , drop = FALSE]

    set.seed(1)
    cv_fit <- try(
      cv.glmnet(X_train, y_train, family="multinomial", type.multinomial="grouped",
                alpha=alpha, nfolds=5, standardize=TRUE),
      silent = TRUE
    )
    lambda_use <- if (!inherits(cv_fit, "try-error")) cv_fit$lambda.1se else lambda_fallback

    fit <- glmnet(X_train, y_train, family="multinomial", type.multinomial="grouped",
                  alpha=alpha, lambda=lambda_use, standardize=TRUE)

    p_arr <- predict(fit, newx=X_test, type="response")
    p <- as.matrix(p_arr[,,1, drop=TRUE]); colnames(p) <- dimnames(p_arr)[[2]]
    p <- pad_probs(p, lev)

    pred <- lev[max.col(p, ties.method="first")]

    preds  <- c(preds, as.character(pred))
    truths <- c(truths, as.character(y_test))
  }

  list(
    truth = factor(truths, levels=lev),
    pred  = factor(preds,  levels=lev)
  )
}

out <- eval_glmnet_collect(d_feat, 120, 14, alpha=0.5)
table(truth = out$truth, pred = out$pred)
              pred
truth          maintaining productive recovery strained unproductive
  maintaining           70         34        4        3            0
  productive            12         35        0        0            0
  recovery              22          0       13        2            0
  strained               5          0        5        9            0
  unproductive           9          9        3        0            0

What can we take away from this? The model is mainly distinguishing maintaining vs productive (including peaking), and it does some Recovery/Strained separation. However, it never predicts unproductive! As much as I’d like to say it’s because I’m never unproductive, it’s probably more to do with this being an imbalanced multi-class problem.

Productive was by far the strongest class separation we’ve got. 34 “maintaining” days were misclassified as “productive bucket”, meaning that CTL/ATL/HRV features don’t capture Garmin’s “maintaining” logic well, and recovery often looks like maintaining in our model (likely because HRV/HRV-suppression isn’t always distinct enough, or load is moderate rather than clearly low). Strained was misclassified maintaining/recovery around 50% of the time.

Let’s look at how well these classes are calibrated. The x-axis shows the model’s mean predicted probability for a set of days, and the y-axis the how often that actually happened, with the dashed diagonal line a perfect predictor:

Show code
get_preds_df_glmnet <- function(df,
                                formula = ~ ctl_z + atl_z + acwr + hrv_z + dhrv_7 + dctl_14 + datl_7,
                                k0 = 120, step = 14,
                                alpha = 0.5,
                                nfolds = 5,
                                lambda_fallback = 0.01) {

  df <- df %>%
    arrange(Date) %>%
    mutate(Status = factor(Status))

  # Global class levels (fixed across folds)
  lev <- levels(df$Status)

  # Build model matrix once
  X <- model.matrix(formula, df)[, -1, drop = FALSE]
  y_all <- df$Status

  all_rows <- list()

  # Helper: pad missing classes
  pad_probs <- function(p, lev) {
    missing <- setdiff(lev, colnames(p))
    if (length(missing) > 0) {
      pad <- matrix(0, nrow = nrow(p), ncol = length(missing),
                    dimnames = list(NULL, missing))
      p <- cbind(p, pad)
    }
    p[, lev, drop = FALSE]
  }

  for (i in seq(k0, nrow(df) - 1, by = step)) {
    idx_train <- 1:i
    idx_test  <- (i + 1):min(i + step, nrow(df))

    X_train <- X[idx_train, , drop = FALSE]
    y_train <- droplevels(y_all[idx_train])

    X_test <- X[idx_test, , drop = FALSE]
    y_test <- y_all[idx_test]

    # CV (may fail early with rare classes) -> fallback lambda
    set.seed(1)
    cv_fit <- try(
      cv.glmnet(
        x = X_train, y = y_train,
        family = "multinomial",
        type.multinomial = "grouped",
        alpha = alpha,
        nfolds = nfolds,
        standardize = TRUE
      ),
      silent = TRUE
    )

    lambda_use <- if (!inherits(cv_fit, "try-error")) cv_fit$lambda.1se else lambda_fallback

    fit <- glmnet(
      x = X_train, y = y_train,
      family = "multinomial",
      type.multinomial = "grouped",
      alpha = alpha,
      lambda = lambda_use,
      standardize = TRUE
    )

    # Predict probabilities -> array [n, classes, 1]
    p_arr <- predict(fit, newx = X_test, type = "response")
    p <- as.matrix(p_arr[, , 1, drop = TRUE])
    colnames(p) <- dimnames(p_arr)[[2]]

    # Pad + reorder to global levels
    p <- pad_probs(p, lev)

    # Build fold df
    fold_df <- tibble(
      Date = df$Date[idx_test],
      truth = y_test
    )

    # Add prob columns named p_<class>
    for (cls in lev) {
      fold_df[[paste0("p_", cls)]] <- p[, cls]
    }

    all_rows[[length(all_rows) + 1]] <- fold_df
  }

  bind_rows(all_rows) %>%
    mutate(truth = factor(truth, levels = lev))
}

# Extract probabilities for each class
preds_df <- get_preds_df_glmnet(d_feat, k0 = 120, step = 14, alpha = 0.5)

# Now do the plot
classes <- levels(preds_df$truth)

calib <- preds_df %>%
  pivot_longer(starts_with("p_"), names_to="cls", values_to="p") %>%
  mutate(cls = sub("^p_", "", cls),
         y = as.integer(truth == cls)) %>%
  group_by(cls, bin = cut(p, breaks = seq(0, 1, by = 0.1), include.lowest = TRUE)) %>%
  summarise(p_hat = mean(p),
            obs = mean(y),
            n = dplyr::n(),
            .groups="drop")

ggplot(calib, aes(x = p_hat, y = obs)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_point(aes(size = n)) +
  geom_line() +
  facet_wrap(~ cls) +
  coord_equal(xlim = c(0,1), ylim = c(0,1)) +
  theme_classic() +
  labs(x = "Mean predicted probability", y = "Observed frequency", size = "n")

This shows that the model is particularly underconfident in classifying productive, especially at higher probabilities. Recovery is probably the best calibrated, with maintaining also pretty good. Strained is very jagged with big swings, representing uncertain calibrations due to too few samples, and as we know the model never predicts unproductive despite this being Garmin’s favourite way to beat you down after a big event.

To improve predictive power, what else could we do? We could improve the target by merging rarer classes, or use class weights to fix class imbalances (or track a different metric, like macro-F1 or per-class recall). Obviously we could dig out other features (eg Garmin’s sleep score, try to include volatility or extremes of values or change the lags), or we could even try other models like gradient boosting or hidden Markov models to handle state persistence over time. But for a simple visualisation exercise this has already got out of hand!