What do FTP and CP really mean? – nickopotamus.co.uk

My Zwift cycling team are running a coached critical power test workout on New Year’s Eve, which means that everyone is getting confused about what critical power really is. I’ve kind of touched on this before so I thought I’d try and explain it a bit more clearly…

What are you trying to measure, and why?

Both Functional Threshold Power (FTP) and Critical Power (CP) aim to give you a measure of how much power you can put out long term while still working “aerobically”.

FTP is a popular “field definition” that was popularised by the Coggan and Allen training style. It’s essentially the highest power you can sustain (in a quasi-steady state) for around 1 hour without fatiguing. It’s straightforward to measure (go as hard as you can for an hour, although we’ll talk about other ways of estimating this below), very “operational” (decent target for an hour race, scale it if doing more of less), and it’s repeatable, so gives you an idea of how your aerobic engine is developing with training if done every few months.

However, it also isn’t particularly “physiological”. It’s well recognised that training can be divided into “zones” based on two physiological thresholds - lactate threshold one (LT1), where lactate starts to rise above baseline but still reaches a steady state; and lactate threshold two (LT2), where lactate accumulates and never stabilises.

*Lactate thresholds graph borrowed from AlpineCols*

Conceptually, LT1 divides the “easy” (Z1) and “hard” (Z2) domains - you’re still in steady-state because lactate doesn’t rise progressively, but you’re working so it’s above baseline. LT2 separates this from “severe” work (Z3), where lactate keeps rising because you’re doing an increasing amount of anaerobic work. However, most of us don’t have the ability to measure lactate - or even ventilatory equivalents - so we need a working measure that allows us to use this “zones” concept for our training.

That’s where these much used but also widely misunderstood numbers come in.

CP is a superficially similar but mechanistically very different metric to FTP. It comes from modelling the power–duration relationship in the severe-intensity domain: it’s often described as the highest power that can be sustained without progressively drawing down a finite work capacity above CP, known as W′ (“W prime”) or AWC (“anaerobic work capacity).

The key idea is that above CP you have a finite “battery” of W′, and as such time-to-exhaustion depends on how far above CP you ride and how big W′ is. Therefore if you draw a graph of power vs duration, CP is the power at the point where the curve flattens out (theoretically the power you can sustain “forever”) and W’ the area under the curve above CP:

It’s important to remember that CP is a model parameter, not something you directly measure. The model can fit severe-domain data well, but it is not meant to actually claim you can ride at CP “forever”, despite this being what the model implies.

Bringing it back then to the zones model, CP is therefore explicitly defined as the boundary between hard and severe intensity domains. On the whole, it should be power at LT2 (but as we’ll explore, not all tests are equally accurate at detecting this precisely), while FTP is an operational construct of “threshold intensity”, aiming to be “near-ish” LT2.

So although the two are similar, they’re not entirely interchangeable, and tell us different things. CP is a mathematical model of a physiological paradigm; FTP is a training anchor for workout scaling, not a single physiological truth.

How do you measure FTP?

FTP is the easiest of the two measures to understand, which might explain its popularity. The big problem however is there are many different FTP “tests” and each can legitimately give different FTP values, depending on cyclist phenotype (sprinter vs diesel), freshness, pacing skill, heat, etc.

Power-duration based FTP tests

The original definition is based around maximal steady state effort for sixty minutes, so ride for sixty minutes as hard as you can and this average power is your FTP. It’s simple and doesn’t rely on assumptions, but requires pacing skill, motivation, and good conditions, and both is hard to repeat (not least due to high fatigue costs) and suffers from environmental drift (heat, hydration, terrain variability, etc ) that can bias results.

More commonly, you’ll see FTP estimated from 20 minute best efforts:

\[FTP_{20} = 0.95 \times P_{20}\]

This comes direct from the Allen/Cogan school, and is how ZwiftPower among other works, as it is shorter (both more repeatable, and more achievable for the average casual rider), but suffers from assuming a “typical” relationship between 20-min and 60-min sustainable power, which can be biased by riders of differing phenotypes (high W′/punchy riders can over-test due to an inflated anaerobic contribution, while very diesel riders can under-test if they’re better at long steady output).

There’s also the classic 2x8-minute tests, which is often used by coaches as a a “lower-burden alternative”, where the average 8 minute power is scaled by 90% to reduce bias from shorter-duration anaerobic contribution:

\[FTP_{8} = 0.90 \times \frac{P_{8,1}+P_{8,2}}{2}\]

However this is even more phenotype-sensitive than 20-min tests, and much more influenced by W′, tactics, and the ability to go deep twice in a row.

We can compare these by pulling my mean-maximal power (MMP) curve from ZwiftPower for my last 90 days (because lets be honest, I haven’t been doing any big efforts outside over winter):

library(tidyverse)
library(httr2)
library(cookiemonster)
library(jsonlite)

# Get power curve as JSON by spoofing required cookies
zp_response <- request('https://zwiftpower.com/api3.php?do=critical_power_profile&zwift_id=1487408') %>%
  req_options(cookie = get_cookies("zwiftpower.com", as = "string")) %>% 
  req_headers(Accept = "application/json") %>% 
  req_headers("Content-Type" = "application/json") %>% 
  req_perform() 

# Extract the data from the response
zp_raw <- zp_response %>% resp_body_raw()
html_content <- rawToChar(zp_raw)
json_data <- fromJSON(html_content)

# Extract 90d power curve
mmp <- as_tibble(json_data$efforts$`90days`) %>% 
  rename(y_90 = y) %>% 
  select(x, y_90)

# Plot it
mmp %>% ggplot() +
  aes(x = x, y = y_90) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  xlab("Duration (s)") + 
  ylab("Power (W)") + 
  theme_bw() +
  theme(legend.position = "none")

From this we can pull the power I’ve put down over the different durations, and adjust them to estimate FTP:

(ftp_60 <- mmp %>% filter(x == 60*60) %>% pull(y_90))

[1] 266

(ftp_20 <- mmp %>% filter(x == 60*20) %>% pull(y_90)*.95)

[1] 272.65

(ftp_8  <- mmp %>% filter(x == 60*8)  %>% pull(y_90)*.90)

[1] 281.7

I haven’t done an all out 60 minute effort for a long time, so the “classical” FTP value is almost certainly far below where it should be (but with FRR on the horizon we’ll find out soon). The 20 and 8 minute efforts are closer to what I’d expect, but the difference between the two shows the impact of “punch” on the result, as almost all the racing I do on Zwift at the moment is short races where 5 to 10 minute efforts are the maximum we do.

Ramp tests

Now let’s be honest. 20 or even 60 all out efforts are really tedious, hence why I haven’t really done any recently! So most indoor platforms instead offer ramp tests, where the required power constantly increases until you can go no further.

These actually estimate your MAP, or maximal aerobic power, from the highest 1-minute power you’re able to sustain. They then work back from this to estimate FTP: TrainerRoad equation explicitly define this as:

\[FTP_{ramp} \approx 0.75 \times P_{1}\]

… with minor adjustments for target adherence, and community documentation suggests Zwift’s ramp test does the same thing.

Although this is very time-efficient, and almost eliminates pacing skill as a confounder (you just have to hang on) the big problem here is it introduces yet more assumptions. The 75% MAP to FTP scaling factor is an empirical population-average shortcut based on regressions of the relationship between 1-min maximal power and threshold. As such it isn’t very individual - ramp tests tend to overestimate FTP in riders with large anaerobic contribution (strong 1–3 min powers, or large W′ and underestimate it in very fatigue-resistant riders who are better at 40–70 min efforts than at 1-min peak.

I haven’t done a ramp test for years, because as I hope you’ve seen they’re not great estimators of anything of use, but we can attempt to estimate my FTP from my 1-min maximum power:

(ftp_map <- mmp %>% filter(x == 60) %>% pull(y_90)*0.75)

[1] 324

I’m not the strongest sprinter, but I do tend to be at the punchier end (especially when not training for triathlon) so this FTP estimate way higher than the longer interval based ones makes sense… and again hopefully convinces you that the ramp test is a bit pointless.

“AI” FTP

Lots of analytics platforms are now offering “AI”, “automatic”, or “modelled” FTPs, and features derived from these (such as the time-to-exhaustion metrics in WKO).

The core principle behind these is that the system builds an athlete’s MMP curve from lots of rides, then fits a power–duration model to this. This is almost always in fact a CP models, though other proprietary curves and tweaks exist, and from this they extract an FTP-like anchor value near the “threshold region” of the CP-based zones (because as we’ve said before, cyclists are more familiar with FTP than the concept of CP). I’ve mentioned before how “zFTP” is really just CP, for example.

The strength of these systems is that they continually update based on your real life efforts, with no single brutal test required, and they should get more accurate as you ride more as they incorporate many durations.

However, the quality and utility of the estimate depends heavily on whether you’ve actually produced maximal efforts across durations (as we’ll discuss below) due to the principle of garbage in, garbage out: if you never truly smash 5–20 min efforts (or even 40–70 min), the model will give you terrible estimations. The exact “FTP” it offers you also becomes method-dependent, and changes with the exact model chosen, any priors/weights built in, and rules for filtering and handling outlier values.

Zwift also offers “The Grade”, a big hill which promises to estimate your FTP if you attack it hard enough. This is based on a big regresion model that Zwift has of how long it takes people to do it versus their FTP estimated via another means.

How do you estimate CP?

As we said, CP is more conceptual. You can’t just ride hard for a set time and read the number… as a minimum you need to ride hard then plug some numbers into an equation. This makes it less popular amongst casual cyclists as it’s slightly harder to interpret, even if it is arguably the more powerful metric.

Time-to-exhaustion tests

The classic way to estimate CP is a multi-trial constant-power time-to-exhaustion (TTE) test. You’ll do several (commonly 3–5) severe-intensity (i.e. above LT2) constant-power efforts to exhaustion, and plug these into the model.

The 2-parameter (or hyperbolic) CP model as shown in the diagram of the top of the post is most commonly expressed as:

\[(P - CP) t = W'\]

… or equivalently as \(P(t) = CP + \frac{W'}{t}\).

You can then fit this curve:

Work-time, if \(W = P \times t\) then fit \(W\) against \(t\) to derive slope \(CP\) and intercept \(W'\)
Power vs inverse time fitting \(P\) against \(\frac{1}{t}\))
Directly fitting \(P(t) = CP + \frac{W'}{t}\) using non-linear least squares (other optimisers are available).

The strengths of this model is it results in very interpretable parameters (CP as the aerobic “ceiling”, and W′ as the finite work capacity above CP) with good predictive utility within the Z3 domain when trials are well-chosen. However, the model is still sensitive to trial selection (too short means the efforts can be neuromuscular; too long and cardiac drift and other fatigue mechanisms are not captured), and still requires truly maximal, well-paced efforts to exhaustion (which are hard!). There’s also an assumption that W′ is a fixed amount of work above CP (at least in the simple models) rather than a “balance” that can be refilled with work below CP.

Lets try this using my MMP curve, assuming this reflects my best 3 and 12 minute efforts as the team will be using in their own TTE test on Wednesday. Lets start with the linear fit of \(P\) against \(\frac{1}{t}\):

tte_2 <- mmp %>% 
  filter(x == 60*3 | x == 60*12) %>% 
  mutate(inv_t = 1/x)

fit_2p_lm_2 <- lm(y_90 ~ inv_t, data = tte_2)

tibble(
  CP_W = unname(coef(fit_2p_lm_2)[["(Intercept)"]]),
  Wprime_kJ = unname(coef(fit_2p_lm_2)[["inv_t"]]) / 1000
)

# A tibble: 1 × 2
   CP_W Wprime_kJ
  <dbl>     <dbl>
1   289      14.4

This is much closer to the CP I’d expect from my own riding (and both the TrainerRoad and Zwift estimates) than some of the FTP models. Let’s see if it gets better with a couple more datapoints, assuming I’ve also done 8 and 15 minute efforts:

tte_4 <- mmp %>% 
  filter(x == 60*3 | x == 60*8 | x == 60*12 | x == 60*15) %>% 
  mutate(inv_t = 1/x)

fit_2p_lm_4 <- lm(y_90 ~ inv_t, data = tte_4)

tibble(
  CP_W = unname(coef(fit_2p_lm_4)[["(Intercept)"]]),
  Wprime_kJ = unname(coef(fit_2p_lm_4)[["inv_t"]]) / 1000
)

# A tibble: 1 × 2
   CP_W Wprime_kJ
  <dbl>     <dbl>
1  283.      15.4

However heteroscedasticity is very important in this model. Time-to-exhaustion variance typically grows as trials get longer, hence why many statisticians recommend using weighted least squares to address non-constant error, so we can also try the non-linear solution:

# Non-linear model
fit_2p_nls <- nls(
  y_90 ~ CP + Wp / x,
  data = tte_4,
  start = list(CP = 100,   # Starting guess - CP = 100W
               Wp = 1000), #                  W' = 1000J
  weights = 1/(x^2),
  algorithm = "port",
  lower = c(CP = 0, Wp = 0)
)

# Tidy up the results
est_2p <- broom::tidy(fit_2p_nls) %>%
  select(term, estimate) %>%
  pivot_wider(names_from = term, values_from = estimate)

tibble(
  CP_W = est_2p$CP,
  Wprime_kJ = est_2p$Wp / 1000
  )

# A tibble: 1 × 2
   CP_W Wprime_kJ
  <dbl>     <dbl>
1  283.      15.6

Very similar result, which is unsurprising as the inputs I’m using aren’t true TTE tests.

3-minute all out test

Vanhatalo et al proposed that end-test power (often last ~30 s) approximates CP, and the work done above that end power approximates W′, so you could just do a single session instead of repeated TTEs to estimate \(EP \approx CP\) and \(WEP \approx W′\) (work performed above EP).

This is very time-efficient, and it tracks CP in one brutal test session, but the “all-out” pacing must truly be all-out as any pacing error contaminates EP. It also makes it very sensitive to test setup (resistance, cadence constraints, erg mode vs fixed resistance) and is still a model proxy which may not match multi-trial CP estimates in every athlete.

Derive CP from the power curve

As we mentioned above, these “AI FTP” models are just using maximal mean powers from real rides/races (e.g., best 3, 5, 8, 12, 20 min efforts), then fitting these to a CP model.

The algorithm uses the same 2-parameter model fits as above, but as it assumes that all data-points are “maximal” (rather than controlled tests) these algorithms need outlier filtering and may weight by duration or expected error.

Personally I think it’s a great approach, as it doesn’t cause any disruption to race or workout schedules due to using data that you already generate. They can also be very accurate, if you actually have true best-effort points (i.e. people who race), but if you don’t have true maximal efforts at key durations it can result in biased CP and W′ estimations. Similarly changes in terrain, drafting, stops, and sprint spikes can complicate “best effort” extraction.

Lets have a look what it thinks for my MMP from 1 to 15 minutes. We’ll ignore the model weighting as the error properties tend to be more uniform (and/or weird!) than TTE tests:

mmp_cp <- mmp %>% filter(x >= 1*60 & x <= 15*60)

mmp_2p_nls <- nls(
  y_90 ~ CP + Wp / x,
  data = mmp_cp,
  start = list(CP = 100,   # Starting guess - CP = 100W
               Wp = 1000), #                  W' = 1000J
  algorithm = "port",
  lower = c(CP = 0, Wp = 0)
)

mmp_2p <- broom::tidy(mmp_2p_nls) %>%
  select(term, estimate) %>%
  pivot_wider(names_from = term, values_from = estimate)

tibble(
  CP_W = mmp_2p$CP,
  Wprime_kJ = mmp_2p$Wp / 1000
  )

# A tibble: 1 × 2
   CP_W Wprime_kJ
  <dbl>     <dbl>
1  299.      9.69

We can even see how this works graphically by flipping the time axes, with the dashed grey line the fitted CP and W’ to my values in black:

mmp_cp %>% ggplot() +
  aes(x = 1/x, y = y_90) +
  geom_point() +
  geom_abline(slope = mmp_2p$Wp, intercept = mmp_2p$CP, linetype = "dotdash", linewidth = 1.5, colour = "darkgrey") +
  xlab("1 / Duration (s)") + 
  ylab("Power (W)") + 
  theme_bw() +
  theme(legend.position = "none")

3-parameter model?

The graph shows that the 2 power model is very heavily biased by the longer efforts towards the bottom of my Z3, and doesn’t account so well for the shorter ones. To account for this Morton suggested an extension to the this model into the 3-parameter model, accounting for variability increasing with endurance time by adding a time shift/asymptote parameter that allows the model to capture increasingly short duration efforts:

\[(P-CP)(t-k) = AWC\]

… where \(k\) is an added parameter and \(AWC\) is analogous to \(W′\).

This results in a model that fit the curvature better when you include very short severe trials, and may reduce bias where 2-parameter models systematically misfit certain athlete profiles. The downside is that the increased parameter sensitivity means that the model needs better spread of trial durations and good data to avoid unstable fits (Morton explicitly noted the need for more selective power settings and more sophisticated fitting), and the different potential 3-parameter formulations can yield different CP and W′ results which are not always interchangeable - however for my MMP curve the values are about the same:

fit_3p_nls <- minpack.lm::nlsLM(
  y_90 ~ CP + AWC/(x - k),
  data = mmp_cp,
  start = list(
    CP = 100, 
    AWC = 1000,
    k = 0),
  lower = c(CP = 0, AWC = 0, k = 0),
  upper = c(CP = Inf, AWC = Inf, k = min(mmp_cp$x) - 1)
)

coef(fit_3p_nls)

       CP       AWC         k 
 297.4938 9961.3954    0.0000

It’s more interesting when we compare the two models with my real data - the 2-parameter model (grey) fits the longer duration efforts better, while the 3-parameter model (red) fits the shorter duration efforts, but over-estimates my longer effort performance:

pred <- tibble(x = seq(60*1, 60*30, by = 10)) %>%
  mutate(
    P_2p = predict(fit_2p_nls, newdata = .),
    P_3p = predict(fit_3p_nls, newdata = .)
  )

# Plot points + model curves
ggplot() +
  geom_point(data = mmp %>% filter(x >= 60 & x <= 60*30), aes(x, y_90)) +
  geom_line(data = pred, aes(x, P_2p), colour = "darkgrey") +
  geom_line(data = pred, aes(x, P_3p), colour = "darkred") +
  scale_x_continuous("Duration (s)") +
  scale_y_continuous("Power (W)") +
  theme_bw()

CP and time-to-exhaustion calculations

One advantage of the CP model is that you can then answer questions like “if I ride full gas at 360W, how long could I last?”, by simply changing the order of the model. Given \((P-CP)t = W'\), then:

\[t(P)=\frac{W'}{P-CP}\]

tte_from_power <- function(P_target, fit_2p, fit_3p) {
  
  c2 <- coef(fit_2p)
  c3 <- coef(fit_3p)

  CP2 <- unname(c2[["CP"]]); Wp <- unname(c2[["Wp"]])
  CP3 <- unname(c3[["CP"]]); AWC <- unname(c3[["AWC"]]); k <- unname(c3[["k"]])

  tibble(
    P_target_W = P_target,
    t_2p_s = if_else(P_target > CP2, Wp / (P_target - CP2), NA_real_),
    t_3p_s = if_else(P_target > CP3, k + AWC / (P_target - CP3), NA_real_)
  )
}

tte_from_power(P_target = 360, fit_2p = fit_2p_nls, fit_3p = fit_3p_nls)

# A tibble: 1 × 3
  P_target_W t_2p_s t_3p_s
       <dbl>  <dbl>  <dbl>
1        360   201.   159.

This fits with the graph above - the 2-paramater model thinks I could hold 360W for longer than the 3-parameter model.

Obviously, the corollary to this is also true - we can predict from the models the maximal power I should be able to hold for 3 and 8 minutes for the test on NYE:

predict_power_targets <- function(durations_s, fit_2p, fit_3p) {
  
  c2 <- coef(fit_2p)
  c3 <- coef(fit_3p)

  CP2 <- unname(c2[["CP"]]); Wp <- unname(c2[["Wp"]])
  CP3 <- unname(c3[["CP"]]); AWC <- unname(c3[["AWC"]]); k <- unname(c3[["k"]])

  tibble(t_s = durations_s) %>%
    mutate(
      P_2p = CP2 + Wp / t_s,
      P_3p = if_else(t_s > k, CP3 + AWC / (t_s - k), NA_real_)
    )
}

predict_power_targets(c(60*3, 60*12), fit_2p = fit_2p_nls, fit_3p = fit_3p_nls)

# A tibble: 2 × 3
    t_s  P_2p  P_3p
  <dbl> <dbl> <dbl>
1   180  369.  353.
2   720  304.  311.

Again, the 3-parameter model thinks I can go harder on the longer interval, and the 2-parameter on the shorter one.

It’s worth saying that the model gives the predicted maximal mean power for the given duration — i.e. the ceiling if you’re fresh and execute well. If instead you want a workout target (repeatable interval “prescription”), a common rule of thumb is to take something like 90–95% of predicted max (depends on the number of reps and rest between these), but that’s coaching preference rather than CP maths.

What does all this mean for the average cyclist?

Don’t get me wrong, knowing your CP and/or FTP is useful. It’s a single, reasonably stable “anchor” power that allows you to set set training intensity, pace hard efforts and events, and track your fitness fitness over time. Even if the number isn’t perfect, it’s good enough to see trends if you test or estimate it consistently.

The problem comes when we treat either value as a “truth”. This is especially the case for FTP, because as we’ve seen it can be estimated any number of ways which for me can differ by over 20%, but as we’ve also shown even among CP models estimated CP/W′ can differ meaningfully depending on whether you use 2-parameter linearisations, nonlinear fits, or 3-parameter variants.

FTP is probably a good measure to use when your riding goals are steady and aerobic (think long climbs, TTs, triathlon, long sportives) as it is a practical proxy for “hard steady” performance in the 40–70 min range (depending on how you test) which is easy to apply. It’s probably also good for simple training zones and “does this workout feel right?” scaling, but the Coggan approach of setting zones as a percentage of FTP is almost certainly wrong for most people.

CP is a better choice if your riding is variable-intensity (racing, punchy group rides, MTB, Zwifting) or you want a model that also gives W′ (your finite work above CP) and can support questions like “how long can I hold this number of watts?” and “how many surges can I survive?”. CP is designed around the boundary between sustainable (heavy) and unsustainable (severe) exercise, and W′ adds explanatory power for efforts above that boundary. It’s also trivial to estimate if you already have good multi-duration best-effort data (or can do a few hard tests), and is built into most analytics packages including TR, Sauce for Strava, and even Zwift itself.

Remember, in many athletes the two values will be close; when they diverge, the gap is informative (e.g. a focus on “punch” or high W’ vs a very durable diesel rider) rather than a problem. It’s also worth baring in mind that FTP is not a physiological threshold by definition. Although it might be a useful training construct, it’s subject to day-to-day variability (modified by heat, fatigue, fueling, sleep, motivation, etc) and doesn’t encode durability (two riders can share the same FTP, but one can hold close to it deep into a 4-hour ride and the other can’t).

CP isn’t perfect however. The classic CP/W′ model is most valid in the severe-intensity domain, which is roughly a few minutes to a few tens of minutes, depending on the athlete. It’s not designed to perfectly describe sprints (neuromuscular) or very long steady endurance (durability/thermal/fueling constraints) efforts, and is subject to highly stochastic changes with fatigue over hours.

W′ also isn’t perfectly “constant”. In reality, W′ depletion/recovery is messy and condition-dependent. W′-balance models help you work out what’s “left in the tank” but rely yet more assumptions (reconstitution kinetics, etc) that might not hold for the individual - most of us have seen the Sauce “Wbal” end up in the negative of big climbs for example.

Personally I find CP to be the more useful metric for setting workouts in training and measuring my performance during races (and especially using W’balance… and W’balance of those around me when racing on Zwift, which can be derived from ZwiftRacing). But I’ll still set my goals during triathlon in terms of FTP… just cognizant that the “FTP” I’m using is in reality “CP + a fudge factor” from TrainerRoad.