In Order to Run the Fastest Marathon in My Life, I’ll Use Data

I wanted some data-driven perspective on how to run a fast marathon. MarathonGuide.com does an amazing job of compiling race results from marathons around the world (but is mostly a database of US marathons with around 85% of results from a US state). Some back-of-the-envelope estimates told me that if I wanted to look at all race results between 2000-2021, I’d be trying to tackle upwards of 12M records (maybe more). Feasible? Sure. Overkill? Almost certainly. So, instead, I analyzed the Top 100 Finishers of all races in the years 2000, 2005, 2010, 2015, and 2021. Why not 2020? There were relatively few races that took place in 2020 😢.

The number of marathon finishers grew dramatically from 2000 to 2015*. But from 2015 to 2021, races dropped almost 36%. Even in 2021 as races started coming back online, it’s clear that the COVID-19 pandemic was still having an impact on race participation.

Either despite that interruption, or in light of it, there’s some great wisdom to be gained from this data if you want to run a faster marathon. 

An important note: I’ve filtered the dataset to only male finishers since half the reason for this analysis is to inform myself about optimizing my own performance. But, I did put the original dataset on Kaggle so, ladies, you can recreate the results if you wish.

22
Running Marathons

Will you run a marathon this year (2022)?


## Is 40 Too Old to Run Super Fast?

It should be no surprise that the older we get, the slower we run. However, that deterioration is gradual. Whether we generate a best fit regression line or look at median finishing times by age, the rise in times through your 20’s and 30’s is slow. When you hit 40, the slope starts to tick up. Does that mean it’s impossible to run super fast when you hit 40? Let’s find out..

For starters, what is super fast?

In this analysis, I use 2:20 or faster (the blue line in the chart). The qualifying standard to participate in the US Olympic Trials is 2:18. At any age, the percentage of runners that achieve this completion time is small. In fact, for the entire dataset the percentage of results with a finishing time of 2:20 or faster is 0.65%. And, looking at the chart, most of those fall between the ages of 24 and 38. What about at 38+? The percentage of results for 38+ with a finishing time of 2:20 or faster is 0.04%. Yup, basically impossible. 

But wait! David Spiegelhalter, on this episode of Learning Bayesian Statistics, lays out a nice description for why there is no such thing as probability. Probability is based on comparing like to like. This is an apt aphorism to remember – especially in this case where each runner has a unique running history. The probability of 0.04% only holds true if every single 38+ year old has an identical running history and physiological makeup. For example, if Individual A decides to run their first marathon at age 40… they most probably won’t be running 2:20. If Individual B has been running 2:2x marathon times for the past 10 years, their likelihood of running 2:20 is significantly better. Effectively, the prior is attached to the individual… or at least a cluster of similar individuals. Can we find those clusters in the data?

We can try 🙂

Here are the steps:

  1. Find the Comparable Subset. Find the subset of runners in the dataset that have multiple race results (if they only have 1 run race, there is no history). Looks like 15% of this data are from individuals with multiple results. The caveat is if people use different versions of their name, I won’t catch it.
  2. Categorize Speediness. Categorize that above subset into “History Faster” and “History Slower”. I said individuals with average finisher times under 2:41 are in the former group, everyone else the latter. 
  3. Filter for Comparable Cluster. Filter the above to only race results to A) History Faster and B) when individuals are 38 or older years old. This gives us a dataset of 597 race results across 385 individuals. 

For those 597 finishing times (fast history, 38+), we can see the distribution below. The peak of the distribution is between 2:30-2:40. Some individuals fall off and have significantly slower times (albeit still relatively fast compared to the total population). Many individuals, however, continue to show strong finishing times after the age of 38+. 

Finally, we can use this subset and distribution to get a sense of the probability of running 2:20 or faster after age 38+. Here is what I printed out in my terminal:

 “Frank, your probability of running faster than 2:20 after the age of 37 is approximately 4.5226 percent. Good luck!”

Ok, so, the analysis isn’t airtight. Does it matter? What if the probability is actually 2%? What if it’s 10%? The truth is that even without data, you know that running sub 2:20 is hard… very hard in fact. The only reason this is helpful is knowing that A) it’s possible because many others have done it and B) the odds are waaaaaay better than Lloyds chances with Mary Swanson. Finally, there are so many peripheral benefits to the marathon lifestyle that it’s win, win… no matter what the final time ends up being.


## Where Can I Run Fastest?

Let us think of location as both space and time. First, what months during the year do we see the fastest marathon running occurring? 

October, November, and December are the relatively fastest months. January, April and May also see some relatively better performances. Weather likely plays a large role in this. Other research shows that the ideal race temperature is between 50-62 (not the 70-80s we see in summer months). But weather, to a degree, depends on latitude… or geographic race location.  

For all races we have in the data, the east & northeast tends to run quicker (darker is faster). However, going back to the weather… your options for races are limited depending upon the time of year. A look below at average marathon times by both state and month shows that A) southern state marathons are almost non-existent in summer months and the same is true for the north in November & December. 

Cool, but this doesn’t tell me anything immediately actionable. 

Here’s what I want; fast races where logistics are relatively simple. Running NYC is an amazing experience – coming off the 59th Street bridge and hearing the roaring of the crowd is incredible. However, unless you are an elite runner, the logistics of getting to the starting line on Staten Island are a bit of a nightmare. Such is the price of bringing together tens of thousands of people to run a race. The conundrum is that bigger races tend to have faster finishing times, do you need to endure the crowds to get on a fast course. No! 

Show me a sorted list of races that exist to fit these parameters:

  • Fastest finishing time is below 2:30 (indicates potentially fast course)
  • The 10th percentile time is below 3:00 (indicates some level of speed density)
  • Number of finishers is below 2,500 (less logistical headaches)
  • It’s in the northeast (if I don’t have to get on a plane, great)

I see some gems here. The Hartford Marathon jumped off the screen for me. I’ll be living in that area and people have run sub 2:20 there. After slightly more research, I see that the start and finish are very close together which makes parking much easier than point to point races that require a shuttle bus. So, bingo. T minus 8 months until marathon race day 🙂


## Summary

If this dataset is representative of all 21 most recent years, we can estimate that ~30k individuals finished a marathon each year. That adds up to 786,000 miles… which is roughly 1.5 round trips to the moon! Marathon participation is increasing for all levels of talent… although marathoners as a percentage of the general population are less than 1%. I know that running the fastest times of your life – your personal records – requires a ton of effort and a bit of luck. Knowing many people have run amazingly fast times and there are dozens of races through the year that can facilitate super fast racing, there’s only one thing left to do; keep dreaming big!

—-

* There is probably debate to be had regarding data completeness especially in 2000 when the internet was new. However, assuming it’s close and using a crude method (which you can find in the markdown), we can estimate that the typical run for a marathon is about 5 years or less. This means that if we compare metrics over time, we are comparing not only different athletes but also different races.

1 Take a look at this rmarkdown used to create the analysis and visualizations here.