Half Marathon Exploratory Data Analysis

Peter TEMPFLI
4 min readNov 25, 2018

--

This weekend I made a half-marathon (my second this year). It was great — nice town, OK weather, great people.

But what do I actually know about my fellow runners? I’m going to use my new R-knowledge, and figure it out!

Get the data

Fortunately, the organisers provide a web-based table, which is pretty easy to convert into a convenient CSV file. Than I open RStudio and load it.

An R-dataframe. Think about it as excel on steroids.

R is a lot of things: a data-management tool, a data-analysis tool, a visualisation tool, and a programming language.

Runner demographics

The race took place at lake Balaton (the Hungarian sea!), which is about one and half hour ride from Budapest. Almost every second runner came from the capital.

The capital is very over-represented

How about gender?

56% of runners are males.

And about age of the runners? Wow, actually I’m rather young(34), if we look at the age distribution.

The vertical red line represents my age

Results

The most important number at a half-marathon is your time. Below 2-hour it is pretty decent (my time was 2:05:16, so I still need to work on it).

Pink horizontal line represents my net time (125 minutes)

The net time distribution is interesting. First, looks like i’m rather below the average pace with my time.

Second, look at the shape of the histograms. You can argue that male times look somewhat normally — with a median around 115 minutes. However, female times are bi-modal. There are 2 distinct group: the ladies who run the same or better times than an average guy, and another group, which peaks around 135 minutes.

And how about a scatter plot: time vs age vs sex ?

Every point represents one runner. The line represents a regressive, imaginary average.
  • Pink points (males) are rather at the lower part of the plot. On average, males run faster.
  • We can’t see any big difference between ages. However, there is a peak around 25. It’s not clear, if the 20-somethings really do worse, or this is some other effect.
  • At very good times (below 1:40) males are over-represented.

Pace

This was a special race: the athletes had to run 2 rounds around the town, and there was a time-checking machine exactly at the middle of the race. So we can see the difference between the 2 rounds.

I have an assumption that usually runners run second round slower — they are tired after 10 kilometers.

Difference between 2 rounds, in seconds. Positive numbers represent slower, while negative numbers a faster pace at the second round. Red line is 0, which is exactly same 2 rounds.

The assumption was right — the average runner loses 1.5 minutes at the second 10 kilometers. It is worth to note that this is a beautiful, a little right skewed bell-curve.

Lets scatter plot the same data! At this plot we see the pace changing (y-scale) vs age, grouped by sex.

The line represents regressed average for the given age group

Here we can see something really fun. Male and female pace changes more-or-less the same way. However:

Females around 25-years lose a lot of pace at the second round — which is not true for other female age groups. This is probably the same affect which we already seen at the age vs net-time distribution; and I’m sure that it is somehow connected with the bi-modal net time female time results.

Looks like male runners are more homogenous, while females are more diverse.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response