Half Marathon Exploratory Data Analysis

This weekend I made a half-marathon (my second this year). It was great — nice town, OK weather, great people.

But what do I actually know about my fellow runners? I’m going to use my new R-knowledge, and figure it out!

Get the data

An R-dataframe. Think about it as excel on steroids.

R is a lot of things: a data-management tool, a data-analysis tool, a visualisation tool, and a programming language.

Runner demographics

The capital is very over-represented

How about gender?

56% of runners are males.

And about age of the runners? Wow, actually I’m rather young(34), if we look at the age distribution.

The vertical red line represents my age


Pink horizontal line represents my net time (125 minutes)

The net time distribution is interesting. First, looks like i’m rather below the average pace with my time.

Second, look at the shape of the histograms. You can argue that male times look somewhat normally — with a median around 115 minutes. However, female times are bi-modal. There are 2 distinct group: the ladies who run the same or better times than an average guy, and another group, which peaks around 135 minutes.

And how about a scatter plot: time vs age vs sex ?

Every point represents one runner. The line represents a regressive, imaginary average.
  • Pink points (males) are rather at the lower part of the plot. On average, males run faster.
  • We can’t see any big difference between ages. However, there is a peak around 25. It’s not clear, if the 20-somethings really do worse, or this is some other effect.
  • At very good times (below 1:40) males are over-represented.


I have an assumption that usually runners run second round slower — they are tired after 10 kilometers.

Difference between 2 rounds, in seconds. Positive numbers represent slower, while negative numbers a faster pace at the second round. Red line is 0, which is exactly same 2 rounds.

The assumption was right — the average runner loses 1.5 minutes at the second 10 kilometers. It is worth to note that this is a beautiful, a little right skewed bell-curve.

Lets scatter plot the same data! At this plot we see the pace changing (y-scale) vs age, grouped by sex.

The line represents regressed average for the given age group

Here we can see something really fun. Male and female pace changes more-or-less the same way. However:

Females around 25-years lose a lot of pace at the second round — which is not true for other female age groups. This is probably the same affect which we already seen at the age vs net-time distribution; and I’m sure that it is somehow connected with the bi-modal net time female time results.

Looks like male runners are more homogenous, while females are more diverse.