Half Marathon Exploratory Data Analysis
This weekend I made a half-marathon (my second this year). It was great — nice town, OK weather, great people.
But what do I actually know about my fellow runners? I’m going to use my new R-knowledge, and figure it out!
Get the data
Fortunately, the organisers provide a web-based table, which is pretty easy to convert into a convenient CSV file. Than I open RStudio and load it.

R is a lot of things: a data-management tool, a data-analysis tool, a visualisation tool, and a programming language.
Runner demographics
The race took place at lake Balaton (the Hungarian sea!), which is about one and half hour ride from Budapest. Almost every second runner came from the capital.

How about gender?

And about age of the runners? Wow, actually I’m rather young(34), if we look at the age distribution.

Results
The most important number at a half-marathon is your time. Below 2-hour it is pretty decent (my time was 2:05:16, so I still need to work on it).

The net time distribution is interesting. First, looks like i’m rather below the average pace with my time.
Second, look at the shape of the histograms. You can argue that male times look somewhat normally — with a median around 115 minutes. However, female times are bi-modal. There are 2 distinct group: the ladies who run the same or better times than an average guy, and another group, which peaks around 135 minutes.
And how about a scatter plot: time vs age vs sex ?

- Pink points (males) are rather at the lower part of the plot. On average, males run faster.
- We can’t see any big difference between ages. However, there is a peak around 25. It’s not clear, if the 20-somethings really do worse, or this is some other effect.
- At very good times (below 1:40) males are over-represented.
Pace
This was a special race: the athletes had to run 2 rounds around the town, and there was a time-checking machine exactly at the middle of the race. So we can see the difference between the 2 rounds.
I have an assumption that usually runners run second round slower — they are tired after 10 kilometers.

The assumption was right — the average runner loses 1.5 minutes at the second 10 kilometers. It is worth to note that this is a beautiful, a little right skewed bell-curve.
Lets scatter plot the same data! At this plot we see the pace changing (y-scale) vs age, grouped by sex.

Here we can see something really fun. Male and female pace changes more-or-less the same way. However:
Females around 25-years lose a lot of pace at the second round — which is not true for other female age groups. This is probably the same affect which we already seen at the age vs net-time distribution; and I’m sure that it is somehow connected with the bi-modal net time female time results.
Looks like male runners are more homogenous, while females are more diverse.
