Paul's Geek Dad Blog: Half Marathon Analysis Using Python and R

I recently did a half-marathon run and, whilst I did a personal best, I paced it really badly meaning the last 5km were hard and slow. This, coupled with some interesting results data that was published gave me a Geek idea!

Here's a snippet from the results PDF:

So for all 10,000 runners a whole plethora of data to compare and contrast. In particular I was interested in how my 5k splits compared with everyone else's. Mine were:

0-5km = 22m0s
5-10km = 21m56s
10-15km = 22m03s
15-20km = 23m37s

So really consistent across the first 15km then a bad fade over the last 5. In fact a quick calculation shows I was 7.4% slower for the last 5km than the average for the first 15km. However before I beat myself up over this I needed to know whether this was typical, better than average or worse than average.

All the results were in a PDF and what a pain that turned out to be to turn into something I could process. I tried various online services, saving as text in Adobe Acrobat, avoided the paid for Adobe service and tried a Python module called pyPdf but none would allow me to turn the PDF file into a well formed text file for processing.

In the end I opened the PDF in Adobe Acrobat, copied all the data then pasted it into Windows Notepad. The data looked like this (abridged):

GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime Chip 5Km Split Chip 10Km Split Chip 15Km Split Chip 20Km Split
5 Robert Mbithi M 01:03:57 1 01:03:57 00:15:08 00:29:39 00:44:44 01:00:24
72 Scott Overall M 01:05:13 2 01:05:13 00:15:33 00:30:46 00:46:13 01:01:59
81 Chris Thompson M ALDERSHOT FARNHAM & DISTRICT AC 01:05:14 3 01:05:14 00:15:33 00:30:46 00:46:13 01:01:59
6 Paul Martelletti M RUN FAST 01:05:15 4 01:05:15 00:15:33 00:30:46 00:46:13 01:01:59
82 Gary Murray M 01:06:12 5 01:06:12 00:15:33 00:30:46 00:46:18 01:02:37

I then had to work pretty hard to turn this into a file that could be read into my favourite analysis package, R. Looking at the data above you can see:

There's spaces between fields.
There's spaces within fields.
There's missing fields (e.g. age and club).
The PDF format means some long fields overlap with each other.

I actually had to write quite a lot of Python script to turn this into a CSV. I've put this in the bottom of this post so as to know interrupt the flow (baby). In the script I also added some fields where the hh:mm:ss format times were turned into simple seconds to help with the maths.

Eventually I had a CSV file to play with and I read it into R using this command:

> rhm1 <- read.csv(file=file.choose(),head=FALSE,sep=",")

I added some column names with:

> colnames(rhm1) <- c("GunPos","RaceNo","Gender","Name","AgeCat","Club","GunTime","GunTimeSecs","ChipPos","ChipTime","ChipTimeSecs","FiveKSplit","FiveKSplitSecs","TenKSplit","TenKSplitSecs","FifteenKSplit","FifteenKSplitSecs","TwentyKSplit","TwentyKSplitSecs")

I then computed the net 5k splits (from the elapsed time) with:
> rhm1$TenKSplitSecsNet <- rhm1$TenKSplitSecs - rhm1$FiveKSplitSecs
> rhm1$FifteenKSplitSecsNet <- rhm1$FifteenKSplitSecs - rhm1$TenKSplitSecs
> rhm1$TwentyKSplitSecsNet <- rhm1$TwentyKSplitSecs - rhm1$FifteenKSplitSecs

I then computed the mean time for the first 3 splits:
> rhm1$FirstFifteenKMean <- (rhm1$FiveKSplitSecs + rhm1$TenKSplitSecsNet + rhm1$FifteenKSplitSecsNet) / 3

...and used this to compute the percentage difference in the last 5k split from the average of the first 3:
rhm1$Last5KDelta <- (rhm1$TwentyKSplitSecsNet - rhm1$FirstFifteenKMean) / rhm1$FirstFifteenKMean

Finding me in the data:

> rhm1[grep("Geek Dad",rhm1$Name),]
GunPos RaceNo Gender Name AgeCat Club GunTime GunTimeSecs ChipPos
869 869 13759 M Geek Dad 40 01:39:12 5952 918
ChipTime ChipTimeSecs FiveKSplit FiveKSplitSecs TenKSplit TenKSplitSecs
01:34:54 5694 00:22:00 1320 00:43:56 2636
FifteenKSplit
01:05:59
FifteenKSplitSecs TwentyKSplit TwentyKSplitSecs TenKSplitSecsNet
3959 01:29:36 5376 1316

FifteenKSplitSecsNet TwentyKSplitSecsNet FirstFifteenKMean Last5KDelta
1323 1417 1319.667 0.073756

I could then plot all the data using ggplot. I chose a density plot to look at the proportions of each "Last5KDelta" value. Here's the command to create the plot (and add some formatting and labels).

> library(ggplot2)
> qplot(rhm1$Last5KDelta, geom="density", main="Density Plot of Last 5K Delta",xlab="% Delta", ylab="Density", fill=I("blue"),col=I("red"), alpha=I(.2),xlim=c(-1,1))

So nice looking chart and I can see that there's more people who were slower in the final 5k (positive value) than faster. Looking good!

However this didn't tell me whether I was better or worse than average. For this I need a cumulative frequency plot. This uses the stat_ecdf (empirical cumulative distribution function) to create the plot. The command below does this, tweaks the x axis to make it tighter and puts in an extra a axis tick at 7% so I can see where "I" sit on the graph.

> chart1 <- ggplot(rhm1,aes(Last5KDelta)) + stat_ecdf(geom = "step",colour="red") + scale_x_continuous(limits=c(-0.3,0.6),breaks=c(-0.3,-0.2,-0.1,0,0.07,0.1,0.2,0.3,0.4,0.5,0.6))
chart1 + ggtitle("Cumulative Frequency of Last 5K Delta") + labs(y="Cumulative Frequency")

So get in! 0.7% sits at less than 50% cumulative frequency! More people faded more than me over the last 5k.

However somewhere behind me was a man carrying a fridge! I decided to look at just those who completed the run in under 2 hours by doing:

> rhmSubTwo <- rhm1[rhm1$ChipTimeSecs<7200,]

Which gives this chart:

Darn it. About 68% of this cohort faded less than me. Not looking so good now...

What about those equal or better than me?
> rhmSubMe <- rhm1[rhm1$ChipTimeSecs<5694,]

Looking pretty poor now :-(. I must pace myself better next time.

Here's all the Python code to create the .csv file:

InputFile = "/home/pi/Documents/RHM/VitalityReadingHalfMarathon_v2.txt"
OutputFile = "/home/pi/Documents/RHM/RHM.csv"

#Open the file
InFile = open(InputFile,'r')
OutFile = open(OutputFile,'w')

#This takes a time in h:m:s or similar and turns to seconds.
def TimeStringToSecs(InputString):
#There are 2 cases
#1)A proper time string hh:mm:ss
#2)Something else with letters and numbers munged together
if len(InputString) == 8:
#Compute the time in seconds
SecondsCount = (float(InputString[0:2]) * 3600) + (float(InputString[3:5]) * 60) + float(InputString[6:8])
return str(SecondsCount)
else:
print "Got this weird string: " + InputString
return "-1"

for i in range(1,10981):
#Initialise variables
Outstring = ""
EndString = ""
MidString = ""
GenderFound = False

#Read a line
InString = InFile.readline().rstrip()

print InString

#Split the line based upon a space
SplitStr1 = InString.split(" ")

#We can rely on the first field which is gun position and second field which is race number. But don't put Gun position as R ignores it!
OutString = SplitStr1[1] + ","

#We can also rely on the last 7 fields of the line which respectively are GunTime,ChipPos,ChipTime,5KSplit,10KSplit,15KSplit,20KSplit
NumFields = len(SplitStr1)
#Compute the end of the output string
for z in range(7,0,-1):
#print "z=" + str(z) + ". Equates to" + SplitStr1[NumFields - z]
EndString = EndString + SplitStr1[NumFields - z] + ","
#Look up the time in seconds. Not for case 6 which is the gun position
if (z != 6):
EndString = EndString + TimeStringToSecs(SplitStr1[NumFields - z]) + ","
#Hardest bit last. Name, Gender, Age and Club. Gender is reliably there, except for long names where it gets mangled.
#Hence find it and you know everything before is the name
for a in range(0,len(SplitStr1)):
if (SplitStr1[a] == "M") or (SplitStr1[a] == "F"):
#THis is the position of the gender which is the "anchor" for everything else
GenderPos = a
#Add it to the middle part of the string. No worries it's in different order to file.
MidString = SplitStr1[a] + ","
#Say we found the gender.
GenderFound = True

#Process for the case where gender was found
if GenderFound:
#Now we know everything before (exclusing first two numbers was the name. Add the parts of the name together. The below code should handle
#complex names
for b in range(2,GenderPos):
MidString = MidString + SplitStr1[b]
#See if it's not the last part of the name. If not add a space
if (b < (GenderPos - 1)):
MidString = MidString + " "
else:
MidString = MidString + ","

#Now test the part after the gender position. If it's "U23" or a number(but not 100 as cllubs start with this!) then this is the age category
if (SplitStr1[GenderPos + 1] == "U23"):
MidString = MidString + SplitStr1[GenderPos + 1] + ","
#Log where the club might start
ClubStartPos = GenderPos + 2
elif SplitStr1[GenderPos + 1].isdigit():
if SplitStr1[GenderPos + 1] != "100":
MidString = MidString + SplitStr1[GenderPos + 1] + ","
#Log where the club might start
ClubStartPos = GenderPos + 2
else:
MidString = MidString + ","
#Log where the club might start
ClubStartPos = GenderPos + 1
else:
MidString = MidString + ","
#Log where the club might start
ClubStartPos = GenderPos + 1

#So now everything from ClubStartPos "might" be a club. We can test this by seeing if what might be the club is actually gun
#time which is 7th from the end
if (SplitStr1[ClubStartPos] == SplitStr1[NumFields - 7]):
MidString = MidString + ","
else:
#Loop adding elements of the club
for c in range(ClubStartPos,NumFields - 7):
MidString = MidString + SplitStr1[c]
#See whether to add a space
if (c < (NumFields - 8)):
MidString = MidString + " "
else:
MidString = MidString + ","
else: #Where there is no gender. Add commas to represent Name,Age and Club and somethign to say it was a long name!!!
MidString = ",Long Name,,,"

#print OutString
#print MidString
#print EndString

print OutString + MidString + EndString
OutFile.write(OutString + MidString + EndString + '\r\n')

InFile.close()
OutFile.close()

Paul's Geek Dad Blog

Friday 8 April 2016

Half Marathon Analysis Using Python and R

No comments:

Post a Comment