Friday, 11 March 2016

First Football (Soccer) Stats Analysis Using Raspberry Pi, Python, MongoDB and R

In my last post I described the setup I'd created on my Raspberry Pi to do football (soccer for some of you) statistics analysis.  Here's an "architecture" diagram:

So in simple terms I:

  • Use Python to gather data from internet sources, parse it and load it into..
  • MongoDB where I store data in document format before...
  • Analysing the data using R and...
  • Presenting the results for you lucky people in Blogger!

So it was time to gather some data, load it and analyse it!

A quick look around the internet showed me a variety of sources, the first of which was:

http://www.football-data.co.uk/

This is a betting focussed website but has a set of free CSV files showing results from multiple European leagues going back to the early '90s, (older data is more sparse than more recent data).  As I say, they offer their data for free but I urge you to say thanks to them as an ad funded website in the only way you can if you see what I mean.

Grabbing a file and looking at it in LibreOffice Calc shows there to be all sorts of interesting data available.  Here's a screenshot showing data for the Belgian Pro league.


So much nice data to play with!!

Step 1- Getting All the Data
Looking at an example URL for one of the CSV files on football-data.co.uk shows it to be:
http://www.football-data.co.uk/mmz4281/1516/E0.csv

So here we see:
  • A base URL - http://www.football-data.co.uk/mmz4281/
  • Four digits indicating the year.  Here 1516 means the season 2015-16
  • The league in question.  Here E0 means English Premier League

Hence it's pretty easy to write a Python script to iterate through the years (9394 to 1516) and leagues to grab all the CSV files.  The script is below.  The comments should explain all but in simple terms it iterates through a list of years (YearList) and for each year a list of leagues (LeagueList), forms a URL for a wget command and then renames the file so we can have all the leagues for all the years in the same folder.

#Downloading a bulk load of Football CSV files from http://www.football-data.co.uk/mmz4281/
#Example URL is http://www.football-data.co.uk/mmz4281/1516/E0.csv - This is the URL for EPL Season 1516
import os
import time

#Constants
DirForFiles = "/home/pi/Documents/FootballProject/"

#Next part is a set of digits that represent the year.  Do these as a list
YearList = ['9394','9495','9596','9697','9798','9899','9900','0001','0102','0203','0304','0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','
1415','1516']

#Then the values that are used to represent leagues.  Another List
LeagueList = ['E0','E1','E2','E3','EC','SC0','SC1','SC2','SC3','D1','D2','I1','I2','SP1','SP2','F1','F2','N1','B1','P1','T1','G1']


#Iterate through the years
for TheYear in YearList:
  #Now iterate through the leagues, forming the command to get
  for TheLeague in LeagueList:
    GetCommand = "sudo wget " + BaseURL + TheYear + "/" + TheLeague + ".csv"
    
    #Also form the name the file will take when downloaded
    FileWhenDownloaded = DirForFiles + TheLeague + ".csv"

    #And the file name to rename to
    RenameFileTo = DirForFiles + TheLeague + "_" + TheYear + ".csv"
    
    #Run the wget command#
    os.system(GetCommand)
    time.sleep(0.5)

    #Rename the file
    RenameCommand = "sudo mv " + FileWhenDownloaded + " " + RenameFileTo
    os.system(RenameCommand)
    time.sleep(0.5)
   

    #print (GetCommand + "|" + FileWhenDownloaded + "|" + RenameFileTo)

A quick ls command shows all the lovely files ready for analysing! 


Step 2 - Loading into MongoDB 
In my last post I covered the basics of loading data into MongoDB using Python. Now it was time to load data downloaded from the football-data site.  I decided to just load one league for one season to have an initial play with the data.

It seems that there is different fields for different leagues for different years.  Field names are common, it's just that they're absent or present from file to file.  The full list of field names is here.

I decided to model the data as one simple JSON document per match with each record as a key value pair.  So for example a simplified document for the first match of the Belgian league file shown above would be:

{
  "Div":"B1",
  "Date":"31/07/09",
  "HomeTeam":"Standard",
  "AwayTeam":"St Truiden"
}

There may well be more elegant ways of modelling the data.  As I explore and learn more I'll work this out!

To prepare MongoDB I created a database called "Footie" and a collection called "Results" using these commands in the Mongo utility:

> use Footie
> db.createCollection("Results")

I wrote the script below to load data for one season and one league (English League 2, 2015/16).

In the script I:

  • Create an object to access MongoDB
  • Open a CSV file to read it
  • Read the first line, remove trailing non-printing characters, split it and use this to form a list (HeaderList) that will make up the key element of the JSON document.
  • Then for each subsequent line, read it, split it into another list (LineList)
  • I then iterate through each list, forming key value pairs and writing them to a Python dictionary, (which is required to write a JSON document to MongoDB).
  • I then write the document to MongoDB!
#Create JSON from football results and write to MongoDB
from pymongo import MongoClient
import sys

#Constants
DirPath = "/home/pi/Documents/FootballProject/"

#Connect to the Footie database
client = MongoClient()
db = client.Footie
#Get a collection
collection = db.Results

#Open the file to process
MyFile = open(DirPath + "E3_1516.csv",'r')

#Read the first lines which is the header line, remove the \r\n at the end and turn it into a list
LineOne = MyFile.readline()
LineOne = LineOne[:-2]
HeaderList = LineOne.split(',')

#Now loop through the file reading lines, creating JSONs and writing them
FileLine = MyFile.readline()
while len(FileLine) > 0:
  print(FileLine)
  #Get rid of last two characters and put in a list
  FileLine = FileLine[:-2]
  LineList = FileLine.split(',')
  
  #Form a JSON from these lists; needs to be a Python Dictionary
  JSONDict = dict()

  #Loop through both lists and add to the JSON
  for i in range(len(HeaderList)):
    #Interestingly field names in MongoDB can't contain a "." so turn it into a European style ","
    HeaderList[i] = HeaderList[i].replace(".",",")
    JSONDict[HeaderList[i]] = LineList[i]
  
  #Write the document to the collection
  print JSONDict
  collection.insert_one(JSONDict)

  #Setup for next loop
  FileLine = MyFile.readline()

print ("Finished writing JSONs")

#Close the file
MyFile.close()

The net result from the Mongo tool being (abridged):

> db.Results.find()
{ "_id" : ObjectId("56ccc42d74fece04d3d5a323"), "BbAHh" : "0.25", "HY" : "1", "BbAH" : "25", "BbMx<2,5" : "1.7", "HTHG" : "0", "HR" : "0", "HS" : "12", "VCA" : "2.5", "BbMx>2,5" : "2.28", "BbMxD" : "3.4", "AwayTeam" : "Luton", "BbAvD" : "3.19", "PSD" : "3.34", "BbAvA" : "2.38", "HC" : "3", "HF" : "12", "Bb1X2" : "44", "BbAvH" : "2.96", "WHD" : "3.2", "Referee" : "G Eltringham", "WHH" : "2.9", "WHA" : "2.5", "IWA" : "2.2", "AST" : "4", "BbMxH" : "3.25", "HTAG" : "0", "BbMxAHA" : "2.14", "IWH" : "2.8", "LBA" : "2.4", "BWA" : "2.15", "BWD" : "3.2", "LBD" : "3.25", "HST" : "4", "PSA" : "2.46", "Date" : "08/08/15", "LBH" : "3.1", "BbAvAHA" : "2.06", "BbAvAHH" : "1.77", "IWD" : "3.1", "AC" : "4", "FTR" : "D", "VCD" : "3.4", "AF" : "15", "VCH" : "3", "FTHG" : "1", "BWH" : "3.1", "AS" : "10", "AR" : "0", "BbAv<2,5" : "1.65", "AY" : "0", "BbAv>2,5" : "2.16", "Div" : "E3", "PSH" : "3.08", "B365H" : "3.2", "HomeTeam" : "Accrington", "B365D" : "3.4", "B365A" : "2.4", "BbMxAHH" : "1.82", "HTR" : "D", "BbOU" : "37", "FTAG" : "1", "BbMxA" : "2.5" }
{ "_id" : ObjectId("56ccc4c674fece05830a706c"), "BbAHh" : "0.25", "HY" : "1", "BbAH" : "25", "BbMx<2,5" : "1.7", "HTHG" : "0", "HR" : "0", "HS" : "12", "VCA" : "2.5", "BbMx>2,5" : "2.28", "BbMxD" : "3.4", "AwayTeam" : "Luton", "BbAvD" : "3.19", "PSD" : "3.34", "BbAvA" : "2.38", "HC" : "3", "HF" : "12", "Bb1X2" : "44", "BbAvH" : "2.96", "WHD" : "3.2", "Referee" : "G Eltringham", "WHH" : "2.9", "WHA" : "2.5", "IWA" : "2.2", "AST" : "4", "BbMxH" : "3.25", "HTAG" : "0", "BbMxAHA" : "2.14", "IWH" : "2.8", "LBA" : "2.4", "BWA" : "2.15", "BWD" : "3.2", "LBD" : "3.25", "HST" : "4", "PSA" : "2.46", "Date" : "08/08/15", "LBH" : "3.1", "BbAvAHA" : "2.06", "BbAvAHH" : "1.77", "IWD" : "3.1", "AC" : "4", "FTR" : "D", "VCD" : "3.4", "AF" : "15", "VCH" : "3", "FTHG" : "1", "BWH" : "3.1", "AS" : "10", "AR" : "0", "BbAv<2,5" : "1.65", "AY" : "0", "BbAv>2,5" : "2.16", "Div" : "E3", "PSH" : "3.08", "B365H" : "3.2", "HomeTeam" : "Accrington", "B365D" : "3.4", "B365A" : "2.4", "BbMxAHH" : "1.82", "HTR" : "D", "BbOU" : "37", "FTAG" : "1", "BbMxA" : "2.5" }

Step 3 - Some Analysis in R
Using R I did some initial analysis of the data.  I thought it would be interesting to ask the all-important question "who is the naughtiest team in League 2?".  The data can help as the following fields are present:

HF = Home Team Fouls Committed
AF = Away Team Fouls Committed 
HY = Home Team Yellow Cards
AY = Away Team Yellow Cards
HR = Home Team Red Cards
AR = Away Team Red Cards



Before using I, I practised the query I wanted to run in the MongoDB shell.  The query is:


> use Footie 
switched to db Footie 
> db.Results.find({},{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0})

Which breaks down as:

> db.Results.find({}, means run a query on the Results collection and provide no filter parameters, i.e. give everything. 

...and...

{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0}) 
 means turn all the fields with a "1" on in the output and turn the _id (which is on by default) off.

...and this yields (abridged):

> db.Results.find({},{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0})
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Luton", "HF" : "12", "AF" : "15", "AR" : "0", "AY" : "0", "HomeTeam" : "Accrington" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Luton", "HF" : "12", "AF" : "15", "AR" : "0", "AY" : "0", "HomeTeam" : "Accrington" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Plymouth", "HF" : "9", "AF" : "7", "AR" : "0", "AY" : "0", "HomeTeam" : "AFC Wimbledon" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Northampton", "HF" : "11", "AF" : "8", "AR" : "0", "AY" : "1", "HomeTeam" : "Bristol Rvs" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Newport County", "HF" : "9", "AF" : "12", "AR" : "0", "AY" : "2", "HomeTeam" : "Cambridge" }

To get the same data into a R data frame I did the following to set things up:








> library(RMongo)
Loading required package: rJava
> mg1 <- mongoDbConnect('Footie')
> print(dbShowCollections(mg1))
[1] "Results"        "system.indexes"

...then this to run the query.  

> query <- dbGetQueryForKeys(mg1, 'Results', "{}","{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id:0}")
> data1 <- query

(Note the use of the dbGetQueryForKeys method which splits the Mongo shell query shown above into two parts).

Which gives this output (abridged):

> data1
          HomeTeam       AwayTeam HF AF HY AY HR AR X_id
1       Accrington          Luton 12 15  1  0  0  0   NA
2    AFC Wimbledon       Plymouth  9  7  1  0  0  0   NA
3      Bristol Rvs    Northampton 11  8  1  1  0  0   NA
4        Cambridge Newport County  9 12  1  2  0  0   NA
5           Exeter         Yeovil  5 10  1  0  0  0   NA

...am not sure why I get the X_id put I'm sure I can deal with it!

I now need to get this side-by-side data (so home and away team on the same row) into a data frame where there's one row per match per team.

To do this I created one data frame for home teams, one for away teams then merged them.

First for the home team.  Get the home team data (columns 1, 3, 5 and 7) and then rename them to make them consistent when we combine home and away data frames:

> data2 <- data1[,c(1,3,5,7)
> colnames(data2) <- c("Team","Fouls","Yellows","Reds") 
> head(data2)
           Team Fouls Yellows Reds
1    Accrington    12       1    0
2 AFC Wimbledon     9       1    0
3   Bristol Rvs    11       1    0
4     Cambridge     9       1    0
5        Exeter     5       1    0
6    Hartlepool    12       4    0

Get away team data and rename:

> data3 <- data1[,c(2,4,6,8)]
> colnames(data3) <- c("Team","Fouls","Yellows","Reds")
            Team Fouls Yellows Reds
1          Luton    15       0    0
2       Plymouth     7       0    0
3    Northampton     8       1    0
4 Newport County    12       2    0
5         Yeovil    10       0    0
6      Morecambe    14       2    0

Merge the two together using the bind function:

data4 < rbind(data2, data3)

> head(data4)
           Team Fouls Yellows Reds
1    Accrington    12       1    0
2 AFC Wimbledon     9       1    0
3   Bristol Rvs    11       1    0
4     Cambridge     9       1    0
5        Exeter     5       1    0
6    Hartlepool    12       4    0

So as a quick check, in the first data frame we had this as the bottom result:

> data1[371,]
    HomeTeam   AwayTeam HF AF HY AY HR AR X_id
371   Yeovil Portsmouth  7 10  1  0  0  1   NA

Now you can see this is split over two rows of our data frame:
> data4[c(371,742),]
           Team Fouls Yellows Reds
371      Yeovil     7       1    0
3711 Portsmouth    10       0    1

Now aggregate to get the count per team across the season so far.  Here you first list the values you're aggregating, then what to group them by, the the function (sum in this case):

> data_agg <-aggregate(list(Fouls=data4$Fouls,Yellows=data4$Yellows,Reds=data4$Reds), list(Team=data4$Team), sum)

Which gives us (abridged):

> data_agg
             Team Fouls Yellows Reds
1      Accrington   302      57    3
2   AFC Wimbledon   377      40    3
3          Barnet   331      46    3
4     Bristol Rvs   299      39    2

Not much use so time to order the data using the "order" function.  There are many ways to order, I selected to order by fouls first, then yellow cards, then red cars.

> agg_sort <- data_agg[order(-data_agg$Fouls,-data_agg$Yellows,-data_agg$Reds),]
> agg_sort
             Team Fouls Yellows Reds
22        Wycombe   420      48    2
13      Mansfield   417      62    5
2   AFC Wimbledon   377      40    3
8     Dag and Red   373      42    2
21      Stevenage   351      56    3
12          Luton   351      53    3
7    Crawley Town   348      52    4
16    Northampton   347      44    5
11  Leyton Orient   338      45    3
19       Plymouth   332      53    0
3          Barnet   331      46    3
17   Notts County   329      56    3
5       Cambridge   329      38    3
23         Yeovil   317      34    3
24           York   316      50    3
18         Oxford   310      46    2
15 Newport County   306      37    2
1      Accrington   302      57    3
9          Exeter   301      38    0
4     Bristol Rvs   299      39    2
14      Morecambe   294      55    2
6        Carlisle   284      39    3
20     Portsmouth   264      28    5
10     Hartlepool   252      45    2

So there, it's official*, Wycombe Wanderers are the naughtiest team in English League 2.



(*Apologies for fans of Wycombe.  It;s just stats geekery and no reflection on your fine football team!).