%load_ext autoreload
%autoreload 2

Examine the source

Link: https://www.football-data.co.uk/data.php

This site provides an extensive list of match statistics, outcomes and odds. However, for some reason there is no easy way to download the data. The main page contains links to the main leagues. Each main league site then provides a list of links to each result csv, grouped by season and league.

mainpage_link = r'https://www.football-data.co.uk/data.php'
mainpage_bs = cache(mainpage_link, 'football_data_mainpage')

Since parallelizatoin doesn't really work from within jupyter notebooks (at least on windows), I'll use this notebook to develop all necessary functions to later run a normal python script (with parallelization).

Understanding the features

The column names for this datasource are rather cryptic. Here are the most useful ones:

Date
HomeTeam, AwayTeam
FTHG*, FTAG** -> Full Time Home/Away Team Goals
B365H/D/A Bet365 home/draw/away odds

Parsing

The following two sections derive the method for generating all necessary download links to the individual csv files. Most of this derivation is not needed in the actual download script and is only kept here for future reference.

Parsing mainpage for country pages

The sub-pages we are looking for have sub-urls of roughly the shape: \<country_name>m.php, so lets look for links like this.

php_links = find_links_by_func(mainpage_bs)
            
print(f'Number of php links: {len(php_links)}')
print('Examples:')
print('\n'.join(php_links[65:75]))

Number of php links: 209
Examples:
spainm.php
francem.php
netherlandsm.php
belgiumm.php
portugalm.php
turkeym.php
greecem.php
Argentina.php
Austria.php
Brazil.php

So the good news is that the links we're looking for are in there. The bad news is that there's also a bunch of other stuff. Lets try using some regex magic.

country_links = find_links_by_pattern(mainpage_bs, r'\w+m\.php')
country_links

['englandm.php',
 'scotlandm.php',
 'germanym.php',
 'italym.php',
 'spainm.php',
 'francem.php',
 'netherlandsm.php',
 'belgiumm.php',
 'portugalm.php',
 'turkeym.php',
 'greecem.php']

Manually checking the site shows that this is exactly the list of links we're looking for. So far so good.

Parsing country pages for results

The country links we gathered in the last step are relative. To build the full links we have to join them with the base link.

example_country_link = urllib.parse.urljoin(mainpage_link, country_links[0])
country_html = cache(example_country_link, 'football_data_country_html')
example_country_link

'https://www.football-data.co.uk/englandm.php'

The result files seem to be located all under the same structure: https://www.football-data.co.uk/mmz4281/

The next part that follows is the season, encoded as yyYY. Here yy are the last two digits of the year of the seasons beginning. YY are the last two digits of the year of the seasons ending.

Getting all seasons

We'll try to parse all available years from the example country page:

csv_links = find_links_by_pattern(country_html, 'mmz4281/\d{4}/\w*\d\.csv')
csv_links[:5]

['mmz4281/2021/E0.csv',
 'mmz4281/2021/E1.csv',
 'mmz4281/2021/E2.csv',
 'mmz4281/2021/E3.csv',
 'mmz4281/1920/E0.csv']

From here we can easily extract the season data:

seasons = set(csv_link.split('/')[1] for csv_link in csv_links)
print(f'Amount of seasons: {len(seasons)}')
print(f'Example: {seasons.__iter__().__next__()}')

Amount of seasons: 28
Example: 0910

Lets refactor that as preparation for using for all the countries:

def unique_seasons(html):
    'Returns a list of all unique seasons for which result.csv files are found in the given html'
    csv_links = find_links_by_pattern(html, 'mmz4281/\d{4}/\w*\d\.csv')
    return set(csv_link.split('/')[1] for csv_link in csv_links)

assert unique_seasons(country_html)==seasons

country_links = [urllib.parse.urljoin(mainpage_link, country_link) for country_link in country_links]

seasons = []
for country_link in country_links[1:]:
    print('Processing ', country_link)
    html = get_html(country_link)
    unique = unique_seasons(html)
    if unique:
        seasons.append(unique)
    else:
        print('Found no seasons')

Processing  https://www.football-data.co.uk/scotlandm.php
Processing  https://www.football-data.co.uk/germanym.php
Processing  https://www.football-data.co.uk/italym.php
Processing  https://www.football-data.co.uk/spainm.php
Processing  https://www.football-data.co.uk/francem.php
Processing  https://www.football-data.co.uk/netherlandsm.php
Processing  https://www.football-data.co.uk/belgiumm.php
Processing  https://www.football-data.co.uk/portugalm.php
Processing  https://www.football-data.co.uk/turkeym.php
Processing  https://www.football-data.co.uk/greecem.php

Let's see which seasons are available for all leagues:

seasons = list(functools.reduce(lambda a,b: a.intersection(b), seasons))

print('Number of available complete leagues:', len(seasons))
print(', '.join(seasons))

Number of available complete leagues: 26
0910, 9596, 1718, 9697, 1819, 1516, 0405, 0506, 1112, 0304, 1617, 1314, 1920, 0809, 0708, 2021, 0203, 9900, 1213, 0102, 1415, 9798, 1011, 0607, 0001, 9899

So appearently the data goes from 1995/96 all the way up to 2019/2020 (the currently running season). I'll grab all the data that's available, we can sort it out later.

Getting the correct sub-league

Looking at the csv links from earlier, the next part is a sub-league code, in this case 'E0'.

csv_links[0]

'mmz4281/2021/E0.csv'

The letter(s) in this code refer to the country in which the season is played. The number seems to be a sort-of ranking. For most countries, number 1 is assigned to the highest league. But not for all. In the example above for england, number 0 is assigned to the premier league. We'll therefore try to get all the relevant subleague-number -> subleague-name mappings.

We still have the example html code some country page.

Since we know that each country page has links to the last season (18/19), we'll parse for result links for this season:

def find_subpage_mapping(country_html):
    for link in find_links_by_pattern(country_html, r'mmz4281/1819/\w*\d\.csv', return_href=False):
        match = re.match(r'mmz4281/1819/(\w*\d)\.csv', link.get('href'))
        print(f'{match.group(1)}: {link.text}')
        
find_subpage_mapping(country_html)

E0: Premier League
E1: Championship
E2: League 1
E3: League 2

for country_link in country_links:
    country = re.search('co.uk/(.+)\.php', country_link).group(1)
    html = cache(country_link, f'football_data_{country}')
    print('-'*20)
    print(country_link)
    find_subpage_mapping(html)
    print('-'*20)

--------------------
https://www.football-data.co.uk/englandm.php
E0: Premier League
E1: Championship
E2: League 1
E3: League 2
--------------------
--------------------
https://www.football-data.co.uk/scotlandm.php
SC0: Premier League
SC1: Division 1
SC2: Division 2
SC3: Division 3
--------------------
--------------------
https://www.football-data.co.uk/germanym.php
D1: Bundesliga 1
D2: Bundesliga 2
--------------------
--------------------
https://www.football-data.co.uk/italym.php
I1: Serie A
I2: Serie B
--------------------
--------------------
https://www.football-data.co.uk/spainm.php
SP1: La Liga Primera Division
SP2: La Liga Segunda Division
--------------------
--------------------
https://www.football-data.co.uk/francem.php
F1: Le Championnat
F2: Division 2
--------------------
--------------------
https://www.football-data.co.uk/netherlandsm.php
N1: Eredivisie
--------------------
--------------------
https://www.football-data.co.uk/belgiumm.php
B1: Jupiler League
--------------------
--------------------
https://www.football-data.co.uk/portugalm.php
P1: Liga I
--------------------
--------------------
https://www.football-data.co.uk/turkeym.php
T1: Futbol Ligi 1
--------------------
--------------------
https://www.football-data.co.uk/greecem.php
G1: Ethniki Katigoria
--------------------

We're interested in the highest league for each country, we'll manually grab those codes:

Putting it all together

We have all the seasons and league information:

FOOTBALL_DATA_LEAGUES

[League(name='england', code='E0'),
 League(name='scotland', code='E0'),
 League(name='germany', code='D1'),
 League(name='italy', code='I1'),
 League(name='spain', code='SP1'),
 League(name='france', code='F1'),
 League(name='netherlands', code='N1'),
 League(name='belgium', code='B1'),
 League(name='portugal', code='P1'),
 League(name='turkey', code='T1'),
 League(name='greece', code='G1')]

FOOTBALL_DATA_SEASONS

['1314',
 '9900',
 '1112',
 '9697',
 '0708',
 '1920',
 '9798',
 '1819',
 '0405',
 '9596',
 '0001',
 '1213',
 '0203',
 '0102',
 '0809',
 '1617',
 '1516',
 '1718',
 '1011',
 '0506',
 '0607',
 '9899',
 '0304',
 '1415',
 '0910']

print('Number of leagues: ', len(FOOTBALL_DATA_LEAGUES))
print('Number of seasons for each league: ', len(FOOTBALL_DATA_SEASONS))

Number of leagues:  11
Number of seasons for each league:  25

For a given league and season

season = FOOTBALL_DATA_SEASONS[5]
league_code = FOOTBALL_DATA_LEAGUES[7].code

print('Season: ', season)
print('League code: ', league_code)

Season:  1920
League code:  B1

we can generate the link to the results file:

build_full_link(season, league_code, mainpage_link)

'https://www.football-data.co.uk/mmz4281/1920/B1.csv'

Second source kaggle

A team of researchers derived a profitable method for sports betting by taking averages of odds from different bookkeepers. They hosted their dataset on kaggle:

https://www.kaggle.com/austro/beat-the-bookie-worldwide-football-dataset

First source football-data.co.uk

Examine the source

Understanding the features

Parsing

Parsing mainpage for country pages

Parsing country pages for results

Getting all seasons

Getting the correct sub-league

`class` `League`[source]

Putting it all together

`build_full_link`[source]

Second source kaggle

First source football-data.co.uk

Examine the source

Understanding the features

Parsing

Parsing mainpage for country pages

Parsing country pages for results

Getting all seasons

Getting the correct sub-league

class League[source]

Putting it all together

build_full_link[source]

Second source kaggle

`class` `League`[source]

`build_full_link`[source]