Simple Soccer’s Pythagorean Formula (part II)

29 03 2013

In last post we started discussion on application of Pythagorean expectation in soccer. First results were not that encouraging, but – as was already pointed out – the issue was with ‘3 points for a win, 1 point for a draw’ rule in soccer.

This rule means that simple assumption:

Pythagorean Points = Win% * Number of Games * 3

is wrong, as number of points which can be re-distributed between participants in a game is on average lower (the maximum can be 3 points in case of one team win, but it can be also 2 points in case of draw)

What is the average number of points to be redistributed within the game (PdPG)?
I’ve made calculations on the same dataset described in previous post: 37 main soccer leagues from all over the world dating back (in majority cases) to 1990’s -> altogether 173 946 observations (games).

73% games in this dataset ended with a win of one of the teams, 27% of games ended with draw -> so average PdPG is 0.73 * 3 + 0.27 * 2 = 2.73.

Naturally this parameter varies between leagues, but only slightly, as in 29 leagues it is within 0.03 from mean.

Only 1st Bulgarian league is a bit of an outlier with PdPG = 2.81, as 81% of games ends there with a win. And from the other side PdPG in 2nd league in France and Italy is at 2.68.

You can check full list of leagues and their PdPG on the following graph: This parameter is very stable over time, as in last 10 seasons, last 5 seasons and also in last 3 seasons average PdPG is the same: 2.73.

Right now we can calculate one more time Pythagorean exponent by minimizing RMSE between Pythagorean points (using in above formula 2.73 instead of 3) and actual points.

But I will do it differently – I will check which pair of parameters (Pythagorean exponent; PdPG) minimizes the error.

Results based on 9 961 obervations from 37 leagues: Pythagorean exponent = 1.32, PdPG = 2.74 -> RMSE = 4.16, MAE = 3.33.

So based on the regression we get nearly the same PdPG as the one calculated straight as average.

What will change if we will shorten time-frame?:

• last 10 seasons: Pythagorean exponent = 1.34, PdPG = 2.74 -> RMSE = 4.08, MAE = 3.26
• last 5 seasons: Pythagorean exponent = 1.34, PdPG = 2.74 -> RMSE = 4.06, MAE = 3.25
• last 3 seasons: Pythagorean exponent = 1.33, PdPG = 2.74 -> RMSE = 4.02, MAE = 3.22

Errors slighlty decrease with lower number of observations, but parameters are very stable.

What impact on RMSE have parameters calculated for every league separately?

Below graph presents league specific parameters (Pythagorean exponent in the lower part and PdPG in the upper part + left scale) and ΔRMSE (blue shaded bars + right scale). PdPB varies nearly in the same way as already discussed above: between 2.67 and 2.81. Pythagorean exponent varies between 1.18 and 1.45.

We can observe biggest decrease in RMSE in already mentioned 3 leagues, for which PdPG differs the most from general PdPG -> altogether for these leagues we would decrease RMSE by 0.27  and MAE by nearly 0.20 by applying league-specific parameters.

RMSE by league
It’s interesting to look at RMSE on a league level, as the differences are significant: from 3.47 in 1st Russian League to 4.57 in English Championship. Below graph presents RMSE by league, calculated on full-time dataset: The upper end of a bar represents RMSE based on general parameters, whereas lower end represents RMSE based on league-specific parameters.

It seems that some leagues have much lower ‘volatility’ -> we will see in future post how it impacts predictions.

What’s important – as compared to first version of formula – is that this model is ‘symmetrical’ -> 49,7% of observations have negative error (Pythagorean points < actual points), 50.3% of observations have positive error (Pythagorean points > actual points).

Distribution of average abolute error (MAE): So for 50% of observations (teams) Pythagorean points differ from actual points by less than 2.85 points.

Definition of outliers is based on subjective thresholds You set – still I put some of them clearly on this graph, f.ex. there are 5% observations (509 teams), for which Pythagorean points differ from actual points by more than 8 points -> this means that on average we should expect one such over- or under- achieving team in the season.

In future posts we will take a look at over- and under-achievers -> and how it translates into predictions.

For now our general Soccer’s Pythagorean formula stands at: In the next post I will replicate Martin Eastwood’s work with, what I call, multiexponential Pythagorean formula to see what added accuracy we can gain.