In the last post we derived simple soccer’s Pythagorean formula which gave us first satisfactory results in terms of RMSE.
We found that general parameters (parameters calculated for all the leagues together) are stable over different time-frames and that application of league-specific parameters does reduce RMSE more significantly only in few cases, where PdPG (points distributed per game) for given league differs the most from general PdPG.
Today we will look at, what I call, multiexponential Soccer’s Pythagorean Formula (meSPF) – formula, which can have different values for different exponents.
Just as a reminder – we minimze RMSE on a dataset consisting of 37 major soccer leagues dating back (in most cases) to 1990’s -> altogether 9 961 team seasons.
Resulting formula (RMSE: 4.08, MAE: 3.27):
So we reduce RMSE by 0.08 and MAE by 0.06 as compared to simple Soccer’s Pythagorean Formula (sSPF).
This formula is also stable over time. Solving the equation over shorter time-frames (last 10, 5 and last 3 seasons), we reduce RMSE by less than 0.01 in each case.
sSPF vs meSPF approach: League-by-league comparison
Below graph presents comparison of RMSE between simple Soccer’s Pythagorean Formula and multiexponential Soccer’s Pythagorean Formula (both based on general, not league-specific, parameters):
Red bars mean decrease of RMSE when applying meSPF, whereas green bars mean that RMSE increases
Only in few cases RMSE increases, but not substantially – the most in USA MLS (by 0.05).
For majority of leagues RMSE decreases – with exception of two leagues it decreases by not more than 0.18. Those two outliers are 1st Swiss League (decrease of RMSE by 0.40) and 1st Scottish League (decrease of RMSE by 0.32).
It’s quite interesting that these two leagues stand out from the crowd – so let’s take a more detailed look at them.
For 1st Swiss League I have data from last 9 seasons and in 7 of them RMSE decreases by at least 0.37 (in other two cases it decreases by 0.23 and 0.02) when applying meSPF as compared to sSPF -> so this difference is permanent over time.
Below graph presents comparison of sSPF and meSPF approach for 1st Swiss League:
All data are presented against Score Ratio (Goals for / Goals against), not Score Differential, as in Pythagorean Formula ratio is more meaningful than differential -> more on this in future post.
Actual points and Pythagorean Points (sSPF aproach) are presented in the upper part of the graph.
Notice in particular that for large Score Ratios (over 1.5) Pythagorean points (orange ones) are for the most part below actual points (blue ones). It seems that in this league better teams are able to gain more points than comparable teams in other soccer leagues and – as noted earlier – this pheomenon takes place since years.
Difference in Pythagorean Points between meSPF and sSPF approach is presented in the lower part of the graph. Green areas represent positive differences (meSPF Points > sSPF Points), whereas red areas present negative differences.
Notice that for very low and very high Score Ratios meSPF approach predicts higher points than sSPF approach (in most cases difference is up to 2 points), which results in a better fit to actual points.
Still differences between actual points and Pythagorean points are higher (for Score Ratios > 1.5 they are in the range of 3 – 9 points), which means that for this particular league ‘league-specific’ parameters can give us substantially better fit than general parameters.
For 1st Scottish League I have data from last 16 seasons and in 13 of them (and 10 of last 11) RMSE decreases by at least 0.21 when applying meSPF as compared to sSPF, so difference between sSPF and meSPF approach is also persistent over time.
Below graph presents comparison of sSPF and meSPF approach for 1st Scottish League:
The same trend as in 1st Swiss League is visible here, although in 1st Scottish League we have even bigger outliers in terms of Score Ratio (naturally Rangers and Celtic) -> notice that for Score Ratio’s over 2.5, Pythagorean points (orange ones) are for the most part below actual points (blue ones) – and these points have biggest influence on error.
For these outliers meSPF approach predicts higher points than sSPF approach in the range of 2 – 6 points, which results in a better fit to actual points.
As the differences between actual points and Pythagorean points are not much higher, it seems that for this particular league ‘league-specific’ parameters will not give us substantially better fit than general parameters.
What impact on RMSE have parameters calculated for every league separately?
Below graph presents impact of league-specific parameters on RMSE:
The upper end of a bar represents RMSE based on general parameters, whereas lower end represents RMSE based on league-specific parameters.
There are few leagues, for which RMSE decreases by more than 0.20, but 1st Swiss League really stands out -> RMSE decreases by 0.46 when applying league-specific parameters, which only confirms earlier findings (better fit for teams with higher Score Ratios needed).
As a side note: RMSE for 1st Scottish League decreases by 0.08, and on average RMSE decreases by 0.11.
Distribution of average abolute error (MAE):
Just as a reminder – this graph shows for how many observations (teams) Pythagorean points differ from actual points by given number, f.ex. for 50% of teams difference is no more than 2.79 -> altogether no big changes as compared to sSPF approach.
So we have the first version of meSPF, which slighlty improved fit to actual data as compared to sSPF. We see that there are also few leagues in which teams behave significantly differently from general consensus.
There is still one additional wrinkle, which should improve the fit – but for it I invite You already to the next post.