What does it mean should? Final outcome is not only result of skill but also luck. The more observations we have (f.ex. the more games played), the smaller role luck plays, but even after full season – when we applaud new champion – luck is still there (we will delve into it more in the future).
So in order to reduce impact of luck we need more observations – but it doesn’t mean only more games. We can also use events, which occur more frequently and which directly impact wins / losses -> like runs in baseball, points in basketball or goals in soccer -> and here comes Pythagorean expectation.
Since Bill James days Pythagorean expectation has been applied to other sports, like:
- basketball (Daryl Morey, Dean Oliver, Kevin Pelton)
- american football (Pro-Football, Advanced NFL Stats)
- soccer (Howard Hamilton, Martin Eastwood, Mark Taylor)
I will try to use this work to see what predictive knowledge we can gain by applying Pythagorean expectation to soccer. Can Pythagorean expectation tell us more about future outcomes than some basic indicators, like no. of points or score differential? We will see in future posts, but first, we have to establish Pythagorean formula in soccer.
We will start with basic formula, and in next posts we will add some wrinkles – still I want to keep things as simple as they can only be, so f.ex. I will not delve into most advanced and probably the most accurate derivation of Pythagorean soccer formula, which has been proposed by Howard Hamilton (for more info see Pythagorean Formula 2.0).
Simple Pythagorean formula (baseball):
So in baseball we use commonly 2 as an exponent, although it has been proved that 1.83 gives more accurate results.
In basketball many analysts have been using exponents between 13 and 16.5, whereas for NFL it’s assumed at 2.37.
What exponent should we apply in soccer?
To calculate it I used data from 37 main soccer leagues from all over the world – for most of them dating back to 1990’s -> altogether 580 seasons and 9961 observations (team seasons).
I used only regular season data (so without playoffs/play-outs, or subsequent splits into subgroups like championship / relegation, etc).
My aim was to minimize RMSE, which is average squared error between predicted and actual number of points.
In this first version of Soccer’s Pythagorean Formula I assume, that number of points = Win% * 3 due to ‘three point for a win’ rule in soccer.
The results: general exponent = 1.20, RMSE = 6.19, MAE (Mean Absolute Error) = 5.13.
Will we get better results, if we will apply different exponents for different leagues?
It seems, that it doesn’t have significant impact. League-specific exponents vary between 1.08 and 1.32 – > as presented on below graph (number in brackets after league abbreviation indicates number of seasons within analysis):
and by applying league-specific exponent we reduce RMSE (ΔRMSE in the graph + right scale) by max 0.15 and in majority of cases by only few bps.
What about consistency of general soccer exponent over time?
Below are the results for shorter time-frames:
-> last 10 seasons: exponent = 1.22, RMSE = 6.14, MAE (Mean Absolute Error) = 5.09, number of observations: 6 149
-> last 5 seasons: exponent = 1.22, RMSE = 6.12, MAE (Mean Absolute Error) = 5.09, number of observations: 3 172
-> last 3 seasons: exponent = 1.18, RMSE = 6.09, MAE (Mean Absolute Error) = 5.13, number of observations: 1 907
These results are very consistent, so it means, that for a basic soccer’s Pythagorean formula we can use 1.20 as general exponent.
Still, we can’t be satisfied with these results as RMSE is quite high -> in other sports (baseball, basketball) as well as in advanced Howard Hamilton’s Pythagorean Formula RMSE is at the level of 4 or lower. So what is the issue?
This simplified version of Soccer’s Pythagorean Formula overestimates number of points (for 85% of observations predicted number of points is higher than actual one). It’s due to a ‘three point for a win, one point for a draw’ rule in soccer, which means that number of points to be distributed within one particular game is either 2 (in case of draw) or 3 (in case of win).
But this is an issue for next post.