Correct

How do you proof that you analysis is correct? Well, first of all a proof might be too strong a demand. Maybe, it is enough to have an analysis that is probable. For normally a proof would need a double blind random test which is not feasible within football. A good alternative would be Bayesian statistics where you calculate how probable your analysis is. The difference is that when you want to prove something you calculate how strong the evidence is in light of your theory.. But when you calculate how probable your theory is in light of the evidence.

Your analysis is a kind of theory. You want to make a point with your analysis. If there is no point in your analysis then it is probably a description of the match. Yet, as soon as you draw conclusions from your analysis you have a theory and you claim that your theory is supported by the data, i.e. the evidence. In football, unlike when you try to prove something, the data is not in question. Sometimes there is a discussion on whether someone has counted for example  the correct number of passes. But most of the time, there is no discussion about the data.

But there is a lot of discussion about what the data tells us. This is exactly what is covered by the Bayesian philosophy: you want to show how probable your theory is given the data (the evidence).

One of the ways in which you could do this, is by showing that your analysis is a true reflection of what really happened on the pitch. This is tricky though. First of all, a reflection pretty much sounds like a description of the match rather than an analysis. But what is worse, is that it would make you a realist. Now, being called a realist sounds like a compliment. And a lot of philosophers call themselves realists. But we are talking here about the philosophical theory called realism. Realism claims that our true knowledge mirrors nature. The problem with realism is that it has many faults. As it turns out it is quite difficult to make sense of the idea of truth. Truth is absolute and we can’t find anything that absolute except within mathematics and formal logic. Everything else has a measure of uncertainty and that uncertainty makes it impossible to find absolute truth. Yet, as soon as you no longer have absolute truth, realisme also becomes impossible as it is no longer possible to proof that your ideas are realistic due to the fact they have a measure of uncertainty.

Another problem with realism is that any useful form of realism makes use of atomism while our language and meaning is holistic. So it is impossible to express isolated atomic parts of reality in our language. That makes atomism also doubtful.

All of this is highly relevant for football because most football analyses treat football as something that can be analysed in smaller parts. This idea that you can understand individual atoms of reality, so that in that way you can understand football, is highly unlikely. In football everything is connected to everything else. In that sense like our language and meaning, football is also holistic.

Fortunately, there are alternatives to realism. These alternatives are called anti-realism. The most important anti-realistic theory is pragmatism. Again, pragmatism sounds more pragmatic that it is. We are talking here about philosophical pragmatism. Philosophical pragmatism is defined as any theory that also cares about other values that the truth. Pragmatist don’t want to figure out how things really are, because they find it more interesting, for example, to figure out what works best. They are interested to find out how things hang together, rather than how things really are. In terms of football: pragmatist are looking for useful holistic patterns of play that help them achieve their goals rather than want to know the truth about the game.

The only measure of success of correctness for pragmatists is whether they achieve their goals. So any football analysis that leads to the team winning is correct. Any football analysis that leads to winning bets is correct. Any football analyses that leads to players getting better efficiencies are correct.

Correlation

Most people have heard that correlation is not causation. Yet, almost no-one has heard that correlation is not correlation. Technically, correlation only establishes a measure of how much two lines are similar to each other. This measure of similarity is not even undisputed as it uses the least square method of the regression of the two lines involved.

Here is issue one as described by Francis Anscombe, a famous statistician. The following four graphs all have the same regression line even though the data points are wildly different:

As you can see only the regression of the first graph (the blue line) seems right to use intuitively. Even worse, all four graphs have a 100% correlation with each other. That is what is meant by the statement: correlation is not correlation.

To make matters even worse: the regressions in the above figures all presume that the “real” line (again the blue line) can be calculated by using the horizontal axis.  The least square method basically calculates which line would involve the least squares to capture all the data points. Here “the least” is calculated as orientated towards the horizontal axis. Yet, this is completely arbitrary. If one uses the vertical axis the line would be the opposite as shown below. The red line is the normal regression, but there is simply no mathematical argument why the black line is not correct.

Of course in the example above, it looks weird. But the reason is that the red line follows the dots really close. If you have two of those lines you get a very high correlation. So, even though there is no sound argument for it, if the correlation is very high, one can still use it.

So what is a very high correlation? As a rule of thumb, any correlation below 80% is suspect. And yes, we haven’t found that many correlations above 80%. So most correlations are spurious.

Underdetermination also plays a role with correlation. Even if you get a high correlation (>80%), even then due to the underdetermination of theory by data, there are many more theories possible besides your one theory.

Correlations in football

Does this have any real world application in the world of football? The answer is yes! Most correlations in football are less than 80% and should be regarded with a pinch of salt. Furthermore, correlation can be gamed.

Let’s look at any correlation involving a team statistics like xG or Xa. If one finds, for example, a correlation of 50% between team xG and the number of goals scored in the next season, how can that correlation be gamed? Easy! For once, the correlation between the xG of defenders (for instance 0.1) and future goals is very high, because the xG of defenders is very low and they will only score a few goals next season. But the correlation between the xG of strikers and future goals is quite low. We looked at the topscorer for each team in the Dutch Eredivisie and the Belgium Jupiler League and found only a 27% correlation between the xG of the striker before the season and the goals scored during the season.

But if we would combine the high correlation of the defenders in our example with the low correlation of the strikers we found, then one gets about a correlation of 50%, which most of the time is considered a good correlation by people who are less strict than we are.

Most importantly: decision made based on these kinds of correlations have a bigger risk of being the wrong decision than decisions based on higher correlation and less combinations of underlying correlations. Especially, when it comes to recruiting players, basing your decision on the wrong kind of correlations can end up in quite a costly debacle.

No tail information

As Nassim Taleb makes clear almost all correlations lack information about the tails of distributions. Correlations, if useful at all, only tell you something about average players. Yet, football clubs and scouts are looking for exceptional players. It is highly unlikely that you will find exceptional players using correlations as exceptional players are located in the tail of a distribution as they outperform average players.

Even in an 50% correlation there really is very little information as can be seen from this graph:

Image