West Coast Stat Views (on Observational Epidemiology and more): Choice of metric: the perils of data science

Friday, December 13, 2019

Choice of metric: the perils of data science

This is Joseph

I was reading this article and I decided it was a good example of how the choice of metrics can really change the answer to a study question. The author concludes that the best general of history was Napoleon:

Among all generals, Napoleon had the highest WAR (16.679) by a large margin. In fact, the next highest performer, Julius Caesar (7.445 WAR), had less than half the WAR accumulated by Napoleon across his battles. Napoleon benefited from the large number of battles in which he led forces. Among his 43 listed battles, he won 38 and lost only 5.

So what was the choice of metric:

Inspired by baseball sabermetrics, I opted to use a system of Wins Above Replacement (WAR). WAR is often used as an estimate of a baseball player’s contributions to his team. It calculates the total wins added (or subtracted) by the player compared to a replacement-level player. For example, a baseball player with 5 WAR contributed 5 additional wins to his team, compared to the average contributions of a high-level minor league player. WAR is far from perfect, but provides a way to compare players based on one statistic.

The problem with this metric is that it presumes that all players have an equal opportunity to accumulate wins. This leads to some odd outcomes:

Napoleon’s large battle count allowed him more opportunities to demonstrate his tactical prowess. Alexander the Great, despite winning all 9 of his battles, accumulated fewer WAR largely because of his shorter and less prolific career.

Now it is true that the historical consensus is that both are good generals. But an unbroken string of impressive victories, against the regional superpower, gives a WAR of 4.370, a quarter of Napoleon's score. I am not sure that the historical consensus is that Alexander was mediocre besides Napoleon because of the number of battles. If anything, there is a perspective that having to fight more battles was actually a sign of a weaker general. Consider these quotes:

There is no instance of a nation benefitting from prolonged warfare.

For to win one hundred victories in one hundred battles is not the acme of skill. To subdue the enemy without fighting is the acme of skill.

Hence to fight and conquer in all your battles is not supreme excellence; supreme excellence consists in breaking the enemy's resistance without fighting.

The best victory is when the opponent surrenders of its own accord before there are any actual hostilities... It is best to win without fighting.

All of this isn't to say that I am not in awe of the work done on this project. It was amazing. But it is a good time to reflect on the issues with any one-dimensional metric when trying to evaluate something as complicated as warfare.

2 comments:

AdamDecember 13, 2019 at 10:03 AM
Notably, WAR also adjusts for the playing environment. So to the extent that a historical baseball era was, for instance, higher scoring, WAR accounts for this when consuming the underlying raw statistics (on-base percentage, home runs, runs allowed, etc.) to provide apples-to-apples comparisons between time periods. If Alexander was active during a low-warmongering era, then his war WAR should reflect this. (I have not read the source analysis.)

JAWS is a metric used to assess baseball Hall of Fame candidacies. It is based on WAR but attempts to balance peak and career performance to reward 1) players with remarkable primes that lacked longevity and 2) players that were very good for very long (and 3: players with a bit of both). Your critique might also highlight that Alexander had better peak performance than Napoleon but is being penalized for having a shorter career.
ReplyDelete
Replies
JosephDecember 13, 2019 at 1:42 PM
I think that Alexander was better at post-victory subduing of opponents, which hurts his score because he doesn't end up fighting the same coalitions over and over again. I think we could work on the metric, but I am mostly thinking that it is hard to rank on a single scale.
ReplyDelete
Replies