Elo Hell-o: Rating Systems on Showdown and You

By scalarmotion. Art by asgdf.
« Previous Article Home Next Article »

The ladder is an integral part of the Showdown experience, a driving force for players to continue playing and seek higher achievement and a source of endless frustration for many. Yet despite the ladder's popularity, few know how it truly works, and many often throw around terms like "ELO" and "Coil" without really understanding what they mean. While skill rating systems might seem simple—win a game and you gain some points, but lose a game and you lose some points—they vary greatly in implementation, as there are many ways to try to condense player skill into a simple number.

Players who have been around for a couple of years may remember Showdown changing its primary rating system to Elo after using ACRE estimates of Glicko-2 (don't worry if these terms look like gibberish; they'll be explained later), which angered many, as their hard-earned ladder ratings were reset. What's the difference, really? Why was this necessary? And how does this rating system mumbo-jumbo affect you as a player? Read on to find out more.

Rating Systems in a Nutshell

First of all, it's important to establish just what rating systems are and why we need them. Simply put, a rating system is an attempt to quantify a player's skill level. Furthermore, an isolated number representing skill level has little meaning on its own, so it must also be possible to compare these skill ratings with each other in order to determine which of two players is better. This may seem quite complicated and unnecessary, but there are two main reasons why rating systems are used.

Rating systems facilitate skill-based matchmaking, which matches players with other players of a similar skill level. It's no fun to be stomped by a tournament-level player on one's first-ever game, nor is there any challenge for the tournament player in question to win such a game. Thus, ensuring even match-ups is crucial in making the Showdown experience enjoyable.

In addition, rating systems provide a form of progression for the player outside of a single battle: a metagame. Just as the Pokémon games have a map to explore and gym badges to earn, it's necessary to give the player a reason to keep on battling beyond the joy of battling itself. Having a skill rating for players to aspire to raise makes things a lot more interesting for players and keeps them committed to the game even as the novelty of simulator battling fades. Hence, it's clear that rating systems are quite important on Showdown. With this demonstrated, we can now closer examine the rating systems that Showdown uses.

Elo—our current system

First, let's examine Showdown's current main rating system, one of the most popular in the world: Elo. Not to be confused with a British rock band (note the capitalization), the Elo rating system is named after its creator, Arpad Elo, and is used in countless multiplayer games worldwide such as chess, League of Legends, and Counter-Strike: Global Offensive. The system is quite simple: each player starts with a fixed initial skill rating, and winning games earns one points, while losing games causes one to lose points. Winning games against higher rated opponents will earn one more points, while losing games against them results in losing fewer, and vice versa. Furthermore, Showdown also has a rating floor of 1000 (the initial rating), a minimum rating which one cannot fall below even if they lose more points. This ensures that a bad start does not disadvantage a new player or account under the Elo system.

However, this system has a few issues. As one gains fewer points and loses more points if they are rated higher, battling becomes very risky for those looking to preserve their ranking, and those at the top of the ladder are discouraged from battling. Showdown deals with this by using a decay system, which causes those with an Elo rating above a certain threshold to start losing points if they go too long without battling, encouraging them to continue playing and winning in order to maintain their rating. In addition, Showdown also has a K-factor—a maximum amount of points that can be gained or lost per match—which decreases as skill rating increases, making losing a game less costly for high-ranked players. These modifications to the Elo system on Showdown create a fair and transparent rating system that ensures that players can have competitive and enjoyable battles on the ladder.

Glicko and its Estimates

Of course, Elo isn't the only rating system out there. Another alternative, which Showdown used before switching to Elo, is the Glicko rating system. This system improves upon the Elo system by adding a second variable: Rating Deviation, an estimate of the accuracy of a rating. This is used to calculate the variance of your rating, shown beside the ± symbol on your Glicko-1 rating. In addition, unlike Elo ratings, which are updated after every match, Glicko ratings are updated after only two days or fifteen battles on Showdown, with the rating deviation increasing based on the time elapsed since the last battle. To simplify this rating for players, Showdown displays a provisional rating, which estimates the rating that the player will have at the end of the next two-day period.

However, as the uncertainty of a Glicko rating makes it not very meaningful to players, one more step is necessary—the creation of a rating estimate that condenses the rating and its deviation into one value. One common type of rating estimate is a Conservative Rating Estimate (CRE), which provides an estimate of the lower bound of the player's actual skill level. In short, a CRE tells you that a player's actual skill level has a good chance (for instance, 92%) of being higher than the estimated rating. Showdown used ACRE—Zarel's very own Advanced Conservative Rating Estimate—which uses a sliding scale estimate. By increasing the confidence level of the estimate for players with high deviation, ACRE provides lower estimates for inactive players, preventing them from being too high on the ladder.

Using Conservative Rating Estimates of the Glicko system has problems just like the Elo system, especially at the top of the ladder. Players who play a lot often find that their high activity decreases their rating deviation significantly, causing their ratings to change very little after many games. Furthermore, a glitch in Showdown's implementation of Glicko-2 caused some players' ratings to grow to unusually large values, despite ACRE's penalties for inactivity. When those players battled, the ratings of their opponents ended up being affected as well. This led to significant rating inflation which made the ratings a lot less meaningful. In addition, the Elo system was much more intuitive, as players would always gain points after a win or lose points after a loss (under Glicko-2, the reduction in deviation caused by playing a game might cause one's ACRE to increase despite losing), hence Showdown made the switch to Elo over Glicko-2 and ACRE.

Another rating estimate used on Showdown is GXE, the Glicko X-Act Estimate, which was developed by the Smogon mathematician X-Act. GXE provides a more concrete and meaningful estimate of one's skill by calculating their odds of defeating a random opponent on the ladder. Unfortunately, it shares ACRE's disadvatages when a player has high rating distribution and is less intuitive for players more used to the numerical progression of an Elo or ACRE rating. Nevertheless, GXE is probably the best Glicko estimate available and has been used as an intermediate variable in the determination of COIL rating, which is used in suspect testing.

Mystery Suspects

The Elo and Glicko rating systems are meant to estimate a player's skill level, but they aren't very applicable to suspect testing. For this specific purpose, Smogon statistician Antar developed two Mystery Ratings in order to recognize players' achievements on the suspect ladder—COIL and ARMS. After heavy testing, COIL (Converging Order-Invariant Ladder) won out as the better system and has been used in suspect testing ever since. COIL is based upon a player's GXE and the number of games that they played on the suspect ladder. As long as a player is able to maintain a high GXE over many games, they will successfully achieve the requirements to vote on the suspect test. The higher the GXE a player can maintain, the fewer games they need to play, and vice versa—a player with a low GXE (such as 50 or below) will never be able to achieve these requirements.

What This Means for You

Now that you have learned about these rating systems, you will probably still be wondering: What does all of this have to do with me, an average player? Why should I care about understanding these complicated numbers when I could just keep on battling and trying to increase them?

First of all, how one goes about laddering (trying to increase one's skill rating) is greatly affected by the type of rating system in use. If Showdown uses ACRE, as it used to, having a low deviation caused by battling a lot would stagnate one's rating, as it would not change much after each game. In fact, this caused players to "abandon" older accounts and create new accounts solely for laddering, hoping for a better "run" during the earlier stages of an account, when the deviation is high and winning has a significant effect on one's rating. In contrast, under the Elo system, one can expect a steady progression in their skill rating as long as they continue to win games consistently, meaning that there is no need to create a new account—winning ten games has roughly the same effect regardless of whether your account is 0 battles or 100 battles old. Thus, players who want to increase their skill rating as much as possible would do well to understand the rating system they use, lest they waste their efforts.

Other ladder-related achievements, such as suspect tests, are also similarly affected. If suspect voting requirements were purely based on Elo, then one would be able to achieve reqs just by winning enough games. However, the COIL requirement in use requires a high GXE to even have any chance of qualifying, meaning that one has to play at a consistently high level in order to achieve reqs, ensuring that suspect voters are good enough at the tier in question to be able to make an informed decision.

On the other hand, understanding the rationale and workings of rating systems can and should completely change one's mindset towards playing and laddering. Many people continue to view skill ratings as a "score" that symbolises their achievement, a number which they should naturally seek to increase. However, a skill rating is actually meant to quantify one's skill level—something which should remain relatively static. When players get frustrated at their rating increasing by very little after a win or decreasing by a lot after a loss, it is important to recognize that it is not the player that is being rewarded or punished, but the rating system itself. The system has to modify one's ranking by a large amount if its estimate was inaccurate (one loses a game against someone with lower rating ie. a game which one is expected to win), whereas a match that reinforces the accuracy of the estimate (one wins the game which one is expected to win) would change the rating less.

Ultimately, laddering is not an attempt to earn as many points as possible, but rather to discover an estimate of one's own skill level and refine it to be as accurate as possible. One should not seek to reach the top of the ladder just by laddering. Rather, one should first try to find out where they stand in terms of skill rating and then attempt to improve their own inherent skill level by learning, which finally leads to another ladder run to find out their revised skill rating that matches their new skill level, and then the process repeating itself. Understanding these fundamental concepts of rating systems will change one's mindset towards laddering, as well as just play in general. Instead of being frustrated over losing a game or being unable to raise one's skill rating, one should accept that this is what the ladder has chosen and try to take it as a learning experience, using it to improve oneself.

Conclusion

All in all, skill rating systems are more than a simple number representing how long or well one has played, as the different implementations used by various rating systems and rating estimates add a great deal of complexity. Ultimately, this results in a more accurate, yet nuanced estimate of a player's ability to win games. Nevertheless, the inherent randomness present in Pokémon battling itself as well as the inconsistency of human performance introduces too much uncertainty to calculate a truly accurate estimate. While these problems will plague any attempt to quantify something as intangible as player skill, you can rest assured that we here on Showdown and Smogon have done our best to give you a skill rating that, while uncertain, will not be completely meaningless.

Further Reading

Of course, the subject of skill rating systems is very complex and goes far deeper than what I have presented here. For those with the ability and interest to learn more, here's some material to chew on.

Everything You Ever Wanted to Know About Ratings - Antar's in-depth article on the topic in general

COIL Explained - Antar's explanation of COIL

Glicko ratings - Glickman's website about the Glicko rating system

« Previous Article Home Next Article »