Rubrics And The Bimodality Of 1–5 Ratings
I’m working on a fun new project called Fret War these days as a way to merge my love for playing guitar with my love of writing software. The concept is simple: Guitarists learn to play a difficult piece of music based on a theme, players and fans rate the quality of their submissions. In order for Fret War to work though, I needed to create a rating system that fought the bimodal trend of most other 1–5 rating systems out there using some different statistics.
In this blog post I’d like to lay out the mathematics and theories I’m using to create a rating system that combats the “1 or 5” tendency. I’ll have code you can use in your own system and encourage comments on the method in order to improve it.
The Competitive Blogging Concept
To understand why this rating system might work (notice I said “might”) you kind of need to understand the overall concept of what I’m calling “competitive blogging”. Competitive blogging for me is where you create a blogging environment where people are doing their posts not to just post, but to compete in some niche competition. In fact, they might not even realize they are “blogging” and instead they’re simply playing a game.
It’s not a terribly original idea, since lots of other sites have sort of done something similar, but not quite the same. If you take CSS Zen Garden and CrossFit you can see almost competitive blogs. They put up some sort of challenge, and people who visit the site post their renditions of it. What’s missing is an overt competitive system with ratings for submissions.
In the case of CrossFit you can see the already competitive nature in the comments:
People desperately want to compete on CrossFit, but the site doesn’t provide a direct way for them to do it. In fact, it’d be difficult because players would be uploading videos of themselves lifting weight for review. Eh, it might work but that’s a seriously narrow audience.
In the case of Fret War we have the perfect setup:
- Guitarists are highly competitive.
- Music is easily distributed and posted to the internet.
- Players and Fans love listening to guitarists try to be badass.
You could possibly find other genres with a similar mix. I’ve already started work on a DJ version, and I’m looking for others.
However, the one thing that binds this whole concept together and makes it potentially work is the rating system. There is no game without a solid rating system that is clearly open to everyone for inspection.
Why 1–5 Is Bimodal
It’s a suspected or potentially known fact that sites with a 1–5 rating system end up being “bimodal”. Bimodal means that you have lots of votes for around 1, and lots of votes for around 5. If you produce a histogram of these votes it’d look like this:
In R you can simulate something like this and get a summary with this code:
\> \> bimod <- c (rnorm (100, 2, 0.1), rnorm (100, 3, 0.1)) \> \> hist (bimod, freq=FALSE) \> summary (bimod) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.707 2.028 2.483 2.504 3.005 3.242 \>
Notice though that while in the graph above we have lots of votes near 1 and lots of votes near 5, when we do a summary we get a mean of 2.5, which is actually misleading.
Now, lots of people have pointed this out, but what they haven’t really said is why it ends up this way. The reason comes down to the average person’s inability to evaluate complex qualitative things in an arbitrary 1–5 scale.
Non-experts Always Rate Like/Dislike
My hypothesis is that without some form of Rubric, non-experts will use a 1–5 system as if it’s a 1/0 system for “like/dislike”. The variability around 1 or 5 comes from people saying how much they liked or disliked, and isn’t any kind of useful information different from what you’d get using the standard deviation around the mean of a logistic summary.
Yes, that’s a lot of words you probably don’t know so I’ll explain.
- When people say 1 or 2, they are saying “I really hated it.” or “I kinda hated it.”
- When people say 4 or 5, they are saying “I kinda liked it.” or “I really liked it.”
- People who say 3 are undecided (more on that later).
- The logistic model of statistics uses degrees or percentages between 0 and 1 based on boolean choices.
- Logistic models show that you get almost the same information from
many boolean votes as you do from complex 1~~5 or open ended voting
- With a logistic model, you can simply asked “did you like it” and give a check box.
- From multiple user votes, you’ll get mean and standard deviations between 0 and 1 which you can use to determine liked/disliked and whether sorta/really.
- However, without a simple “survey” or rubric to guide the non-expert, they’ll have a hard time making a good evaluation.
- To get the best results, combine boolean choices with 1~~5 ratings (linear model) but influence the user’s choices with user interface changes.
By assuming that users will need some help guiding their evaluation, and providing them with a micro survey that features “like/dislike” as well as “overall ratings”, I can then gather up some simple statistics which make the rating very robust and meaningful. You’ll see that what I’m getting out of the Fret War ratings is actually why they liked or disliked a particular submission and also helping them pick a better 1–5 rating.
An Indirect Rubric For Users
When you rate a Fret War submission you see this (with the Overall Rating pulled down):
These five “qualities” of Accuracy, Speed, Interpretation, Uniqueness, and Tone are actually things that guitarists care about, and experts would use to rate a player’s abilities. Tone in particular is a very guitarist specific quality. Notice also that there’s a 1~~5 overall rating, matching the same number of qualitative ratings presented. The goal is to get people to make the same rating an expert would make by presenting them with an indirect rubric to use as the basis of their 1~~5 vote.
What I’m doing here is subverting the way to do a “correct” survey by purposefully influencing the commenter’s viewpoint. In a real survey I wouldn’t present these two pieces of information together since one would influence the other. In this case, I want to influence their rating so I present the qualities they should rate in a way that then gets them to pick a 1–5 that’s similar.
In other words, my hypothesis is that their overall rating will be closer to the number of check boxes they check off, and that by doing this I’ll get a more normally distributed overall rating instead of a bimodal one.
The Math And Code
The only downside to this is you now need some slightly complex math to handle the summary statistics, and that math needs to be a rolling calculation. The last thing you want is to have to troll through a table in the database adding up votes. You want to take each vote and use information collected so far to quickly recalculate the new summary.
The first thing you need is a separate table that contains the statistics for any object in your database:
CREATE TABLE statistic (other*type TEXT, other*id INTEGER, name text, sum REAL, sumsq REAL, n INTEGER, min REAL, max REAL, mean REAL, sd REAL, primary key (other*type, other*id, name));
In this table, we use “othertype" and "otherid” as a sort of polymorphic relation. The “name” is what the name of the statistic is, like “accuracy” or “tone”. The other numbers are used in doing the rolling calculations and later pulling up the values of “mean” and “sd” (standard deviation).
With that table in place, and some functions to get and update them, you have only this tiny bit of Python and you’ve got a rolling “sample” method:
def sample (other*type, other*id, name, value): stat = get (other*type, other*id, name) if not stat: create (other*type, other*id, name) stat = get (other*type, other*id, name) stat.sum += value stat.sumsq += value \* value if stat.n == 0: stat.min = value stat.max = value else: if stat.min \> value: stat.min = value if stat.max < value: stat.max = value stat.n += 1.0 try: stat.mean = stat.sum / stat.n except ZeroDivisionError: stat.mean = 0.0 try: 1. (sqrt ( ((s).sumsq - ( (s).sum \* (s).sum / (s).n)) / ((s).n–1) )) stat.sd = sqrt ( (stat.sumsq - ( stat.sum \* stat.sum / stat.n )) / (stat.n - 1) ) except ZeroDivisionError: stat.sd = 0.0 update (stat)
The super magic in this calculation is in the line where we set the
value. That math is basically the normal calculation for standard
but turned on its head with some algebra so that we don’t need to look
the records over and over. In fact, I’ve been using this code so long
just sort of trust it and only validate it against R periodically.
You would use the above code like this:
\>\>\> from app.model import ratings \>\>\> ratings.sample (“submission”, 0, “overall\_rating”, 1) \>\>\> ratings.sample (“submission”, 0, “overall\_rating”, 2) \>\>\> ratings.sample (“submission”, 0, “overall\_rating”, 3) \>\>\> ratings.sample (“submission”, 0, “overall\_rating”, 5) \>\>\> ratings.sample (“submission”, 0, “overall\_rating”, 5) \>\>\> stat = ratings.get (“submission”, 0, “overall\_rating”) \>\>\> stat.n 5 \>\>\> stat.sd 1.7888543819998315 \>\>\> stat.mean 3.2000000000000002 \>\>\> stat.sum 16.0 \>\>\> stat.min 1.0 \>\>\> stat.max 5.0 \>\>\>
Which if you did in R comes out to:
\> summary (c (1,2,3,5,5)) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 2.0 3.0 3.2 5.0 5.0 \> sd (c (1,2,3,5,5))  1.788854 \>
That’s pretty close apart from a few rounding errors as you get further out.
The beauty of this code is that you can keep track of as many varieties of statistics with just a few database accesses, and you can also “roll up” these statistics.
Mean of Mean Theory
How we use this on Fret War is that, when you vote on your submission we do a sample of each of your qualitative boolean choices, and your overall rating. We then also roll this up by taking the “mean of mean” and “mean of sd” for all submissions to produce the overall round summary.
In our code we’re kind of cheating, or being “practical” by using a standard model to analyze what’s really a logistic model. We just use the same mean/sd calculations for binary data as we do for 1–5 data. This makes real statisticians cringe, but for practical purposes, it’s good enough.
One useful theory though is that if you take a mean of a summary statistic (like mean or standard deviation) then that summary will be normally distributed no matter what form the original data takes.
It’s kind of like doing a meta-mean or meta-sd, and it says that, even if your data is totally weird and not normal, you can assume that the meta-version will be normal.
In this way I’m cheating since I get each submission’s rating mean and standard deviation, which is really logistic in shape, and then just turn them into a normal distribution by meta-summarizing all of them.
In practice this isn’t terribly useful, but in Fret War it’s very
because we use it to determine rankings and analyze the trend of the
example, we can see that a particular fan’s rating is probably a troll
are consistently 1 standard deviation away from everyone else in the
Simply keep the meta-mean for all submissions in a round, and then if
every submission at less than
(meta_mean - meta_sd) then he’s
This is the plan to make these measurements robust. By knowing the meta-mean and meta-sd of the round, we can evaluate outliers and potentially throw them out, and possibly even do it in an automated fashion.
Standard Deviation And “Sorta” vs. “Really”
Alright, that’s a hell of a lot of math and information, and sadly guitarists and fans are not known for their math prowess. That means we needed a way to describe these statistics to people in a meaningful way.
Here’s what all the ratings displays look like on Fret War:
Which is kind of funny, but when people look at it they find it makes total sense. How do we determine these? Here’s the code:
def mean*sd*as\_english (mean, sd): level = “” if mean < 0.1: level = “Sucks” elif mean < 0.2: level = “Mediocre” elif mean < 0.5: level = “Not Bad” elif mean < 0.7: level = “Awesome” elif mean <= 1.0: level = “Kicks Ass” else: level = “ERROR: %f” % mean if sd < 0.2: level = “Really” + level elif sd \> 0.5: level = “Sorta” + level return level
This function is only used on the logistic descriptors (Accuracy, Speed, etc.) which should be between 0 and 1. The levels and names are pretty much just guessed at, but seem reasonable.
What’s very fun though is the use of standard deviation (
“Sorta” vs. “Really”. The standard deviation is basically a measure of
“wide” your distribution is around the mean. A smaller
means that most
people rated it consistently at that level. A larger
sd (wider) means
people weren’t so consistent.
For example, if two players both have an Accuracy
mean of 0.8, but
is 0.1 and Mary’s is 0.8 then you can determine the following:
- Joe was seen as more consistently accurate than Mary.
- Mary was still just as accurate, but enough people voted the other way that it spread her distribution out.
- I can use Joe’s
sdto say he was “Really” awesome, as a way of denoting consistency in the voting.
- Consquently, Mary’s
sdsays she was “Sorta” awesome because enough people thought she wasn’t.
- Mary also may have gotten more votes than Joe, and actually people who thought she was accurate probably ranked her as more accurate than Joe.
- There’s probably something else going on with Mary’s submission that’s confounding her accuracy rating. Maybe she picked a Rhythm that some people just don’t like or can’t hear well.
With that in place, it’s very simple to present to the user what’s actually a very complex statistical model of their playing, but in a way they understand.
If you look at the Winnars page you can see we have a rating called “Cowbells” which seems really weird. Here’s a screenshot of it:
To make things fun I decided that we’d have what seems like a fairly arbitrary huge ass number to show your ranking compared to someone else. That page is showing the winnars sorted by their mean (DESC) then their standard deviation (ASC) so that higher means with lower sd are at the top.
The Cowbells is meant to be funny and keep people guessing, but it’s simply the following:
\> mean (c (0,0,0,0,0,1)) \* 1000  166.6667 \> mean (c (1,1,1,1,1,5)) \* 1000  1666.667 \> mean (c (0.5,0.5,0.5,0.5,0.5,3)) \* 1000  916.6667 \>
Yep, just the mean of all the qualitative ratings and the overall rating combined times 1000. Why 1000? Then you get to see 666 when you’re a top perfect player, and that’s so metal.
Robustness And Gaming
Obviously anything can be gamed, and this is no different. It’s trivial for a bunch of trolls to go on Fret War and consistently rate one way or another, as demonstrated by the Mountain Men’s Three Wolf Tee on Amazon.
If a bunch of people want a particular player to suck, well that’s what they’ll do. They do it to American Idol and they’ll do it on Fret War.
What this set of measurements gives us though is the ability to detect the gaming, and it also sets the bar a little higher. It’s not just a 1–5 but instead several check boxes and a required comment of 20 characters. We can also decided after a round if we want to throw out outlier votes, and in fact a simple query will show us all the possible gamers.
But, like I said, anything can be gamed, even this.
Currently there are two really obvious flaws which we’re fixing.
The first is that the method of getting and setting a new statistic has a race condition. That was fine when it was just a few people hacking on it, but pretty soon we’ll need to serialize the summary calculation code. In our case we’ll just delay all posted comments and ratings and send them through a Lamson server. Lamson will then do the calculations on the posts in order after spam filtering and other quality control.
If you were to use this code in your own site, you could do something
having a secondary table that stored the periodic votes. Just make a
the parameters to
sample and then have something run every 5 minutes
or so to
roll them up and clear the table. This is sort of a compromise between
this calculation on all table rows each time, and having the race
Another flaw, which might not be such a big deal, is progressive rounding errors. You can already see a small rounding error above with just a few samples. As the number of samples goes up we’ll see rounding errors increase for later samples.
We’ll be fighting that by simply running one mass calculation at the end of a round to determine the real winners.
Currently Fret War is in beta so we’ll definitely have problems with this code. I’ll hopefully be tweaking most of the displays and measurements over the next few months and working on ways to keep it sane.
Also, if you have feedback on this method then feel free to email me and discuss it.
There’s a good chance the site will crash if this blog post hits the nerd sites, so just ignore Fret War until it’s stable.