A Few Interesting Ratings System Observations

I’m currently crunching some preliminary numbers from the Fret War rating system experiment. If you haven’t read it, here’s the blog post about how I’m doing Fret War’s ratings. It covers mostly the stats and code behind it, not really the social impact.

Today I was collecting a bit of preliminary information to see how it’s working out. I’ve also been participating in a few forums and comment sites to see how they operate. I don’t have any hard numbers yet, but there’s a few observations I’ve made which I might try to turn into some sort of survey or secondary analysis.

1–5 Ratings Are Actually Bimodal

I think the first thing is that many people who commented on the previous blog post were just plain wrong: 1–5 ratings are very much bimodal. However, the reason they were wrong is they didn’t understand the statement, so I should probably clarify.

Typically, the people who disagreed would go look at a single Amazon product page and say “see, almost all 4.5” or “see, almost all 1s”.

What they weren’t doing is taking a large random sample of say Netflix movies or Amazon products and looking at the distribution. If you do that you start to see that it’s very heavily either close to 5 or close to 1.

So, the full statement should be, “out of a large random sample of 1–5 rated items you’ll find that it’s very bimodal”. Not, “pick any random item and it’s votes will be bimodal”. The latter is just retarded, and actually would only apply to potentially a small set of items that are controversial.

I think what I may do, assuming I can stay interested, is go grab some of the Netflix data and verify this further. I based it on statements from various companies, and random samples of movies and products.

Qualitative Logit Ratings Work (I Hope)

So far I’m finding that the use of a mix of 5 qualitative ratings and then a 1–5 overall rating actually does give a richer data set and help the user “justify” their rating. The data so far is pretty supportive of the claim, but I need to study logistic regression more so that I get it right.

I think that if the UI for doing these ratings could be improved it would turn out to be only a slight higher burden on raters to indicate qualitative metrics as well as “stars”. My thought was that I’d put a small display of 5 icons above a row of 5 stars. The stars are disabled until you like/dislike all 5 qualitatives, then you pick your 1–5 off that.

There could be other ways to compress the UI and make it just as informative, but the idea is to basically develop the littlest survey I can for collecting a person’s opinion of a piece of music.

Removing Confounding From Ratings

This actually applies to ratings of comments as well on sites like reddit, HN, or slashdot. If you go look at many of the comment threads, you’ll find a prominent “dog pile” effect. Once a comment gets a rating (either positive or negative) it subsequently will get many more positive or negative comments right after it. And I think the most likely direction for comment ratings is in the direction of the initial rating.

Slashdot is interesting in that it tries to add a qualitative element with it’s “Insightful” or “Funny” modifiers. However, I find more often that this just acts as a context which turns off the brains of the reader. What I find is that any comment which gets “Insightful” is immediately and implicitly believed by the reader on slashdot. Even if it’s entirely false, the “Insightful” marker makes it truthful. Same for “Funny” or any of the others.

I actually would love to study these two statements:

* If you randomized the qualitative markers on slashdot (Insightful, Funny) it would still heavily correlate with a reader’s rating of a comment. * If you required people to rate before they read comments you would remove this “contextual rating confounding”.

I really can’t do the first one, since that would require some human trials permissions and probably a real sociology survey or two. However I can do the second one, or at least implement it as part of Fret War.

In order to get more (hopefully) honest opinions from voters, Fret War doesn’t let you see the comments until you log in and rate the submission as well. Once you’ve rated it, and contributed a comment, you can get in there and reply to other people’s comments and have a conversation.

Now what I’d like to gather is if this gives a more normal distribution to the ratings, and if the posted comments are more varied. My hypothesis is that the ratings will “less bimodal” and that the comments will cover a wider range of opinions with less group think.

The twist though is I’d then like to look at the comment replies people can make after they vote. I’m wondering if those are more group-think oriented like on other sites, or if they stay individualized.

Spotting Stupid Trolls, And Not Caring About The Smart Ones

I’m also starting to spot some ways to track down dumb trolls, but also that the smarter trolls won’t be distinguishable from normal ratings so they won’t matter.

Remember that you have to vote before you can see the comments. And login. And getting an account is a little painful. All of this should add up to reductions in the ass level, but it’s still not beyond some people.

Let’s say you have trolls who aren’t very smart. What they’ll do is realize that they have to vote before seeing the comments, and that they have to give more than just a 1–5 rating. First thing unsophisticated trolls will do is rate everything the same. Click one qualitative and pick 3 or 1 or something easy.

These trolls will be easy to spot because they’ll rate exactly the same on everything. Since all of the music on Fret War is fairly different there’s no way someone would rate them all the same, or it’d be very rare. Simply finding voters who have very little variation in their ratings will spot them quickly.

Alright, so let’s say smarter trolls figure this out and start picking more randomly just to post nasty comments. Well then, who cares? I’ll give the player who submitted the right to stuff a comment, but if the ratings aren’t uniform then they just become random noise and not really a problem. They could almost just be game elements.

What it comes down to is if everyone has to rate when they comment, and you try to eject people who rate the same way too frequently, then random ratings that aren’t different from normal ones shouldn’t matter.

Of course, that’d depend on the what players get trashed by it, so moderation will probably be still important.

Required Random Ratings Would Work Too

I also have this idea that requiring ratings from some people at random would work just as well as fully requiring them. While I think the burden of rating is pretty low, and acts as a decent filter, if someone wanted to try required ratings they could start with randomly doing it.

I think there’s some issues with usability in this case, since the expectation for most users is already set for a site like HN with it’s open comment reading.

But, I’d be interested in someone who’s making a new comment system trying this option out. See if requiring a login and rating changes the numbers.

Small Data Sets

Another observation I’ve made is that it is possible to gather useful information from small data sets if you plan ahead. I made sure that I laid out exactly what I wanted to know and then tried a few simulations before actually going with it. This meant I only had a few glitches here and there to work out.

Because the data set is a mini-survey with effectively 6 questions per person for each submission, I can use this information to analyze results now. If I had only a simple 1–5 I’d have to wait longer and get more people involved.

People Are Honest About Their Own Works

This one actually surprised me, but I think it’s mostly the current small set of friendly users. So far I’m finding that people rate themselves about where everyone else rates them, with a few exceptions here and there.

I personally thought people would either rate themselves far above or below the ratings of their peers.

Graphs And Stats To Come

I’ll be doing some stats on the Fret War data this week to see how this is working out. I’ll see if I can write it up as a nice essay with R code showing you how it was done. I’d also like to do this so that people can check my work and let me know if I’m wrong.

If you have other ideas of things you’d like to know, just let me know.