Normalize the scores from different reviewers

Started by wpzdm

Normalize the scores from different reviewers

13 replies 73 views
Started by wpzdm

The motivation is simple: some reviewers (e.g., Joyce) tend to give high scores, so if a product happened to be reviewed by them, the averaged score would be higher than it should be. Likewise if some reviewers tend to give lower scores (e.g, Precogvision). It comes more complex if we consider further. Some reviewers like super-review uses a different scale. Some reviewers seem rarely give very low scores (e.g, Jays). Of course, I know some scores are AI-analysed from review videos.

Proposal. For each reviewer (call him/her r): Compute the reviewer's mean and variance (call them mu, sigma), then compute the mean and variance (call them mu0, sigma0) of the site scores (currently averaged on all reviewers) of all the products reviewed by reviewer r. Now, for each score s given by reviewer r, do the normalization s_normalized = sigma0*((s - mu)/sigma) + mu0. Notes:

  • The core assumption is that all reviewers give scores according to a Gaussian distribution. This is generally false, but a common working assumption.
  • The core idea is to work on the the products reviewed by the reviewer. This matters. For example, at first glance, Smirk tends to give higher scores, but, if you look at each product, it is clear they in fact give lower scores. This is because smirk focuses on reviewing high-end products.
  • At a high level, this requires a 3-stage procedure: 1) compute the plain site scores as now, 2) compute the normalized scores for each reviewer (proposed here), 3) compute site scores again. Maybe there are cleverer ways, idk.
  • Improving the LLM analysis of review-to-score is of course (maybe more) important, but the above idea applies after the LLM analysis.
  • As to those reviewers who have their own explicit scores, I think we should respect their scores and show them directly. But the proposal applies to the site scores.
  • For those who like math: this procedure will probably reduce the variance between reviewers (on certain products), but I guess only to a limited extent.
Edited by wpzdm on .

Yeah, I think this is the way! I see the problem which we currently have. Thank you very much for this proposal. I am currently a bit busy with something else. But that will be the next bigger change I am working on :)

Currently there are some "adjustments" made for reviewers who clearly review by value (e.g. giving Truthear Pure 5/5 stars because it is great value, Super* Reviews). There is nothing wrong/bad about doing that. But it does not really fit the sites purpose.

But I think your proporsal leads to a more generalized and transparent solution overall. We need to make sure that the system is communicated to the community with enough transparency.

Endoki wrote: Yeah, I think this is the way! I see the problem which we currently have. Thank you very much for this proposal. I am currently a bit busy with something else. But that will be the next bigger change I am working on :)

Thank you as always! You are doing valuable work for our community!

Currently there are some "adjustments" made for reviewers who clearly review by value (e.g. giving Truthear Pure 5/5 stars because it is great value, Super* Reviews). There is nothing wrong/bad about doing that. But it does not really fit the sites purpose.

Interesting. this is another (independent) point I did not think of.

Thanks a lot again for the detailed suggestion – this is very close to what we’ve decided to implement, with a few extra layers on top to also tackle the “price-to-performance” issue and reviewer weighting.

1. Per-reviewer normalization (strict vs lenient scales)

We agree that different reviewers use the 0–10 scale very differently. So for each reviewer we will:

  • Look at their full history of scores.
  • Compute their personal “average” and “spread”.
  • Convert each of their scores into a relative position on their own scale (essentially a z-score: how far above/below their usual this IEM is).

This means:

  • A “7” from a strict reviewer and a “9” from a generous reviewer can end up in the same normalized zone if, for both, it means “this is really good for me”.
  • Reviewers who never give 10s are no longer penalised: their “top tier” scores still map to the top of our site scale.

We also use global stats to gently pull very small reviewers towards the overall average, so one person with 3 reviews cannot distort things too much.


2. Correcting price-to-performance bias

Your idea handles scale differences, but some reviewers also bake value into their scores (e.g. cheap IEMs getting 10/10 because they’re “insane for the price”).

To address this, we:

  • Group IEMs into price bins (e.g. <100, 100–300, 300–1000, >1000).
  • Within each price bin, compare each reviewer’s normalized scores to the consensus on the same products.
  • If a reviewer systematically scores cheap IEMs higher than consensus, or expensive ones lower, we detect that as a price-tier bias.
  • We then subtract that bias from their scores in that price bin (with a tunable factor depending on how value-focused that reviewer is).

Result: The site score becomes as price-agnostic as possible. A budget IEM that’s great “for the money” will still be rewarded, but not ranked like an absolute technical monster if it doesn’t actually perform on that level.


3. Reviewer weights (expert vs casual / consistent vs noisy)

Not all reviewers will contribute equally to the final site score. For each reviewer we define a weight that combines:

  1. Expertise (manual):

    • Critical, highly experienced listeners (e.g. Precog, Smirk) get a higher base weight.
    • More casual / “vibes” reviewers or pure LLM-derived scores get a lower base weight.
  2. Consistency (data-driven):

    • We check how often a reviewer’s normalized scores line up with the overall consensus on IEMs others have also reviewed.
    • Reviewers who are all over the place get a lower weight; those who are consistently in line with the broader picture get a higher one.
  3. Experience / volume:

    • Reviewers with many IEMs under their belt get a small bonus vs someone with only a handful of reviews.

These factors are multiplied into a single w_r, and we average normalized scores for each IEM using these weights. So expert, consistent reviewers with lots of experience move the needle more than very casual or noisy ones.


4. Final site score & transparency

Putting it all together:

  • Each review gets turned into a normalized, price-corrected performance score.
  • These are combined using reviewer weights to produce a final performance-only site score for each IEM.
  • The original reviewer scores will still be shown so people can see what Joyce, Precog, Super*Review etc. actually gave.

This should:

  • Reduce artificial inflation from “value” scoring.
  • Put strict and lenient reviewers onto a common scale.
  • Let the site lean more on trusted critical listeners, while still using everyone’s input.

Thanks again for the nudge – your normalization idea was the backbone, we just extended it to explicitly handle price bias and reviewer weighting.

Edited by Endoki on .

Another adjustment to the normalized overall score will be that more reviews -> better because probably more unbiased. When an IEM has only 1-4 reviews it's probably a warning sign that the score could be biased, so we will make adjustments to the overall normalized score based on how many reviews an IEM has already received.

I will probably also create an infopage which explains everything.

After a lot of testing and as a first step I have implemented a "low-review-count" penalty. There will be no changes in the aggregated score when the IEM has 6 reviews. If the IEM has a lower number of reviews, it will subtract 0.2 in steps based on the amount of reviews. So an IEM with only 1 review should receive a subtraction of -1 in the total score. If the IEM has more than 6 reviews, it will get a +0.2 bonus thanks to it's popularity (but no more steps in that direction)

I plan to iteratively implement changes to the new normalized score and update this forum thread.

Wow that is a well sorted plan!

I have only one comment about a reviewer's "budget level" and its relation to "consistency". This is just what I mentioned about Smirk, who focuses on reivewing high-end products (high "budget level"). We can imagine there could be an "anti-Smirk", who focuses on low budget level (and, of course, this is as helpful as Smirk!) Now, as we have seen for Smirk and can imagine for "anti-Smirk", even after the z-score normalization, the same score from Smirk and anti-Smirk would mean quite differently about the sound quality. Moreover, they both can be of relatively low consistency, under the currect definition, because, for example, Smirk would, to some extent, consistently give lower scores than the consensus (I believe this is not the consistency you meant by "line up with the overall consensus on IEMs others have also reviewed"). My normalization based on each reviewer's product-base is one way to fix the this.

The penalty on low-review-count products is also a nice idea, tho it seem a bit strong. For example, it seems to me unconvincing that Dunu Gracier scores the same as 622B and Fugaku. Maybe change to 0.1 (both for plus and minus) would better? Or have you tried it?

Edited by wpzdm on .

wpzdm wrote: Wow that is a well sorted plan!

I have only one comment about a reviewer's "budget level" and its relation to "consistency". This is just what I mentioned about Smirk, who focuses on reivewing high-end products (high "budget level"). We can imagine there could be an "anti-Smirk", who focuses on low budget level (and, of course, this is as helpful as Smirk!) Now, as we have seen for Smirk and can imagine for "anti-Smirk", even after the z-score normalization, the same score from Smirk and anti-Smirk would mean quite differently about the sound quality. Moreover, they both can be of relatively low consistency, under the currect definition, because, for example, Smirk would, to some extent, consistently give lower scores than the consensus (I believe this is not the consistency you meant by "line up with the overall consensus on IEMs others have also reviewed"). My normalization based on each reviewer's product-base is one way to fix the this.

The penalty on low-review-count products is also a nice idea, tho it seem a bit strong. For example, it seems to me unconvincing that Dunu Gracier scores the same as 622B and Fugaku. Maybe change to 0.1 (both for plus and minus) would better? Or have you tried it?

You are right, 0.1 steps are a better fit.

Edit: I have changed it

Edited by Endoki on .

Thanks again for the fast update. I have scanned the top 50, and yes, it looks more convincing now.

I started to roll out some more changes, basic normalization and adding weights based on the experience of the Reviewers. E.g. Precogvision is considered to be the pinnacle of critical listening and receives the biggest weight.

Edited by Endoki on .

I am done with rolling out the changes. The normalized score is now considering everything we discussed. On every IEM page, we show both scores on the top (IEMR Normalized Score and Reviewers Average Score) for comparison. Rankings only consider the normalized score.

Endoki wrote: I started to roll out some more changes, basic normalization and adding weights based on the experience of the Reviewers. E.g. Precogvision is considered to be the pinnacle of critical listening and receives the biggest weight.

Looks nice! In the future, it would be great to have some community driven reviews of the reviewers.

After thinking again, I realized that your "Correcting price-to-performance bias" idea actually does more than the name suggests and kinda subsumes my normalization w.r.t reviewer's product-base. For example, although Smirk does not favor high price-to-performance products, he does give high-priced ones lower scores than consensus anyway, so the algorithm will adjust his scores higher. So, your idea, as far as I understand now, is somehow a normalization w.r.t reviewer's price-binned product-base, which is more fine-grained and better!

Endoki wrote: I am done with rolling out the changes. The normalized score is now considering everything we discussed. On every IEM page, we show both scores on the top (IEMR Normalized Score and Reviewers Average Score) for comparison. Rankings only consider the normalized score.

Oh, wow, you just did it during my writting of the prev reply! I will check it now!

Edit: this is definitely a big step forward, cheers! A question: The "Reviewer Average Score" in the product page is on original (unnormalized) reviewer scores, right?

Edited by wpzdm on .

wpzdm wrote:

Endoki wrote: I am done with rolling out the changes. The normalized score is now considering everything we discussed. On every IEM page, we show both scores on the top (IEMR Normalized Score and Reviewers Average Score) for comparison. Rankings only consider the normalized score.

Oh, wow, you just did it during my writting of the prev reply! I will check it now!

Edit: this is definitely a big step forward, cheers! A question: The "Reviewer Average Score" in the product page is on original (unnormalized) reviewer scores, right?

yes they are displaying the original score currently. Thinking about showing both but don’t want it to make too complicated for users.

Sign in and set your nickname to join the conversation.

Footer