This talk will look at how meta-review sites such as Metacritic, Gamerankings, and Rotten Tomatoes assess games, and how these scores relate to game sales. The broad validity of various metascore generation processes will be discussed, along with recent quantitative research into meta-review scores that touches both on the composition of these scores and also on their usefulness as metrics for assessing game quality.
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
On the validity and impact of metacritic scores adams greenwood-ericksen
1. ON THE VALIDITY AND
IMPACT OF METACRITIC
SCORES
Adams Greenwood-Ericksen, PhD
Special thanks to Erica Holcomb, MS and Cameron
Bolinger, MS
2. Agenda
Why does metareview happen?
How does metareview work?
Why is metareview problematic?
What have we found out so far?
What’s next?
3. In the beginning…
There was Play Meter (1974-present).
Then Weekly Shōnen Jump, Computer and Video
Games, Electronic Games, Electronic Gaming
Monthly (still around!), Computer Gaming World,
Nintendo Power, and so on.*
* According to Wikipedia, at least. I wasn’t around until that last
one.
4. Diversification and the internet
As gaming journals proliferated, so did the
diversity of opinions.
With the rise of the internet, this proliferation
led to information overload across all areas of
media criticism.
This, in turn, led to the beginnings of
metareview.
5. Metareview sites matter in the
industry
In general, there’s a sense that Metareview
scores are a critical factor in the success or
failure of game products.
Why?
THQ stock price
Bonus money for Fallout: New Vegas
Bonus money for Destiny
Warren Spector on Metacritic at DICE 2013
http://news.yahoo.com/homefront-reviews-torpedo-thq-stock-price-metacritic-broken-20110316-084500-427.html
http://www.kotaku.com.au/2014/09/destiny-review-scores-may-cost-bungie-25-million/
http://www.joystiq.com/2012/03/15/obsidian-missed-fallout-new-vegas-metacritic-bonus-by-one-point/
6. We wondered…
Why is this such a big deal?
How does it work?
Does it work at all?
How does Metareview help or hurt?
Who does it help or hurt?
Are there better ways to do this?
7. My grad students
(Because somebody’s got to do the real work)
Erica Holcomb, MS Cameron Bolinger, MS
8. What’s good about
metareview?
Clearinghouse and index for lots of different
information sources
Aggregates lots of individual data points into a
more coherent single answer
Metareview reduces information overload
9. Problems with metareview
Basic premise: give up nuance and diversity of
opinion in exchange for clarity
Lots of issues with validity related to scores,
aggregation, etc (Greenwood-Ericksen,
Poorman, and Papp, 2014).
Vulnerability to manipulation by 3rd parties
Leads to oversimplification of a complex topic
Discards lots of relevant context
http://www.eludamos.org/index.php/eludamos
10. Metacritic vs. Gamerankings
Actually indicative of theoretical difference in
approach
Gamerankings: all currently extant publications
have equal value
Metacritic: some publications are more
trustworthy than others
Neither approach is without drawbacks
Greater transparency vs. greater reliability of data
Rotten Tomatoes has an interesting alternative
approach as well (but they don’t review
games).
11. Examination of validity issues
Identified lots of issues with the validity of the
metareview process in general, and Metacritic
specifically (Greenwood-Ericksen, Poorman, &
Papp, 2014).
Loss of useful diversity
Issues with review reliability
Distortions in translation to 100 pt scale
Lack of transparency in weighting system
Very serious problems with interpretation and
application of results
Still found very strong relationship between sales and
scores (r = .72)
13. The transparency gap: Metacritic
weights
One key issue with Metacritic: secret weighting
of scores by publication
Exact weights are a moving target
Two things to look at:
Derived weights: try to figure out what numbers
Metacritic is using (Greenwood-Ericksen,
Poorman, & Papp, 2013; Bolinger, Holcomb,
Greenwood-Ericksen, 2015)
Observed weights: watch how much publications
push or pull overall scores (Swisher, 2014).
14. Observed weights
Less about exact match, and more about
rationality of influence
Scott Swisher (UW Madison, now at
Cambridge)
Doctoral thesis on review weighting and reliability
Found some logical and interesting patterns over
time
Gamespot/Jeff Gerstmann scandal 2007
Really great treasure trove of data on Metacritic –
check it out if you’re interested.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0CCMQFjAB&url=https
%3A%2F%2Fs3.amazonaws.com%2Fsswisher_econ%2FSwisher_InfoAgg_Summary_7-12-
2013.pdf&ei=Jj0_VIKcEYSkyQS29oHwBQ&usg=AFQjCNEud9tDybB6oZ54XPyh-O1cAx-vzQ&
sig2=G9zxCSmAkT2FAlBXnDgkFg&bvm=bv.77648437,d.aWw
15. Derived weights
“Derived” in that we’re trying to reproduce the
actual weights Metacritic uses
This is really hard, it turns out
Obstacles:
Rounding
Relatively rapid changes in weighting
Identifying starting weights
Identifying number of tiers (or if there are tiers at
all)
16. Modeling derived weights
First pass: GDC 2013
Fundamental approach seemed to work
Problems with reliability of method
Some assumptions probably wrong
Key outcomes:
Established basic method as possible
Got us yelled at by Metacritic
In response, Metacritic published a ton of new
information about how weighting works
Subsequently, found more issues with model stability and
uniqueness.
Lots of great new info to work with!
17. Fanmail!
What they said:
They use
weighting tiers,
but…
We had too many
tiers
Our range of
weights was too
large, and…
We had at least
some of the
modeled weights
wrong
18. Modeling derived weights: Recent
work
Second pass (GDC 2015):
Computationally more robust
Incorporated new info from Metacritic
Removed most sources of human error
Focused on one narrow time frame
Focused on one single platform
Were able to identify a stable model with 0% error
(that’s possible because the original inputs are
manmade)
Note: that doesn’t necessarily mean it’s the actual values!
Full results to be presented at GDC San Francisco in
March.
19. Another core question:
Does the weighting matter at all?
Did a comparison using 2012 data of game sales
(VGChartz) to:
Metacritic metascore
Gamerankings score
Unweighted average of critic scores curated by
Metacritic
Found significant correlation between games and
scores for each, BUT…
Metacritic no better than the others
20. Relative predictive value of
metareview scores relative to sales
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Correlation of Sales to Scores
1st week 2012 sales total
Pearson's R
Sales
Metacritic
Gamerankings
Unweighted
21. What does that mean?
It means that, from the standpoint of sales
prediction, the weights don’t matter.
Shouldn’t be hugely surprising – differences
always very small
Might suggest that reviewers tend to have similar
opinions – or that they’re benchmarking scores to
each other
Suggest that Metacritic could drop this
controversial aspect of their product without
weakening its predictive value
22. What we’re thinking about
next…
We’ve found some cool things, but there’s still
a lot to be done on this
Do game reviewers adjust their scores based on
already published reviews?
Do score averages tend to go up or down over
time?
How much does marketing affect score/sales?
Ultimately, more research is needed
Metareview isn’t going away – it’s too useful
Implications for virtually all media
Need to figure out how to do this better!
Find images for covers:
Early on, most of these were print publications of limited circulation, which meant that usually gamers had only a single source of reviews to rely upon.
References:
Play Meter: http://www.ebay.com/itm/1977-arcade-pinball-PLAY-METER-MAGAZINE-Atarians-Wurlitzer-SeaWolf-Jukebox-ATARI-/130776351290