On the validity and impact of metacritic scores adams greenwood-ericksen

ON THE VALIDITY AND
IMPACT OF METACRITIC
SCORES
Adams Greenwood-Ericksen, PhD
Special thanks to Erica Holcomb, MS and Cameron
Bolinger, MS

Agenda
 Why does metareview happen?
 How does metareview work?
 Why is metareview problematic?
 What have we found out so far?
 What’s next?

In the beginning…
 There was Play Meter (1974-present).
Then Weekly Shōnen Jump, Computer and Video
Games, Electronic Games, Electronic Gaming
Monthly (still around!), Computer Gaming World,
Nintendo Power, and so on.*
* According to Wikipedia, at least. I wasn’t around until that last
one.

Diversification and the internet
 As gaming journals proliferated, so did the
diversity of opinions.
 With the rise of the internet, this proliferation
led to information overload across all areas of
media criticism.
 This, in turn, led to the beginnings of
metareview.

Metareview sites matter in the
industry
 In general, there’s a sense that Metareview
scores are a critical factor in the success or
failure of game products.
 Why?
THQ stock price
Bonus money for Fallout: New Vegas
Bonus money for Destiny
Warren Spector on Metacritic at DICE 2013
http://news.yahoo.com/homefront-reviews-torpedo-thq-stock-price-metacritic-broken-20110316-084500-427.html
http://www.kotaku.com.au/2014/09/destiny-review-scores-may-cost-bungie-25-million/
http://www.joystiq.com/2012/03/15/obsidian-missed-fallout-new-vegas-metacritic-bonus-by-one-point/

We wondered…
 Why is this such a big deal?
 How does it work?
 Does it work at all?
 How does Metareview help or hurt?
 Who does it help or hurt?
 Are there better ways to do this?

My grad students
(Because somebody’s got to do the real work)
Erica Holcomb, MS Cameron Bolinger, MS

What’s good about
metareview?
 Clearinghouse and index for lots of different
information sources
 Aggregates lots of individual data points into a
more coherent single answer
Metareview reduces information overload

Problems with metareview
 Basic premise: give up nuance and diversity of
opinion in exchange for clarity
 Lots of issues with validity related to scores,
aggregation, etc (Greenwood-Ericksen,
Poorman, and Papp, 2014).
 Vulnerability to manipulation by 3rd parties
 Leads to oversimplification of a complex topic
 Discards lots of relevant context
http://www.eludamos.org/index.php/eludamos

Metacritic vs. Gamerankings
 Actually indicative of theoretical difference in
approach
Gamerankings: all currently extant publications
have equal value
Metacritic: some publications are more
trustworthy than others
 Neither approach is without drawbacks
Greater transparency vs. greater reliability of data
 Rotten Tomatoes has an interesting alternative
approach as well (but they don’t review
games).

Examination of validity issues
 Identified lots of issues with the validity of the
metareview process in general, and Metacritic
specifically (Greenwood-Ericksen, Poorman, &
Papp, 2014).
Loss of useful diversity
Issues with review reliability
Distortions in translation to 100 pt scale
Lack of transparency in weighting system
Very serious problems with interpretation and
application of results
Still found very strong relationship between sales and
scores (r = .72)

Sales vs. Scores (≈10 years)
14
12
10
8
6
4
2
0
PS3 Games
XBOX360
0 10 20 30 40 50 60 70 80 90 100
Sales (Millions)
Metacritic Score

The transparency gap: Metacritic
weights
 One key issue with Metacritic: secret weighting
of scores by publication
 Exact weights are a moving target
 Two things to look at:
Derived weights: try to figure out what numbers
Metacritic is using (Greenwood-Ericksen,
Poorman, & Papp, 2013; Bolinger, Holcomb,
Greenwood-Ericksen, 2015)
Observed weights: watch how much publications
push or pull overall scores (Swisher, 2014).

Observed weights
 Less about exact match, and more about
rationality of influence
 Scott Swisher (UW Madison, now at
Cambridge)
Doctoral thesis on review weighting and reliability
Found some logical and interesting patterns over
time
 Gamespot/Jeff Gerstmann scandal 2007
Really great treasure trove of data on Metacritic –
check it out if you’re interested.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0CCMQFjAB&url=https
%3A%2F%2Fs3.amazonaws.com%2Fsswisher_econ%2FSwisher_InfoAgg_Summary_7-12-
2013.pdf&ei=Jj0_VIKcEYSkyQS29oHwBQ&usg=AFQjCNEud9tDybB6oZ54XPyh-O1cAx-vzQ&
sig2=G9zxCSmAkT2FAlBXnDgkFg&bvm=bv.77648437,d.aWw

Derived weights
 “Derived” in that we’re trying to reproduce the
actual weights Metacritic uses
This is really hard, it turns out
 Obstacles:
Rounding
Relatively rapid changes in weighting
Identifying starting weights
Identifying number of tiers (or if there are tiers at
all)

Modeling derived weights
 First pass: GDC 2013
Fundamental approach seemed to work
 Problems with reliability of method
 Some assumptions probably wrong
Key outcomes:
 Established basic method as possible
 Got us yelled at by Metacritic
 In response, Metacritic published a ton of new
information about how weighting works
 Subsequently, found more issues with model stability and
uniqueness.
 Lots of great new info to work with!

Fanmail!
 What they said:
They use
weighting tiers,
but…
We had too many
tiers
 Our range of
weights was too
large, and…
We had at least
some of the
modeled weights
wrong

Modeling derived weights: Recent
work
 Second pass (GDC 2015):
Computationally more robust
Incorporated new info from Metacritic
Removed most sources of human error
Focused on one narrow time frame
Focused on one single platform
Were able to identify a stable model with 0% error
(that’s possible because the original inputs are
manmade)
 Note: that doesn’t necessarily mean it’s the actual values!
Full results to be presented at GDC San Francisco in
March.

Another core question:
 Does the weighting matter at all?
Did a comparison using 2012 data of game sales
(VGChartz) to:
 Metacritic metascore
 Gamerankings score
 Unweighted average of critic scores curated by
Metacritic
Found significant correlation between games and
scores for each, BUT…
 Metacritic no better than the others

Relative predictive value of
metareview scores relative to sales
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Correlation of Sales to Scores
1st week 2012 sales total
Pearson's R
Sales
Metacritic
Gamerankings
Unweighted

What does that mean?
 It means that, from the standpoint of sales
prediction, the weights don’t matter.
Shouldn’t be hugely surprising – differences
always very small
Might suggest that reviewers tend to have similar
opinions – or that they’re benchmarking scores to
each other
Suggest that Metacritic could drop this
controversial aspect of their product without
weakening its predictive value

What we’re thinking about
next…
 We’ve found some cool things, but there’s still
a lot to be done on this
Do game reviewers adjust their scores based on
already published reviews?
Do score averages tend to go up or down over
time?
How much does marketing affect score/sales?
 Ultimately, more research is needed
Metareview isn’t going away – it’s too useful
Implications for virtually all media
Need to figure out how to do this better!

Thanks for your time!
Questions?

On the validity and impact of metacritic scores adams greenwood-ericksen

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to On the validity and impact of metacritic scores adams greenwood-ericksen

Similar to On the validity and impact of metacritic scores adams greenwood-ericksen (20)

More from Mary Chan

More from Mary Chan (20)

Recently uploaded

Recently uploaded (20)

On the validity and impact of metacritic scores adams greenwood-ericksen

Editor's Notes