*Where I (successfully?) attempt to reverse engineer
the relative weights that the website
metacritic
assigns to movie critics.*

metacritic is a popular site that computes an
aggregate **metascore** for each movie.
The metascore is a *weighted average* of individual critic scores.
The metacritic FAQ Page says:

Q: Can you tell me how each of the different critics are weighted in your formula?

A: Absolutely not.

That sounds like a challenge to me. Using standard machine learning/optimization techniques, we should be able to tell what critics are more important than others. In fact, the same techniques should also allow us to correctly predict the metascore for any new movie (given the individual critic ratings). This post describes my attempt to build such a system.

*Note: All related code is available on my github.*

### The Model

We introduce some notations and assumptions about the problem:

- metacritic uses ratings from critics.
- Our data set has movies.
- is the rating of movie by critic .
This forms an matrix.
- Not all critics rate all movies! In other words, may not be defined for all and .
- Where defined, the values are constrained: .

- is the
*relative weight*(or importance) of critic (this is what we are trying to*learn*).- There is no point in having a critic weight of (why even consider a critic whose rating does not affect the metascore at all?).
- In light of the previous point, we constrain
critic weights to be
*positive*,*i.e.*, for all , - Since these weights are relative, they must add up to one,
*i.e.*, . - Critic weights stay constant across movies (but may get updated over time).
- The -dimensional vector
represents
*a solution*, a possible assignment of weights to critics. - Due to the above constraints, the solution space of forms
a
*bounded, affine hyperplane*.

- is the
*published*metascore for movie .- These values are also constrained: for all .

- is the
*predicted*metascore for movie for a given choice of relative weights.- We will drop the when it is obvious.

- and are vectorized forms we will use for conciseness later.

An obvious definition of is simply a weighted sum:

But there is a problem with this definition. Remember:
*Not all critics rate all movies*.
In other words, the summation above may be invalid,
since not all values are necessarily defined.
How do we deal with this *incomplete* matrix ?
My best guess is that metacritic normalizes the metascore
over the available critic weights.
For example, assume that the (excellent) movie
Ex Machina
has the index in our data set.
Assume that only two critics,
with weights and
have currently rated this movie.
We denote their ratings and respectively.
The metascore for this movie is then

In fact, the metacritic FAQ page says they wait until a movie has at least 4 reviews before computing a metascore. So they want at least 4 defined values for a given . Lets define the following additional variables:

- if movie is rated by critic and otherwise.
- if movie is rated by critic , and otherwise.

Note that and are both matrices, but unlike , they are fully defined.

Using these, we modify the definition for :

How does this function vary with (once we fix the values)?
I wrote up
a little script to plot
for the example involving the movie *Ex Machina*
(I fixed the critic ratings to
and ;
I know. Stupid critics!)
The following image of the plot
hopefully makes it clear that
is *not* linear in .
But the function is still *smooth* (*i.e.*, *differentiable*).

Now consider the -vector .
This vector is a measure of how *off* the predictions
are from actual metascores for
a given .
We will try to find a
that minimizes the value of the function
,
where represents the
norm.
Formally,

This is a standard
constrained minimization problem.
Our expectation is that any solution
of the above system
*(a)* fits the training set well, and
*(b)* also predicts metascores for new movies.
Notice that is not a *linear* function of
because isn’t either.
So, we have to use a
nonlinear solver.

### The Implementation

With (most of the) annoying math out of the way, lets write code! The implementation pipeline consists of the following stages:

- Collect movie ratings data from metacritic.
- Preprocess the data:
- Remove ratings from critics who’ve rated very few movies, and
- Create the and matrices.

- Partition the data into a
*training*set and a*test*set. - Find a best fit by running the optimization routine on the training set.
- Compute accuracy against the test set.
- Output the results.

It turns out that a Makefile is really well suited to
building these kinds of pipelines,
where each stage produces a *file* that can be used as a Make target
for that stage.
Each stage can be dependent on files produced in one or more previous stages.

#### Collecting ratings data from metacritic

Unfortunately, metacritic does not, as far as I know, have any API’s to make this data available easily. So I periodically scrape metacritic’s New Movie Releases page for links to actual metacritic movie pages, which I then scrape to get the overall metascore, and the individual critic ratings.

I used a combination of xpathtool and the lxml Python library for the scraping.

The output of this stage is a Python cPickle dump file that represents a dictionary of the form:

```
{ movie_url -> (metascore, individual_ratings), ... }
```

where `individual_ratings`

is itself a dictionary of the form

```
{ critic_name -> numeric_rating, ... }.
```

For example, this structure could look like:

```
{
'http://www.metacritic.com/movie/mad-max-fury-road/critic-reviews' ->
(89,
{'Anthony Lane (The New Yorker)': 100,
'A.A. Dowd (TheWrap)': 95,
...}),
'http://www.metacritic.com/movie/ex-machina/critic-reviews' ->
(78,
{'Steven Rea (Philadelphia Inquier)': 100,
'Manohla Dargis (The New York Times)': 90,
...}),
...
}
```

I know cPickle is not exactly the most portable format, but it works well at this early stage. In the long run, I want to persist all of the ratings data in a database (sqlite? Postgres?).

#### Preprocessing

We first eliminate from our data set
the long tail of critics who’ve rated very few movies.
Not only are these critics not very influential
to the overall optimization routine,
eliminating them also
helps reduce (the matrix width).
Accordingly, there is a configurable *rating count threshold*,
currently set to .
We do one pass over the ratings data and construct
a dictionary of the form:

```
{ critic_name -> movies_rated, ... }
```

We then do another pass through the data and remove ratings
from critics whose `movies_rated`

value is lower than the
threshold.

The second preprocessing step is to construct the and matrices, which of course is a simple matter of programming. I store these values as numpy matrices.

#### Partitioning the data set

This is straightforward.
I use a configurable `training_frac`

parameter
(a value in the interval ) to probabilistically
split the cleaned up data into a test set and a training
set.

#### Optimization routine

There are numerous “solvers” available for constrained optimization problems of the type we described above, but not all of them are freely available.

I tried the following two solvers, available as part of scipy.optimize:

Solver | Differentiability requirements | Allows bounds? | Allows equality constraints? | Allows inequality constraints? |
---|---|---|---|---|

Sequential Least Squares Programming (SLSQP) | The objective function and the constraints should be twice continuously differentiable | Yes | Yes | Yes |

Constrained Optimization By Linear Approximations (COBYLA) | None | No | No | Yes |

Note that (and therefore ) satisfies the differentiability requirement of SLSQP.

Also, COBYLA does not allow you to specify bounds on
values or equality constraints.
So, we employ a common technique in optimization formulations,
which is to push the constraints *into the objective function*.
Consider the “tub” function defined as:

Our modified objective function (for use with COBYLA) becomes:

where and are configurable weights that decide
how much we should *penalize* the optimization algorithm when it
chooses:

- a that doesn’t lie on the affine hyperplane, and
- values outside the interval ,

respectively.

Setting both and to 0 reduces our objective function to its original form, so we can use the same function for both solvers by simply tweaking these weights.

### Results

Before I actually launch into details, I should note the following issues right at the outset:

- I was actually unable to get either SLSQP or COBYLA to ever successfully converge on a solution.
- The values (
*i.e.*, critic weights) learned by these solvers were often*way*outside the interval .

Most of the times, both routines finished their iterations and failed with errors of this form:

```
optimization failed [8]: Positive directional derivative for linesearch
optimization failed [2]: Maximum number of function evaluations has been exceeded
```

If you have experience with the numpy optimization library, I’d love to hear about suggestions you may have on how to deal with these errors.

Perhaps more interestingly, in spite of the above issues, the learned values were still able to successfully predict metascores for movies in the test set.

After removing ratings from insignificant critics, I constructed a training set of about 2800 ratings of 190 movies by 188 critics.

The following table lists
the **top 20** critics by weight
learned using the above training set
with each optimization routine.
Note that the weights are expressed as a percentage
of the weight for the *top* critic in each list.
Interestingly enough,
both algorithms think
Mike Scott of the New Orleans Times-Picayune
is the metacritic MVP.
So, for example, according to SLSQP,
a review by Justin Lowe carries only
95% of the importance that is given
to a review by Mike Scott.

SLSQP | COBYLA | ||
---|---|---|---|

Weight | Critic | Weight | Critic |

Mike Scott (New Orleans Times-Picayune) | Mike Scott (New Orleans Times-Picayune) | ||

0.949948 | Justin Lowe (The Hollywood Reporter) | 0.902288 | Slant Magazine |

0.929495 | Jordan Hoffman (The Guardian) | 0.900024 | Ronnie Scheib (Variety) |

0.914186 | Marjorie Baumgarten (Austin Chronicle) | 0.890113 | Wes Greene (Slant Magazine) |

0.910820 | Fionnuala Halligan (Screen International) | 0.887845 | Chris Cabin (Slant Magazine) |

0.909801 | James Mottram (Total Film) | 0.885791 | Martin Tsai (Los Angeles Times) |

0.904564 | Variety | 0.863626 | Lawrence Toppman (Charlotte Observer) |

0.903029 | Guy Lodge (Variety) | 0.858237 | Anthony Lane (The New Yorker) |

0.897749 | Inkoo Kang (TheWrap) | 0.845864 | Fionnuala Halligan (Screen International) |

0.894605 | indieWIRE | 0.834088 | Boyd van Hoeij (The Hollywood Reporter) |

0.892237 | Ben Kenigsberg (The New York Times) | 0.820908 | Variety |

0.882656 | Mike D’Angelo (The Dissolve) | 0.814152 | Nicolas Rapold (The New York Times) |

0.875550 | Simon Abrams (Village Voice) | 0.735623 | Justin Lowe (The Hollywood Reporter) |

0.875385 | Martin Tsai (Los Angeles Times) | 0.629166 | Mark Olsen (Los Angeles Times) |

0.875062 | Manohla Dargis (The New York Times) | 0.625141 | The Globe and Mail (Toronto) |

0.874889 | Kyle Smith (New York Post) | 0.567734 | James Berardinelli (ReelViews) |

0.872482 | Nicolas Rapold (The New York Times) | 0.562606 | Peter Sobczynski (RogerEbert.com) |

0.869911 | James Berardinelli (ReelViews) | 0.558980 | John Anderson (Wall Street Journal) |

0.863118 | Ronnie Scheib (Variety) | 0.520855 | Steve Macfarlane (Slant Magazine) |

0.860796 | Nikola Grozdanovic (The Playlist) | 0.510842 | Jordan Mintzer (The Hollywood Reporter) |

I should clarify that the top 20 list can change from one run to the next, since it depends on the training set chosen (which is at random).

Next, we show a few sample predicted metascores:

Movie | Actual Metascore | Predicted (SLSQP) | Predicted (COBYLA) |
---|---|---|---|

Survivor | 26 | 30 (16.18%) | 33 (28.12%) |

Entourage | 38 | 57 (50.25%) | 43 (15.65%) |

Chappie | 41 | 46 (12.89%) | 43 (6.00%) |

Dreamcatcher | 86 | 77 (-10.46%) | 83 (-2.77%) |

Avengers: Age of Ultron | 66 | 62 (-5.05%) | 67 (1.57%) |

Gemma Bovery | 57 | 0 (-100.00%) | 61 (7.15%) |

Alleluia | 84 | 90 (7.41%) | 87 (4.13%) |

With a test set of size 60, the movie predictions by the two algorithms had the following RMSE values:

SLSQP | COBYLA |
---|---|

0.053437 | 0.016942 |

## Takeaways

- Working out the math was obviously fun!
- I got to brush up on my long defunct skills with numpy, scipy, matplotlib, etc.

- Even a toy project like this one can end up demanding substantial time and attention, especially if it is something you want to share with the rest of the world. For example, I ended up setting up proper Python package management for all the code.
- Writing this blog post was also extremely useful because it clarified my own thinking on the topic. I was actually able to go back and refactor the code to better match the implementation pipeline I described above.