Reverse engineering Metacritic

Where I (successfully?) attempt to reverse engineer the relative weights that the website metacritic assigns to movie critics.

metacritic is a popular site that computes an aggregate metascore for each movie. The metascore is a weighted average of individual critic scores. The metacritic FAQ Page says:

Q: Can you tell me how each of the different critics are weighted in your formula?

A: Absolutely not.

That sounds like a challenge to me. Using standard machine learning/optimization techniques, we should be able to tell what critics are more important than others. In fact, the same techniques should also allow us to correctly predict the metascore for any new movie (given the individual critic ratings). This post describes my attempt to build such a system.

Note: All related code is available on my github.

The Model

We introduce some notations and assumptions about the problem:

metacritic uses ratings from \(n\) critics.
Our data set has \(m\) movies.
\(r_{ij}\) is the rating of movie \(i\) by critic \(j\). This forms an \(m \times n\) matrix.
- Not all critics rate all movies! In other words, \(r_{ij}\) may not be defined for all \(i\) and \(j\).
- Where defined, the values are constrained: \(0 \leq r_{ij} \leq 100\).
\(\theta_j\) is the relative weight (or importance) of critic \(j\) (this is what we are trying to learn).
- There is no point in having a critic weight of \(0\) (why even consider a critic whose rating does not affect the metascore at all?).
- In light of the previous point, we constrain critic weights to be positive, i.e., \(\theta_j > 0\) for all \(j\),
- Since these weights are relative, they must add up to one, i.e., \(\sum_{j=1}^n \theta_j = 1\).
- Critic weights stay constant across movies (but may get updated over time).
- The \(n\)-dimensional vector \(\theta = (\theta_1, \theta_2, \ldots, \theta_n)\) represents a solution, a possible assignment of weights to critics.
- Due to the above constraints, the solution space of \(\theta\) forms a bounded, affine hyperplane.
\(p_i\) is the published metascore for movie \(i\).
- These values are also constrained: \(0 \leq p_i \leq 100\) for all \(i\).
\(y_i(\theta)\) is the predicted metascore for movie \(i\) for a given choice of relative weights.
- We will drop the \(\theta\) when it is obvious.
\(p = (p_1, p_2, \ldots, p_m)\) and \(y(\theta) = (y_1(\theta), y_2(\theta), \ldots, y_m(\theta))\) are vectorized forms we will use for conciseness later.

An obvious definition of \(y_i(\theta)\) is simply a weighted sum:

\[y_i(\theta) = \sum_{j=1}^n \theta_j r_{ij}.\]

But there is a problem with this definition. Remember: Not all critics rate all movies. In other words, the summation above may be invalid, since not all \(r_{ij}\) values are necessarily defined. How do we deal with this incomplete matrix \(r_{ij}\)? My best guess is that metacritic normalizes the metascore over the available critic weights. For example, assume that the (excellent) movie Ex Machina has the index \(i = 4\) in our data set. Assume that only two critics, with weights \(\theta_1\) and \(\theta_2\) have currently rated this movie. We denote their ratings \(r_{41}\) and \(r_{42}\) respectively. The metascore for this movie is then

\[y_4(\theta_1, \theta_2) = \frac{\theta_1 r_{41} + \theta_2 r_{42}}{\theta_1 + \theta_2}.\]

In fact, the metacritic FAQ page says they wait until a movie has at least 4 reviews before computing a metascore. So they want at least 4 defined \(r_{ij}\) values for a given \(i\). Lets define the following additional variables:

\(r'_{ij} = r_{ij}\) if movie \(i\) is rated by critic \(j\) and \(0\) otherwise.
\(e_{ij} = 1\) if movie \(i\) is rated by critic \(j\), and \(0\) otherwise.

Note that \(r'_{ij}\) and \(e_{ij}\) are both \(m \times n\) matrices, but unlike \(r_{ij}\), they are fully defined.

Using these, we modify the definition for \(y_i\):

\[y_i(\theta) = \frac{ \sum_{j=1}^n \theta_j r'_{ij} }{ \sum_{j=1}^n \theta_j e_{ij} }.\]

How does this function vary with \(\theta\) (once we fix the \(r_{ij}\) values)? I wrote up a little script to plot \(y_4 (\theta_1, \theta_2)\) for the example involving the movie Ex Machina (I fixed the critic ratings to \(r_{41} = 79\) and \(r_{42} = 67\); I know. Stupid critics!) The following image of the plot hopefully makes it clear that \(y_i (\theta)\) is not linear in \(\theta\). But the function is still smooth (i.e., differentiable).

Now consider the \(m\)-vector \(d(\theta) = p - y(\theta)\). This vector is a measure of how off the predictions are from actual metascores for a given \(\theta\). We will try to find a \(\theta\) that minimizes the value of the function \(f(\theta) = \Vert d(\theta) \Vert\), where \(\Vert \cdot \Vert\) represents the \(L^2\) norm. Formally,

\[\DeclareMathOperator*{\argmin}{\arg\!\min} \begin{gather*} \argmin_{\theta} \Vert p - y(\theta) \Vert \\ \text{subject to} \\ \sum_{j=1}^n \theta_j = 1, \text{and}\\ \theta_j > 0~\text{for all}~j. \end{gather*}\]

This is a standard constrained minimization problem. Our expectation is that any solution \(\theta\) of the above system (a) fits the training set well, and (b) also predicts metascores for new movies. Notice that \(d\) is not a linear function of \(\theta\) because \(y(\theta)\) isn’t either. So, we have to use a nonlinear solver.

The Implementation

With (most of the) annoying math out of the way, lets write code! The implementation pipeline consists of the following stages:

Collect movie ratings data from metacritic.
Preprocess the data:
- Remove ratings from critics who’ve rated very few movies, and
- Create the \(r'_{ij}\) and \(e_{ij}\) matrices.
Partition the data into a training set and a test set.
Find a best fit \(\theta\) by running the optimization routine on the training set.
Compute accuracy against the test set.
Output the results.

It turns out that a Makefile is really well suited to building these kinds of pipelines, where each stage produces a file that can be used as a Make target for that stage. Each stage can be dependent on files produced in one or more previous stages.

Collecting ratings data from metacritic

Unfortunately, metacritic does not, as far as I know, have any API’s to make this data available easily. So I periodically scrape metacritic’s New Movie Releases page for links to actual metacritic movie pages, which I then scrape to get the overall metascore, and the individual critic ratings.

I used a combination of xpathtool and the lxml Python library for the scraping.

The output of this stage is a Python cPickle dump file that represents a dictionary of the form:

{ movie_url -> (metascore, individual_ratings), ... }

where individual_ratings is itself a dictionary of the form

{ critic_name -> numeric_rating, ... }.

For example, this structure could look like:

{
    'http://www.metacritic.com/movie/mad-max-fury-road/critic-reviews' ->
         (89,
          {'Anthony Lane (The New Yorker)': 100,
           'A.A. Dowd (TheWrap)': 95,
           ...}),
    'http://www.metacritic.com/movie/ex-machina/critic-reviews' ->
         (78,
          {'Steven Rea (Philadelphia Inquier)': 100,
           'Manohla Dargis (The New York Times)': 90,
           ...}),
    ...
}

I know cPickle is not exactly the most portable format, but it works well at this early stage. In the long run, I want to persist all of the ratings data in a database (sqlite? Postgres?).

Preprocessing

We first eliminate from our data set the long tail of critics who’ve rated very few movies. Not only are these critics not very influential to the overall optimization routine, eliminating them also helps reduce \(n\) (the matrix width). Accordingly, there is a configurable rating count threshold, currently set to \(5\). We do one pass over the ratings data and construct a dictionary of the form:

{ critic_name -> movies_rated, ... }

We then do another pass through the data and remove ratings from critics whose movies_rated value is lower than the threshold.

The second preprocessing step is to construct the \(r'_{ij}\) and \(e_{ij}\) matrices, which of course is a simple matter of programming. I store these values as numpy matrices.

Partitioning the data set

This is straightforward. I use a configurable training_frac parameter (a value in the interval \([0, 1]\)) to probabilistically split the cleaned up data into a test set and a training set.

Optimization routine

There are numerous “solvers” available for constrained optimization problems of the type we described above, but not all of them are freely available.

I tried the following two solvers, available as part of scipy.optimize:

Solver	Differentiability requirements	Allows bounds?	Allows equality constraints?	Allows inequality constraints?
Sequential Least Squares Programming (SLSQP)	The objective function and the constraints should be twice continuously differentiable	Yes	Yes	Yes
Constrained Optimization By Linear Approximations (COBYLA)	None	No	No	Yes

Note that \(y(\theta)\) (and therefore \(f(\theta)\)) satisfies the differentiability requirement of SLSQP.

Also, COBYLA does not allow you to specify bounds on \(\theta\) values or equality constraints. So, we employ a common technique in optimization formulations, which is to push the constraints into the objective function. Consider the “tub” function \(\tau(x, l, u)\) defined as:

\[\tau (x, l, u) = \begin{cases} 0 & \quad \text{if } l \leq x \leq u, \\ 1 & \quad \text{otherwise}. \end{cases}\]

Our modified objective function (for use with COBYLA) becomes:

\[f(\theta) = \Vert p - y(\theta) \Vert + P_h \cdot \Vert 1 - \sum_{j=1}^n \theta_j \Vert + P_b \cdot \sum_{j = 1}^n \tau(\theta_j, 0, 1),\]

where \(P_h\) and \(P_b\) are configurable weights that decide how much we should penalize the optimization algorithm when it chooses:

a \(\theta\) that doesn’t lie on the affine hyperplane, and
\(\theta_j\) values outside the interval \([0, 1]\),

respectively.

Setting both \(P_h\) and \(P_b\) to 0 reduces our objective function to its original form, so we can use the same function for both solvers by simply tweaking these weights.

Results

Before I actually launch into details, I should note the following issues right at the outset:

I was actually unable to get either SLSQP or COBYLA to ever successfully converge on a solution.
The \(\theta\) values (i.e., critic weights) learned by these solvers were often way outside the interval \([0, 1]\).

Most of the times, both routines finished their iterations and failed with errors of this form:

optimization failed [8]: Positive directional derivative for linesearch
optimization failed [2]: Maximum number of function evaluations has been exceeded

If you have experience with the numpy optimization library, I’d love to hear about suggestions you may have on how to deal with these errors.

Perhaps more interestingly, in spite of the above issues, the learned \(\theta\) values were still able to successfully predict metascores for movies in the test set.

After removing ratings from insignificant critics, I constructed a training set of about 2800 ratings of 190 movies by 188 critics.

The following table lists the top 20 critics by weight learned using the above training set with each optimization routine. Note that the weights are expressed as a percentage of the weight for the top critic in each list. Interestingly enough, both algorithms think Mike Scott of the New Orleans Times-Picayune is the metacritic MVP. So, for example, according to SLSQP, a review by Justin Lowe carries only 95% of the importance that is given to a review by Mike Scott.

SLSQP		COBYLA
Weight	Critic	Weight	Critic
\(\cdot\)	Mike Scott (New Orleans Times-Picayune)	\(\cdot\)	Mike Scott (New Orleans Times-Picayune)
0.949948	Justin Lowe (The Hollywood Reporter)	0.902288	Slant Magazine
0.929495	Jordan Hoffman (The Guardian)	0.900024	Ronnie Scheib (Variety)
0.914186	Marjorie Baumgarten (Austin Chronicle)	0.890113	Wes Greene (Slant Magazine)
0.910820	Fionnuala Halligan (Screen International)	0.887845	Chris Cabin (Slant Magazine)
0.909801	James Mottram (Total Film)	0.885791	Martin Tsai (Los Angeles Times)
0.904564	Variety	0.863626	Lawrence Toppman (Charlotte Observer)
0.903029	Guy Lodge (Variety)	0.858237	Anthony Lane (The New Yorker)
0.897749	Inkoo Kang (TheWrap)	0.845864	Fionnuala Halligan (Screen International)
0.894605	indieWIRE	0.834088	Boyd van Hoeij (The Hollywood Reporter)
0.892237	Ben Kenigsberg (The New York Times)	0.820908	Variety
0.882656	Mike D’Angelo (The Dissolve)	0.814152	Nicolas Rapold (The New York Times)
0.875550	Simon Abrams (Village Voice)	0.735623	Justin Lowe (The Hollywood Reporter)
0.875385	Martin Tsai (Los Angeles Times)	0.629166	Mark Olsen (Los Angeles Times)
0.875062	Manohla Dargis (The New York Times)	0.625141	The Globe and Mail (Toronto)
0.874889	Kyle Smith (New York Post)	0.567734	James Berardinelli (ReelViews)
0.872482	Nicolas Rapold (The New York Times)	0.562606	Peter Sobczynski (RogerEbert.com)
0.869911	James Berardinelli (ReelViews)	0.558980	John Anderson (Wall Street Journal)
0.863118	Ronnie Scheib (Variety)	0.520855	Steve Macfarlane (Slant Magazine)
0.860796	Nikola Grozdanovic (The Playlist)	0.510842	Jordan Mintzer (The Hollywood Reporter)

I should clarify that the top 20 list can change from one run to the next, since it depends on the training set chosen (which is at random).

Next, we show a few sample predicted metascores:

Movie	Actual Metascore	Predicted (SLSQP)	Predicted (COBYLA)
Survivor	26	30 (16.18%)	33 (28.12%)
Entourage	38	57 (50.25%)	43 (15.65%)
Chappie	41	46 (12.89%)	43 (6.00%)
Dreamcatcher	86	77 (-10.46%)	83 (-2.77%)
Avengers: Age of Ultron	66	62 (-5.05%)	67 (1.57%)
Gemma Bovery	57	0 (-100.00%)	61 (7.15%)
Alleluia	84	90 (7.41%)	87 (4.13%)

With a test set of size 60, the movie predictions by the two algorithms had the following RMSE values:

SLSQP	COBYLA
0.053437	0.016942

Takeaways

Working out the math was obviously fun!
- I got to brush up on my long defunct skills with numpy, scipy, matplotlib, etc.
Even a toy project like this one can end up demanding substantial time and attention, especially if it is something you want to share with the rest of the world. For example, I ended up setting up proper Python package management for all the code.
Writing this blog post was also extremely useful because it clarified my own thinking on the topic. I was actually able to go back and refactor the code to better match the implementation pipeline I described above.