A Mostly Technical BlogA Mostly Technical Blog - Shashank Ramaprasad
http://shashank.ramaprasad.com
Sat, 24 Jun 2017 23:26:31 +0000Sat, 24 Jun 2017 23:26:31 +000060Reverse engineering Metacritic
<p><em>Where I (successfully?) attempt to reverse engineer
the relative weights that the website
<a href="http://www.metacritic.com/">metacritic</a>
assigns to movie critics.</em></p>
<p>metacritic is a popular site that computes an
aggregate <strong>metascore</strong> for each movie.
The metascore is a <em>weighted average</em> of individual critic scores.
The <a href="http://www.metacritic.com/faq">metacritic FAQ Page</a> says:</p>
<blockquote>
<p>Q: Can you tell me how each of the different critics are weighted in your formula?</p>
<p>A: Absolutely not.</p>
</blockquote>
<p>That sounds like a challenge to me.
Using standard machine learning/optimization techniques,
we should be able to
tell what critics are more important than others.
In fact, the same techniques
should also allow us to correctly predict the metascore
for any new movie (given the individual critic ratings).
This post describes my attempt to build such a system.</p>
<p><em>Note: All related code is available on <a href="https://github.com/shashank025/metacritic-weights">my github</a>.</em></p>
<h3 id="the-model">The Model</h3>
<p>We introduce some notations and assumptions about the problem:</p>
<ul>
<li>metacritic uses ratings from <script type="math/tex">n</script> critics.</li>
<li>Our data set has <script type="math/tex">m</script> movies.</li>
<li><script type="math/tex">r_{ij}</script> is the rating of movie <script type="math/tex">i</script> by critic <script type="math/tex">j</script>.
This forms an <script type="math/tex">m \times n</script> matrix.
<ul>
<li>Not all critics rate all movies!
In other words, <script type="math/tex">r_{ij}</script> may not be defined for all <script type="math/tex">i</script> and <script type="math/tex">j</script>.</li>
<li>Where defined, the values are constrained: <script type="math/tex">0 \leq r_{ij} \leq 100</script>.</li>
</ul>
</li>
<li><script type="math/tex">\theta_j</script> is the <em>relative weight</em> (or importance) of critic <script type="math/tex">j</script>
(this is what we are trying to <em>learn</em>).
<ul>
<li>There is no point in having a critic weight of <script type="math/tex">0</script>
(why even consider a critic whose rating does not affect the metascore at all?).</li>
<li>In light of the previous point, we constrain
critic weights to be <em>positive</em>,
<em>i.e.</em>, <script type="math/tex">\theta_j > 0</script> for all <script type="math/tex">j</script>,</li>
<li>Since these weights are relative, they must add up to one,
<em>i.e.</em>, <script type="math/tex">\sum_{j=1}^n \theta_j = 1</script>.</li>
<li>Critic weights stay constant across movies (but may get updated over time).</li>
<li>The <script type="math/tex">n</script>-dimensional vector
<script type="math/tex">\theta = (\theta_1, \theta_2, \ldots, \theta_n)</script>
represents <em>a solution</em>, a possible assignment of weights to critics.</li>
<li>Due to the above constraints, the solution space of <script type="math/tex">\theta</script> forms
a <em>bounded, <a href="https://en.wikipedia.org/wiki/Hyperplane#Affine_hyperplanes">affine hyperplane</a></em>.</li>
</ul>
</li>
<li><script type="math/tex">p_i</script> is the <em>published</em> metascore for movie <script type="math/tex">i</script>.
<ul>
<li>These values are also constrained: <script type="math/tex">0 \leq p_i \leq 100</script> for all <script type="math/tex">i</script>.</li>
</ul>
</li>
<li><script type="math/tex">y_i(\theta)</script> is the <em>predicted</em> metascore for movie <script type="math/tex">i</script>
for a given choice of relative weights.
<ul>
<li>We will drop the <script type="math/tex">\theta</script> when it is obvious.</li>
</ul>
</li>
<li><script type="math/tex">p = (p_1, p_2, \ldots, p_m)</script> and <script type="math/tex">y(\theta) = (y_1(\theta), y_2(\theta), \ldots, y_m(\theta))</script>
are vectorized forms we will use for conciseness later.</li>
</ul>
<p>An obvious definition of <script type="math/tex">y_i(\theta)</script>
is simply a weighted sum:</p>
<script type="math/tex; mode=display">y_i(\theta) = \sum_{j=1}^n \theta_j r_{ij}.</script>
<p>But there is a problem with this definition. Remember:
<em>Not all critics rate all movies</em>.
In other words, the summation above may be invalid,
since not all <script type="math/tex">r_{ij}</script> values are necessarily defined.
How do we deal with this <em>incomplete</em> matrix <script type="math/tex">r_{ij}</script>?
My best guess is that metacritic normalizes the metascore
over the available critic weights.
For example, assume that the (excellent) movie
<a href="http://www.imdb.com/title/tt0470752/">Ex Machina</a>
has the index <script type="math/tex">i = 4</script> in our data set.
Assume that only two critics,
with weights <script type="math/tex">\theta_1</script> and <script type="math/tex">\theta_2</script>
have currently rated this movie.
We denote their ratings <script type="math/tex">r_{41}</script> and <script type="math/tex">r_{42}</script> respectively.
The metascore for this movie is then</p>
<script type="math/tex; mode=display">y_4(\theta_1, \theta_2) = \frac{\theta_1 r_{41} + \theta_2 r_{42}}{\theta_1 + \theta_2}.</script>
<p>In fact, the metacritic FAQ page says they wait until
a movie has at least 4 reviews before computing a metascore.
So they want at least 4 defined <script type="math/tex">r_{ij}</script> values
for a given <script type="math/tex">i</script>.
Lets define the following additional variables:</p>
<ul>
<li><script type="math/tex">r'_{ij} = r_{ij}</script> if movie <script type="math/tex">i</script> is rated by critic <script type="math/tex">j</script>
and <script type="math/tex">0</script> otherwise.</li>
<li><script type="math/tex">e_{ij} = 1</script> if movie <script type="math/tex">i</script> is rated by critic <script type="math/tex">j</script>, and <script type="math/tex">0</script> otherwise.</li>
</ul>
<p>Note that <script type="math/tex">r'_{ij}</script> and <script type="math/tex">e_{ij}</script> are both <script type="math/tex">m \times n</script> matrices,
but unlike <script type="math/tex">r_{ij}</script>, they are fully defined.</p>
<p>Using these, we modify the definition for <script type="math/tex">y_i</script>:</p>
<script type="math/tex; mode=display">y_i(\theta) = \frac{ \sum_{j=1}^n \theta_j r'_{ij} }{ \sum_{j=1}^n \theta_j e_{ij} }.</script>
<p>How does this function vary with <script type="math/tex">\theta</script> (once we fix the <script type="math/tex">r_{ij}</script> values)?
I wrote up
<a href="https://github.com/shashank025/metacritic-weights/blob/master/bin/mc_ytheta">a little script to plot <script type="math/tex">y_4 (\theta_1, \theta_2)</script></a>
for the example involving the movie <em>Ex Machina</em>
(I fixed the critic ratings to
<script type="math/tex">r_{41} = 79</script> and <script type="math/tex">r_{42} = 67</script>;
I know. Stupid critics!)
The following image of the plot
hopefully makes it clear that
<script type="math/tex">y_i (\theta)</script> is <em>not</em> linear in <script type="math/tex">\theta</script>.
But the function is still <em>smooth</em> (<em>i.e.</em>, <em>differentiable</em>).</p>
<p><a href="/assets/images/plot-of-y-theta.png">
<img width="400" height="300" src="/assets/images/plot-of-y-theta.png" title="Plot of y(theta) - click to zoom" alt="Plot of y(theta) - click to zoom" />
</a></p>
<p>Now consider the <script type="math/tex">m</script>-vector <script type="math/tex">d(\theta) = p - y(\theta)</script>.
This vector is a measure of how <em>off</em> the predictions
are from actual metascores for
a given <script type="math/tex">\theta</script>.
We will try to find a <script type="math/tex">\theta</script>
that minimizes the value of the function
<script type="math/tex">f(\theta) = \Vert d(\theta) \Vert</script>,
where <script type="math/tex">\Vert \cdot \Vert</script> represents the
<a href="https://en.wikipedia.org/wiki/Lp_space"><script type="math/tex">L^2</script> norm</a>.
Formally,</p>
<script type="math/tex; mode=display">\DeclareMathOperator*{\argmin}{\arg\!\min}
\begin{gather*}
\argmin_{\theta} \Vert p - y(\theta) \Vert \\
\text{subject to} \\
\sum_{j=1}^n \theta_j = 1, \text{and}\\
\theta_j > 0~\text{for all}~j.
\end{gather*}</script>
<p>This is a standard
<a href="https://en.wikipedia.org/wiki/Constrained_optimization">constrained minimization problem</a>.
Our expectation is that any solution <script type="math/tex">\theta</script>
of the above system
<em>(a)</em> fits the training set well, and
<em>(b)</em> also predicts metascores for new movies.
Notice that <script type="math/tex">d</script> is not a <em>linear</em> function of <script type="math/tex">\theta</script>
because <script type="math/tex">y(\theta)</script> isn’t either.
So, we have to use a
<a href="https://en.wikipedia.org/wiki/Nonlinear_programming">nonlinear solver</a>.</p>
<h3 id="the-implementation">The Implementation</h3>
<p>With (most of the) annoying math out of the way, lets write code!
The implementation pipeline consists of the following stages:</p>
<ol>
<li>Collect movie ratings data from metacritic.</li>
<li>Preprocess the data:
<ul>
<li>Remove ratings from critics who’ve rated very few movies, and</li>
<li>Create the <script type="math/tex">r'_{ij}</script> and <script type="math/tex">e_{ij}</script> matrices.</li>
</ul>
</li>
<li>Partition the data into a <em>training</em> set and a <em>test</em> set.</li>
<li>Find a best fit <script type="math/tex">\theta</script> by running the optimization routine on the training set.</li>
<li>Compute accuracy against the test set.</li>
<li>Output the results.</li>
</ol>
<p>It turns out that a Makefile is really well suited to
building these kinds of pipelines,
where each stage produces a <em>file</em> that can be used as a Make target
for that stage.
Each stage can be dependent on files produced in one or more previous stages.</p>
<h4 id="collecting-ratings-data-from-metacritic">Collecting ratings data from metacritic</h4>
<p>Unfortunately, metacritic does not, as far as I know,
have any API’s to make this data available easily.
So I periodically scrape metacritic’s
<a href="http://www.metacritic.com/browse/movies/release-date/theaters/metascore?view=condensed">New Movie Releases page</a>
for links to actual metacritic movie pages,
which I then scrape to get the overall metascore,
and the individual critic ratings.</p>
<p>I used a combination of
<a href="http://www.semicomplete.com/projects/xpathtool/">xpathtool</a>
and
the <a href="http://lxml.de/">lxml Python library</a>
for the scraping.</p>
<p>The output of this stage is a
<a href="https://docs.python.org/2/library/pickle.html">Python cPickle</a>
dump file that represents a dictionary of the form:</p>
<div class="highlighter-rouge"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="err">movie_url</span><span class="w"> </span><span class="err">-></span><span class="w"> </span><span class="err">(metascore,</span><span class="w"> </span><span class="err">individual_ratings),</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">}</span><span class="w">
</span></code></pre>
</div>
<p>where <code class="highlighter-rouge">individual_ratings</code> is itself a dictionary of the form</p>
<div class="highlighter-rouge"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="err">critic_name</span><span class="w"> </span><span class="err">-></span><span class="w"> </span><span class="err">numeric_rating,</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">}</span><span class="err">.</span><span class="w">
</span></code></pre>
</div>
<p>For example, this structure could look like:</p>
<div class="highlighter-rouge"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="err">'http://www.metacritic.com/movie/mad-max-fury-road/critic-reviews'</span><span class="w"> </span><span class="err">-></span><span class="w">
</span><span class="err">(89,</span><span class="w">
</span><span class="err">{'Anthony</span><span class="w"> </span><span class="err">Lane</span><span class="w"> </span><span class="err">(The</span><span class="w"> </span><span class="err">New</span><span class="w"> </span><span class="err">Yorker)':</span><span class="w"> </span><span class="err">100,</span><span class="w">
</span><span class="err">'A.A.</span><span class="w"> </span><span class="err">Dowd</span><span class="w"> </span><span class="err">(TheWrap)':</span><span class="w"> </span><span class="err">95,</span><span class="w">
</span><span class="err">...</span><span class="p">}</span><span class="err">),</span><span class="w">
</span><span class="err">'http://www.metacritic.com/movie/ex-machina/critic-reviews'</span><span class="w"> </span><span class="err">-></span><span class="w">
</span><span class="err">(</span><span class="mi">78</span><span class="err">,</span><span class="w">
</span><span class="p">{</span><span class="err">'Steven</span><span class="w"> </span><span class="err">Rea</span><span class="w"> </span><span class="err">(Philadelphia</span><span class="w"> </span><span class="err">Inquier)':</span><span class="w"> </span><span class="err">100,</span><span class="w">
</span><span class="err">'Manohla</span><span class="w"> </span><span class="err">Dargis</span><span class="w"> </span><span class="err">(The</span><span class="w"> </span><span class="err">New</span><span class="w"> </span><span class="err">York</span><span class="w"> </span><span class="err">Times)':</span><span class="w"> </span><span class="err">90,</span><span class="w">
</span><span class="err">...</span><span class="p">}</span><span class="err">),</span><span class="w">
</span><span class="err">...</span><span class="w">
</span><span class="err">}</span><span class="w">
</span></code></pre>
</div>
<p>I know cPickle is not exactly the most portable format,
but it works well at this early stage.
In the long run, I want to persist all of the ratings data
in a database (sqlite? Postgres?).</p>
<h4 id="preprocessing">Preprocessing</h4>
<p>We first eliminate from our data set
the long tail of critics who’ve rated very few movies.
Not only are these critics not very influential
to the overall optimization routine,
eliminating them also
helps reduce <script type="math/tex">n</script> (the matrix width).
Accordingly, there is a configurable <em>rating count threshold</em>,
currently set to <script type="math/tex">5</script>.
We do one pass over the ratings data and construct
a dictionary of the form:</p>
<div class="highlighter-rouge"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="err">critic_name</span><span class="w"> </span><span class="err">-></span><span class="w"> </span><span class="err">movies_rated,</span><span class="w"> </span><span class="err">...</span><span class="w"> </span><span class="p">}</span><span class="w">
</span></code></pre>
</div>
<p>We then do another pass through the data and remove ratings
from critics whose <code class="highlighter-rouge">movies_rated</code> value is lower than the
threshold.</p>
<p>The second preprocessing step is to construct the
<script type="math/tex">r'_{ij}</script> and <script type="math/tex">e_{ij}</script> matrices, which of course
is
<a href="https://en.wikipedia.org/wiki/Small_matter_of_programming">a simple matter of programming</a>.
I store these values as
<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html">numpy matrices</a>.</p>
<h4 id="partitioning-the-data-set">Partitioning the data set</h4>
<p>This is straightforward.
I use a configurable <code class="highlighter-rouge">training_frac</code> parameter
(a value in the interval <script type="math/tex">[0, 1]</script>) to probabilistically
split the cleaned up data into a test set and a training
set.</p>
<h4 id="optimization-routine">Optimization routine</h4>
<p>There are numerous “solvers” available for
constrained optimization problems of the type we
described above, but not all of them are
freely available.</p>
<p>I tried the following two solvers,
available as part of
<a href="http://docs.scipy.org/doc/scipy-0.13.0/reference/optimize.html">scipy.optimize</a>:</p>
<table>
<thead>
<tr>
<th>Solver</th>
<th>Differentiability requirements</th>
<th>Allows bounds?</th>
<th>Allows equality constraints?</th>
<th>Allows inequality constraints?</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.optimize.fmin_slsqp.html">Sequential Least Squares Programming (SLSQP)</a></td>
<td>The objective function and the constraints should be twice <a href="https://en.wikipedia.org/wiki/Differentiable_function#Differentiability_classes">continuously differentiable</a></td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td><a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_cobyla.html">Constrained Optimization By Linear Approximations (COBYLA)</a></td>
<td>None</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<p>Note that <script type="math/tex">y(\theta)</script> (and therefore <script type="math/tex">f(\theta)</script>) satisfies
the differentiability requirement of SLSQP.</p>
<p>Also, COBYLA does not allow you to specify bounds on
<script type="math/tex">\theta</script> values or equality constraints.
So, we employ a common technique in optimization formulations,
which is to push the constraints <em>into the objective function</em>.
Consider the “tub” function <script type="math/tex">\tau(x, l, u)</script> defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tau (x, l, u) =
\begin{cases}
0 & \quad \text{if } l \leq x \leq u, \\
1 & \quad \text{otherwise}.
\end{cases} %]]></script>
<p>Our modified objective function (for use with COBYLA) becomes:</p>
<script type="math/tex; mode=display">f(\theta) = \Vert p - y(\theta) \Vert
+ P_h \cdot \Vert 1 - \sum_{j=1}^n \theta_j \Vert
+ P_b \cdot \sum_{j = 1}^n \tau(\theta_j, 0, 1),</script>
<p>where <script type="math/tex">P_h</script> and <script type="math/tex">P_b</script> are configurable weights that decide
how much we should <em>penalize</em> the optimization algorithm when it
chooses:</p>
<ul>
<li>a <script type="math/tex">\theta</script> that doesn’t lie on the affine hyperplane,
and</li>
<li><script type="math/tex">\theta_j</script> values outside the interval
<script type="math/tex">[0, 1]</script>,</li>
</ul>
<p>respectively.</p>
<p>Setting both <script type="math/tex">P_h</script> and <script type="math/tex">P_b</script> to 0 reduces our objective
function to its original form, so we can use the same
function for both solvers by simply tweaking these weights.</p>
<h3 id="results">Results</h3>
<p>Before I actually launch into details, I should note
the following issues right at the outset:</p>
<ul>
<li>I was actually unable to get either SLSQP or COBYLA
to ever successfully converge on a solution.</li>
<li>The <script type="math/tex">\theta</script> values (<em>i.e.</em>, critic weights)
learned by these solvers
were often <em>way</em> outside the interval <script type="math/tex">[0, 1]</script>.</li>
</ul>
<p>Most of the times, both routines finished their iterations
and failed with errors of this form:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>optimization failed [8]: Positive directional derivative for linesearch
optimization failed [2]: Maximum number of function evaluations has been exceeded
</code></pre>
</div>
<p>If you have experience with the numpy optimization
library, I’d love to hear about suggestions you may
have on how to deal with these errors.</p>
<p>Perhaps more interestingly, in spite of the above issues,
the learned <script type="math/tex">\theta</script> values were still able to successfully
predict metascores for movies in the test set.</p>
<p>After removing ratings from insignificant critics,
I constructed a training set of about 2800 ratings
of 190 movies by 188 critics.</p>
<p>The following table lists
the <strong>top 20</strong> critics by weight
learned using the above training set
with each optimization routine.
Note that the weights are expressed as a percentage
of the weight for the <em>top</em> critic in each list.
Interestingly enough,
both algorithms think
<a href="http://connect.nola.com/user/mbscott/posts.html">Mike Scott of the New Orleans Times-Picayune</a>
is the metacritic MVP.
So, for example, according to SLSQP,
a review by Justin Lowe carries only
95% of the importance that is given
to a review by Mike Scott.</p>
<table>
<thead>
<tr>
<th style="text-align: center">SLSQP</th>
<th style="text-align: left"> </th>
<th style="text-align: center">COBYLA</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: center">Weight</th>
<th style="text-align: left">Critic</th>
<th style="text-align: center">Weight</th>
<th style="text-align: left">Critic</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><script type="math/tex">\cdot</script></td>
<td style="text-align: left">Mike Scott (New Orleans Times-Picayune)</td>
<td style="text-align: center"><script type="math/tex">\cdot</script></td>
<td style="text-align: left">Mike Scott (New Orleans Times-Picayune)</td>
</tr>
<tr>
<td style="text-align: center">0.949948</td>
<td style="text-align: left">Justin Lowe (The Hollywood Reporter)</td>
<td style="text-align: center">0.902288</td>
<td style="text-align: left">Slant Magazine</td>
</tr>
<tr>
<td style="text-align: center">0.929495</td>
<td style="text-align: left">Jordan Hoffman (The Guardian)</td>
<td style="text-align: center">0.900024</td>
<td style="text-align: left">Ronnie Scheib (Variety)</td>
</tr>
<tr>
<td style="text-align: center">0.914186</td>
<td style="text-align: left">Marjorie Baumgarten (Austin Chronicle)</td>
<td style="text-align: center">0.890113</td>
<td style="text-align: left">Wes Greene (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.910820</td>
<td style="text-align: left">Fionnuala Halligan (Screen International)</td>
<td style="text-align: center">0.887845</td>
<td style="text-align: left">Chris Cabin (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.909801</td>
<td style="text-align: left">James Mottram (Total Film)</td>
<td style="text-align: center">0.885791</td>
<td style="text-align: left">Martin Tsai (Los Angeles Times)</td>
</tr>
<tr>
<td style="text-align: center">0.904564</td>
<td style="text-align: left">Variety</td>
<td style="text-align: center">0.863626</td>
<td style="text-align: left">Lawrence Toppman (Charlotte Observer)</td>
</tr>
<tr>
<td style="text-align: center">0.903029</td>
<td style="text-align: left">Guy Lodge (Variety)</td>
<td style="text-align: center">0.858237</td>
<td style="text-align: left">Anthony Lane (The New Yorker)</td>
</tr>
<tr>
<td style="text-align: center">0.897749</td>
<td style="text-align: left">Inkoo Kang (TheWrap)</td>
<td style="text-align: center">0.845864</td>
<td style="text-align: left">Fionnuala Halligan (Screen International)</td>
</tr>
<tr>
<td style="text-align: center">0.894605</td>
<td style="text-align: left">indieWIRE</td>
<td style="text-align: center">0.834088</td>
<td style="text-align: left">Boyd van Hoeij (The Hollywood Reporter)</td>
</tr>
<tr>
<td style="text-align: center">0.892237</td>
<td style="text-align: left">Ben Kenigsberg (The New York Times)</td>
<td style="text-align: center">0.820908</td>
<td style="text-align: left">Variety</td>
</tr>
<tr>
<td style="text-align: center">0.882656</td>
<td style="text-align: left">Mike D’Angelo (The Dissolve)</td>
<td style="text-align: center">0.814152</td>
<td style="text-align: left">Nicolas Rapold (The New York Times)</td>
</tr>
<tr>
<td style="text-align: center">0.875550</td>
<td style="text-align: left">Simon Abrams (Village Voice)</td>
<td style="text-align: center">0.735623</td>
<td style="text-align: left">Justin Lowe (The Hollywood Reporter)</td>
</tr>
<tr>
<td style="text-align: center">0.875385</td>
<td style="text-align: left">Martin Tsai (Los Angeles Times)</td>
<td style="text-align: center">0.629166</td>
<td style="text-align: left">Mark Olsen (Los Angeles Times)</td>
</tr>
<tr>
<td style="text-align: center">0.875062</td>
<td style="text-align: left">Manohla Dargis (The New York Times)</td>
<td style="text-align: center">0.625141</td>
<td style="text-align: left">The Globe and Mail (Toronto)</td>
</tr>
<tr>
<td style="text-align: center">0.874889</td>
<td style="text-align: left">Kyle Smith (New York Post)</td>
<td style="text-align: center">0.567734</td>
<td style="text-align: left">James Berardinelli (ReelViews)</td>
</tr>
<tr>
<td style="text-align: center">0.872482</td>
<td style="text-align: left">Nicolas Rapold (The New York Times)</td>
<td style="text-align: center">0.562606</td>
<td style="text-align: left">Peter Sobczynski (RogerEbert.com)</td>
</tr>
<tr>
<td style="text-align: center">0.869911</td>
<td style="text-align: left">James Berardinelli (ReelViews)</td>
<td style="text-align: center">0.558980</td>
<td style="text-align: left">John Anderson (Wall Street Journal)</td>
</tr>
<tr>
<td style="text-align: center">0.863118</td>
<td style="text-align: left">Ronnie Scheib (Variety)</td>
<td style="text-align: center">0.520855</td>
<td style="text-align: left">Steve Macfarlane (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.860796</td>
<td style="text-align: left">Nikola Grozdanovic (The Playlist)</td>
<td style="text-align: center">0.510842</td>
<td style="text-align: left">Jordan Mintzer (The Hollywood Reporter)</td>
</tr>
</tbody>
</table>
<p>I should clarify that the top 20 list can change from one run to the next,
since it depends on the training set chosen (which is at random).</p>
<p>Next, we show a few sample predicted metascores:</p>
<table>
<thead>
<tr>
<th>Movie</th>
<th>Actual Metascore</th>
<th>Predicted (SLSQP)</th>
<th>Predicted (COBYLA)</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.imdb.com/title/tt3247714/">Survivor</a></td>
<td><a href="http://www.metacritic.com/movie/survivor">26</a></td>
<td>30 (16.18%)</td>
<td>33 (28.12%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt1674771/">Entourage</a></td>
<td><a href="http://www.metacritic.com/movie/entourage">38</a></td>
<td>57 (50.25%)</td>
<td>43 (15.65%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt1823672/">Chappie</a></td>
<td><a href="http://www.metacritic.com/movie/chappie">41</a></td>
<td>46 (12.89%)</td>
<td>43 (6.00%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt3215846/">Dreamcatcher</a></td>
<td><a href="http://www.metacritic.com/movie/dreamcatcher-2015">86</a></td>
<td>77 (-10.46%)</td>
<td>83 (-2.77%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt2395427/">Avengers: Age of Ultron</a></td>
<td><a href="http://www.metacritic.com/movie/avengers-age-of-ultron">66</a></td>
<td>62 (-5.05%)</td>
<td>67 (1.57%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt2788556/">Gemma Bovery</a></td>
<td><a href="http://www.metacritic.com/movie/gemma-bovery">57</a></td>
<td>0 (-100.00%)</td>
<td>61 (7.15%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt3218580/">Alleluia</a></td>
<td><a href="http://www.metacritic.com/movie/alleluia">84</a></td>
<td>90 (7.41%)</td>
<td>87 (4.13%)</td>
</tr>
</tbody>
</table>
<p>With a test set of size 60, the movie predictions
by the two algorithms had the following
<a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">RMSE</a> values:</p>
<table>
<thead>
<tr>
<th>SLSQP</th>
<th>COBYLA</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.053437</td>
<td>0.016942</td>
</tr>
</tbody>
</table>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>Working out the math was obviously fun!
<ul>
<li>I got to brush up on my long defunct skills with numpy, scipy, matplotlib, etc.</li>
</ul>
</li>
<li>Even a toy project like this one can end up demanding
substantial time and attention,
especially if it is something you want to share
with the rest of the world.
For example, I ended up setting up proper Python package management for all the code.</li>
<li>Writing this blog post was also extremely useful because
it clarified my own thinking on the topic. I was actually
able to go back and refactor the code to better match
the implementation pipeline I described above.</li>
</ul>
<hr />
http://shashank.ramaprasad.com/2015/06/14/reverse-engineering-the-metacritic-movie-ratings
http://shashank.ramaprasad.com/2015/06/14/reverse-engineering-the-metacritic-movie-ratingsSun, 14 Jun 2015 00:00:00 +0000Bloom filters for set intersections?
<p>For any set <script type="math/tex">S</script>, one may construct a Bloom filter,
which is a probabilistic data structure <script type="math/tex">B</script> that
approximately and efficiently answers the <em>set membership</em>
question,
such that <script type="math/tex">B(x)</script> returns <script type="math/tex">YES</script> or <script type="math/tex">NO</script>,
indicating the membership of <script type="math/tex">x</script> in <script type="math/tex">S</script>.
Bloom filters are <em>approximate</em> since they can return
false positives.
The error rate <script type="math/tex">\epsilon</script> for a Bloom filter is:</p>
<script type="math/tex; mode=display">Pr(\text{YES} \vert x \notin S) = \epsilon .</script>
<p>To find an overlap
(<em>i.e.</em>, a non-empty intersection) between a set <script type="math/tex">S</script> (represented by a bloom filter <script type="math/tex">B</script>)
and some other set <script type="math/tex">Q</script>, I can apply the following procedure:</p>
<p><em>Procedure 1:</em> If <script type="math/tex">B(q) = YES</script> for any element <script type="math/tex">q</script> from the set <script type="math/tex">Q</script>, then return <script type="math/tex">YES</script>, else return <script type="math/tex">NO</script>.</p>
<p>How accurate is the above procedure? Since Bloom filters cannot have false negatives,
the procedure is inaccurate if and only if there is a false positive,
<em>i.e.</em>, the Bloom filter returns <script type="math/tex">YES</script> for at least one element in <script type="math/tex">Q</script>,
even though no element in <script type="math/tex">Q</script> belongs to <script type="math/tex">S</script>.
How likely is this event?
It is often easier to compute the probability of the complementary event.
In our case, this is the event that the bloom filter is accurate, <em>i.e.</em>,
it returns <script type="math/tex">NO</script> for every element in <script type="math/tex">Q</script>,
given that no element in <script type="math/tex">Q</script> belongs to <script type="math/tex">S</script>.
It follows directly from the definition of the error rate that:</p>
<script type="math/tex; mode=display">Pr(\text{NO} \vert q \notin S) = 1 - \epsilon .</script>
<p>Assuming independence of Bloom filter output for the different elements of <script type="math/tex">Q</script>,
and assuming that <script type="math/tex">Q</script> has <script type="math/tex">n</script> elements,</p>
<script type="math/tex; mode=display">Pr(\text{intersection is accurate}) = (1 - \epsilon)^n .</script>
<p>In other words,</p>
<script type="math/tex; mode=display">Pr(\text{intersection is inaccurate}) = 1 - (1 - \epsilon)^n .</script>
<p>Since <script type="math/tex">1 - e = f</script> is a fraction, <script type="math/tex">f^n</script> is pretty small,
which means that the probability of inaccuracy is significant.
Note that this derivation assumed that the Bloom filter output
for the elements of <script type="math/tex">Q</script> were independent events. But this might not be true,
since the order of operations can affect the likelihood of false positives.
I am not sure how to compute how non-independence affects the above calculation,
but I can’t imagine it would materially alter it all that much in many scenarios.</p>
http://shashank.ramaprasad.com/2015/06/07/bloom-filters-for-set-intersections
http://shashank.ramaprasad.com/2015/06/07/bloom-filters-for-set-intersectionsSun, 07 Jun 2015 00:00:00 +0000Bay Bridges Challenge (solved!)<p>I
<a href="/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge">previously wrote</a>
about the
<a title="Bay Bridges Challenge" href="https://www.codeeval.com/open_challenges/109/">Bay Bridges Challenge</a>,
hosted at
<a title="CodeEval" href="https://www.codeeval.com/">CodeEval</a>.
In my last post, I showed that the problem could be modeled as a
<a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">minimum vertex cover</a> problem,
and wondered if we can do better than iterating through the
<a title="Wikipedia page on Power Sets" href="http://en.wikipedia.org/wiki/Power_set">power set</a>
of bridges, and picking the highest cardinality subset that is feasible.
I said that most likely,</p>
<blockquote><p>there is additional structure inherent in the problem, that can be exploited to make the problem more tractable.</p></blockquote>
<p>Spurred by some <a title="helpful comments on previous post about Bay Bridges challenge" href="http://shashankr.wordpress.com/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge/#comment-68">helpful recent comments</a>, I spent some more time on the problem. As it turns out, we <em>can</em> do better than an exhaustive search. But first, remember that a <strong><em>feasible solution</em></strong> is any set of bridges with <strong>no</strong> intersections. Our task is to find <em>an</em> <strong><em>optimal solution</em></strong>, which is simply the <em>largest</em> such feasible solution (note that there can be more than one).</p>
<p><strong><em>Claim 1:</em></strong> If there is no feasible solution with <script type="math/tex">k</script> bridges, then there cannot be a larger feasible solution.</p>
<p><em>Proof (by contradiction):</em> Assume that there <em>is</em> a solution with <script type="math/tex">l = k + 1</script> bridges. The removal of any bridge from this solution should still be a set of non-intersecting bridges, <em>i.e.</em>, a feasible solution, but of size <script type="math/tex">k</script>, which is a contradiction. By induction, we can see that this is also true for higher values of <script type="math/tex">l</script>. QED.</p>
<p>With this claim in hand, let us partition the power set of <script type="math/tex">n</script> bridges so that <script type="math/tex">p(i)</script> represents
the set of all sets of bridges of length <script type="math/tex">i</script>. We can then make the following observation about the
<script type="math/tex">n</script> partitions (we can safely ignore the empty partition <script type="math/tex">p(0)</script>):</p>
<blockquote>
<p>If any set in <script type="math/tex">p(n/2)</script> is <em>feasible</em>, the <em>optimal</em> solution is
in one of the partitions <script type="math/tex">p(n/2)</script> through <script type="math/tex">p(n)</script>.
Otherwise, it is in one of the partitions <script type="math/tex">p(1)</script> through <script type="math/tex">p(n/2 - 1)</script>.</p>
</blockquote>
<p>In other words, we can do a <em>binary search</em> on the partition index <script type="math/tex">i</script>
until we find the partition with the optimal solution. While this seems very promising,
the partition size <script type="math/tex">|p(i)| = {n \choose i}</script> unfortunately has a
<em>maximum</em> at <script type="math/tex">i = n/2</script> (note that <script type="math/tex">|p(n/2)| = {n \choose n/2} \approx 2^n</script>).
So, in the worst case, we still need to search an exponential number of bridge sets.d
But in practice, this should still be significantly better than exhaustively searching each partition.</p>
<p>Anyway, I didn’t really feel like coding up this binary search, so I tried submitting a simpler,
more naive solution: Search each partition <script type="math/tex">p(i)</script>, starting with <script type="math/tex">p(n)</script>,
in decreasing order of <script type="math/tex">i</script>, until you find the <em><strong>first</strong></em> feasible solution <script type="math/tex">f</script>, and return <script type="math/tex">f</script>.
Because of <em>Claim 1</em>, it is easy to see that <script type="math/tex">f</script> <em>will</em> be
<em>an</em> optimal solution to our problem.
It turns out that this approach was enough to get a score of 100 (with a ranking of 86).
So, now I am even less inclined to implement binary search.</p>
<p>Also, I should really post the code for all of my solutions to github, which I hope to do soon.</p>
<p><em><strong>Update:</strong></em> I have checked in <a title="bay bridges solution code (Python)" href="https://github.com/shashank025/codeeval/blob/master/bridges.py">solution code for the Bay Bridges challenge</a> and other codeeval challenges at github. Check it out at <a title="my github repo for codeeval challenges" href="https://github.com/shashank025/codeeval">my github</a>.</p>
<hr />
http://shashank.ramaprasad.com/2014/03/10/bay-bridges-challenge-solved
http://shashank.ramaprasad.com/2014/03/10/bay-bridges-challenge-solvedMon, 10 Mar 2014 08:44:15 +0000A Graph Theoretic Approach to the Bay Bridges Challenge<p><strong>Note</strong>:
<em>If you are impatient, see my newer post on
<a href="/2014/03/10/bay-bridges-challenge-solved">an actual solution to the bay bridges challenge</a>.</em></p>
<p>I recently came across the
<a title="Bay Bridges Challenge" href="http://blog.codeeval.com/bridges/">Bay Bridges Challenge</a>,
over at
<a title="CodeEval" href="https://www.codeeval.com/">CodeEval</a>.
The challenge is to pick the largest set of bridges for construction,
such that no two bridges cross or <em>intersect</em>.</p>
<p>My first instinct was to model this as a
<a title="Wikipedia page on Graph Theory" href="http://en.wikipedia.org/wiki/Graph_theory">graph</a> problem.
Suppose we construct an <a title="Wikipedia page on Intersection Graphs" href="http://en.wikipedia.org/wiki/Intersection_graph"><em>intersection graph</em></a>
<strong>G</strong>, as follows:
create a graph node (or a <em>vertex</em>) corresponding to each bridge,
and add an edge between two vertices <em>u</em> and <em>v</em>
if and only if the corresponding bridges cross.
Figure 1 below shows an example.
Then, its easy to see that the original problem is equivalent to
finding the <a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">minimum vertex cover</a> on G.</p>
<p><a href="https://docs.google.com/drawings/d/1bWFPkc1C-RjBSOSsaK_VUUA4FZ6EaKpABrT7iuN9BL4/pub?w=960&h=720">
<img width="480" height="360" src="/assets/images/bay_bridges.png" title="A Sample Intersection Graph" alt="A Sample Intersection Graph" />
</a>
<em>Figure 1:</em> A Sample Intersection Graph</p>
<p>It turns out that finding the minimum vertex cover is <a title="Wikipedia page on NP-hard problems" href="http://en.wikipedia.org/wiki/NP-hard">NP-hard</a>. For this problem, it means (roughly) that we can do <em>no better</em> than to iterate through the <a title="Wikipedia page on Power Sets" href="http://en.wikipedia.org/wiki/Power_set">power set</a> of bridges, and pick the highest cardinality subset that is <em>feasible</em> (i.e., no two bridges in that subset cross). That was a bit disheartening. But it is unlikely that a run-of-the-mill coding challenge requires coming up with acceptable approximation algorithms to hard problems. Most likely, there is additional structure inherent in the problem, that can be exploited to make the problem more tractable.</p>
<p>Since the bridges exist in a 2-D plane (roughly), we can use the <a title="Euclidean Distance" href="http://en.wikipedia.org/wiki/Euclidean_distance">distance metric</a> to automatically rule out large portions of the search space. For example, if a bridge does not cross any other bridge, then it is always included in every optimal solution. By eliminating such bridges from consideration, we might reduce the problem size considerably. Spatial data structures like bounding rectangles, <a title="k-d trees" href="http://en.wikipedia.org/wiki/K-d_tree">k-d trees</a>, or their cousins can be used to partition the set of bridges into smaller subsets, and solve many smaller, independent sub-problems (maybe even in parallel). But none of these approaches reduce the essential <em>strength</em> of the problem, since we are still looking at exponential worst-time solutions.</p>
<p>At this point, I was feeling a bit stuck. My attempts at finding
additional structure in the original problem space hadn’t really helped.
But going over the Wikipedia page on
<a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">vertex covers</a>,
I read that a minimum vertex cover can be found in polynomial time for
<a title="Wikipedia page on Bipartite Graphs" href="http://en.wikipedia.org/wiki/Bipartite_graph">bipartite graphs</a>.
Finally, some progress:
If the intersection graph G is bipartite
(a graph’s bipartite-ness can be tested in <em>linear</em> time),
we have an efficient solution.
But it is not hard to come up with real world inputs
where the intersection graph is <em>not</em> bipartite.
For example, in Figure 1 above, the intersection graph
is not actually bipartite
(However, if you imagined that bridge b4 was removed from the input,
the remaining graph would indeed become bipartite).
So, we still don’t have a solution that always works.</p>
<p>As of this point, I haven’t written a line of code.
My hunch is that I need to look for a solution in the original problem space:
the roughly Euclidean plane on which the line segments corresponding to
the bridges exist,
and not in the intersection graph space.</p>
http://shashank.ramaprasad.com/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge
http://shashank.ramaprasad.com/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challengeMon, 16 Dec 2013 20:55:34 +0000Understanding the big O: a note on abuse of notation<p>A friend of mine, who is currently enrolled in the
<a href="https://www.coursera.org/course/algo">Algorithms course</a>
over at
<a href="https://www.coursera.org/">Coursera</a>
reached out to me with the following problem:</p>
<blockquote>
<p>Assume two (positive) nondecreasing functions <script type="math/tex">f</script> and <script type="math/tex">g</script>.
If <script type="math/tex">f(n) = O(g(n))</script>, is it also true that
<script type="math/tex">2^{f(n)} = O(2^{g(n)})</script>?</p>
</blockquote>
<p>My friend felt unable to apply his understanding of
<a title="Wikipedia article on Big-O" href="https://en.wikipedia.org/wiki/Big_O_notation">big-O notation</a>
to this question.
In fact, he felt tripped up by the <script type="math/tex">=</script> signs in the above expression.
I remembered being similarly at a loss a decade earlier doing my first Algorithms course at my alma mater,
<a title="BITS, Pilani" href="http://www.bits-pilani.ac.in/">BITS Pilani</a>,
and I now realize that my confusion was at least partly due to bad notation.
Computer Science-folks (as opposed to mathematicians),
when teaching complexity, do not emphasize
<em>at all</em>
that big-O actually refers to a
<strong>class</strong>, or a <strong>family</strong> of functions,
rather than a single one:</p>
<blockquote>
<p><script type="math/tex">O(f(n))</script> is the set of all functions whose absolute value (asymptotically)
grows no faster than <script type="math/tex">\vert f(n) \vert</script>,
by up to a constant factor.</p>
</blockquote>
<p>Armed with the above informal definition,
we (the mathematically trained) immediately see an abuse of notation
in the above problem with:</p>
<blockquote>
<script type="math/tex; mode=display">f(n) = O(g(n))</script>
</blockquote>
<p>and realize that the expression would be better written as</p>
<blockquote>
<script type="math/tex; mode=display">f(n) \in O(g(n))</script>
</blockquote>
<p>The <strong>belongs-to</strong> symbol (<script type="math/tex">\in</script>) is more appropriate
than the <strong>equality</strong> symbol (<script type="math/tex">=</script>) since the left hand side
is a single function, while the right hand side is a family of functions.
In other words, we are making <em>a statement about set membership</em>.
Unfortunately, this abuse of notation is widespread in Computer Science literature.
Newcomers, as always, bear the brunt.</p>
<p>Before answering the original problem my friend posed, consider the more general problem:</p>
<blockquote>
<p>Given two functions <script type="math/tex">f</script> and <script type="math/tex">g</script>,
if <script type="math/tex">f(n) \in O(g(n))</script>, is it also true that
<script type="math/tex">2^{f(n)} \in O(2^{g(n)})</script>?</p>
</blockquote>
<p>The answer is <strong>no</strong>, since <script type="math/tex">f(n) = -n</script> and <script type="math/tex">g(n) = -n^2</script> is a counterexample.
Since <script type="math/tex">|-n| = |n|</script> grows no faster than <script type="math/tex">|-n^2| = n^2</script>, we get that <script type="math/tex">f(n) \in O(g(n))</script>.
But <script type="math/tex">|2^{-n}| = (0.5)^n</script> <em>does</em> grow faster than <script type="math/tex">| 2^{-n^2} | = (0.5)^{n^2}</script>,
and so <script type="math/tex">2^{f(n)} \notin O(2^{g(n)})</script>.</p>
<p>The nature of the counterexample however indicates to us that
if we imposed the restriction that the functions were positive and non-decreasing,
things might be different.
In fact, if <script type="math/tex">x > 0</script>, <script type="math/tex">y > 0</script>, and
<script type="math/tex">M > 0</script>, then <script type="math/tex">% <![CDATA[
\vert x \vert < M |y| %]]></script></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&\implies \qquad {} x < My & \qquad {} (|a| = a \text{ when } a > 0) \\
&\implies \qquad {} 2^x < 2^{My} & \\
&\implies \qquad {} 2^x < k 2^{y} & \qquad {} \text{ where } k = 2^{(M - 1)y} > 0 \\
&\implies \qquad {} |2^x| < k |2^{y}|. &
\end{align*} %]]></script>
<p>In other words, <strong>yes</strong>,
if <script type="math/tex">f</script> and <script type="math/tex">g</script> are positive, and
<script type="math/tex">f(n) \in O(g(n))</script>, then
<script type="math/tex">2^{f(n)} \in O(2^{g(n)})</script>.</p>
http://shashank.ramaprasad.com/2013/07/16/understanding-the-big-o-a-note-on-abuse-of-notation
http://shashank.ramaprasad.com/2013/07/16/understanding-the-big-o-a-note-on-abuse-of-notationTue, 16 Jul 2013 17:55:35 +0000