A Mostly Technical Blog2023-12-31T04:32:17+00:00https://shashank.ramaprasad.comShashank Ramaprasadshashank.ramaprasad+blog@gmail.comInterval length via discretization2019-08-09T00:00:00+00:00https://shashank.ramaprasad.com/2019/08/09/interval-length-via-discretization
<p>I found an interesting exercise in <a href="https://www.math.ucla.edu/~tao/">Terry Tao</a>’s
<a href="https://terrytao.wordpress.com/books/an-introduction-to-measure-theory/">Measure Theory book</a>:</p>
<blockquote>
<p>Proposition: for any \(N \in \mathbb{R}\) and interval \(I = [a, b]\) over
\(\mathbb{R}\),
\(|I| = \lim_{N \to \infty} \frac{|I ~ \cap ~ \frac{\mathbb{Z}}{N} |}{N},\)
where
\(\frac{\mathbb{Z}}{N} := \{ \frac{n}{N} \mid n \in \mathbb{Z} \}\).</p>
</blockquote>
<p>The interesting thing about this proposition is that it expresses the length of
an interval \(|I| = (b - a)\) (which is a <em>continuous</em> measure) in terms of the
cardinality of a <em>discrete</em> set, \(I \cap \frac{\mathbb{Z}}{N}\).
To develop an intuition for how this proposition could work, first notice that
\(I \cap \frac{\mathbb{Z}}{N}\) is the set of points in
\(\frac{\mathbb{Z}}{N}\) which lie within the interval \(I\).
For example, if \(a = 3\), \(b = 8\), and \(N = 25\), then
\(I \cap \frac{\mathbb{Z}}{N} = \{ {75}/{25}, {76}/{25}, \ldots, {200}/{25} \}\),
which means its cardinality is
\((200 - 75) + 1 = 126\).</p>
<p>Plugging this back into the right-hand side of the proposition, we see that:</p>
\[\frac{|I ~ \cap ~ \frac{\mathbb{Z}}{N} |}{N} = \frac{126}{25} = 5 + 0.04.\]
<p>As we can see, this is pretty close to \(|I| = 5\).
So, if \(| I \cap \frac{\mathbb{Z}}{N} |\) can be expressed as
\(N(b - a) + \epsilon\) where \(\epsilon\) is some constant,
then the proposition pretty much holds <em>at the limit</em>.</p>
<p>To formalize this intuition, consider \(\frac{l}{N}\) and \(\frac{h}{N}\),
the smallest and largest elements of \(I \cap \frac{\mathbb{Z}}{N}\),
where \(l, h \in \mathbb{Z}\).
From our prior example, it is clear that the cardinality of that set is:</p>
\[\vert I \cap \frac{\mathbb{Z}}{N} \vert = h - l + 1.\]
<p>Next, note that by definition, \(\frac{l}{N} \ge a\) and \(\frac{(l - 1)}{N} \lt a\),
so we get \(l \lt Na + 1\). Similarly, we get \(h \gt Nb - 1\). Plugging these inequalities into the expression for the interval size,
we get:</p>
\[\vert I \cap \frac{\mathbb{Z}}{N} \vert = h - l + 1 \gt Nb - Na - 1.\]
<p>which we can rewrite as:</p>
\[\vert I \cap \frac{\mathbb{Z}}{N} \vert = N(b - a) - 1 + \epsilon\]
<p>where \(\epsilon > 0\) is a constant.
Plugging this back into the right hand side of the original assertion, we get:</p>
\[\begin{align*}
\lim_{N \to \infty} \frac{|I \cap \frac{\mathbb{Z}}{N} |}{N}
& = \lim_{N \to \infty} \frac{N(b - a) - 1 + \epsilon}{N} \\
& = (b - a) + \lim_{N \to \infty} \frac{\epsilon - 1}{N} \\
& = b - a.
\end{align*}\]
On the Countability of a Particular Equivalence2019-07-30T00:00:00+00:00https://shashank.ramaprasad.com/2019/07/30/on-the-countability-of-a-particular-equivalence
<p>Let’s define two real numbers to be <em>equivalent</em> (denoted by \(\sim\))
if their difference is a rational number:</p>
\[x \sim y ~ \textrm{for} ~ x, y \in \mathbb{R} ~ \textrm{if and only if} ~ x - y \in \mathbb{Q}.\]
<p>Now consider the resulting <em>equivalence class</em>, which, for any real number \(x\),
is the set of real numbers that are equivalent to \(x\) under \(\sim\):</p>
\[[x] = \{ y \in \mathbb{R} ~ \mid ~ x \sim y \}\]
<p>For example, all the rational numbers trivially fall into \([0]\).</p>
<p><em>How <strong>many</strong> such equivalence classes are there in \(\mathbb{R}\)?</em>
The answer to this question is a key (if small) step along the way to a proof
about the nonexistence of a <em>universal measure</em> (<em>i.e.</em>, a measure defined on
all subsets) on the real numbers, which I came across recently while watching
this <a href="https://www.youtube.com/watch?v=llnNaRzuvd4">intro video about Measure Theory</a>.
The answer, which may seem obvious to some, took me a while to figure out, so I
figured I’d share my thought process:</p>
<p>First, note that each equivalence class is <em>countable</em>. In fact, each class is
exactly as large as the set of rational numbers, which is of course countable.
This becomes obvious if we rewrite the class like so:</p>
\[[x] = \{ x + r ~ \mid ~ r \in \mathbb{Q} \}.\]
<p>Next, notice that the <em>union</em> of all these equivalence classes needs to cover
the real numbers. After all, every real number belongs to (exactly) one of these
classes.</p>
<p>It follows that there must be an <strong>uncountable</strong> number of these classes, since
otherwise, we would have the union of a countable number of countable sets, which
cannot possibly cover \(\mathbb{R}\) (which is uncountable). <em>Neat!</em></p>
Reverse engineering Metacritic2015-06-14T00:00:00+00:00https://shashank.ramaprasad.com/2015/06/14/reverse-engineering-the-metacritic-movie-ratings
<p><em>Where I (successfully?) attempt to reverse engineer
the relative weights that the website
<a href="http://www.metacritic.com/">metacritic</a>
assigns to movie critics.</em></p>
<p>metacritic is a popular site that computes an
aggregate <strong>metascore</strong> for each movie.
The metascore is a <em>weighted average</em> of individual critic scores.
The <a href="http://www.metacritic.com/faq">metacritic FAQ Page</a> says:</p>
<blockquote>
<p>Q: Can you tell me how each of the different critics are weighted in your formula?</p>
<p>A: Absolutely not.</p>
</blockquote>
<p>That sounds like a challenge to me.
Using standard machine learning/optimization techniques,
we should be able to
tell what critics are more important than others.
In fact, the same techniques
should also allow us to correctly predict the metascore
for any new movie (given the individual critic ratings).
This post describes my attempt to build such a system.</p>
<p><em>Note: All related code is available on <a href="https://github.com/shashank025/metacritic-weights">my github</a>.</em></p>
<h3 id="the-model">The Model</h3>
<p>We introduce some notations and assumptions about the problem:</p>
<ul>
<li>metacritic uses ratings from \(n\) critics.</li>
<li>Our data set has \(m\) movies.</li>
<li>\(r_{ij}\) is the rating of movie \(i\) by critic \(j\).
This forms an \(m \times n\) matrix.
<ul>
<li>Not all critics rate all movies!
In other words, \(r_{ij}\) may not be defined for all \(i\) and \(j\).</li>
<li>Where defined, the values are constrained: \(0 \leq r_{ij} \leq 100\).</li>
</ul>
</li>
<li>\(\theta_j\) is the <em>relative weight</em> (or importance) of critic \(j\)
(this is what we are trying to <em>learn</em>).
<ul>
<li>There is no point in having a critic weight of \(0\)
(why even consider a critic whose rating does not affect the metascore at all?).</li>
<li>In light of the previous point, we constrain
critic weights to be <em>positive</em>,
<em>i.e.</em>, \(\theta_j > 0\) for all \(j\),</li>
<li>Since these weights are relative, they must add up to one,
<em>i.e.</em>, \(\sum_{j=1}^n \theta_j = 1\).</li>
<li>Critic weights stay constant across movies (but may get updated over time).</li>
<li>The \(n\)-dimensional vector
\(\theta = (\theta_1, \theta_2, \ldots, \theta_n)\)
represents <em>a solution</em>, a possible assignment of weights to critics.</li>
<li>Due to the above constraints, the solution space of \(\theta\) forms
a <em>bounded, <a href="https://en.wikipedia.org/wiki/Hyperplane#Affine_hyperplanes">affine hyperplane</a></em>.</li>
</ul>
</li>
<li>\(p_i\) is the <em>published</em> metascore for movie \(i\).
<ul>
<li>These values are also constrained: \(0 \leq p_i \leq 100\) for all \(i\).</li>
</ul>
</li>
<li>\(y_i(\theta)\) is the <em>predicted</em> metascore for movie \(i\)
for a given choice of relative weights.
<ul>
<li>We will drop the \(\theta\) when it is obvious.</li>
</ul>
</li>
<li>\(p = (p_1, p_2, \ldots, p_m)\) and \(y(\theta) = (y_1(\theta), y_2(\theta), \ldots, y_m(\theta))\)
are vectorized forms we will use for conciseness later.</li>
</ul>
<p>An obvious definition of \(y_i(\theta)\)
is simply a weighted sum:</p>
\[y_i(\theta) = \sum_{j=1}^n \theta_j r_{ij}.\]
<p>But there is a problem with this definition. Remember:
<em>Not all critics rate all movies</em>.
In other words, the summation above may be invalid,
since not all \(r_{ij}\) values are necessarily defined.
How do we deal with this <em>incomplete</em> matrix \(r_{ij}\)?
My best guess is that metacritic normalizes the metascore
over the available critic weights.
For example, assume that the (excellent) movie
<a href="http://www.imdb.com/title/tt0470752/">Ex Machina</a>
has the index \(i = 4\) in our data set.
Assume that only two critics,
with weights \(\theta_1\) and \(\theta_2\)
have currently rated this movie.
We denote their ratings \(r_{41}\) and \(r_{42}\) respectively.
The metascore for this movie is then</p>
\[y_4(\theta_1, \theta_2) = \frac{\theta_1 r_{41} + \theta_2 r_{42}}{\theta_1 + \theta_2}.\]
<p>In fact, the metacritic FAQ page says they wait until
a movie has at least 4 reviews before computing a metascore.
So they want at least 4 defined \(r_{ij}\) values
for a given \(i\).
Lets define the following additional variables:</p>
<ul>
<li>\(r'_{ij} = r_{ij}\) if movie \(i\) is rated by critic \(j\)
and \(0\) otherwise.</li>
<li>\(e_{ij} = 1\) if movie \(i\) is rated by critic \(j\), and \(0\) otherwise.</li>
</ul>
<p>Note that \(r'_{ij}\) and \(e_{ij}\) are both \(m \times n\) matrices,
but unlike \(r_{ij}\), they are fully defined.</p>
<p>Using these, we modify the definition for \(y_i\):</p>
\[y_i(\theta) = \frac{ \sum_{j=1}^n \theta_j r'_{ij} }{ \sum_{j=1}^n \theta_j e_{ij} }.\]
<p>How does this function vary with \(\theta\) (once we fix the \(r_{ij}\) values)?
I wrote up
<a href="https://github.com/shashank025/metacritic-weights/blob/master/bin/mc_ytheta">a little script to plot \(y_4 (\theta_1, \theta_2)\)</a>
for the example involving the movie <em>Ex Machina</em>
(I fixed the critic ratings to
\(r_{41} = 79\) and \(r_{42} = 67\);
I know. Stupid critics!)
The following image of the plot
hopefully makes it clear that
\(y_i (\theta)\) is <em>not</em> linear in \(\theta\).
But the function is still <em>smooth</em> (<em>i.e.</em>, <em>differentiable</em>).</p>
<p><a href="/assets/images/plot-of-y-theta.png">
<img width="400" height="300" src="/assets/images/plot-of-y-theta.png" title="Plot of y(theta) - click to zoom" alt="Plot of y(theta) - click to zoom" />
</a></p>
<p>Now consider the \(m\)-vector \(d(\theta) = p - y(\theta)\).
This vector is a measure of how <em>off</em> the predictions
are from actual metascores for
a given \(\theta\).
We will try to find a \(\theta\)
that minimizes the value of the function
\(f(\theta) = \Vert d(\theta) \Vert\),
where \(\Vert \cdot \Vert\) represents the
<a href="https://en.wikipedia.org/wiki/Lp_space">\(L^2\) norm</a>.
Formally,</p>
\[\DeclareMathOperator*{\argmin}{\arg\!\min}
\begin{gather*}
\argmin_{\theta} \Vert p - y(\theta) \Vert \\
\text{subject to} \\
\sum_{j=1}^n \theta_j = 1, \text{and}\\
\theta_j > 0~\text{for all}~j.
\end{gather*}\]
<p>This is a standard
<a href="https://en.wikipedia.org/wiki/Constrained_optimization">constrained minimization problem</a>.
Our expectation is that any solution \(\theta\)
of the above system
<em>(a)</em> fits the training set well, and
<em>(b)</em> also predicts metascores for new movies.
Notice that \(d\) is not a <em>linear</em> function of \(\theta\)
because \(y(\theta)\) isn’t either.
So, we have to use a
<a href="https://en.wikipedia.org/wiki/Nonlinear_programming">nonlinear solver</a>.</p>
<h3 id="the-implementation">The Implementation</h3>
<p>With (most of the) annoying math out of the way, lets write code!
The implementation pipeline consists of the following stages:</p>
<ol>
<li>Collect movie ratings data from metacritic.</li>
<li>Preprocess the data:
<ul>
<li>Remove ratings from critics who’ve rated very few movies, and</li>
<li>Create the \(r'_{ij}\) and \(e_{ij}\) matrices.</li>
</ul>
</li>
<li>Partition the data into a <em>training</em> set and a <em>test</em> set.</li>
<li>Find a best fit \(\theta\) by running the optimization routine on the training set.</li>
<li>Compute accuracy against the test set.</li>
<li>Output the results.</li>
</ol>
<p>It turns out that a Makefile is really well suited to
building these kinds of pipelines,
where each stage produces a <em>file</em> that can be used as a Make target
for that stage.
Each stage can be dependent on files produced in one or more previous stages.</p>
<h4 id="collecting-ratings-data-from-metacritic">Collecting ratings data from metacritic</h4>
<p>Unfortunately, metacritic does not, as far as I know,
have any API’s to make this data available easily.
So I periodically scrape metacritic’s
<a href="http://www.metacritic.com/browse/movies/release-date/theaters/metascore?view=condensed">New Movie Releases page</a>
for links to actual metacritic movie pages,
which I then scrape to get the overall metascore,
and the individual critic ratings.</p>
<p>I used a combination of
<a href="http://www.semicomplete.com/projects/xpathtool/">xpathtool</a>
and
the <a href="http://lxml.de/">lxml Python library</a>
for the scraping.</p>
<p>The output of this stage is a
<a href="https://docs.python.org/2/library/pickle.html">Python cPickle</a>
dump file that represents a dictionary of the form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ movie_url -> (metascore, individual_ratings), ... }
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">individual_ratings</code> is itself a dictionary of the form</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ critic_name -> numeric_rating, ... }.
</code></pre></div></div>
<p>For example, this structure could look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
'http://www.metacritic.com/movie/mad-max-fury-road/critic-reviews' ->
(89,
{'Anthony Lane (The New Yorker)': 100,
'A.A. Dowd (TheWrap)': 95,
...}),
'http://www.metacritic.com/movie/ex-machina/critic-reviews' ->
(78,
{'Steven Rea (Philadelphia Inquier)': 100,
'Manohla Dargis (The New York Times)': 90,
...}),
...
}
</code></pre></div></div>
<p>I know cPickle is not exactly the most portable format,
but it works well at this early stage.
In the long run, I want to persist all of the ratings data
in a database (sqlite? Postgres?).</p>
<h4 id="preprocessing">Preprocessing</h4>
<p>We first eliminate from our data set
the long tail of critics who’ve rated very few movies.
Not only are these critics not very influential
to the overall optimization routine,
eliminating them also
helps reduce \(n\) (the matrix width).
Accordingly, there is a configurable <em>rating count threshold</em>,
currently set to \(5\).
We do one pass over the ratings data and construct
a dictionary of the form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ critic_name -> movies_rated, ... }
</code></pre></div></div>
<p>We then do another pass through the data and remove ratings
from critics whose <code class="language-plaintext highlighter-rouge">movies_rated</code> value is lower than the
threshold.</p>
<p>The second preprocessing step is to construct the
\(r'_{ij}\) and \(e_{ij}\) matrices, which of course
is
<a href="https://en.wikipedia.org/wiki/Small_matter_of_programming">a simple matter of programming</a>.
I store these values as
<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html">numpy matrices</a>.</p>
<h4 id="partitioning-the-data-set">Partitioning the data set</h4>
<p>This is straightforward.
I use a configurable <code class="language-plaintext highlighter-rouge">training_frac</code> parameter
(a value in the interval \([0, 1]\)) to probabilistically
split the cleaned up data into a test set and a training
set.</p>
<h4 id="optimization-routine">Optimization routine</h4>
<p>There are numerous “solvers” available for
constrained optimization problems of the type we
described above, but not all of them are
freely available.</p>
<p>I tried the following two solvers,
available as part of
<a href="http://docs.scipy.org/doc/scipy-0.13.0/reference/optimize.html">scipy.optimize</a>:</p>
<table>
<thead>
<tr>
<th>Solver</th>
<th>Differentiability requirements</th>
<th>Allows bounds?</th>
<th>Allows equality constraints?</th>
<th>Allows inequality constraints?</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.optimize.fmin_slsqp.html">Sequential Least Squares Programming (SLSQP)</a></td>
<td>The objective function and the constraints should be twice <a href="https://en.wikipedia.org/wiki/Differentiable_function#Differentiability_classes">continuously differentiable</a></td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td><a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_cobyla.html">Constrained Optimization By Linear Approximations (COBYLA)</a></td>
<td>None</td>
<td>No</td>
<td>No</td>
<td>Yes</td>
</tr>
</tbody>
</table>
<p>Note that \(y(\theta)\) (and therefore \(f(\theta)\)) satisfies
the differentiability requirement of SLSQP.</p>
<p>Also, COBYLA does not allow you to specify bounds on
\(\theta\) values or equality constraints.
So, we employ a common technique in optimization formulations,
which is to push the constraints <em>into the objective function</em>.
Consider the “tub” function \(\tau(x, l, u)\) defined as:</p>
\[\tau (x, l, u) =
\begin{cases}
0 & \quad \text{if } l \leq x \leq u, \\
1 & \quad \text{otherwise}.
\end{cases}\]
<p>Our modified objective function (for use with COBYLA) becomes:</p>
\[f(\theta) = \Vert p - y(\theta) \Vert
+ P_h \cdot \Vert 1 - \sum_{j=1}^n \theta_j \Vert
+ P_b \cdot \sum_{j = 1}^n \tau(\theta_j, 0, 1),\]
<p>where \(P_h\) and \(P_b\) are configurable weights that decide
how much we should <em>penalize</em> the optimization algorithm when it
chooses:</p>
<ul>
<li>a \(\theta\) that doesn’t lie on the affine hyperplane,
and</li>
<li>\(\theta_j\) values outside the interval
\([0, 1]\),</li>
</ul>
<p>respectively.</p>
<p>Setting both \(P_h\) and \(P_b\) to 0 reduces our objective
function to its original form, so we can use the same
function for both solvers by simply tweaking these weights.</p>
<h3 id="results">Results</h3>
<p>Before I actually launch into details, I should note
the following issues right at the outset:</p>
<ul>
<li>I was actually unable to get either SLSQP or COBYLA
to ever successfully converge on a solution.</li>
<li>The \(\theta\) values (<em>i.e.</em>, critic weights)
learned by these solvers
were often <em>way</em> outside the interval \([0, 1]\).</li>
</ul>
<p>Most of the times, both routines finished their iterations
and failed with errors of this form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>optimization failed [8]: Positive directional derivative for linesearch
optimization failed [2]: Maximum number of function evaluations has been exceeded
</code></pre></div></div>
<p>If you have experience with the numpy optimization
library, I’d love to hear about suggestions you may
have on how to deal with these errors.</p>
<p>Perhaps more interestingly, in spite of the above issues,
the learned \(\theta\) values were still able to successfully
predict metascores for movies in the test set.</p>
<p>After removing ratings from insignificant critics,
I constructed a training set of about 2800 ratings
of 190 movies by 188 critics.</p>
<p>The following table lists
the <strong>top 20</strong> critics by weight
learned using the above training set
with each optimization routine.
Note that the weights are expressed as a percentage
of the weight for the <em>top</em> critic in each list.
Interestingly enough,
both algorithms think
<a href="http://connect.nola.com/user/mbscott/posts.html">Mike Scott of the New Orleans Times-Picayune</a>
is the metacritic MVP.
So, for example, according to SLSQP,
a review by Justin Lowe carries only
95% of the importance that is given
to a review by Mike Scott.</p>
<table>
<thead>
<tr>
<th style="text-align: center">SLSQP</th>
<th style="text-align: left"> </th>
<th style="text-align: center">COBYLA</th>
<th style="text-align: left"> </th>
</tr>
<tr>
<th style="text-align: center">Weight</th>
<th style="text-align: left">Critic</th>
<th style="text-align: center">Weight</th>
<th style="text-align: left">Critic</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">\(\cdot\)</td>
<td style="text-align: left">Mike Scott (New Orleans Times-Picayune)</td>
<td style="text-align: center">\(\cdot\)</td>
<td style="text-align: left">Mike Scott (New Orleans Times-Picayune)</td>
</tr>
<tr>
<td style="text-align: center">0.949948</td>
<td style="text-align: left">Justin Lowe (The Hollywood Reporter)</td>
<td style="text-align: center">0.902288</td>
<td style="text-align: left">Slant Magazine</td>
</tr>
<tr>
<td style="text-align: center">0.929495</td>
<td style="text-align: left">Jordan Hoffman (The Guardian)</td>
<td style="text-align: center">0.900024</td>
<td style="text-align: left">Ronnie Scheib (Variety)</td>
</tr>
<tr>
<td style="text-align: center">0.914186</td>
<td style="text-align: left">Marjorie Baumgarten (Austin Chronicle)</td>
<td style="text-align: center">0.890113</td>
<td style="text-align: left">Wes Greene (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.910820</td>
<td style="text-align: left">Fionnuala Halligan (Screen International)</td>
<td style="text-align: center">0.887845</td>
<td style="text-align: left">Chris Cabin (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.909801</td>
<td style="text-align: left">James Mottram (Total Film)</td>
<td style="text-align: center">0.885791</td>
<td style="text-align: left">Martin Tsai (Los Angeles Times)</td>
</tr>
<tr>
<td style="text-align: center">0.904564</td>
<td style="text-align: left">Variety</td>
<td style="text-align: center">0.863626</td>
<td style="text-align: left">Lawrence Toppman (Charlotte Observer)</td>
</tr>
<tr>
<td style="text-align: center">0.903029</td>
<td style="text-align: left">Guy Lodge (Variety)</td>
<td style="text-align: center">0.858237</td>
<td style="text-align: left">Anthony Lane (The New Yorker)</td>
</tr>
<tr>
<td style="text-align: center">0.897749</td>
<td style="text-align: left">Inkoo Kang (TheWrap)</td>
<td style="text-align: center">0.845864</td>
<td style="text-align: left">Fionnuala Halligan (Screen International)</td>
</tr>
<tr>
<td style="text-align: center">0.894605</td>
<td style="text-align: left">indieWIRE</td>
<td style="text-align: center">0.834088</td>
<td style="text-align: left">Boyd van Hoeij (The Hollywood Reporter)</td>
</tr>
<tr>
<td style="text-align: center">0.892237</td>
<td style="text-align: left">Ben Kenigsberg (The New York Times)</td>
<td style="text-align: center">0.820908</td>
<td style="text-align: left">Variety</td>
</tr>
<tr>
<td style="text-align: center">0.882656</td>
<td style="text-align: left">Mike D’Angelo (The Dissolve)</td>
<td style="text-align: center">0.814152</td>
<td style="text-align: left">Nicolas Rapold (The New York Times)</td>
</tr>
<tr>
<td style="text-align: center">0.875550</td>
<td style="text-align: left">Simon Abrams (Village Voice)</td>
<td style="text-align: center">0.735623</td>
<td style="text-align: left">Justin Lowe (The Hollywood Reporter)</td>
</tr>
<tr>
<td style="text-align: center">0.875385</td>
<td style="text-align: left">Martin Tsai (Los Angeles Times)</td>
<td style="text-align: center">0.629166</td>
<td style="text-align: left">Mark Olsen (Los Angeles Times)</td>
</tr>
<tr>
<td style="text-align: center">0.875062</td>
<td style="text-align: left">Manohla Dargis (The New York Times)</td>
<td style="text-align: center">0.625141</td>
<td style="text-align: left">The Globe and Mail (Toronto)</td>
</tr>
<tr>
<td style="text-align: center">0.874889</td>
<td style="text-align: left">Kyle Smith (New York Post)</td>
<td style="text-align: center">0.567734</td>
<td style="text-align: left">James Berardinelli (ReelViews)</td>
</tr>
<tr>
<td style="text-align: center">0.872482</td>
<td style="text-align: left">Nicolas Rapold (The New York Times)</td>
<td style="text-align: center">0.562606</td>
<td style="text-align: left">Peter Sobczynski (RogerEbert.com)</td>
</tr>
<tr>
<td style="text-align: center">0.869911</td>
<td style="text-align: left">James Berardinelli (ReelViews)</td>
<td style="text-align: center">0.558980</td>
<td style="text-align: left">John Anderson (Wall Street Journal)</td>
</tr>
<tr>
<td style="text-align: center">0.863118</td>
<td style="text-align: left">Ronnie Scheib (Variety)</td>
<td style="text-align: center">0.520855</td>
<td style="text-align: left">Steve Macfarlane (Slant Magazine)</td>
</tr>
<tr>
<td style="text-align: center">0.860796</td>
<td style="text-align: left">Nikola Grozdanovic (The Playlist)</td>
<td style="text-align: center">0.510842</td>
<td style="text-align: left">Jordan Mintzer (The Hollywood Reporter)</td>
</tr>
</tbody>
</table>
<p>I should clarify that the top 20 list can change from one run to the next,
since it depends on the training set chosen (which is at random).</p>
<p>Next, we show a few sample predicted metascores:</p>
<table>
<thead>
<tr>
<th>Movie</th>
<th>Actual Metascore</th>
<th>Predicted (SLSQP)</th>
<th>Predicted (COBYLA)</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.imdb.com/title/tt3247714/">Survivor</a></td>
<td><a href="http://www.metacritic.com/movie/survivor">26</a></td>
<td>30 (16.18%)</td>
<td>33 (28.12%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt1674771/">Entourage</a></td>
<td><a href="http://www.metacritic.com/movie/entourage">38</a></td>
<td>57 (50.25%)</td>
<td>43 (15.65%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt1823672/">Chappie</a></td>
<td><a href="http://www.metacritic.com/movie/chappie">41</a></td>
<td>46 (12.89%)</td>
<td>43 (6.00%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt3215846/">Dreamcatcher</a></td>
<td><a href="http://www.metacritic.com/movie/dreamcatcher-2015">86</a></td>
<td>77 (-10.46%)</td>
<td>83 (-2.77%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt2395427/">Avengers: Age of Ultron</a></td>
<td><a href="http://www.metacritic.com/movie/avengers-age-of-ultron">66</a></td>
<td>62 (-5.05%)</td>
<td>67 (1.57%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt2788556/">Gemma Bovery</a></td>
<td><a href="http://www.metacritic.com/movie/gemma-bovery">57</a></td>
<td>0 (-100.00%)</td>
<td>61 (7.15%)</td>
</tr>
<tr>
<td><a href="http://www.imdb.com/title/tt3218580/">Alleluia</a></td>
<td><a href="http://www.metacritic.com/movie/alleluia">84</a></td>
<td>90 (7.41%)</td>
<td>87 (4.13%)</td>
</tr>
</tbody>
</table>
<p>With a test set of size 60, the movie predictions
by the two algorithms had the following
<a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">RMSE</a> values:</p>
<table>
<thead>
<tr>
<th>SLSQP</th>
<th>COBYLA</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.053437</td>
<td>0.016942</td>
</tr>
</tbody>
</table>
<h2 id="takeaways">Takeaways</h2>
<ul>
<li>Working out the math was obviously fun!
<ul>
<li>I got to brush up on my long defunct skills with numpy, scipy, matplotlib, etc.</li>
</ul>
</li>
<li>Even a toy project like this one can end up demanding
substantial time and attention,
especially if it is something you want to share
with the rest of the world.
For example, I ended up setting up proper Python package management for all the code.</li>
<li>Writing this blog post was also extremely useful because
it clarified my own thinking on the topic. I was actually
able to go back and refactor the code to better match
the implementation pipeline I described above.</li>
</ul>
<hr />
Bloom filters for set intersections?2015-06-07T00:00:00+00:00https://shashank.ramaprasad.com/2015/06/07/bloom-filters-for-set-intersections
<p>A Bloom filter is a probabilistic data structure \(B\) that
approximately and efficiently answers the <em>set membership</em>
question. Given a Bloom filter B constructed out of the elements
of a set \(S\), \(B(x)\) returns \(YES\) or \(NO\),
indicating the membership of \(x\) in \(S\).
Bloom filters are <em>approximate</em> since they can return
false positives. The false positive rate \(\epsilon\) for a Bloom filter
can be expressed by:</p>
\[Pr(B(x) = \text{YES} \mid x \notin S) = \epsilon.\]
<p>An interesting question is:
<em>Can Bloom filters be used to compute whether two sets are <strong>disjoint</strong>?</em></p>
<p>That is, given a set \(S\) represented by a Bloom filter \(B\),
and another set \(Q\), can we use the Bloom filter to accurately
determine whether \(Q \cap S = \varnothing\)?
An obvious procedure to compute disjointness is:</p>
<blockquote>
<p><em>Procedure \(P\):</em>
Iterate through each element \(q\) of the set \(Q\).
If \(B(q) = YES\) for any element \(q\),
then return \(NO\), else return \(YES\).</p>
</blockquote>
<p>Before we can reason about the accuracy of this procedure, let’s define
the following events:</p>
<ul>
<li>\(D\): the event that \(S\) and \(Q\) are disjoint
(and \(D'\) to be its complement), and</li>
<li>\(A\): the event that \(P\) is accurate (and \(A'\) to be its complement).</li>
</ul>
<p>Now, we make the following claims about our procedure \(P\):</p>
<blockquote>
<p><em>Claim 1</em>: \(Pr(A' \mid D') = 0\).</p>
</blockquote>
<p>In other words, our procedure \(P\) is guaranteed to be correct when
\(S\) and \(Q\) are <em>not</em> disjoint.
This claim follows easily from the fact that a Bloom filter cannot have
false negatives, <em>i.e.</em>, if \(q\) does exist in \(S\), then \(B(q)\) will
return \(YES\).</p>
<blockquote>
<p><em>Claim 2</em>: \(Pr(A' \mid D) \ge 0\).</p>
</blockquote>
<p>In other words, \(P\) <em>can</em> be inaccurate when
\(S\) and \(Q\) are disjoint.
Specifically, this procedure can be inaccurate if
the Bloom filter returns \(YES\) for at least one element in \(Q\),
given that no element in \(Q\) belongs to \(S\).
But <em>how</em> inaccurate?
It is often easier to compute the probability of the complementary event
\(A \mid D\),
which is the event that the Bloom filter returns \(NO\) for every element of \(Q\).
The probability that the Bloom filter returns \(NO\) for a given element
\(q \in Q\) can be expressed by:</p>
\[Pr(B(q) = \text{NO}~\mid~q \notin S) = 1 - \epsilon.\]
<p>Assuming independence of Bloom filter output for the different elements of
\(Q\), and assuming that \(Q\) has \(n\) elements, we can write:</p>
\[Pr(A \mid D) = (1 - \epsilon)^n.\]
<p>Further, notice that the likelihood of the sets being disjoint has nothing
whatsoever to do with the Bloom filter accuracy (I like to picture it as:
\(Pr(D \mid A) = Pr(D)\)). In other words, \(D\) and \(A\) are independent!
If we wanted to be explicit, we could apply Bayes’ Theorem:</p>
\[Pr(A \mid D) = \frac {Pr(D \mid A) \cdot Pr(A)}{ Pr(D) } = Pr(A),\]
<p>from which it follows that:</p>
\[Pr(A) = (1 - \epsilon)^n.\]
<p>and:</p>
\[Pr(A') = 1 - (1 - \epsilon)^n.\]
<p>Since \(0 \le \epsilon \le 1\), the inaccuracy grows with \(n\).
This makes intuitive sense since the larger the set \(Q\),
the more the chances that we will run into a Bloom filter false positive.
In plain English: Bloom filters are probably <strong>not</strong> a good idea for computing
whether two sets are disjoint.</p>
Bay Bridges Challenge (solved!)2014-03-10T08:44:15+00:00https://shashank.ramaprasad.com/2014/03/10/bay-bridges-challenge-solved<p>I
<a href="/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge">previously wrote</a>
about the
<a title="Bay Bridges Challenge" href="https://www.codeeval.com/open_challenges/109/">Bay Bridges Challenge</a>,
hosted at
<a title="CodeEval" href="https://www.codeeval.com/">CodeEval</a>.
In my last post, I showed that the problem could be modeled as a
<a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">minimum vertex cover</a> problem,
and wondered if we can do better than iterating through the
<a title="Wikipedia page on Power Sets" href="http://en.wikipedia.org/wiki/Power_set">power set</a>
of bridges, and picking the highest cardinality subset that is feasible.
I said that most likely,</p>
<blockquote><p>there is additional structure inherent in the problem, that can be exploited to make the problem more tractable.</p></blockquote>
<p>Spurred by some <a title="helpful comments on previous post about Bay Bridges challenge" href="http://shashankr.wordpress.com/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge/#comment-68">helpful recent comments</a>, I spent some more time on the problem. As it turns out, we <em>can</em> do better than an exhaustive search. But first, remember that a <strong><em>feasible solution</em></strong> is any set of bridges with <strong>no</strong> intersections. Our task is to find <em>an</em> <strong><em>optimal solution</em></strong>, which is simply the <em>largest</em> such feasible solution (note that there can be more than one).</p>
<p><strong><em>Claim 1:</em></strong> If there is no feasible solution with \(k\) bridges, then there cannot be a larger feasible solution.</p>
<p><em>Proof (by contradiction):</em> Assume that there <em>is</em> a solution with \(l = k + 1\) bridges. The removal of any bridge from this solution should still be a set of non-intersecting bridges, <em>i.e.</em>, a feasible solution, but of size \(k\), which is a contradiction. By induction, we can see that this is also true for higher values of \(l\). QED.</p>
<p>With this claim in hand, let us partition the power set of \(n\) bridges so that \(p(i)\) represents
the set of all sets of bridges of length \(i\). We can then make the following observation about the
\(n\) partitions (we can safely ignore the empty partition \(p(0)\)):</p>
<blockquote>
<p>If any set in \(p(n/2)\) is <em>feasible</em>, the <em>optimal</em> solution is
in one of the partitions \(p(n/2)\) through \(p(n)\).
Otherwise, it is in one of the partitions \(p(1)\) through \(p(n/2 - 1)\).</p>
</blockquote>
<p>In other words, we can do a <em>binary search</em> on the partition index \(i\)
until we find the partition with the optimal solution. While this seems very promising,
the partition size \(|p(i)| = {n \choose i}\) unfortunately has a
<em>maximum</em> at \(i = n/2\) (note that \(|p(n/2)| = {n \choose n/2} \approx 2^n\)).
So, in the worst case, we still need to search an exponential number of bridge sets.d
But in practice, this should still be significantly better than exhaustively searching each partition.</p>
<p>Anyway, I didn’t really feel like coding up this binary search, so I tried submitting a simpler,
more naive solution: Search each partition \(p(i)\), starting with \(p(n)\),
in decreasing order of \(i\), until you find the <em><strong>first</strong></em> feasible solution \(f\), and return \(f\).
Because of <em>Claim 1</em>, it is easy to see that \(f\) <em>will</em> be
<em>an</em> optimal solution to our problem.
It turns out that this approach was enough to get a score of 100 (with a ranking of 86).
So, now I am even less inclined to implement binary search.</p>
<p>Also, I should really post the code for all of my solutions to github, which I hope to do soon.</p>
<p><em><strong>Update:</strong></em> I have checked in <a title="bay bridges solution code (Python)" href="https://github.com/shashank025/codeeval/blob/master/bridges.py">solution code for the Bay Bridges challenge</a> and other codeeval challenges at github. Check it out at <a title="my github repo for codeeval challenges" href="https://github.com/shashank025/codeeval">my github</a>.</p>
<hr />
A Graph Theoretic Approach to the Bay Bridges Challenge2013-12-16T20:55:34+00:00https://shashank.ramaprasad.com/2013/12/16/a-graph-theoretic-approach-to-the-bay-bridges-challenge<p><strong>Note</strong>:
<em>If you are impatient, see my newer post on
<a href="/2014/03/10/bay-bridges-challenge-solved">an actual solution to the bay bridges challenge</a>.</em></p>
<p>I recently came across the
<a title="Bay Bridges Challenge" href="http://blog.codeeval.com/bridges/">Bay Bridges Challenge</a>,
over at
<a title="CodeEval" href="https://www.codeeval.com/">CodeEval</a>.
The challenge is to pick the largest set of bridges for construction,
such that no two bridges cross or <em>intersect</em>.</p>
<p>My first instinct was to model this as a
<a title="Wikipedia page on Graph Theory" href="http://en.wikipedia.org/wiki/Graph_theory">graph</a> problem.
Suppose we construct an <a title="Wikipedia page on Intersection Graphs" href="http://en.wikipedia.org/wiki/Intersection_graph"><em>intersection graph</em></a>
<strong>G</strong>, as follows:
create a graph node (or a <em>vertex</em>) corresponding to each bridge,
and add an edge between two vertices <em>u</em> and <em>v</em>
if and only if the corresponding bridges cross.
Figure 1 below shows an example.
Then, its easy to see that the original problem is equivalent to
finding the <a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">minimum vertex cover</a> on G.</p>
<p><a href="https://docs.google.com/drawings/d/1bWFPkc1C-RjBSOSsaK_VUUA4FZ6EaKpABrT7iuN9BL4/pub?w=960&h=720">
<img width="480" height="360" src="/assets/images/bay_bridges.png" title="A Sample Intersection Graph" alt="A Sample Intersection Graph" />
</a>
<em>Figure 1:</em> A Sample Intersection Graph</p>
<p>It turns out that finding the minimum vertex cover is <a title="Wikipedia page on NP-hard problems" href="http://en.wikipedia.org/wiki/NP-hard">NP-hard</a>. For this problem, it means (roughly) that we can do <em>no better</em> than to iterate through the <a title="Wikipedia page on Power Sets" href="http://en.wikipedia.org/wiki/Power_set">power set</a> of bridges, and pick the highest cardinality subset that is <em>feasible</em> (i.e., no two bridges in that subset cross). That was a bit disheartening. But it is unlikely that a run-of-the-mill coding challenge requires coming up with acceptable approximation algorithms to hard problems. Most likely, there is additional structure inherent in the problem, that can be exploited to make the problem more tractable.</p>
<p>Since the bridges exist in a 2-D plane (roughly), we can use the <a title="Euclidean Distance" href="http://en.wikipedia.org/wiki/Euclidean_distance">distance metric</a> to automatically rule out large portions of the search space. For example, if a bridge does not cross any other bridge, then it is always included in every optimal solution. By eliminating such bridges from consideration, we might reduce the problem size considerably. Spatial data structures like bounding rectangles, <a title="k-d trees" href="http://en.wikipedia.org/wiki/K-d_tree">k-d trees</a>, or their cousins can be used to partition the set of bridges into smaller subsets, and solve many smaller, independent sub-problems (maybe even in parallel). But none of these approaches reduce the essential <em>strength</em> of the problem, since we are still looking at exponential worst-time solutions.</p>
<p>At this point, I was feeling a bit stuck. My attempts at finding
additional structure in the original problem space hadn’t really helped.
But going over the Wikipedia page on
<a title="Wikipedia page on Vertex Cover" href="http://en.wikipedia.org/wiki/Vertex_cover">vertex covers</a>,
I read that a minimum vertex cover can be found in polynomial time for
<a title="Wikipedia page on Bipartite Graphs" href="http://en.wikipedia.org/wiki/Bipartite_graph">bipartite graphs</a>.
Finally, some progress:
If the intersection graph G is bipartite
(a graph’s bipartite-ness can be tested in <em>linear</em> time),
we have an efficient solution.
But it is not hard to come up with real world inputs
where the intersection graph is <em>not</em> bipartite.
For example, in Figure 1 above, the intersection graph
is not actually bipartite
(However, if you imagined that bridge b4 was removed from the input,
the remaining graph would indeed become bipartite).
So, we still don’t have a solution that always works.</p>
<p>As of this point, I haven’t written a line of code.
My hunch is that I need to look for a solution in the original problem space:
the roughly Euclidean plane on which the line segments corresponding to
the bridges exist,
and not in the intersection graph space.</p>