## Math still not the answer

I wrote a quick (but not very elegant) `python` script to retrieve locally enough data from www.metacritic.com for pattern recognition purposes. The main goal is to help me decide how much I will enjoy a movie, before watching it. I included the script at the end of the post, in case you want to try it yourself (and maybe improve it too!). It takes a while to complete, although it is quite entertaining to see its progress on screen. At the end, it provides with two lists of the same length: `critics`—a list of `str` containing the names of the critics; and `scoredMovies`—a list of `dict` containing, at index `k`, the evaluation of all the movies scored by the critic at index `k` in the previous list.

For example:

**>>> critics[43]**

‘James White’

**>>> scoredMovies[43]**

{‘hall-pass’: 60, ‘the-karate-kid’: 60, ‘the-losers’: 60,

‘the-avengers-2012’: 80, ‘the-other-guys’: 60, ‘shrek-forever-after’: 80,

‘the-lincoln-lawyer’: 80, ‘the-company-men’: 60, ‘jonah-hex’: 40,

‘arthur’: 60, ‘vampires-suck’: 20, ‘american-reunion’: 40,

‘footloose’: 60, ‘real-steel’: 60}

The number of scored films by critic varies: there are individuals that gave their opinion on a few dozen movies, and others that took the trouble to evaluate up to four thousand flicks! Note also that the names of the movies correspond with their web pages in www.metacritic.com. For example, to see what critics have to say about the “Karate Kid” and other relevant information online, point your browser to www.metacritic.com/movie/**the-karate-kid**. It also comes in very handy if there are several versions of a single title: *Which “Karate Kid” does this score refer to, the one in the eighties, or Jackie Chan’s?*

Feel free to download a copy of the resulting data [here] (note it is a large file: 1.6MB).

But the fact that we have that data stored locally allows us to gather that information with simple `python` commands, and perform many complex operations on it.

Let’s see, for example, which critics on the list scored the movie Juno, and how did they like it:

**>>> [[critics[index],x[‘juno’]] for index,x in enumerate(scoredMovies)**

if ‘juno’ in x]

if ‘juno’ in x]

[[‘Roger Ebert’, 100],

[‘Peter Travers’, 88],

[‘Joe Morgenstern’, 80],

[‘David Denby’, 90],

[‘A.O. Scott’, 90],

[‘James Berardinelli’, 88],

[‘Michael Phillips’, 88],

[‘Lou Lumenick’, 100],

[‘Todd McCarthy’, 80],

[‘Michael Sragow’, 58],

[‘Claudia Puig’, 100],

[‘J.R. Jones’, 70],

[‘Scott Tobias’, 83],

[‘Kirk Honeycutt’, 80],

[‘Ty Burr’, 0],

[‘Stephanie Zacharek’, 80],

[‘Lawrence Toppman’, 75],

[‘Richard Schickel’, 80],

[‘Desson Thomson’, 90],

[‘Liam Lacey’, 75],

[‘Carrie Rickey’, 88],

[‘Maitland McDonagh’, 75]]

This is a particularly interesting example. Juno is one of those movies that I watch once, and stays on my mind for a long time: The topic leads to rich discussions; the characters are deep although stereotypical; I resided in that part of the country, and it is always enjoyable to see again the old sights and take that trip down Memory Lane. For these and many other reasons, I would grant the flick a solid B+ (85 in *my* scale). Note that there are no critics in this sample that scored exactly like me, although quite a few were close (those that gave it 83 or 88, for instance)

This is the starting point for my algorithm. I will assign a *weight* to each critic after each movie scored by me. For example, critics not in the previous list will get a weight of zero (since they did not even see the film). Critics in this list that gave June 83 points will have a weight close to 1, and the larger the difference between a critic’s score of June and mine, the smaller the weight. By picking movie after movie and scoring them, I **update** the weights for each critic (maybe by averaging all the positive values together, although there are better methods).

As for the individual weights for critics that evaluated Juno, we could follow this formula:

The number 85 is of course the score I gave Juno: everybody gets their weight according to the difference between their evaluations and mine. The number 5211 is approximately which is what you need to grant a smallest weight of 25% to any critic that watched the same movie. You can play with these parameters, of course.

I can now put to practice the algorithm that I proposed in my previous post: Let me start by picking a small random set of movies which I have seen, starting of course with the Blair Witch Project. I use that training set to compute the weight for each critic, and use those weights to assess different movies (some that I have seen, some that I haven’t):

mymovies=dict({'oldboy': 85, 'juno': 85, 'vicky-cristina-barcelona': 85, 'pans-labyrinth': 95, 'indiana-jones-and-the-kingdom-of-the-crystal-skull': 85, 'planet-of-the-apes': 90, '8-women': 50, 'being-john-malkovich': 90, 'pulp-fiction': 95, 'munich': 95, 'district-9': 85, 'frequency': 75, 'the-adventures-of-tintin': 90, 'sleepy-hollow': 80, 'signs': 95, 'spider-man-3': 50, 'space-cowboys': 60, 'this-is-spinal-tap': 70, 'spanglish':70, 'proof': 70, 'the-blair-witch-project': 40})

I can compute now the weights for all the critics by averaging non-zero values of the basic weights I indicated above:

nonzeros=numpy.zeros(len(movieData.critics)) weights=numpy.zeros(len(movieData.critics)) weighthelper = [[index,numpy.exp( (-1)*numpy.log(4)*(mymovies[x] - yourmovies[x])**2 / float(mymovies[x]**2))] for index,yourmovies in enumerate(movieData.scoredMovies) for x in mymovies if x in yourmovies] for datum in weighthelper: weights[datum[0]]*=nonzeros[datum[0]] weights[datum[0]]+=datum[1] nonzeros[datum[0]]+=1 weights[datum[0]]/=nonzeros[datum[0]] weights=weights.tolist() def assessMovie(movieTag,scoredMovies,weights): helper=[scoreOf[movieTag]*weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] relevantWeights = [weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] return reduce(lambda x,y: x+y, helper)/float(numpy.sum(relevantWeights))

Let’s test it:

**>>> weights**

[0.9316894375373113, 0.9440456788129853, 0.8276943837371913,

0.8354699447724799, 0.9096233757112504, 0.9131781397354621,

0.0, 0.792063820900018, 0.0,

0.7905341294370402, 0.8911293907698545, 0.9591397211822409,

0.9122390115047787, 0.9094164789352943, 0.0,

0.8743220704979506, 0.8091610605956614, 0.9830308800891736,

0.9377611752511361, 0.0, 0.8588275926217391,

0.8881609250607919, 0.9503520095147542, 0.0,

0.9806218435940762, 0.8326500757874696, 0.8389774746326207,

0.8743683771430136, 0.8894078063468858, 0.0,

0.0, 0.7329668437056417, 0.916911984206937,

0.41179550863378656, 0.0, 0.9830308800891736,

0.0, 0.0, 0.8543417500931425,

0.7472733165814486, 0.959985337534043, 0.25,

0.0, 0.0, 0.9153304682482231,

0.9213229176351494, 0.7186166247610994, 0.9638336270354847,

0.5553963593823906, 0.9938576336658272, 0.9719197080733987,

0.7084015185939553, 0.9291760858845585, 0.8504427376265632,

0.9554366051158306, 0.9195026707321936, 0.7071067811865476,

0.9047116150472949, 0.9123122074919429, 0.0,

0.9374115172355054, 0.0, 0.0,

0.8695144502247731, 0.8896854138078707, 0.25,

0.817259795695882, 0.0, 1.0,

0.8778357293248331, 0.6359295154878736, 0.9247356530790823,

0.8374453617968352, 0.9830308800891736, 0.8317179210428074,

0.9961672133641973, 0.0, 0.0,

0.0, 0.7925811080710169, 0.909238787663484,

0.9529153979813423, 0.7615303747676634, 0.9916488411455211,

0.9565014890519766, 0.8675413259697287]

**>>> assessMovie(‘the-aristocrats’, scoredMovies, weights)**

73.20726258692754

**>>> assessMovie(‘the-bridesmaid’, scoredMovies, weights)**

72.84280005703968

**>>> assessMovie(‘pi’, scoredMovies, weights)**

74.84860303913176

**>>> assessMovie(‘the-artist’, scoredMovies, weights)**

90.92750278128389

Compare these weighted averages with those indicated in www.metacritic.com, to realize the power of this scheme. And it only gets stronger the more movies I score and include in my training dictionary! But this leads again to the same question that I posed in my previous post, since I cannot figure out how Mathematics will help me choose a minimal set of movies that will guarantee success of any of the proposed algorithms. What do you think?

In a series of follow-up posts, I will compare the results of this method with other data mining schemes. Ideally, I would like to devote a post to each different method. This is a great way of illustrating the ideas behind the beautiful field of pattern recognition.

### Script to retrieve critics data

Apologies for the poorly commented (and written!) script. I will probably update it to something more readable and stylish soon.

import urllib, numpy def retrieveInfo(page): # First, obtain the source code of the page f=urllib.urlopen(page) source=f.read() f.close() return source # Retrieve from metacritic.com up to 100 popular critics metacritic="http://www.metacritic.com/browse/movies/critic/popular?num_items=100" source=retrieveInfo(metacritic) critics=[] pages=[] scoredMovies=[] while source.count("<h3><a href=\"/critic/")>0: [a,b,c]=source.partition("<h3><a href=\"/critic/") [a,b,c]=c.partition("\">") print a if a.count("/")>0: [a,b,source]=c.partition("</a>") else: pages.append("/critic/"+a) [a,b,source]=c.partition("</a>") critics.append(a) # For each of them, compute how many movies they have evaluated numberOfMovies=[] for critic in pages: print "\n\n\n***************** " print "Gathering information for",critic # retrieve the first page of scores for this critic criticPage = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score") # remove the score of trailers, if any [criticPage,b,c]=criticPage.partition("<div class=\"module list_trailers\">") [a,b,c] = criticPage.partition("<a href=\""+critic+"\">") [a,b,c] = c.partition("</a>") numberOfMovies.append(int(a.replace(',',''))) numberOfPagesWithScores = int(numpy.ceil(int(a.replace(',',''))/100.0)) criticMovies=dict() print "Movies evaluated:", a.replace(',','') print "This person has ",numberOfPagesWithScores,"pages with scores" while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) if numberOfPagesWithScores>1: for pageNumber in range(1,numberOfPagesWithScores): print "\n***" print "Page (",pageNumber+1,"/",numberOfPagesWithScores,") for",critic c = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score&page=" + str(pageNumber)) [c,b,a]=c.partition("<div class=\"module list_trailers\">") while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) scoredMovies.append(criticMovies)

### Leave a Reply Cancel reply

### We have moved!

### In the news:

### Recent Posts

- Migration
- Computational Geometry in Python
- Searching (again!?) for the SS Central America
- Jotto (5-letter Mastermind) in the NAO robot
- Robot stories
- Advanced Problem #18
- Book presentation at the USC Python Users Group
- Areas of Mathematics
- More on Lindenmayer Systems
- Some results related to the Feuerbach Point
- An Automatic Geometric Proof
- Sympy should suffice
- A nice application of Fatou’s Lemma
- Have a child, plant a tree, write a book
- Project Euler with Julia
- Seked
- Nezumi San
- Ruthless Thieves Stealing a Roll of Cloth
- Which one is the fake?
- Stones, balances, matrices
- Buy my book!
- Trigonometry
- Naïve Bayes
- Math still not the answer
- Sometimes Math is not the answer
- What if?
- Edge detection: The Convolution Approach
- OpArt
- So you want to be an Applied Mathematician
- Smallest Groups with Two Eyes
- The ultimate metapuzzle
- Where are the powers of two?
- Geolocation
- Boundary operators
- The Cantor Pairing Function
- El País’ weekly challenge
- Math Genealogy Project
- Basic Statistics in sage
- A Homework on the Web System
- Apollonian gaskets and circle inversion fractals
- Toying with basic fractals
- Unusual dice
- Wavelets in sage
- Edge detection: The Scale Space Theory
- Bertrand Paradox
- Voronoi mosaics
- Image Processing with numpy, scipy and matplotlibs in sage
- Super-Resolution Micrograph Reconstruction by Nonlocal-Means Applied to HAADF-STEM
- The Nonlocal-means Algorithm
- The hunt for a Bellman Function.
- Presentation: Hilbert Transform Pairs of Wavelets
- Presentation: The Dual-Tree Complex Wavelet Transform
- Presentation: Curvelets and Approximation Theory
- Poster: Curvelets vs. Wavelets (Mathematical Models of Natural Images)
- Wavelet Coefficients
- Modeling the Impact of Ebola and Bushmeat Hunting on Western Lowland Gorillas
- Triangulations
- Mechanical Geometry Theorem Proving

### Pages

- About me
- Books
- Curriculum Vitae
- Research
- Teaching
- Mathematical Imaging
- Introduction to the Theory of Distributions
- An Introduction to Algebraic Topology
- The Basic Practice of Statistics
- MA598R: Measure Theory
- MA122—Fall 2014
- MA141—Fall 2014
- MA142—Summer II 2012
- MA241—Spring 2014
- MA242—Fall 2013
- Past Sections
- MA122—Spring 2012
- MA122—Spring 2013
- Lesson Plan—section 007
- Lesson Plan—section 008
- Review for First part (section 007)
- Review for First part (section 008)
- Review for Second part (section 007)
- Review for Third part (section 007)
- Review for the Second part (section 008)
- Review for the Fourth part (section 007)
- Review for Third and Fourth parts (section 008)

- MA122—Fall 2013
- MA141—Spring 2010
- MA141—Fall 2012
- MA141—Spring 2013
- MA141—Fall 2013
- MA141—Spring 2014
- MA141—Summer 2014
- MA142—Fall 2011
- MA142—Spring 2012
- MA241—Fall 2011
- MA241—Fall 2012
- MA241—Spring 2013
- MA242—Fall 2012
- MA242—Spring 2012
- First Midterm Practice Test
- Second Midterm-Practice Test
- Third Midterm—Practice Test
- Review for the fourth part of the course
- Blake Rollins’ code in Java
- Ronen Rappaport’s project: messing with strings
- Sam Somani’s project: Understanding Black-Scholes
- Christina Papadimitriou’s project: Diffusion and Reaction in Catalysts

- Problem Solving
- Borsuk-Ulam and Fixed Point Theorems
- The Cantor Set
- The Jordan Curve Theorem
- My oldest plays the piano!
- How many hands did Ernie shake?
- A geometric fallacy
- What is the next number?
- Remainders
- Probability and Divisibility by 11
- Convex triangle-square polygons
- Thieves!
- Metapuzzles
- What day of the week?
- Exact Expression
- Chess puzzles
- Points on a plane
- Sequence of right triangles
- Sums of terms from Fibonacci
- Alleys
- Arithmetic Expressions
- Three circles
- Pick a point
- Bertrand Paradox
- Unusual dice
- El País’ weekly challenge
- Project Euler with Julia

- LaTeX

### Categories

### Archives

- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- March 2014
- December 2013
- October 2013
- September 2013
- July 2013
- June 2013
- April 2013
- January 2013
- December 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- November 2011
- September 2011
- August 2011
- June 2011
- May 2011
- April 2011
- February 2011
- January 2011
- December 2010
- May 2010
- April 2010
- September 2008
- September 2007
- August 2007

### @eseprimo

- About to place notes for my upcoming Nonlinear Optimization course online in @overleaf #LaTeX @xkcdComic @SymPy… twitter.com/i/web/status/8… 8 hours ago
- I'm using @overleaf, the free online collaborative LaTeX editor - it's awesome and easy to use! overleaf.com/signup?ref=b35… 10 hours ago
- RT @rgbkrk: Download nteract 0.2.0, install pandas >= v0.20, set `pd.options.display.html.table_schema = True`, display a table, provide fe… 3 days ago
- Just got the #totalsolareclipse glasses that @UofSC issues to all employees. Now, time for some #finalexam… twitter.com/i/web/status/8… 1 week ago
- @jamestanton I think not. Grab a circle, divide in equal six arcs, sub each other arc by a segment. Star-shaped con… twitter.com/i/web/status/8… 1 week ago
- RT @jmitani: 錯視立体で有名な杉原厚吉先生と昼休憩にいろいろ話をさせていただきました。 杉原先生の、鏡に映すと違う立体に見える作品が会場に展示されています。 写真は、4つ並んだハートの断面が、鏡の中ではスペード、ハート、クラブ、ダイヤに見えるというもの。 https… 1 week ago
- RT @UofSC: Wondering what the 🐿 and 🐜 will be up to during #TotalEclipseCAE? So is #UofSC naturalist, Rudy Mancke. sc.edu/eclipse… 1 week ago
- Having the ability to include @xkcdComic style graphs with #tikz (#latex) or with @matplotlib (#Python) in class no… twitter.com/i/web/status/8… 2 weeks ago
- Just posted a photo instagram.com/p/BW5HP5on1NPW… 4 weeks ago
- RT @thecommongreen: The only human, alive or dead, not contained within the frame of this photo is Michael Collins. The ultimate anti-selfi… 1 month ago
- How many different file types? $ find Dropbox/Documents/Research/ -print0 | xargs -0 file -b | LC_ALL='C' sort | LC… twitter.com/i/web/status/8… 1 month ago
- Let's try this again: how many lines of #Python have you written lately? $ find Dropbox/ -iname "*.py" -print0 |… twitter.com/i/web/status/8… 1 month ago
- RT @lineofmargarets: My first week at @UofSC, I was sent to interview a math professor. Gulp. I gave it a go: Tackling the messes https://t… 1 month ago
- $ find Dropbox/ -iname "*.py" -print0 | xargs -0 wc | tail -1 630876 2301324 21962525 total #python #bash #xargs #find #wc #tail 1 month ago
- RT @ArtsSciencesUSC: Mathematics professor Frank Thorne looks for answers to messy problems ow.ly/Hpoo30dGFwf 1 month ago
- Wait. @RAEinforma accepting #iros cause nbdy uses correct expr? This is like accepting $(fg)'=f'g'$ #calculus… twitter.com/i/web/status/8… 1 month ago
- RT @doctorow: Free on the Internet Archive: 255 issues of Galaxy Magazines, 1950-1976 boingboing.net/2017/07/16/emb… https://t.co/jlspVVfDpf 1 month ago
- RT @escabellat: Things are getting a little tense https://t.co/EjewHe5Gsx 1 month ago
- RT @Sopas: Spain is different doi.org/10.1016/j.meas… @CientificoenEsp https://t.co/1VCwf7pqT0 1 month ago
- RT @emmavaast: Really cool! Algorithm generates optimal origami folding pattern to produce any 3-D structure buff.ly/2sNRJmX https:… 1 month ago

### Math updates on arXiv.org

- Portfolio Optimization with Entropic Value-at-Risk. (arXiv:1708.05713v1 [q-fin.PM])
- Global analysis of an infection age SEI model with a large class of nonlinear incidence rates. (arXiv:1708.05726v1 [math.DS])
- Some nilpotence theorems for Chow motives. (arXiv:1708.05731v1 [math.AG])
- Geometry Of The Expected Value Set And The Set-Valued Sample Mean Process. (arXiv:1708.05735v1 [math.PR])
- Auxiliary Space Multigrid Method Based on Additive Schur Complement Approximation for Graph Laplacian. (arXiv:1708.05738v1 [math.NA])
- Dynamic Connectivity Game for Adversarial Internet of Battlefield Things Systems. (arXiv:1708.05741v1 [cs.IT])
- Universal Series for Hilbert Schemes and Strange Duality. (arXiv:1708.05743v1 [math.AG])
- Universism and Extensions of V. (arXiv:1708.05751v1 [math.LO])
- Nodal intersections and Geometric Control. (arXiv:1708.05754v1 [math.AP])
- Zero Entropy Interval Maps And MMLS-MMA Property. (arXiv:1708.05755v1 [math.DS])

### Computational Geometry updates on arXiv.org

- Computer-aided position planning of miniplates to treat facial bone defects. (arXiv:1708.05711v1 [cs.CV])
- Minimum Hidden Guarding of Histogram Polygons. (arXiv:1708.05815v1 [cs.CG])
- Balanced partitions of 3-colored geometric sets in the plane. (arXiv:1708.06062v1 [cs.CG])
- Helly Numbers of Polyominoes. (arXiv:1708.06063v1 [cs.CG])
- Geodesic Order Types. (arXiv:1708.06064v1 [cs.CG])
- 3D Visibility Representations of 1-planar Graphs. (arXiv:1708.06196v1 [cs.CG])
- Algorithms for Covering Multiple Barriers. (arXiv:1704.06870v2 [cs.CG] UPDATED)

### sagemath

- An error has occurred; the feed is probably down. Try again later.

Part of your struggle is that you are not accounting for a lot of other factors which lend themselves to something like a Bayesian analysis or a vector search. For example, if you have young kids at home, the Transformers cartoon movie may be something you watch (rated G) but the Transformers movies don’t get watched (rated PG-13). Or perhaps you normally like certain films, but one of them contains a lot of things that insult your religion.

Those are the kinds of things that are much harder to quantify and identify trends in, because many times there are only rare outliers that throw the while thing off.

J.Ja