## Math still not the answer

I wrote a quick (but not very elegant) `python` script to retrieve locally enough data from www.metacritic.com for pattern recognition purposes. The main goal is to help me decide how much I will enjoy a movie, before watching it. I included the script at the end of the post, in case you want to try it yourself (and maybe improve it too!). It takes a while to complete, although it is quite entertaining to see its progress on screen. At the end, it provides with two lists of the same length: `critics`—a list of `str` containing the names of the critics; and `scoredMovies`—a list of `dict` containing, at index `k`, the evaluation of all the movies scored by the critic at index `k` in the previous list.

For example:

**>>> critics[43]**

‘James White’

**>>> scoredMovies[43]**

{‘hall-pass’: 60, ‘the-karate-kid’: 60, ‘the-losers’: 60,

‘the-avengers-2012’: 80, ‘the-other-guys’: 60, ‘shrek-forever-after’: 80,

‘the-lincoln-lawyer’: 80, ‘the-company-men’: 60, ‘jonah-hex’: 40,

‘arthur’: 60, ‘vampires-suck’: 20, ‘american-reunion’: 40,

‘footloose’: 60, ‘real-steel’: 60}

The number of scored films by critic varies: there are individuals that gave their opinion on a few dozen movies, and others that took the trouble to evaluate up to four thousand flicks! Note also that the names of the movies correspond with their web pages in www.metacritic.com. For example, to see what critics have to say about the “Karate Kid” and other relevant information online, point your browser to www.metacritic.com/movie/**the-karate-kid**. It also comes in very handy if there are several versions of a single title: *Which “Karate Kid” does this score refer to, the one in the eighties, or Jackie Chan’s?*

Feel free to download a copy of the resulting data [here] (note it is a large file: 1.6MB).

But the fact that we have that data stored locally allows us to gather that information with simple `python` commands, and perform many complex operations on it.

Let’s see, for example, which critics on the list scored the movie Juno, and how did they like it:

**>>> [[critics[index],x[‘juno’]] for index,x in enumerate(scoredMovies)**

if ‘juno’ in x]

if ‘juno’ in x]

[[‘Roger Ebert’, 100],

[‘Peter Travers’, 88],

[‘Joe Morgenstern’, 80],

[‘David Denby’, 90],

[‘A.O. Scott’, 90],

[‘James Berardinelli’, 88],

[‘Michael Phillips’, 88],

[‘Lou Lumenick’, 100],

[‘Todd McCarthy’, 80],

[‘Michael Sragow’, 58],

[‘Claudia Puig’, 100],

[‘J.R. Jones’, 70],

[‘Scott Tobias’, 83],

[‘Kirk Honeycutt’, 80],

[‘Ty Burr’, 0],

[‘Stephanie Zacharek’, 80],

[‘Lawrence Toppman’, 75],

[‘Richard Schickel’, 80],

[‘Desson Thomson’, 90],

[‘Liam Lacey’, 75],

[‘Carrie Rickey’, 88],

[‘Maitland McDonagh’, 75]]

This is a particularly interesting example. Juno is one of those movies that I watch once, and stays on my mind for a long time: The topic leads to rich discussions; the characters are deep although stereotypical; I resided in that part of the country, and it is always enjoyable to see again the old sights and take that trip down Memory Lane. For these and many other reasons, I would grant the flick a solid B+ (85 in *my* scale). Note that there are no critics in this sample that scored exactly like me, although quite a few were close (those that gave it 83 or 88, for instance)

This is the starting point for my algorithm. I will assign a *weight* to each critic after each movie scored by me. For example, critics not in the previous list will get a weight of zero (since they did not even see the film). Critics in this list that gave June 83 points will have a weight close to 1, and the larger the difference between a critic’s score of June and mine, the smaller the weight. By picking movie after movie and scoring them, I **update** the weights for each critic (maybe by averaging all the positive values together, although there are better methods).

As for the individual weights for critics that evaluated Juno, we could follow this formula:

The number 85 is of course the score I gave Juno: everybody gets their weight according to the difference between their evaluations and mine. The number 5211 is approximately which is what you need to grant a smallest weight of 25% to any critic that watched the same movie. You can play with these parameters, of course.

I can now put to practice the algorithm that I proposed in my previous post: Let me start by picking a small random set of movies which I have seen, starting of course with the Blair Witch Project. I use that training set to compute the weight for each critic, and use those weights to assess different movies (some that I have seen, some that I haven’t):

mymovies=dict({'oldboy': 85, 'juno': 85, 'vicky-cristina-barcelona': 85, 'pans-labyrinth': 95, 'indiana-jones-and-the-kingdom-of-the-crystal-skull': 85, 'planet-of-the-apes': 90, '8-women': 50, 'being-john-malkovich': 90, 'pulp-fiction': 95, 'munich': 95, 'district-9': 85, 'frequency': 75, 'the-adventures-of-tintin': 90, 'sleepy-hollow': 80, 'signs': 95, 'spider-man-3': 50, 'space-cowboys': 60, 'this-is-spinal-tap': 70, 'spanglish':70, 'proof': 70, 'the-blair-witch-project': 40})

I can compute now the weights for all the critics by averaging non-zero values of the basic weights I indicated above:

nonzeros=numpy.zeros(len(movieData.critics)) weights=numpy.zeros(len(movieData.critics)) weighthelper = [[index,numpy.exp( (-1)*numpy.log(4)*(mymovies[x] - yourmovies[x])**2 / float(mymovies[x]**2))] for index,yourmovies in enumerate(movieData.scoredMovies) for x in mymovies if x in yourmovies] for datum in weighthelper: weights[datum[0]]*=nonzeros[datum[0]] weights[datum[0]]+=datum[1] nonzeros[datum[0]]+=1 weights[datum[0]]/=nonzeros[datum[0]] weights=weights.tolist() def assessMovie(movieTag,scoredMovies,weights): helper=[scoreOf[movieTag]*weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] relevantWeights = [weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] return reduce(lambda x,y: x+y, helper)/float(numpy.sum(relevantWeights))

Let’s test it:

**>>> weights**

[0.9316894375373113, 0.9440456788129853, 0.8276943837371913,

0.8354699447724799, 0.9096233757112504, 0.9131781397354621,

0.0, 0.792063820900018, 0.0,

0.7905341294370402, 0.8911293907698545, 0.9591397211822409,

0.9122390115047787, 0.9094164789352943, 0.0,

0.8743220704979506, 0.8091610605956614, 0.9830308800891736,

0.9377611752511361, 0.0, 0.8588275926217391,

0.8881609250607919, 0.9503520095147542, 0.0,

0.9806218435940762, 0.8326500757874696, 0.8389774746326207,

0.8743683771430136, 0.8894078063468858, 0.0,

0.0, 0.7329668437056417, 0.916911984206937,

0.41179550863378656, 0.0, 0.9830308800891736,

0.0, 0.0, 0.8543417500931425,

0.7472733165814486, 0.959985337534043, 0.25,

0.0, 0.0, 0.9153304682482231,

0.9213229176351494, 0.7186166247610994, 0.9638336270354847,

0.5553963593823906, 0.9938576336658272, 0.9719197080733987,

0.7084015185939553, 0.9291760858845585, 0.8504427376265632,

0.9554366051158306, 0.9195026707321936, 0.7071067811865476,

0.9047116150472949, 0.9123122074919429, 0.0,

0.9374115172355054, 0.0, 0.0,

0.8695144502247731, 0.8896854138078707, 0.25,

0.817259795695882, 0.0, 1.0,

0.8778357293248331, 0.6359295154878736, 0.9247356530790823,

0.8374453617968352, 0.9830308800891736, 0.8317179210428074,

0.9961672133641973, 0.0, 0.0,

0.0, 0.7925811080710169, 0.909238787663484,

0.9529153979813423, 0.7615303747676634, 0.9916488411455211,

0.9565014890519766, 0.8675413259697287]

**>>> assessMovie(‘the-aristocrats’, scoredMovies, weights)**

73.20726258692754

**>>> assessMovie(‘the-bridesmaid’, scoredMovies, weights)**

72.84280005703968

**>>> assessMovie(‘pi’, scoredMovies, weights)**

74.84860303913176

**>>> assessMovie(‘the-artist’, scoredMovies, weights)**

90.92750278128389

Compare these weighted averages with those indicated in www.metacritic.com, to realize the power of this scheme. And it only gets stronger the more movies I score and include in my training dictionary! But this leads again to the same question that I posed in my previous post, since I cannot figure out how Mathematics will help me choose a minimal set of movies that will guarantee success of any of the proposed algorithms. What do you think?

In a series of follow-up posts, I will compare the results of this method with other data mining schemes. Ideally, I would like to devote a post to each different method. This is a great way of illustrating the ideas behind the beautiful field of pattern recognition.

### Script to retrieve critics data

Apologies for the poorly commented (and written!) script. I will probably update it to something more readable and stylish soon.

import urllib, numpy def retrieveInfo(page): # First, obtain the source code of the page f=urllib.urlopen(page) source=f.read() f.close() return source # Retrieve from metacritic.com up to 100 popular critics metacritic="http://www.metacritic.com/browse/movies/critic/popular?num_items=100" source=retrieveInfo(metacritic) critics=[] pages=[] scoredMovies=[] while source.count("<h3><a href=\"/critic/")>0: [a,b,c]=source.partition("<h3><a href=\"/critic/") [a,b,c]=c.partition("\">") print a if a.count("/")>0: [a,b,source]=c.partition("</a>") else: pages.append("/critic/"+a) [a,b,source]=c.partition("</a>") critics.append(a) # For each of them, compute how many movies they have evaluated numberOfMovies=[] for critic in pages: print "\n\n\n***************** " print "Gathering information for",critic # retrieve the first page of scores for this critic criticPage = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score") # remove the score of trailers, if any [criticPage,b,c]=criticPage.partition("<div class=\"module list_trailers\">") [a,b,c] = criticPage.partition("<a href=\""+critic+"\">") [a,b,c] = c.partition("</a>") numberOfMovies.append(int(a.replace(',',''))) numberOfPagesWithScores = int(numpy.ceil(int(a.replace(',',''))/100.0)) criticMovies=dict() print "Movies evaluated:", a.replace(',','') print "This person has ",numberOfPagesWithScores,"pages with scores" while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) if numberOfPagesWithScores>1: for pageNumber in range(1,numberOfPagesWithScores): print "\n***" print "Page (",pageNumber+1,"/",numberOfPagesWithScores,") for",critic c = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score&page=" + str(pageNumber)) [c,b,a]=c.partition("<div class=\"module list_trailers\">") while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) scoredMovies.append(criticMovies)

### Leave a Reply Cancel reply

### We have moved!

### In the news:

### Recent Posts

- Migration
- Computational Geometry in Python
- Searching (again!?) for the SS Central America
- Jotto (5-letter Mastermind) in the NAO robot
- Robot stories
- Advanced Problem #18
- Book presentation at the USC Python Users Group
- Areas of Mathematics
- More on Lindenmayer Systems
- Some results related to the Feuerbach Point
- An Automatic Geometric Proof
- Sympy should suffice
- A nice application of Fatou’s Lemma
- Have a child, plant a tree, write a book
- Project Euler with Julia
- Seked
- Nezumi San
- Ruthless Thieves Stealing a Roll of Cloth
- Which one is the fake?
- Stones, balances, matrices
- Buy my book!
- Trigonometry
- Naïve Bayes
- Math still not the answer
- Sometimes Math is not the answer
- What if?
- Edge detection: The Convolution Approach
- OpArt
- So you want to be an Applied Mathematician
- Smallest Groups with Two Eyes
- The ultimate metapuzzle
- Where are the powers of two?
- Geolocation
- Boundary operators
- The Cantor Pairing Function
- El País’ weekly challenge
- Math Genealogy Project
- Basic Statistics in sage
- A Homework on the Web System
- Apollonian gaskets and circle inversion fractals
- Toying with basic fractals
- Unusual dice
- Wavelets in sage
- Edge detection: The Scale Space Theory
- Bertrand Paradox
- Voronoi mosaics
- Image Processing with numpy, scipy and matplotlibs in sage
- Super-Resolution Micrograph Reconstruction by Nonlocal-Means Applied to HAADF-STEM
- The Nonlocal-means Algorithm
- The hunt for a Bellman Function.
- Presentation: Hilbert Transform Pairs of Wavelets
- Presentation: The Dual-Tree Complex Wavelet Transform
- Presentation: Curvelets and Approximation Theory
- Poster: Curvelets vs. Wavelets (Mathematical Models of Natural Images)
- Wavelet Coefficients
- Modeling the Impact of Ebola and Bushmeat Hunting on Western Lowland Gorillas
- Triangulations
- Mechanical Geometry Theorem Proving

### Pages

- About me
- Books
- Curriculum Vitae
- Research
- Teaching
- Mathematical Imaging
- Introduction to the Theory of Distributions
- An Introduction to Algebraic Topology
- The Basic Practice of Statistics
- MA598R: Measure Theory
- MA122—Fall 2014
- MA141—Fall 2014
- MA142—Summer II 2012
- MA241—Spring 2014
- MA242—Fall 2013
- Past Sections
- MA122—Spring 2012
- MA122—Spring 2013
- Lesson Plan—section 007
- Lesson Plan—section 008
- Review for First part (section 007)
- Review for First part (section 008)
- Review for Second part (section 007)
- Review for Third part (section 007)
- Review for the Second part (section 008)
- Review for the Fourth part (section 007)
- Review for Third and Fourth parts (section 008)

- MA122—Fall 2013
- MA141—Spring 2010
- MA141—Fall 2012
- MA141—Spring 2013
- MA141—Fall 2013
- MA141—Spring 2014
- MA141—Summer 2014
- MA142—Fall 2011
- MA142—Spring 2012
- MA241—Fall 2011
- MA241—Fall 2012
- MA241—Spring 2013
- MA242—Fall 2012
- MA242—Spring 2012
- First Midterm Practice Test
- Second Midterm-Practice Test
- Third Midterm—Practice Test
- Review for the fourth part of the course
- Blake Rollins’ code in Java
- Ronen Rappaport’s project: messing with strings
- Sam Somani’s project: Understanding Black-Scholes
- Christina Papadimitriou’s project: Diffusion and Reaction in Catalysts

- Problem Solving
- Borsuk-Ulam and Fixed Point Theorems
- The Cantor Set
- The Jordan Curve Theorem
- My oldest plays the piano!
- How many hands did Ernie shake?
- A geometric fallacy
- What is the next number?
- Remainders
- Probability and Divisibility by 11
- Convex triangle-square polygons
- Thieves!
- Metapuzzles
- What day of the week?
- Exact Expression
- Chess puzzles
- Points on a plane
- Sequence of right triangles
- Sums of terms from Fibonacci
- Alleys
- Arithmetic Expressions
- Three circles
- Pick a point
- Bertrand Paradox
- Unusual dice
- El País’ weekly challenge
- Project Euler with Julia

- LaTeX

### Categories

### Archives

- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- March 2014
- December 2013
- October 2013
- September 2013
- July 2013
- June 2013
- April 2013
- January 2013
- December 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- November 2011
- September 2011
- August 2011
- June 2011
- May 2011
- April 2011
- February 2011
- January 2011
- December 2010
- May 2010
- April 2010
- September 2008
- September 2007
- August 2007

### @eseprimo

- RT @amuellerml: For those that haven't seen it, matplotlib 2.2 has a layout engine. Update now and set figure.constrainedlayout.do to True in… 3 hours ago
- RT @algebrasnotwar: Hey that’s me! twitter.com/uofsc_cte/stat… 10 hours ago
- RT @Simon_Gregg: If the little white square is 1, what other numbers can you see? #numbersearch Also... Which one doesn't belong? #wodb htt… 1 day ago
- RT @stevenstrogatz: 10 Books to Spark a Love of Math in Kids of All Ages | MindShift | KQED News (from Feb 22) kqed.org/mindshift/5059… 1 day ago
- RT @Beatriz_Nino: Después de ver toda la gente que lo ha hecho en sus clases, por fin lo he probado con mis alumnos. https://t.co/g7cI6FMkNB 3 days ago
- RT @rasbt: Great tutorial on the PageRank algorithm -- this resource is a bit older, but nonetheless an excellent yet short write-up, inclu… 6 days ago
- RT @JohnDCook: Plastic bag found in the Mariana Trench. Sad, but also interesting. news.nationalgeographic.com/2018/05/plasti… 6 days ago
- RT @bit101: #MayContainGIFs day 15 https://t.co/6Fj5azjbCf 6 days ago
- RT @willhek1: Playing with maths (literally). I found this in a shop for kids. The best way to visualize the volume of an stellated rhombic… 1 week ago
- RT @Rainmaker1973: Watch these @MIT origami robotic metamorphosis exoskeletons, including a foldable robot glider buff.ly/2IDG5pb h… 1 week ago
- RT @p_trivino: Sin duda la construcción más dificil que he logrado Un polígono de n lados girando dentro de otro polígono de m lados con ta… 1 week ago
- Oh wait, *of course* there is a way to do #latex + @lilypondblog in @orgmode_bot Once again, @emacs #FTW https://t.co/RyeNe2YXdK 1 week ago
- Typing music in #LaTeX w/ #musixtex has been a fun ride. But srsly: ain't nobody gottime fo'tha. @lilypondblog #FTW… twitter.com/i/web/status/9… 1 week ago
- RT @mathematicsprof: I SOLVED THE RIEMANN! https://t.co/fLeza37895 1 week ago
- RT @juliomulero: En matemáticas, una serie es la generalización de la noción de suma aplicada a los términos de una sucesión matemática. Es… 1 week ago
- RT @fhuszar: Physicists, teachers or geek parents help me: My son asked me "How do we know the Earth is spherical?" and then he said "I wan… 1 week ago
- Someone send this to @simongerman600 please twitter.com/carlosandradas… 2 weeks ago
- RT @mathematicsprof: For convergence of sequences of real numbers, there is generally one kind of convergence, but for functions. there are… 2 weeks ago
- RT @Wikingenieria: Diferentes proyecciones de mapas mundiales que distorsionan la Tierra. Vía @CodyPhelan https://t.co/SkIxIj1dKa 2 weeks ago
- So very impressed with @overleaf v2 (in beta so far). Kudos for a well designed tool, guys. Impressed with the lo… twitter.com/i/web/status/9… 2 weeks ago

### Math updates on arXiv.org

- A Study on Phase-Field Models for Brittle Fracture. (arXiv:1805.07357v1 [math.NA])
- Generators of invariant linear system on tropical curves for finite isometry group. (arXiv:1805.07358v1 [math.AG])
- Prediction in Projection: A new paradigm in delay-coordinate reconstruction. (arXiv:1805.07360v1 [math.DS])
- A descriptive set theorist's proof of the pointwise ergodic theorem. (arXiv:1805.07365v1 [math.DS])
- A combination theorem for Anosov subgroups. (arXiv:1805.07374v1 [math.GR])
- Fixed-PSNR Lossy Compression for Scientific Data. (arXiv:1805.07384v1 [cs.IT])
- Resolvent Estimates on Asymptotically Hyperbolic Spaces. (arXiv:1805.07400v1 [math.AP])
- Asset Price Bubbles: An Option-based Indicator. (arXiv:1805.07403v1 [q-fin.PR])
- Discovery of Nonlinear Multiscale Systems: Sampling Strategies and Embeddings. (arXiv:1805.07411v1 [math.DS])
- Computing Kantorovich-Wasserstein Distances on $d$-dimensional histograms using $(d+1)$-partite graphs. (arXiv:1805.07416v1 [math.OC])

### Computational Geometry updates on arXiv.org

- Approximate Data Depth Revisited. (arXiv:1805.07373v1 [cs.CG])
- AlgorithmXXX: Efficient Atlasing and Search of Configuration Spaces of Point-Sets Constrained by Distance Intervals. (arXiv:1805.07450v1 [cs.CG])
- Revealing the Basis: Ordinal Embedding Through Geometry. (arXiv:1805.07589v1 [cs.CG])
- Hardness of CONTIGUOUS SAT and Visibility with Uncertain Obstacles. (arXiv:1805.07724v1 [cs.CG])
- Global Minimum for a Finsler Elastica Minimal Path Approach. (arXiv:1612.00343v3 [cs.CG] UPDATED)

### sagemath

- An error has occurred; the feed is probably down. Try again later.

Part of your struggle is that you are not accounting for a lot of other factors which lend themselves to something like a Bayesian analysis or a vector search. For example, if you have young kids at home, the Transformers cartoon movie may be something you watch (rated G) but the Transformers movies don’t get watched (rated PG-13). Or perhaps you normally like certain films, but one of them contains a lot of things that insult your religion.

Those are the kinds of things that are much harder to quantify and identify trends in, because many times there are only rare outliers that throw the while thing off.

J.Ja