Math still not the answer
I wrote a quick (but not very elegant) python script to retrieve locally enough data from www.metacritic.com for pattern recognition purposes. The main goal is to help me decide how much I will enjoy a movie, before watching it. I included the script at the end of the post, in case you want to try it yourself (and maybe improve it too!). It takes a while to complete, although it is quite entertaining to see its progress on screen. At the end, it provides with two lists of the same length: critics—a list of str containing the names of the critics; and scoredMovies—a list of dict containing, at index k, the evaluation of all the movies scored by the critic at index k in the previous list.
For example:
‘James White’
>>> scoredMovies[43]
{‘hall-pass’: 60, ‘the-karate-kid’: 60, ‘the-losers’: 60,
‘the-avengers-2012’: 80, ‘the-other-guys’: 60, ‘shrek-forever-after’: 80,
‘the-lincoln-lawyer’: 80, ‘the-company-men’: 60, ‘jonah-hex’: 40,
‘arthur’: 60, ‘vampires-suck’: 20, ‘american-reunion’: 40,
‘footloose’: 60, ‘real-steel’: 60}
The number of scored films by critic varies: there are individuals that gave their opinion on a few dozen movies, and others that took the trouble to evaluate up to four thousand flicks! Note also that the names of the movies correspond with their web pages in www.metacritic.com. For example, to see what critics have to say about the “Karate Kid” and other relevant information online, point your browser to www.metacritic.com/movie/the-karate-kid. It also comes in very handy if there are several versions of a single title: Which “Karate Kid” does this score refer to, the one in the eighties, or Jackie Chan’s?
Feel free to download a copy of the resulting data [here] (note it is a large file: 1.6MB).
But the fact that we have that data stored locally allows us to gather that information with simple python commands, and perform many complex operations on it.
Let’s see, for example, which critics on the list scored the movie Juno, and how did they like it:
if ‘juno’ in x]
[[‘Roger Ebert’, 100],
[‘Peter Travers’, 88],
[‘Joe Morgenstern’, 80],
[‘David Denby’, 90],
[‘A.O. Scott’, 90],
[‘James Berardinelli’, 88],
[‘Michael Phillips’, 88],
[‘Lou Lumenick’, 100],
[‘Todd McCarthy’, 80],
[‘Michael Sragow’, 58],
[‘Claudia Puig’, 100],
[‘J.R. Jones’, 70],
[‘Scott Tobias’, 83],
[‘Kirk Honeycutt’, 80],
[‘Ty Burr’, 0],
[‘Stephanie Zacharek’, 80],
[‘Lawrence Toppman’, 75],
[‘Richard Schickel’, 80],
[‘Desson Thomson’, 90],
[‘Liam Lacey’, 75],
[‘Carrie Rickey’, 88],
[‘Maitland McDonagh’, 75]]
This is a particularly interesting example. Juno is one of those movies that I watch once, and stays on my mind for a long time: The topic leads to rich discussions; the characters are deep although stereotypical; I resided in that part of the country, and it is always enjoyable to see again the old sights and take that trip down Memory Lane. For these and many other reasons, I would grant the flick a solid B+ (85 in my scale). Note that there are no critics in this sample that scored exactly like me, although quite a few were close (those that gave it 83 or 88, for instance)
This is the starting point for my algorithm. I will assign a weight to each critic after each movie scored by me. For example, critics not in the previous list will get a weight of zero (since they did not even see the film). Critics in this list that gave June 83 points will have a weight close to 1, and the larger the difference between a critic’s score of June and mine, the smaller the weight. By picking movie after movie and scoring them, I update the weights for each critic (maybe by averaging all the positive values together, although there are better methods).
As for the individual weights for critics that evaluated Juno, we could follow this formula:
The number 85 is of course the score I gave Juno: everybody gets their weight according to the difference between their evaluations and mine. The number 5211 is approximately which is what you need to grant a smallest weight of 25% to any critic that watched the same movie. You can play with these parameters, of course.
I can now put to practice the algorithm that I proposed in my previous post: Let me start by picking a small random set of movies which I have seen, starting of course with the Blair Witch Project. I use that training set to compute the weight for each critic, and use those weights to assess different movies (some that I have seen, some that I haven’t):
mymovies=dict({'oldboy': 85, 'juno': 85, 'vicky-cristina-barcelona': 85, 'pans-labyrinth': 95, 'indiana-jones-and-the-kingdom-of-the-crystal-skull': 85, 'planet-of-the-apes': 90, '8-women': 50, 'being-john-malkovich': 90, 'pulp-fiction': 95, 'munich': 95, 'district-9': 85, 'frequency': 75, 'the-adventures-of-tintin': 90, 'sleepy-hollow': 80, 'signs': 95, 'spider-man-3': 50, 'space-cowboys': 60, 'this-is-spinal-tap': 70, 'spanglish':70, 'proof': 70, 'the-blair-witch-project': 40})
I can compute now the weights for all the critics by averaging non-zero values of the basic weights I indicated above:
nonzeros=numpy.zeros(len(movieData.critics)) weights=numpy.zeros(len(movieData.critics)) weighthelper = [[index,numpy.exp( (-1)*numpy.log(4)*(mymovies[x] - yourmovies[x])**2 / float(mymovies[x]**2))] for index,yourmovies in enumerate(movieData.scoredMovies) for x in mymovies if x in yourmovies] for datum in weighthelper: weights[datum[0]]*=nonzeros[datum[0]] weights[datum[0]]+=datum[1] nonzeros[datum[0]]+=1 weights[datum[0]]/=nonzeros[datum[0]] weights=weights.tolist() def assessMovie(movieTag,scoredMovies,weights): helper=[scoreOf[movieTag]*weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] relevantWeights = [weights[index] for index,scoreOf in enumerate(scoredMovies) if movieTag in scoreOf] return reduce(lambda x,y: x+y, helper)/float(numpy.sum(relevantWeights))
Let’s test it:
[0.9316894375373113, 0.9440456788129853, 0.8276943837371913,
0.8354699447724799, 0.9096233757112504, 0.9131781397354621,
0.0, 0.792063820900018, 0.0,
0.7905341294370402, 0.8911293907698545, 0.9591397211822409,
0.9122390115047787, 0.9094164789352943, 0.0,
0.8743220704979506, 0.8091610605956614, 0.9830308800891736,
0.9377611752511361, 0.0, 0.8588275926217391,
0.8881609250607919, 0.9503520095147542, 0.0,
0.9806218435940762, 0.8326500757874696, 0.8389774746326207,
0.8743683771430136, 0.8894078063468858, 0.0,
0.0, 0.7329668437056417, 0.916911984206937,
0.41179550863378656, 0.0, 0.9830308800891736,
0.0, 0.0, 0.8543417500931425,
0.7472733165814486, 0.959985337534043, 0.25,
0.0, 0.0, 0.9153304682482231,
0.9213229176351494, 0.7186166247610994, 0.9638336270354847,
0.5553963593823906, 0.9938576336658272, 0.9719197080733987,
0.7084015185939553, 0.9291760858845585, 0.8504427376265632,
0.9554366051158306, 0.9195026707321936, 0.7071067811865476,
0.9047116150472949, 0.9123122074919429, 0.0,
0.9374115172355054, 0.0, 0.0,
0.8695144502247731, 0.8896854138078707, 0.25,
0.817259795695882, 0.0, 1.0,
0.8778357293248331, 0.6359295154878736, 0.9247356530790823,
0.8374453617968352, 0.9830308800891736, 0.8317179210428074,
0.9961672133641973, 0.0, 0.0,
0.0, 0.7925811080710169, 0.909238787663484,
0.9529153979813423, 0.7615303747676634, 0.9916488411455211,
0.9565014890519766, 0.8675413259697287]
>>> assessMovie(‘the-aristocrats’, scoredMovies, weights)
73.20726258692754
>>> assessMovie(‘the-bridesmaid’, scoredMovies, weights)
72.84280005703968
>>> assessMovie(‘pi’, scoredMovies, weights)
74.84860303913176
>>> assessMovie(‘the-artist’, scoredMovies, weights)
90.92750278128389
Compare these weighted averages with those indicated in www.metacritic.com, to realize the power of this scheme. And it only gets stronger the more movies I score and include in my training dictionary! But this leads again to the same question that I posed in my previous post, since I cannot figure out how Mathematics will help me choose a minimal set of movies that will guarantee success of any of the proposed algorithms. What do you think?
In a series of follow-up posts, I will compare the results of this method with other data mining schemes. Ideally, I would like to devote a post to each different method. This is a great way of illustrating the ideas behind the beautiful field of pattern recognition.
Script to retrieve critics data
Apologies for the poorly commented (and written!) script. I will probably update it to something more readable and stylish soon.
import urllib, numpy def retrieveInfo(page): # First, obtain the source code of the page f=urllib.urlopen(page) source=f.read() f.close() return source # Retrieve from metacritic.com up to 100 popular critics metacritic="http://www.metacritic.com/browse/movies/critic/popular?num_items=100" source=retrieveInfo(metacritic) critics=[] pages=[] scoredMovies=[] while source.count("<h3><a href=\"/critic/")>0: [a,b,c]=source.partition("<h3><a href=\"/critic/") [a,b,c]=c.partition("\">") print a if a.count("/")>0: [a,b,source]=c.partition("</a>") else: pages.append("/critic/"+a) [a,b,source]=c.partition("</a>") critics.append(a) # For each of them, compute how many movies they have evaluated numberOfMovies=[] for critic in pages: print "\n\n\n***************** " print "Gathering information for",critic # retrieve the first page of scores for this critic criticPage = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score") # remove the score of trailers, if any [criticPage,b,c]=criticPage.partition("<div class=\"module list_trailers\">") [a,b,c] = criticPage.partition("<a href=\""+critic+"\">") [a,b,c] = c.partition("</a>") numberOfMovies.append(int(a.replace(',',''))) numberOfPagesWithScores = int(numpy.ceil(int(a.replace(',',''))/100.0)) criticMovies=dict() print "Movies evaluated:", a.replace(',','') print "This person has ",numberOfPagesWithScores,"pages with scores" while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) if numberOfPagesWithScores>1: for pageNumber in range(1,numberOfPagesWithScores): print "\n***" print "Page (",pageNumber+1,"/",numberOfPagesWithScores,") for",critic c = mv.retrieveInfo( "http://metacritic.com" + critic + "?filter=movies&num_items=100&sort_options=critic_score&page=" + str(pageNumber)) [c,b,a]=c.partition("<div class=\"module list_trailers\">") while c.count("data critscore")>0: [a,b,c]=c.partition("data critscore") [a,b,c]=c.partition(">") [score,b,c]=c.partition("</span>") score=int(score.replace(',','')) [a,b,c]=c.partition("<a href=\"/movie/") [movieTitle,b,c]=c.partition("\"") criticMovies[movieTitle]=score print dict({movieTitle:score}) scoredMovies.append(criticMovies)
Leave a comment Cancel reply
We have moved!
In the news:
Recent Posts
- Migration
- Computational Geometry in Python
- Searching (again!?) for the SS Central America
- Jotto (5-letter Mastermind) in the NAO robot
- Robot stories
- Advanced Problem #18
- Book presentation at the USC Python Users Group
- Areas of Mathematics
- More on Lindenmayer Systems
- Some results related to the Feuerbach Point
- An Automatic Geometric Proof
- Sympy should suffice
- A nice application of Fatou’s Lemma
- Have a child, plant a tree, write a book
- Project Euler with Julia
- Seked
- Nezumi San
- Ruthless Thieves Stealing a Roll of Cloth
- Which one is the fake?
- Stones, balances, matrices
- Buy my book!
- Trigonometry
- Naïve Bayes
- Math still not the answer
- Sometimes Math is not the answer
- What if?
- Edge detection: The Convolution Approach
- OpArt
- So you want to be an Applied Mathematician
- Smallest Groups with Two Eyes
- The ultimate metapuzzle
- Where are the powers of two?
- Geolocation
- Boundary operators
- The Cantor Pairing Function
- El País’ weekly challenge
- Math Genealogy Project
- Basic Statistics in sage
- A Homework on the Web System
- Apollonian gaskets and circle inversion fractals
- Toying with basic fractals
- Unusual dice
- Wavelets in sage
- Edge detection: The Scale Space Theory
- Bertrand Paradox
- Voronoi mosaics
- Image Processing with numpy, scipy and matplotlibs in sage
- Super-Resolution Micrograph Reconstruction by Nonlocal-Means Applied to HAADF-STEM
- The Nonlocal-means Algorithm
- The hunt for a Bellman Function.
- Presentation: Hilbert Transform Pairs of Wavelets
- Presentation: The Dual-Tree Complex Wavelet Transform
- Presentation: Curvelets and Approximation Theory
- Poster: Curvelets vs. Wavelets (Mathematical Models of Natural Images)
- Wavelet Coefficients
- Modeling the Impact of Ebola and Bushmeat Hunting on Western Lowland Gorillas
- Triangulations
- Mechanical Geometry Theorem Proving
Pages
- About me
- Books
- Curriculum Vitae
- Research
- Teaching
- Mathematical Imaging
- Introduction to the Theory of Distributions
- An Introduction to Algebraic Topology
- The Basic Practice of Statistics
- MA598R: Measure Theory
- MA122—Fall 2014
- MA141—Fall 2014
- MA142—Summer II 2012
- MA241—Spring 2014
- MA242—Fall 2013
- Past Sections
- MA122—Spring 2012
- MA122—Spring 2013
- Lesson Plan—section 007
- Lesson Plan—section 008
- Review for First part (section 007)
- Review for First part (section 008)
- Review for Second part (section 007)
- Review for Third part (section 007)
- Review for the Second part (section 008)
- Review for the Fourth part (section 007)
- Review for Third and Fourth parts (section 008)
- MA122—Fall 2013
- MA141—Spring 2010
- MA141—Fall 2012
- MA141—Spring 2013
- MA141—Fall 2013
- MA141—Spring 2014
- MA141—Summer 2014
- MA142—Fall 2011
- MA142—Spring 2012
- MA241—Fall 2011
- MA241—Fall 2012
- MA241—Spring 2013
- MA242—Fall 2012
- MA242—Spring 2012
- First Midterm Practice Test
- Second Midterm-Practice Test
- Third Midterm—Practice Test
- Review for the fourth part of the course
- Blake Rollins’ code in Java
- Ronen Rappaport’s project: messing with strings
- Sam Somani’s project: Understanding Black-Scholes
- Christina Papadimitriou’s project: Diffusion and Reaction in Catalysts
- Problem Solving
- Borsuk-Ulam and Fixed Point Theorems
- The Cantor Set
- The Jordan Curve Theorem
- My oldest plays the piano!
- How many hands did Ernie shake?
- A geometric fallacy
- What is the next number?
- Remainders
- Probability and Divisibility by 11
- Convex triangle-square polygons
- Thieves!
- Metapuzzles
- What day of the week?
- Exact Expression
- Chess puzzles
- Points on a plane
- Sequence of right triangles
- Sums of terms from Fibonacci
- Alleys
- Arithmetic Expressions
- Three circles
- Pick a point
- Bertrand Paradox
- Unusual dice
- El País’ weekly challenge
- Project Euler with Julia
- LaTeX
Categories
Archives
- November 2014
- September 2014
- August 2014
- July 2014
- June 2014
- March 2014
- December 2013
- October 2013
- September 2013
- July 2013
- June 2013
- April 2013
- January 2013
- December 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- November 2011
- September 2011
- August 2011
- June 2011
- May 2011
- April 2011
- February 2011
- January 2011
- December 2010
- May 2010
- April 2010
- September 2008
- September 2007
- August 2007
Math updates on arXiv.org
- On the image of the total power operation for Burnside rings
- A note on hidden classes in spinor classification
- Modularity of certain products of the Rogers-Ramanujan continued fraction
- Complex Analytic Structure of Stationary Flows of an Ideal Incompressible Fluid
- Learning the local density of states of a bilayer moir\'e material in one dimension
- Hypergeometric Distribution Revisited: Tail Inequalities, Confidence Bounds and Sample Sizes
- Positive formula for the product of conjugacy classes on the unitary group
- Neural Estimation Of Entropic Optimal Transport
- Some Homological Conjectures Over Idealization Rings
- On kernels of homological representations of mapping class groups
sagemath
- An error has occurred; the feed is probably down. Try again later.
Part of your struggle is that you are not accounting for a lot of other factors which lend themselves to something like a Bayesian analysis or a vector search. For example, if you have young kids at home, the Transformers cartoon movie may be something you watch (rated G) but the Transformers movies don’t get watched (rated PG-13). Or perhaps you normally like certain films, but one of them contains a lot of things that insult your religion.
Those are the kinds of things that are much harder to quantify and identify trends in, because many times there are only rare outliers that throw the while thing off.
J.Ja