data mining | Francisco Blanco-Silva

Which one is the fake?

December 7, 2012 5 comments


“Crab on its back”	“Willows at sunset”	“Still life: Potatoes in a yellow dish”

Categories: Approximation Theory, Data mining, Imaging, Probability, puzzles, Scientific Computing, Statistics Tags: approximation theory, art authentication, curvelets, data mining, Daubechies, scientific computing, shearlets, statistics, Van Gogh, wavelets

There is nothing naïve about Naïve Bayes—a very basic, but extremely efficient data mining method to take decisions when a vast amount of data is available. The name comes from the fact that this is the simplest application to this problem, upon (the naïve) assumption of independence of the events. It is based on Bayes’ rule of conditional probability: If you have a hypothesis $H$ and evidence $E$ that bears on that hypothesis, then

$\mathrm{Pr} \big( H \lvert E \big) = \displaystyle{ \frac{\mathrm{Pr} \big( E \lvert H\big) \mathrm{Pr}(H)}{\mathrm{Pr}(E)} }$

where as usual, $\mathrm{Pr}(A)$ denotes the probability of the event $A,$ and $\mathrm{Pr}\big( A \lvert B \big)$ denotes the probability of the event $A$ conditional to another event $B.$

I would like to show an example of this technique, of course, with yet another decision-making algorithm oriented to guess my reaction to a movie I have not seen before. From the data obtained in a previous post, I create a simpler table with only those movies that have been scored more than 28 times (by a pool of 87 of the most popular critics featured in www.metacritics.com) [I posted the script to create that table at the end of the post]

Let’s test it:

>>> table=prepTable(scoredMovies,28)
>>> len(table)
49

>>> [entry[0] for entry in table]
[‘rabbit-hole’, ‘carnage-2011’, ‘star-wars-episode-iii—revenge-of-the-sith’,

 ‘shame’, ‘brokeback-mountain’, ‘drive’, ‘sideways’, ‘salt’,

 ‘million-dollar-baby’, ‘a-separation’, ‘dark-shadows’,

 ‘the-lord-of-the-rings-the-return-of-the-king’, ‘true-grit’, ‘inception’,

 ‘hereafter’, ‘master-and-commander-the-far-side-of-the-world’, ‘batman-begins’,

 ‘harry-potter-and-the-deathly-hallows-part-2’, ‘the-artist’, ‘the-fighter’,

 ‘larry-crowne’, ‘the-hunger-games’, ‘the-descendants’, ‘midnight-in-paris’,

 ‘moneyball’, ‘8-mile’, ‘the-departed’, ‘war-horse’,

 ‘the-lord-of-the-rings-the-fellowship-of-the-ring’, ‘j-edgar’,

 ‘the-kings-speech’, ‘super-8’, ‘robin-hood’, ‘american-splendor’, ‘hugo’,

 ‘eternal-sunshine-of-the-spotless-mind’, ‘the-lovely-bones’, ‘the-tree-of-life’,

 ‘the-pianist’, ‘the-ides-of-march’, ‘the-quiet-american’, ‘alexander’,

 ‘lost-in-translation’, ‘seabiscuit’, ‘catch-me-if-you-can’, ‘the-avengers-2012’,

 ‘the-social-network’, ‘closer’, ‘the-girl-with-the-dragon-tattoo-2011’]

>>> table[0]
[‘rabbit-hole’, ”, ‘B+’, ‘B’, ”, ‘C’, ‘C+’, ”, ‘F’, ‘B+’, ‘F’, ‘C’, ‘F’, ‘D’,

 ”, ”, ‘A’, ”, ”, ”, ”, ‘B+’, ‘C+’, ”, ”, ”, ”, ”, ”, ‘C+’, ”, ”,

 ”, ”, ”, ”, ‘A’, ”, ”, ”, ”, ”, ‘A’, ”, ”, ‘B+’, ‘B+’, ‘B’, ”, ”,

 ”, ‘D’, ‘B+’, ”, ”, ‘C+’, ”, ”, ”, ”, ”, ”, ‘B+’, ”, ”, ”, ”, ”,

 ”, ‘A’, ”, ”, ”, ”, ”, ”, ”, ‘D’, ”, ”,’C+’, ‘A’, ”, ”, ”, ‘C+’, ”]

Math still not the answer

May 16, 2012 1 comment

I wrote a quick (but not very elegant) python script to retrieve locally enough data from www.metacritic.com for pattern recognition purposes. The main goal is to help me decide how much I will enjoy a movie, before watching it. I included the script at the end of the post, in case you want to try it yourself (and maybe improve it too!). It takes a while to complete, although it is quite entertaining to see its progress on screen. At the end, it provides with two lists of the same length: critics—a list of str containing the names of the critics; and scoredMovies—a list of dict containing, at index k, the evaluation of all the movies scored by the critic at index k in the previous list.

For example:

>>> critics[43]
‘James White’

>>> scoredMovies[43]
{‘hall-pass’: 60,         ‘the-karate-kid’: 60,  ‘the-losers’: 60,

‘the-avengers-2012’: 80,  ‘the-other-guys’: 60,  ‘shrek-forever-after’: 80,

‘the-lincoln-lawyer’: 80, ‘the-company-men’: 60, ‘jonah-hex’: 40,

‘arthur’: 60,             ‘vampires-suck’: 20,   ‘american-reunion’: 40,

‘footloose’: 60,          ‘real-steel’: 60}

The number of scored films by critic varies: there are individuals that gave their opinion on a few dozen movies, and others that took the trouble to evaluate up to four thousand flicks! Note also that the names of the movies correspond with their web pages in www.metacritic.com. For example, to see what critics have to say about the “Karate Kid” and other relevant information online, point your browser to www.metacritic.com/movie/the-karate-kid. It also comes in very handy if there are several versions of a single title: Which “Karate Kid” does this score refer to, the one in the eighties, or Jackie Chan’s?

Feel free to download a copy of the resulting data [here] (note it is a large file: 1.6MB).

But the fact that we have that data stored locally allows us to gather that information with simple python commands, and perform many complex operations on it.

Francisco Blanco-Silva

Archive

Which one is the fake?

Naïve Bayes

Math still not the answer

We have moved!

Blanco-Silva’s Books

In the news:

Recent Posts

Pages

Archives

Email Subscription

@eseprimo

Math updates on arXiv.org

Computational Geometry updates on arXiv.org

sagemath

Archive

Share this:

Share this:

Share this:

We have moved!

Blanco-Silva’s Books

In the news:

Recent Posts

Pages

Categories

Archives

Email Subscription