Constraint and Consequence
 
   
  Craig Holman   Research   Jabberwocky   Quotations   Contact
 
  Home > Jabberwocky           Previous      Next
 
 
 

Jabberwocky for May 29, 2008

 
 

The Netflix Prize - Corated movies revisited

          First in thread      Previous in thread

I wrote yesterday about analyzing corated movies in order to identify clusters of movies. Some elaboration is called for.

Two movies are corated by a group of customers if each customer in the group has rated both movies.

For purposes of the following examples, let's assume that only eight movies have been rated and they are labeled 'A' through 'H'.

We'll construct a 8 by 8 grid of counters. Each row is labeled with 'A' through 'H', as is each column. Each counter is initialized to 0.

 
Corated Movies - Counter Grid - Initialized
A B C D E F G H
A 0 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0

For each customer, for each pair of different movies that the customer has rated, increment the counter at the intersection of the two movies. For convenience, each counter at Row[i]Column[j] will be shown as duplicated at Row[j]Column[i].

 
Corated Movies - Counter Grid - Populated
A B C D E F G H
A 0 23 1096 3 825 90 7192 256
B 23 0 84 902 1460 79 0 11
C 1096 84 0 8 304 66 2 101
D 3 902 8 0 42 6 280 703
E 825 1460 304 42 0 1045 827 320
F 90 79 66 6 1045 0 85 298
G 7192 0 2 280 827 85 0 55
H 256 11 101 703 320 298 55 0

The counter values range from 0 through 7192. Pick a threshold value in this range. I picked 300 for this next example.

For each counter in the grid, replace its value with 1 if the counter is greater than or equal to the threshold, or with 0 otherwise.

Corated Movies - Graph Adjacency Matrix - 300 and higher
A B C D E F G H
A 0 0 1 0 1 0 1 0
B 0 0 0 1 1 0 0 0
C 1 0 0 0 1 0 0 0
D 0 1 0 0 0 0 0 1
E 1 1 1 0 0 1 1 1
F 0 0 0 0 1 0 0 0
G 1 0 0 0 1 0 0 0
H 0 0 0 1 1 0 0 0

What we have constructed is the adjacency matrix for a graph, where the nodes of the graph are the movies 'A' through 'H' and there is an edge between a pair of nodes X and Y if at least 300 customers rated both movie X and movie Y.

Our next task is to discover the maximal cliques of this graph. Using the maximal cliques algorithm that I described earlier, the following maximal cliques were discovered (represented as sets of nodes):

{ A C E }  { A E G }  { B D }  { B E }  { D H }  { E F }  { E H }

For each maximal clique, all of the movies in the clique were rated by the same group of at least 300 customers.

There are 28 distinct values in the non-diagonal counters in the populated counter grid. We can construct 28 different graphs using the same process that we used for a threshold of 300. From each graph, we can discover the set of maximal cliques for that graph.

These 28 sets of maximal cliques can then be analyzed to discover movie clusters and to determine the degree of membership of each movie in each cluster. I'll explore how this can be accomplished later.

          First in thread      Previous in thread

 
 
 
  Home > Jabberwocky           Previous      Next
 
  Craig Holman   Research   Jabberwocky   Quotations   Contact  
 
 
 
  Patterncraft™ and Constraint and Consequence™
are trademarks of Craig Stewart Holman

Copyright © 2008 Craig Stewart Holman

Copyright and Legal Notices

All Rights Reserved.

 
   
 
  Patterncraft logo