![]() |
|
| | |
| Craig Holman Research Jabberwocky Quotations Contact | |
| Home >
Jabberwocky      |
|
Jabberwocky for May 28, 2008 |
|
The Netflix Prize     The Netflix Prize aims at improving algorithms for estimating how customers will rate movies and other DVDs by offering a million-dollar prize for significant improvement over what Netflix was able to do as of the beginning of October 2006. The Netflix Prize site is: www.NetflixPrize.com The rules of the competition, a leaders' board, a forum, and data downloads are available from that site. The task is to mine and analyze a large collection of customer rating records in order be able to make predictions for a user for whom you have some historical rating information.
Each rating record contains a customer ID (not the real ID), a movie ID, a rating, and the date that customer rated that movie. Each movie record contains a movie ID, a title, and a year of release
While many contestants wanted to use external data (e.g. directory, genre, actors), I remain convinced that this information isn't necessary to win the prize and, in fact, may be distracting and misleading.
The number of customers = 480189 The number of movies = 17770 The maximum number of ratings per customer = 17653 The largest customer ID = 2649429 The largest movie ID = 17770 The Netflix Prize challenge - My involvementI learned of the Netflix Prize shortly before it began in October 2006 and became intrigued with it. The problems that it poses are deep and difficult. I spent a fair amount of time working on it until my partner died in late December 2006. While I let my computer continue to crunch on discovering maximal cliques in order to establish movie clustering, my heart wasn't in it and I stopped working on the problem in February 2007. I've been tempted to resume work on the problem several times, but other research problems, especially graph isomorphism, have lured me away. I think my approach has some novel aspects and may be of interest to people who work in this area. The Netflix Prize challenge - Some issuesHere are some issues that have come to mind as I've thought about how to use the data:
The Netflix Prize challenge - My approach to a solutionMy approach to this problem requires a great deal of processing up front and much less when making preditions. Nearly all of the initial work is involved in determing significant clusters of movies and the degree of membership that each movie has in each cluster. Each customer's affinity for signficant clusters is determined. Each customer's rating bias within each cluster for which s/he has affinity is determined. Prediction of a customer's rating for a movie is to be done as follows: The Netflix Prize challenge - Movie clustersThis problem is, at heart, a clustering problem. How can we discover 'natural' clusters of movies - those movies that people actually corate highly together - and use them effectively for prediction? Here are a few questions related to movie clusters:
The Netflix Prize challenge - Corated moviesI use my maximal cliques algorithm to discover the signficant movie clusters and the strength of membership of each movie in each cluster The processing required by this approach is incredible. I'm sure that my initial approach can be significantly improved upon. For each number of people who corate a movie, I generate a graph and then identify all of the maximal cliques of that graph. For example, in the data that follows, I constructed a graph of movies that were corated by 54,000 or more customers in the data set. In example that follows, 54,000 or more customers all rated the first twelve movies listed. Once the graph is constructed, I then ran my maximal cliques algorithm to discover all of the maximal cliques of that graph. My plan, as yet unfulfilled, is to use this mountain of information to identify significant clusters and to determine membership strength.
Maximal cliques of size 11 or more of movies having 54,000 or more customers
corating them in the Netflix Prize data set
Maximal 12-cliques
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
6972 : Armageddon
10550 : Cold Mountain
10583 : The School of Rock
12470 : Twister
13359 : Chicago
16242 : Con Air
16882 : Anger Management
Maximal 11-cliques
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
9340 : Pearl Harbor
10550 : Cold Mountain
10583 : The School of Rock
12470 : Twister
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
457 : Kill Bill: Vol. 2
4356 : Road to Perdition
5862 : Memento
6972 : Armageddon
10550 : Cold Mountain
10583 : The School of Rock
12470 : Twister
13359 : Chicago
16242 : Con Air
16882 : Anger Management
30 : Something's Gotta Give
457 : Kill Bill: Vol. 2
4356 : Road to Perdition
5862 : Memento
6972 : Armageddon
10583 : The School of Rock
12470 : Twister
13359 : Chicago
15471 : Phone Booth
16242 : Con Air
16882 : Anger Management
30 : Something's Gotta Give
457 : Kill Bill: Vol. 2
4356 : Road to Perdition
6972 : Armageddon
10550 : Cold Mountain
10583 : The School of Rock
11182 : Master and Commander: The Far Side of the World
12470 : Twister
13359 : Chicago
16242 : Con Air
16882 : Anger Management
30 : Something's Gotta Give
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
6972 : Armageddon
10583 : The School of Rock
12470 : Twister
13359 : Chicago
15471 : Phone Booth
16242 : Con Air
16882 : Anger Management
30 : Something's Gotta Give
4356 : Road to Perdition
5085 : Seabiscuit
6972 : Armageddon
10550 : Cold Mountain
10583 : The School of Rock
11182 : Master and Commander: The Far Side of the World
12470 : Twister
13359 : Chicago
16242 : Con Air
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
1220 : Man on Fire
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
10550 : Cold Mountain
10583 : The School of Rock
11089 : Monsters
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
1220 : Man on Fire
4356 : Road to Perdition
5085 : Seabiscuit
8782 : The Royal Tenenbaums
10550 : Cold Mountain
10583 : The School of Rock
11089 : Monsters
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
8782 : The Royal Tenenbaums
9340 : Pearl Harbor
10550 : Cold Mountain
10583 : The School of Rock
10947 : The Incredibles
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
9340 : Pearl Harbor
10550 : Cold Mountain
10583 : The School of Rock
10947 : The Incredibles
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
8782 : The Royal Tenenbaums
10550 : Cold Mountain
10583 : The School of Rock
11089 : Monsters
13081 : Troy
13359 : Chicago
16882 : Anger Management
30 : Something's Gotta Give
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
10550 : Cold Mountain
10583 : The School of Rock
11089 : Monsters
13081 : Troy
13359 : Chicago
16882 : Anger Management
1180 : A Beautiful Mind
4356 : Road to Perdition
5085 : Seabiscuit
5862 : Memento
6972 : Armageddon
10550 : Cold Mountain
12161 : Big Fish
12470 : Twister
13359 : Chicago
16242 : Con Air
16882 : Anger Management
More on this tomorrow...      |
| Home >
Jabberwocky      |
|
| Craig Holman Research Jabberwocky Quotations Contact | |
| Patterncraft and Constraint and Consequence are trademarks of Craig Stewart Holman Copyright © 2008 Craig Stewart Holman All Rights Reserved. |
|
| | |
![]() |
|