Constraint and Consequence
 
   
  Craig Holman   Research   Jabberwocky   Quotations   Contact
 
  Home > Jabberwocky           Previous      Next
 
 
 

Jabberwocky for May 28, 2008

 
 

The Netflix Prize

          Next in thread

The Netflix Prize aims at improving algorithms for estimating how customers will rate movies and other DVDs by offering a million-dollar prize for significant improvement over what Netflix was able to do as of the beginning of October 2006.

The Netflix Prize site is: www.NetflixPrize.com

The rules of the competition, a leaders' board, a forum, and data downloads are available from that site.

The task is to mine and analyze a large collection of customer rating records in order be able to make predictions for a user for whom you have some historical rating information.

Each rating record contains a customer ID (not the real ID), a movie ID, a rating, and the date that customer rated that movie.

Each movie record contains a movie ID, a title, and a year of release

While many contestants wanted to use external data (e.g. directory, genre, actors), I remain convinced that this information isn't necessary to win the prize and, in fact, may be distracting and misleading.

The number of customers = 480189

The number of movies = 17770

The maximum number of ratings per customer = 17653

The largest customer ID = 2649429

The largest movie ID = 17770

The Netflix Prize challenge - My involvement

I learned of the Netflix Prize shortly before it began in October 2006 and became intrigued with it. The problems that it poses are deep and difficult.

I spent a fair amount of time working on it until my partner died in late December 2006. While I let my computer continue to crunch on discovering maximal cliques in order to establish movie clustering, my heart wasn't in it and I stopped working on the problem in February 2007. I've been tempted to resume work on the problem several times, but other research problems, especially graph isomorphism, have lured me away.

I think my approach has some novel aspects and may be of interest to people who work in this area.

The Netflix Prize challenge - Some issues

Here are some issues that have come to mind as I've thought about how to use the data:

  • How to handle multiple people using the same account (i.e. customer ID) to rate movies
  • How to handle changing groups of people using the same account over time
  • How do ratings change over time for a member?
  • What are the consequences, if any, of the date being the date of rating rather than the date of viewing?
  • How do ratings change by season, time of the month, time of the week, and time within payday cycles?
  • How do are ratings affected by political calendars, religious calendars, sports calendars, wedding seasons, graduation season, and vacation seasons?
  • How are ratings affected by the number of ratings that a customer is making in one sitting or within a short period of time?
  • How does a customer's affiliation with a cluster of movies change over time (e.g. weariness, jadedness, change in expectations)?
  • How do life events (e.g. marriage, divorce, births, deaths, illness) alter a customer's rating tendencies?
  • Some people and companies use software to try to promote (or attack) one film at the expense of others. How can this 'gaming' of the system be detected and the data discarded?
  • Should older ratings be treated differently than newer ratings?
  • How should the release date of movies be used in analysis and prediction?
  • The Netflix Prize challenge - My approach to a solution

    My approach to this problem requires a great deal of processing up front and much less when making preditions.

    Nearly all of the initial work is involved in determing significant clusters of movies and the degree of membership that each movie has in each cluster.

    Each customer's affinity for signficant clusters is determined.

    Each customer's rating bias within each cluster for which s/he has affinity is determined.

    Prediction of a customer's rating for a movie is to be done as follows:

  • Identify the relevant clusters in which the movie has sufficient membership and for which the customer has sufficient affinity.
  • Apply the customer's cluster-related biases
  • Apply relevant temporal biases
  • Calculate the predicted rating for the movie by that user
  • The Netflix Prize challenge - Movie clusters

    This problem is, at heart, a clustering problem. How can we discover 'natural' clusters of movies - those movies that people actually corate highly together - and use them effectively for prediction?

    Here are a few questions related to movie clusters:

  • How can we discover the cores of clusters?
  • How can we determine which clusters are of sufficient significance?
  • Which movies belong to a cluster?
  • What is the strength of their membership in that cluster?
  • Is there a minimum threshold that establishes the boundary of cluster membership?
  • Should there be any limit on how many significant clusters a movie may belong to?
  • The Netflix Prize challenge - Corated movies

    I use my maximal cliques algorithm to discover the signficant movie clusters and the strength of membership of each movie in each cluster

    The processing required by this approach is incredible. I'm sure that my initial approach can be significantly improved upon.

    For each number of people who corate a movie, I generate a graph and then identify all of the maximal cliques of that graph. For example, in the data that follows, I constructed a graph of movies that were corated by 54,000 or more customers in the data set. In example that follows, 54,000 or more customers all rated the first twelve movies listed. Once the graph is constructed, I then ran my maximal cliques algorithm to discover all of the maximal cliques of that graph.

    My plan, as yet unfulfilled, is to use this mountain of information to identify significant clusters and to determine membership strength.

    Maximal cliques of size 11 or more of movies having 54,000 or more customers 
    corating them in the Netflix Prize data set
    
        Maximal 12-cliques
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
         6972 :   Armageddon
        10550 :   Cold Mountain
        10583 :   The School of Rock
        12470 :   Twister
        13359 :   Chicago
        16242 :   Con Air
        16882 :   Anger Management
    
    
        Maximal 11-cliques
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
         9340 :   Pearl Harbor
        10550 :   Cold Mountain
        10583 :   The School of Rock
        12470 :   Twister
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
          457 :   Kill Bill: Vol. 2
         4356 :   Road to Perdition
         5862 :   Memento
         6972 :   Armageddon
        10550 :   Cold Mountain
        10583 :   The School of Rock
        12470 :   Twister
        13359 :   Chicago
        16242 :   Con Air
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
          457 :   Kill Bill: Vol. 2
         4356 :   Road to Perdition
         5862 :   Memento
         6972 :   Armageddon
        10583 :   The School of Rock
        12470 :   Twister
        13359 :   Chicago
        15471 :   Phone Booth
        16242 :   Con Air
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
          457 :   Kill Bill: Vol. 2
         4356 :   Road to Perdition
         6972 :   Armageddon
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11182 :   Master and Commander: The Far Side of the World
        12470 :   Twister
        13359 :   Chicago
        16242 :   Con Air
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
         6972 :   Armageddon
        10583 :   The School of Rock
        12470 :   Twister
        13359 :   Chicago
        15471 :   Phone Booth
        16242 :   Con Air
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         6972 :   Armageddon
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11182 :   Master and Commander: The Far Side of the World
        12470 :   Twister
        13359 :   Chicago
        16242 :   Con Air
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         1220 :   Man on Fire
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11089 :   Monsters
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         1220 :   Man on Fire
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         8782 :   The Royal Tenenbaums
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11089 :   Monsters
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         8782 :   The Royal Tenenbaums
         9340 :   Pearl Harbor
        10550 :   Cold Mountain
        10583 :   The School of Rock
        10947 :   The Incredibles
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
         9340 :   Pearl Harbor
        10550 :   Cold Mountain
        10583 :   The School of Rock
        10947 :   The Incredibles
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         8782 :   The Royal Tenenbaums
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11089 :   Monsters
        13081 :   Troy
        13359 :   Chicago
        16882 :   Anger Management
    
           30 :   Something's Gotta Give
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
        10550 :   Cold Mountain
        10583 :   The School of Rock
        11089 :   Monsters
        13081 :   Troy
        13359 :   Chicago
        16882 :   Anger Management
    
         1180 :   A Beautiful Mind
         4356 :   Road to Perdition
         5085 :   Seabiscuit
         5862 :   Memento
         6972 :   Armageddon
        10550 :   Cold Mountain
        12161 :   Big Fish
        12470 :   Twister
        13359 :   Chicago
        16242 :   Con Air
        16882 :   Anger Management
    

    More on this tomorrow...

              Next in thread

     
     
     
      Home > Jabberwocky           Previous      Next
     
      Craig Holman   Research   Jabberwocky   Quotations   Contact  
     
     
     
      Patterncraft™ and Constraint and Consequence™
    are trademarks of Craig Stewart Holman

    Copyright © 2008 Craig Stewart Holman

    Copyright and Legal Notices

    All Rights Reserved.

     
       
     
      Patterncraft logo