Collaborative Filtering

Collaborative filtering techniques are used to make recommendations based on user preferences and behavior. Collaborative filtering algorithms can be divided into two categories: memory-based and model-based. Memory-based algorithms use the entire dataset to generate recommendations, while model-based algorithms use a subset of the data to create a model that can be used to make predictions. Some of the most popular collaborative filtering algorithms include user-based collaborative filtering, item-based collaborative filtering, matrix factorization, and deep learning.

User-based collaborative filtering is a technique that recommends items based on the preferences of similar users. Item-based collaborative filtering is a technique that recommends items based on their similarity to items that a user has already rated. Matrix factorization is a technique that decomposes a large matrix into two smaller matrices that can be used to make predictions. Deep learning is a technique that uses neural networks to learn complex patterns in data.

Screen Shot 2023-04-02 at 12.10.40 PM.png

Matrix factorization algorithm for collaborative filtering

Matrix Factorization is a technique that decomposes a large matrix into two smaller matrices, one representing the users and the other representing the items. By multiplying these two matrices, we can reconstruct the original matrix and fill in the missing values. This way, we can predict how a user would rate an item that they have not seen before.

Dataset

I built a collaborative filtering model to predict ratings that MovieLens users give to movies. I have used a dataset with 100836 ratings, 610 users, and 9724 movies.

Data can be downloaded using this command line:

wget <http://files.grouplens.org/datasets/movielens/ml-latest-small.zip>

The steps to build and train the Matrix Factorization model are as follows:

1. Encoding rating data

Why we need to encode the data?

Collaborative filtering is a technique for building recommender systems that use the ratings or preferences of users to predict what they might like. One of the challenges of collaborative filtering is that the user and item ids are often not continuous integers, but rather strings or other types of identifiers. This makes it difficult to use them as indices for matrices or tensors. To solve this problem, we need to encode the user and item ids into continuous integers that range from 0 to n-1, where n is the number of unique values.

To do this, I will use a simple function from the fast.ai library called proc_col. This function takes a pandas column as input and returns a dictionary that maps each unique value to an integer, an array that contains the encoded values, and the number of unique values. For example, if we have a column with values ['a', 'b', 'c', 'a', 'b'], proc_col will return ({'a': 0, 'b': 1, 'c': 2}, [0, 1, 2, 0, 1], 3).

The first step is to encode the rating data into a sparse matrix, where each row represents a user and each column represents a movie. The matrix elements are the ratings given by the users to the movies. If a user has not rated a movie, the element is zero.

def proc_col(col):
    """Encodes a pandas column with values between 0 and n-1.
 
    where n = number of unique values
    """
    uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx[x] for x in col]), len(uniq)

def encode_data(df):
    """Encodes rating data with continous user and movie ids using 
    the helpful fast.ai function from above.
    
    Arguments:
      df: a csv file with columns userId, movieId, rating 
    
    Returns:
      df: a dataframe with the encode data
      num_users
      num_movies
      
    """
    users_enc = proc_col(df.userId)
    num_users = len(np.unique(users_enc[1]))
    movie_enc = proc_col(df.movieId)
    num_movies = len(np.unique(movie_enc[1]))
    df.userId = users_enc[1]
    df.movieId = movie_enc[1]
    return df, num_users, num_movies

2. Initializing parameters

We need to specify the number of latent factors (k) that we want to use to represent the users and the items. The latent factors are hidden features that capture the preferences and characteristics of the users and the items. For example, a latent factor could represent how much a user likes comedy movies or how funny a movie is.