Implicit Feedback and Collaborative Filtering
SAN FRANCISCO, CA — I have been re-reading this classic paper, and was impressed both by its clarity and simplicity.
Often the most powerful ideas are also the most elegant!
Here, I will try to summarize the technique, with just enough math thrown in to facilitate understanding from a coding perspective.
Good Old Explicit
The classic problem of collaborative filtering (CF) is the one with explicit feedback: given a set of users and items observe the user-item interaction explicitly through, say, a rating , and then predict the ratings for the user-item pairs that are unknown. If we have users and items, we can envision an ratings matrix such that each row corresponds to a given user and non-empty elements in that row correspond to the ratings that this user has given to the items. Because each user would typically rate a small proportion of the items, this matrix will be sparse.
The aim of Recommender Systems is to fill in the blanks of this matrix with predicted ratings using the information at hand: matrix completion. The popular method of accomplishing this is through matrix factorization whereby the ratings matrix (that can routinely be a million users by a million items) is expressed as a product of two lower dimensional matrices, a user-features matrix and a item-features matrix. Typically, one uses 50-200 features.
Let us introduce a bit of notation. I borrow a bit of notation from quantum mechanics and write vectors as , such that it understood to be a column vector. This simplifies some algebraic manipulations later. The latent feature vector of user is denoted as:
where the superscripts denote the indices of the elements. Its transpose is denoted by the row vector:
so that the squared norm can be written as the inner product: . Similarly, we can write the latent feature vector for item , as
The problem is formulated in terms of minimizing an objective function defined as a sum of residuals from knows ratings:
In other words, solve for the latent factors until the squared difference between the explicit and predicted ratings is minimized. The term adds regularization to prevent overfitting. By merit of the sparsity of the ratings matrix this problem can be efficiently solved, typically using a Stochastic Gradient Descent type algorithm.
The sparsity of the ratings matrix is a good thing, but it is also its problem. In most settings, it is very hard to get a large fraction of users rate a large fraction of the items. So there might just be too little signal if we stuck to explicit ratings only. However the user item interaction can happen in the many different ways other than a rating. A user may be searching for an item (say a restaurant) on the web, or listening to/skipping a track, or clicking on links in an email campaign. and so on and so forth. Usually this type of implicit signals show the attachment of an user to an item through some sort of frequency. These actions do not explicitly state or quantify any preference of the user for the item, but instead gives us confidence about the user’s opinion. And we usually have plenty more of these signals than explicit ratings.
Consider, for example, listening behavior to songs. When I listen to music on a streaming service, I rarely ever rate a song that I like or dislike. But more often I skip a song, or listen only halfway through it if dislike it. If I really like a song, I come back to it often. Even if I would like to rate a song, sometimes it is simply not possible, as I might be driving, or the song might be playing through sonos, roku or some system which does not allow for a easy rating interface. So, to infer my musical taste profile, my listens, my repeat listens, my skips, and fraction of tracks listened to, etc. are far more valuable signals than explicit ratings. This is what this paper attempted to incorporate within the CF framework.
To keep it concrete, let us continue with the music streaming example, and lets say that we are only going to use the number of times a user has listened to a track (track listen count) as the only signal. For user and track , let us reuse to denote that number. We begin by defining a preference of the user to track such that it is unity when the user has at least one listen and zero if the user has not interacted with the track at all:
So, at the end, the task of the recommendation system will be to predict the preference for the unseen tracks. Next, we use the track listen count to define a confidence of the liking of the track by the user:
This is done in sort of an ad hoc way, and the confidence could de defined in many other ways as long as it is a monotonic function of the frequency. The confidence has a base value of for all user-item pairs and it gets stronger as the number of listens increase. So a high value of confidence would mean that for that user-track pair these is a lot less noise about the preference of that particular user to that track. Therefore, it means that it is more important for the algorithm to predict a preference closer to unity for the pairs where the confidence is high. This leads to the following analog of explicit objective function minimization problem:
A big means that the prediction has to be that much closer to for that term not to contribute a lot to the sum.
But suddenly now, the problem is no longer sparse, and the sum contains all terms.
How to solve this, then?
Linear algebra tricks come to the rescue!
First, note that if the user latent factors are held constant the problem is quadratic in the item latent factors, and vice versa. This leads to the alternating least squares (ALS) approach for this problem. But that by itself does not solve the lack of sparsity issue.
The ALS proceeds in two steps:
STEP 1 of ALS: First, we hold the item latent vectors fixed, and solve for the user latent vectors . For a given user , the optimization function can be re-written in the matrix notation. First we define a item-factor matrix as:
Next, we define a vector of preferences for the user as:
and an diagonal matrix of confidence in items for this user as:
With these, the relevant part of the objective function can be rewritten as:
To minimize this, we set the derivative w.r.t. to zero, and we find the solution (see derivation at the bottom) as
STEP 2 of ALS: In the next step, in an exactly analogous fashion, we can hold the user latent vectors fixed and vary the item latent vectors to minimize the cost function. We define an user-factor matrix , an diagonal matrix , and a preference vector for all users against item . We end up with the solution:
The algorithm then alternates between STEP 1 and STEP 2 until the latent factors converge (usually sweeps).
Going back to STEP 1, we first note that where is unity is zero by construction. Therefore for each of the users, has only non-zero terms where are the number of items for which the user has preference equal to unity, and typically . The main bottleneck is the term whose naïve calculation would require time for each user. Speedup can be obtained just by rewriting this term as
The beauty of this expression is that the matrix has to be calculated only once (being independent of the user), and is non-zero for only those items that the user has expressed any preference for! Therefore this part of the calculation reduces vastly in complexity to for each of the users.
An analogous trick works for STEP 2 as well!
Once the solution converges, the predicted preference of a user for an item can be computed as
Sort the tracks by this value, and voila, you have recommended tracks for every listener in your sample!
Suppose there are 100 million songs. If we are using 200 features, then has dimension . Therefore, is not a trivial multiplication. Conventional matrix multiplication will take a row of that is 100 million wide and multiply it with a column of that is also 100 million tall. Such computations can be costly and slow. However the way we wrote down in terms of the vectors gives us another way to compute this: i.e. the matrix product is the sum of the outer products of the rows, each of which is and being independent of each other can be computed on different nodes and then added back together!
Quick derivation of the update rule
We expand the cost function by propagating the transpose and then multiplying through :
Taking the derivative w.r.t. and setting it to zero gives:
that immediately gives us the solution (1) written above.