In building a recommender, it’s common to ask the question, how well does it work? Ultimately, you’ll only know when you release it live to users and measure it against your targets such as increasing sales or user activity. It’s unlikely, however, that you’ll get it right first time round and you’ll want to be able to improve it over time. With recommenders, like many personalised systems, there are typically four methodologies for evaluating their quality:
- offline evaluations
- user surveys
- user studies
- online evaluations
Offline evaluations allow you to predict how well a recommender system will perform without directly involving real users. The most typical type of offline evaluation takes a set of user preferences and splits them up into a training set and a test set. For example, you may have a data set that contains the ratings that some users have given to movies, and you split it into two data sets. Given the training set, it’s the job of the recommender to predict new recommendations that the users may appreciate. You can tell how much users appreciate them by comparing the recommendations generated with the user preferences in the test set. The better the recommender can predict the preferences in the test set, bearing in mind that it was never exposed to them, the more confidence you can have that it’s generating good quality recommendations. This setup is typically referred to as a prediction problem. Offline evaluations are particularly useful for testing out how well a large range of recommenders and recommender settings will perform without having to go through the potentially expensive process of testing them out with real users. You can also implement a wide range of metrics from standard information retrieval ones such as precision and recall to more unusual ones such as measuring the diversity and novelty of recommendations. The main drawback with this methodology is that there is really no substitute for real user feedback. It can, in practice, be very difficult to find an offline evaluation data set that represents the problem you’re trying to solve.
The second common evaluation methodology is to construct and send user surveys. Constructing user surveys is an art form but there are a number of good rules of thumb to follow when doing so. For example, in order to ease the process of analysing the survey results you should ask questions with predefined answers rather than open ones, you should include questions that can be used to test the consistency of the respondents’ results (e.g. two different forms of the same question) and you should carefully understand the biases that exist in the sample population of users who receive the survey and go on to respond. User surveys are good for taking the temperature of how well your recommendations will be received and can be useful in giving users the opportunity to say what they want and expect from the recommender. It’s important to be cautious when interpreting the results of surveys as what people say when they respond in surveys about how they would act, isn’t always how they would actually act.
User studies are the third way to evaluate a recommender. From using paper storyboards to building functional prototypes, it’s a good idea to be able to test how well your product will work by getting feedback from real users in person. Like user surveys, they allow you to go beyond testing just the recommender model and allow you to get feedback on the entire product, including the user interface. User studies often take the shape of user interviews. Depending upon what you want to get out of the interviews, it’s common to set up a number of scenarios and to ask participants to walk through them, voicing their thoughts as they go. User interviews are useful for discovering the major flaws with your products. After a dozen interviews or so, you’ll typically discover around 90% of the problems with your product. Getting at the other 10% will mean running much larger scale tests, which can be very time consuming and expensive.
Online evaluations are a powerful methodology for evaluating a recommender’s performance. They are, as their name suggests, run live with real users, as opposed to offline evaluations that are simulations without users. Online evaluations allow us to measure how users interact with the recommender. For example, we may measure the click through rate for users with recommendations by taking the number of times that users click on recommended items divided by the number of times that they are displayed to them. Once we can measure a user’s interactions, we are then in the position to test how different two recommenders compare to one another in an online setting. The most popular way of doing this is by running an A/B test. A/B tests require us to bucket users into two buckets and provide one set of them recommendations from one recommender (A), and the other set recommendations from another recommender (B). By putting two recommenders up against one another and measuring user interactions, it’s possible to tell how much difference there is between them. It’s good practice to run statistical tests against the user interaction logs which will tell you how much confidence you can have that one recommender is better than another, depending upon how different they are and how many interactions were stored.
In future posts, we’ll look at each of these methodologies in detail and give practical examples of how to evaluate recommenders using them.