Testing Recommenders

Why Test?

When I met fellow GroupLens alum Sean McNee, he had a bit of advice for me: Write tests for your code.

It took me some time to grasp the wisdom of this — after all, isn’t it just research code? — but testing has been a very valuable tool in our development of the LensKit recommender toolkit. In this post, I’ll be going through some the different ways that we employ automated testing in LensKit development; many of our approaches should be widely applicable in recommender development and other data science applications.

Challenges of Testing

While testing software is widely regarded as a Good Idea, recommender systems (and similar information processing and machine learning systems) have some interesting challenges that make extensive testing difficult.

  • The algorithms are complex, making it difficult to manually compute correct solutions even for relatively simple data sets.
  • Many algorithms are nondeterministic, depending on random initialization or changing as the input data shifts subtly.
  • Algorithms can have costly run times, making testing on many inputs and outputs an expensive proposition.

Manageable Testing for Recommender Systems

In regular development of LensKit, we use a multi-pronged testing approach to help maintain code quality. Together, these testing methods give us a good degree of confidence in our code, and we are regularly adding new tests as we grow LensKit’s feature set and fix bugs that arise.

The first, most basic testing we do is unit testing of discrete components. Data structures, utility code, and individual, self-contained computations such as mean and similarity functions are the most heavily tested in this regard. We also have extensive unit tests for many of our infrastructural pieces, and have recently retooled our evaluation metric API to make it easier to test evaluation metrics. The goal of this testing is to make sure that the individual components that comprise our recommender algorithms are well-tested, to avoid bugs that arise because of tricky data structure errors. Make sure the pieces are sound, to improve the likelihood of the final construction being correct. We’re using JUnit 4 for all of this.

We’ve also been adding some randomized testing and first steps towards property-based testing, based on the QuickCheck framework, for some of our data structures and utilities.

We then conduct more complex integration tests of components with behavior that can be reasonably specified (such as our configuration infrastructure) and testing basic functionality of our build system plug-ins and command-line tools. A number of these tests live in our JUnit tests; others live as external tests run by the build system. This makes sure that the wiring that makes LensKit work passes its basic specifications.

Whenever we find a bug in LensKit, we try to write a test (either a unit test or integration test) to check for it so that we can hopefully keep it from returning.

The next types of testing try to make sure that LensKit works as a recommender. End-to-end smoketests make sure that we can successfully train and/or run a recommender on real data without crashing. For this, we typically use the MovieLens ML-100K data set, as it is large enough to produce interesting results but small enough to run in a reasonable time frame. For these tests, we don’t examine the outputs; we just make sure that the code runs.

Threshold-based integration tests perform cross-validated accuracy estimates of our recommender algorithms against the ML-100K data set. Based on prior runs, we have set a target value for target metrics, such as RMSE, and fail the test if the algorithm’s accuracy is more than a set tolerance away. The tolerance is somewhat wide, as the randomization in the crossfold procedure introduces variance in the measurements; we check that the RMSE of a typical item-item configuration is in the range 0.85-0.95. These checks don’t verify the correctness of the algorithms (or evaluator) in a strict sense, but they do provide a way to detect when we have made some change that significantly alters the algorithm’s output. This helped catch one bug that cropped up in some evaluator revisions: after making changes, the accuracy was suddenly far better than it had been previously. Further investigation found that a typo caused us to train the recommender on the test data; the threshold tests let us catch this bug before shipping it to users.

We also regularly run static analysis of our code base using SonarQube; fixing issues found by the linter has uncovered real bugs at times.

And finally, we run continuous integration to apply all of these tests to new code. Every time a pull request is submitted to GitHub, Travis and AppVeyor run all of our tests (except the static analysis) on the new code on both Linux and Windows, across all supported Java versions. Even though the code is in Java, running tests regularly on Windows is important due to subtle differences in file system behavior. Static analysis is automatically run after the code is merged (due to limitations in SonarQube’s support for pre-merge analysis).

Future Directions for LensKit Testing

There are several ways that we are looking to improve our testing:

  • Improving test coverage of existing code (particularly unit tests).
  • Writing additional integration tests, with a particular eye towards testing on additional data sets.
  • A repeatable experiment to run with each new LensKit version, to produce current charts of the accuracy various LensKit algorithms and configurations achieve on widely-used data sets, and to let us more easily detect problems such as one version of the software failing to correction use the neighborhood size in the recommender.

Conclusion

Software testing is hard work, and testing complex non-deterministic systems is particularly hard. This portfolio of techniques has helped us catch a number of LensKit bugs, and overall helps us maintain software quality. Hopefully it’s also useful to you in keeping the bugs at bay from your system.

Testing is also worth its while even in straight research code that will never see production. Time spent writing tests is time I find I don’t have to spend second-guessing the correctness of the computations producing my research results.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s