If you've worked much with recommender systems you'll know that offline evaluation of recommendation algorithms can bring on a serious headache in no time. It was nice of Netflix to offer researchers $1M to improve their recommender system, but I think many of them would have taken part in the challenge even without a serious prize simply because the competition format means that someone else has to worry about preparing training and test data, and choosing a success measure, not to mention the pure mechanics of running evaluations over and over again.
With the benefit of hindsight of course it's clear now that rating prediction accuracy isn't a great way of evaluating a recommender system: it isn't applicable at all to most domains, where people can't or won't give ratings; it's not a very good predictor of the relevance of the top few recommendations that are shown to users, and which they actually care about; even if you can measure relevance it's far from the only thing that influences uptake; and, last but not least, small changes to your user interface or improvements to your input data are both likely to have a much bigger impact than anything you might do to fix up your algorithm. Consequently most practitioners would much prefer to do online A/B testing, which will tell you reliably whether or not any changes you've made are going to have a positive impact on your bottom line.
But unless you work for a huge company, online testing is a limited resource, especially compared to the vast parameter space you might need to explore when setting up a new recommender system. And if you're an academic you may not be able to reach users easily, and you have to have some formal results to publish in any event. So the need to do offline evaluation isn't going away any time soon.
Here are some of the pain points that I've bumped into recently:
- public datasets aren't enough for (easily) reproducible research: take a look at the results reported here, here and here, all based on the same dataset, but with different experimental setups making for a lot of extra work to compare with them.
- toolkits need to scale: MyMediaLite, for example, is a great tutorial resource for recommender systems, with clear, well thought out implementations of a host of algorithms, and evaluation code too, but it doesn't handle even moderately large datasets.
- baseline methods need to be reproducible too: I just came across this impressive paper by Jagadeesh Gorla which introduces a genuinely novel approach to making recommendations and also reports great results, but I can't come close to replicating his baseline results; should I believe the results for his new method?
- evaluation needs resources: generating full similarity matrices for many different parameter settings, or recommendations for all users in a big dataset, can be way more compute intensive than running a recommender service for real.
- Hadoop doesn't help much: map-reduce isn't usually a good fit for machine learning algorithms, and the large resources needed by most recommender systems mean that you can't even resort to map-only driver jobs to launch code in parallel.
- distributed processing without Hadoop is hard: I got excited about StarCluster and IPython.parallel, as in principle they make it dead easy to set up a Python cluster on AWS and run code on multiple worker processes, but in I've found that processes easily get into an unrecoverable state meaning frustration and wasted time tearing down clusters and setting up new ones. In practice I've made faster progress by running parallel processes on the single biggest machine I can find locally.
What's your experience?
Are there some great tools and frameworks that I should know about? These issues aren't unique to recommender systems. Have they already been solved in your field? If not, what kind of tools would you like to see made available?
I've just been invited to give a keynote at the workshop on Reproducibility and Replication in Recommender Systems Evaluation scheduled for this year's ACM RecSys conference in October, and I'd love to be able to talk about solutions and not just problems.
Let me know what you think!