We at News360 have been working on news personalization for a long, long time now. The idea of tracking user interests based on reading behaviour, and using that data to recommend articles, sources and other things is very appealing – especially if you already have a lot of the basic building blocks with named entity extraction.
The obvious, naive idea is to just build a classification system. You take a bunch of “events” that tell you something about a user’s relationship with an article – for example, opening an article is a weakly-positive event (the user probably liked the headline he clicked on, so the content is likely to be relevant to him), immediately closing an article after opening is a strongly-negative event, scrolling to the end of the text is strongly-positive and so on. And, of course, you can give the user tools to give you explicit feedback on how much he liked the article, and what specifically he liked about it.
Once you have the events set up, it’s pretty trivial to track them for every user, and train a classifier based on all the negative/positive events the user has had. Classifiers need features to connect to the positive/negative connotations of events – you can use traditional bag-of-words approaches, or augment them with lists of named entities, sources, categories, tags and so on. Once you’ve got the classifiers trained, you look at the stream of new content, and see which articles are likely to generate a positive event for particular users, based on how their features coincide with the user’s interest graph.
On it’s own, though, this naive approach usually doesn’t work too well:
- It takes too long to train – the user must read hundreds of articles before you can get an accurate classifier going
- It tends to regress on itself - users are very heavily biased to read what you show them. If all you give them are articles on tech gadgets, then all the positive/negative feedback you will get will be on tech gadgets. And once you establish classifiers that will filter content for a particular user, they will start being self-reinforcing. This is sort like finding a local minimum/maximum point in a graph, and ignoring the absolute minimum/maximum.
- It’s not transparent – when the recommendation is made incorrectly, it’s difficult to explain why exactly the recommendation was made, since there are usually multiple vectors that were used.
There are, however, several key positive ideas in this approach:
- Requires little attention from the user – as long as the user keeps reading, the system should get more accurate.
- Allows unified interest graphs – all users have interest graphs that exist in the same space – this means you can easily compare two users to see how “similar” their interests are. If you run the learning algorithm on all content from a single source, you’ll get it’s interest graph, which can be compared to interest graphs of users to determine how useful this particular source will be for them.
So, if our task is to create an effective personalization system, we must compensate for the negatives of this approach, and reinforce the positives. We’ll take a look at possible ways to do this in our next post.