Simple ML Hacks: Reddit's "best" comment scoring algorithm as a multi-armed bandit task

User-based scoring is great way to automatically moderate user comments, many websites feature some form of upvote/downvote, digg/bury, agree/disagree, happyface/sadface, etc. voting scheme, this post is about reddit's comment system (pictured), which I happen to be quite fond of (albeit, for a multitude of reasons).

The most simple, and most popular, form of moderation on such websites is to rank according to a simple difference, score = #upvotes - #downvotes (now denoted $U$ and $D$). However, this scoring function significantly over-weights comments that have a high number of views, typically comments that were posted within the first hour of a submission. This drawback was the motivation for the "best" ranking on reddit, which was developed from xkcd's Randall Munroe's suggestion, described in this blog entry. Fundamentally, this scoring function uses a ratio of $U/(U+D)$, as opposed to a difference, and is therefore should not change significantly as we collect more samples, at least in theory. Additionally, this ranking approach subtracts off a confidence estimate from each comment score, which naturally tightens as more samples are collected. Intuitively, this approach ranks comments by the value "we're pretty sure it's at least this good" (with 95% confidence). In this post, I point out that this ranking task is very similar to the multi-armed bandit problem, and that there are a number of alternative approaches developed in this area that likely outperform this lower bound approach. As a bonus, the final algorithm I advocate using is extremely simple, specifically, each time the comments page is loaded, the score for each comment, $i$, is sampled from a $Beta(1+U_i,1+D_i)$ distribution, comments are then ranked by this sampled score, in descending order. Before I get to why this might be a good idea, let's state the problem a bit more formally.

We model each comment's votes using a Bernoulli random variable, that is, if we are given $K$ comments we assume that each comment, $i$, has some fixed underlying probability, $\mu_i$, that a user will click an upvote, (we don't count non-votes so $1-\mu_i$ gives the probability of a downvote). We, of course, cannot directly observe the $\mu_i$ values, but we can infer them from user actions, that is, we assume we observe samples the random variable $X_{i,t} \in \{1,0\}$ at time-step (user click) $t \in \{1,...,n_i\}$ ($X=1$ for upvote). While it's clear that the most likely value for $\mu_i$ is given by the empirical mean $\hat{\mu}_i = \frac{1}{n_i}\sum_t X_{i,t}$, this value alone does not take into account the uncertainty we have about this estimate, i.e. the number of samples. A very natural way of modelling this uncertainty to use Bayesian inference techniques, which allow us to consider a (posterior) distribution over all possible values for $\mu_i$.

Bayesian model

The posterior distribution over a Bernoulli parameter is, conveniently, given by the Beta distribution where the parameters $\alpha$ and $\beta$ related directly to the number of successes and failures respectively, plus the prior parameters. For example, consider a comment which has true parameter $\mu_i = \frac{3}{4}$, we begin with an uniformed (flat) prior ($\alpha=1,\beta=1$) shown in the leftmost figure, after observing 3 upvotes and 1 downvote we have the middle figure, and after 30 upvotes and 10 downvotes we have the rightmost plot. Intuitively, these plots reflect our "belief" over the unobserved parameter.

Density plots from left-to-right: Beta(1,1), Beta(4,2), Beta(31,11)

There are two important advantages of this Bayesian formulation, first is that the uncertainty around the unknown parameter reflects the number of samples collected, as seen in the above figure. Secondly, we can easily add-in prior information and see how that affects our posterior beliefs, for example, if we are fairly sure that comments have a 5:1 upvote/downvote ratio we may want to set our prior parameters to ($\alpha=5,\beta=1$), or even ($\alpha=50,\beta=10$) if we are quite certain of this fact. This will require that a comment receive significantly more samples before it can drift away from this prior mean.

The current "best" ranking system can be seen as similar except that it incorporates a level of uncertainty around the empirical ratio in the form of a confidence region, assuming the statistic in normally distributed. However, this assumption is not likely to hold in practice, especially since the comments we care about are like have on the order of 5, 10, 20 votes, far from the number required to invoke the central limit theorem. In addition to the fact that the true parameter is defined only on $(0,1)$.

Online learning and multi-armed bandits

Online learning problems, such as the stochastic mulit-armed bandit (MAB) problem, seem to be increasingly popular amongst today's "data scientists". For example, there are a number well known engineers making an effort to bring these methods to the mainstream, such as Ted Dunning at MapR, and Steve Scott at google. In general the MAB problem setting is widely used because it is extremely simple yet exhibits practical challenges, most notably, handling the exploration vs. exploitation dilemma. I will mostly skip the background on online learning, if you are new to the topic you may want to check out Sebastein Bubeck's recent survey paper, Ted Dunning's blog post, or this blog entry I saw recently on /r/machinelearning.

The stochastic MAB problem setting is typically described with by the gambling analogy for which the domain was named. That is, imagine you are in a casino with a $K$ different slot machines (a.k.a one-armed bandits), and you are told you must choose, sequentially, which slot machine to play at each time step $t$ until a finite stopping time ($n$ plays) had been reached, $n$ is not known to you beforehand. Your objective, as the gambler, is to devise a policy for choosing slot machines (arms) that will maximize the total winnings, on expectation. Again, we will assume that the payoff of each machine $j$ is a Bernoulli-distributed random variable $X_j$ with unknown mean $\mu_j$. The challenge in this task is that you, the gambler, must always weigh the decision of playing the best machine in hindsight ($\arg\max_i \hat{\mu}_i$) (exploiting) to that of improving the current estimate of the unknown means (exploring).

First, some quick notation, Let $T_j$ denote the total number of pulls from arm $j$, so $\hat\mu_j = \frac{1}{T_j}\sum_t X_{t,j}$ denotes the empirical mean, and $U_j = \sum_t X_{t,j}$ denote the number of "successes" so far and $D_j = \sum_t 1-X_{t,j}$ the number of "fails". Additionally, we will let the asterisks operator denote the optimal (highest paying) arm, for example, $i^*$ gives the index of the best arm, $X^*$ denotes the random variable for the payoff of arm $i^*$, $\mu^*$ for $\mathbb{E}[X^*]$, and so on.

Next, we consider the following 3 approaches:

$\epsilon$-greedy: at each time step, with probability $(1-\epsilon$) play the best-performing arm in hindsight, other wise choose an arm uniformly at random.
UCB: at each time step, for each arm calculate confidence bound $c_j = \sqrt{\frac{2\log t}{T_j}}$, always choose arm $\arg\max_j \hat{\mu}_j + c_j$
Thompson Sampling: at each time step, for each arm sample from the posterior $\bar{\mu}_j \sim \mathcal{B}(1 + U_{j}, 1 + D_{j})$, then choose arm $\arg\max_j \bar{\mu}_j$.

One of the most important details of this simple problem is that the policies may be analyzed mathematically and researchers have been able to prove strict bounds on the performance. To do so, we first define the notion of regret, specifically, if we let $X_t$ denote the action taken at time $t \in \{1,...,n\}$, we have regret: $$ R_n = n\mu^* - \sum_t X_t.$$ This value can simply be thought of as the best possible payoff that could be attained minus the payoff that was actually obtained, therefore, naturally, we seek to minimize this quantity. Interestingly, [Lai and Robbins, 1985] showed that for any possible allocation scheme the regret is lower bounded as $\Omega(\log n)$.

In regards to the three simple algorithms above, we can first observe that, since, $\epsilon$-greedy will take a suboptimal arm with a fixed probability at each time step the regret must be $\mathcal{O}(n)$, and therefore much worse than the theoretical best. Happily, both UCB [Auer et. al, 2002] and Thompson sampling [Kaufmann et al., 2012] have been shown to have a regret of $\mathcal{O}(\log n)$, which implies that these algorithms are optimal to within constant factors. Additionally, in the same paper Auer et. al. present a $\mathcal{O}(\log n)$ version of the $\epsilon$-greedy approach where $\epsilon$ decays at a rate of $\frac{cK}{d^2t}$, where $d$ is a lower bound on the difference between the best and worse arms, and $c$ is a fixed exploration constant. Although these algorithms share the same order their performance has been shown to be quite different in empirical studies such as [Chapelle and Li, 2011]., additionally, as a simple illustration, consider the plot below which shows the cumulative regret for these three methods in a simple scenario where $(K=100, \mu_1 = 0.9, \mu_{2:K} = 0.1, n=1e6)$.

Cumulative regret for $(\epsilon=0.01)$-greedy (green), $(\epsilon=\frac{5K}{(0.8)^2t})$-greedy (magenta),

UCB (blue), and Thompson sampling (red)

Additionally, studies have also shown that even Bayes-UCB approach, which uses confidence bounds constructed from the Bayesian posterior, is also outperformed (empirically) by Thompson sampling [Kaufmann et al., 2012]. The important conclusion in all of this is is that posterior sampling approaches have advantages, both theoretically and empirically, over confidence-based approaches, at least when it comes to the stochastic MAB problem.

It is possible to implement a lower-confidence bound algorithm, like the one used in the current reddit implementation, however this approach would clearly have non-zero probability of sticking with a suboptimal arm forever. Since the regret in this case would be linear, the expected regret must be $\mathcal{O}(n)$. In fact, in the simple tests I ran on the above toy problem, I found this algorithm to extremely exploitive as it would very quickly converge to pulling one of the 99 suboptimal arms forever, it was much much worse than $\epsilon$-greedy. This lack of exploration would, in a way, be mitigated by the fact that users actually get a full list of comments rather than just the max, but this is not enough to move this algorithm anywhere close to the performance of, say, Thompson sampling.

Sample-based comment ranking (my proposal)

The comment scoring problem on reddit is slightly different from the basic setting described above. This is because we are not actually choosing a single comment to present to the user (pulling one arm) but, instead, producing a ranking of comments. There are some interesting research papers on modelling this problem precisely, for example [Radlinski, et al.,2008], but, it turns out to be a combinatorial optimization (hard). However, rather than going down this complex modelling path, one could simply rank all of the $\bar{\mu}_i$ samples instead of taking just the max, this gives us a full ranking and, since the max is still at the top, is unlikely to adversely affect the convergence of the algorithm.

Again, the resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a Beta(1+U,1+D), comments are then ranked by this score in descending order.

This randomization has a unique benefit in that even untouched comments (U=1,D=0) have some chance of being seen even in threads with 5000+ comments (something that is not happing now), but, at the same time, the user will is not likely to be inundated with rating these new comments. However, even in the the event that new comments appear too frequently one can easily modify the prior to reflect our beliefs about the comment scores. Interestingly, this prior value could be tuned in various ways, by the user directly, by either user's past voting percentage (the poster or viewer), as some linear function of whatever "features" we may have for the users, and so on. Lastly, the general idea of randomly sampling users to extract information relates naturally to the age-old idea of polling individuals, in that both ideas utilize the Monte Carlo principal that only a very small number of individuals must be sampled in order to get a good estimate of a statistic.

References

[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysius of the multiarmed bandit problem. Machine Learning, 47:235–256.

[Chapelle and Li, 2011] Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. In Neural Information Processing Systems (NIPS).
[Kaufmann et al., 2012] Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite time analysis. stat, 1050:19.

[Lai and Robbins, 1985] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:422.

[Radlinski, et al.,2008] Radlinski, Filip, Robert Kleinberg, and Thorsten Joachims, (2008). Learning diverse rankings with multi-armed bandits. Proceedings of the 25th international conference on Machine learning. ACM.

Comments welcome, or tweet to me @jamneuf.

7 comments:

Unknown12 April 2013 at 03:33
I haven't had a chance to really work through it, but from what I've seen so far, this looks like a really good take on the problem! I appreciate your suggestions and will pass them along; I have a project I've been meaning to do which involves a very similar problem (nothing fancy, just an organizational thing for my own use), and I'll definitely incorporate your approach.

Oh, and credit for the Wilson score algorithm reddit uses should go to Evan Miller —I just convinced reddit to use it.

(Sorry for the deleted comment; it told me I'd double-posted).
Anonymous12 April 2013 at 08:15
Very cool. Minor nitpick: when people have analyzed the attainable performance of the e-greedy algorithm, they typically consider an annealing variant with O(log T) performance. See, for example, the quick discussion in "Multi-Armed Bandit Algorithms and Empirical Evaluation" by Vermorel and Mohri.
Anonymous12 April 2013 at 21:21
The approach - each new time the comments page is loaded, the score for each comment is sampled from a Beta(1+U,1+D), comments are then ranked by this score in descending order - could cause some confusion, no?
With each page load, unseen comments could come in and replace others. If they came in at the top (or bottom) it's understandable, but having something come in somewhere in the middle wouldn't work well with the user's mental model. Am I mistaken?
Unknown13 April 2013 at 06:38
Hey, I'd love to get your take on the problem I mentioned. If you're interested, email me at randall@xkcd.com -- thanks!
Greg3 May 2013 at 07:28
Nice article. I like the ideas but I have a few points/comments:

1. There's an inherent assumption in this article that Reddit wants to maximize the number of upvotes in the comments (which is what you would get if you put the "best" article at the top). I imagine that's not actually true. Significant randomization may make it difficult for long comment threads to emerge, something that may be valuable for user engagement on Reddit. I could also tell a plausible story where the probability that a user stops reading comments is affected by the quality of the last comment they read, and if you randomize a poor article high into the ranking, I may just quit right then and never even see the better articles.

2. Another implicit assumption is that vote are independent (a necessary condition for all the methods you listed) but they're not. In fact, that was the point of the recent "hide the score" feature of comments was to remove one signal that votes were being correlated upon. I would imagine the correlation between votes isn't that bad but I dont think that should be ignored.

3. For the proper setting of epsilon, epsilon-greedy does achieve log T regret (the Auer paper you cited shows it). In addition, epsilon greedy has a lot of other desirable properties (its computationally easier than UCB, it works better in combinatorial settings when run with other epsilon greedy instances), so I wouldn't discount it.

4. About the "normal" assumption for the best ranking; the normal assumption is assuming that the prior of bernoulli parameters is normally distributed, not that the votes on a given article will eventually converge to something normal looking. Therefore, you dont have to invoke the central limit theorem to get that assumption to be realistic, so the comment that a small number of votes is not enough for the CLT is not exactly accurate.

It is a nice idea to apply the standard "explore-exploit" framework to comment ranking on Reddit but I'm not really convinced that it really solves the appropriate maximization problem. I have a strong feeling that implementing a strong amount of randomization might have a negative response from the user community.

Tuesday, 9 April 2013

Reddit's "best" comment scoring algorithm as a multi-armed bandit task

Bayesian model

Online learning and multi-armed bandits

Sample-based comment ranking (my proposal)

References

7 comments: