# Tutorial :What formula should be used to determine â€œhotâ€ questions?

### Question:

Right now the front page Popular tab is fairly broken -- it's a simple descending sort by views. As Joel said in podcast #18, it is "a self-fulfilling prophecy." But this is not intentional, it's only because we haven't had time to improve it yet!

As I sit down to write a better algorithm, I thought you guys might have some insight into what would make a good formula to determine the "hot" questions -- over the last 24 hours, the last week, the last month, and all time.

Here are the variables we have to work with:

• Views
• Whether the question has an accepted answer or not
• Time question was originally asked
• Time of last activity on question

Joel cited the reddit formula as an example, but I think our site is a little different than just a basic link submission site. Plus, we have more variables and data to work with.

I found a blog post describing a few formulas:

Y Combinator Hacker News

(p - 1) / (t + 2)^1.5

p = votes (points) from users
t = time since submission in hours

Reddit

t = (time of entry post) - (Dec 8, 2005)

y = {1 if x > 0, 0 if x = 0, -1 if x < 0)
z = {1 if x < 0, otherwise x}

log(z) + (y * t)/45000

Thoughts? Ideas? I don't want to overthink this; I just want the "hot" questions to be relevant and useful to us. So the simpler the algorithm the better!

Based on my analysis of the above and the comments so far, here's the second version of what I have implemented so far. This might suck. I don't know:

(log(Qviews)*4) + ((Qanswers * Qscore)/5) + sum(Ascores)  --------------------------------------------------------  ((QageInHours+1) - ((QageInHours - Qupdated)/2)) ^ 1.5

Note that accepted answers weight not at all in hotness. This is intentional, as I feel accepted answers are a fine social contract, but not a good data point for question or answer quality.

This algorithm will heavily favor questions with LOTS of answers, as the sum(Ascores) are now included -- one assumes if there are lots of answers, there will be a lot more voting on the answers, too.

Update: Note that this formula is what is used on the hot tab linked from the homepage of each site. It is not the formula used to determine the network hot list. See How do the â€œarbitrary hotness pointsâ€ work on the new Stack Exchange home page?.

### Solution:1

I don't think that reputation of participants should be considered. As the beta has clearly shown, "reputable" people can have irrelevant discussions.

I believe it would be better to just consider amount of activity, recentness of activity, and total upvotes minus downvotes in discussion.

### Solution:2

Rather than jumping right to the formulas, I think it would be helpful to think about what kind of popularity you're trying to measure. Stackoverflow is a user-generated content site, so popularity should be largely governed by user actions. Most popularity driven sites (Digg, reddit) want their list to change regularly so people have a reason to come back to the site every day (or hour). They usually do this by incorporating some sort of temporal decay. Assuming that this is what you want when you say you want a popularity metric, you pretty much inevitably end up with a formula that is some variation on (user activity) / (time)

So how to measure user activity? Well you listed some possible inputs in the question.

• Votes [aka Score] - Probably the best popularity metric available.
• Views - A pretty poor substitute for votes. A view from someone who clicks on a link, spends ten seconds reading and decides this is boring gets counted just as much as a view from someone who reads the entire question and every answer.
• Answers - What are you trying to accomplish with the popularity page? Do you want it to be a tool to call attention to unanswered questions, in which case questions with fewer answers should be rated higher? Or do you want to call attention to good answers, in which case more answers should be rated higher? Perhaps there should be a default popularity (no answer-based weighting) with tabs for most popular unanswered questions and most popular questions with accepted answers.
• Whether the question has an accepted answer or not - Same concerns as answers.
• Time question was originally asked - If you incorporate the time originally asked in the denominator, the popular list will incorporate only recently asked questions. It will be all but impossible for an old question which sees a burst of activity to show up.
• Time of last activity on question - Using the time since the last activity makes it possible for old questions to show up again, but it also would make it possible for someone to game the system and keep a question they like "popular" for a long time, they just have to keep doing minor things to keep the time since last activity low.
• Reputation of asker - Given the huge disparities in reputation that have arisin just in the beta period, it seems like using this will mean that newbies will never appear in the most popular list, regardless of how good their question is or how much interest it attracts.
• Reputation(s) of answerers - Similar concerns to the asker's reputation, a question wouldn't show up on the list until someone with a lot of rep answered it.

I think number of votes is pretty clearly the best popularity metric. Number of answers/accepted answer could be incorporated depending on what you want to emphasize (I think a separate tab for popular unanswered questions would be a great addition).

I'm not really enamored with either of the time alternatives you listed. Instead, I'm thinking the best way to incorporate time would be to use it as a sampling window, rather than dividing popularity by some time measure. Look at the number of votes/views/answers in the last hour/day/week rather than dividing by some sort of time measure. This disadvantages questions that have just been posted, of course, since they haven't had a full hour/day/week to accumulate votes. You could weight the votes based on how long it's been since the question was posted up to the sampling window.

So, I would propose something like:

Popularity = (Number of Votes in last 24 hours) / (Hours since question was posted or 24, whichever is less)

### Solution:3

As far as I can tell, substantial part of Qanswers in current formula is problematic:

(log(Qviews)*4) + ((Qanswers * Qscore)/5) + sum(Ascores)

About 1/3 of the answers studied here (83 of total 254) have score less than 1/100 of top voted post in respective question. Given the high amount of views and votes on studied questions, it looks like when sufficient evidence is obtained for the question, time comes to re-check (Qanswers * Qscore) part of the algorithm to make sure that it still reflects the underlying assumption:

one assumes if there are lots of answers, there will be a lot more voting on the answers, too

The observed score difference ("less than 1/100 of top voted post") clearly indicates that not all the answers satisfy above assumption. Algorithm assumes voting on the answers but the evidence strongly indicates readers don't vote on some of these; thus (Qanswers * Qscore) part becomes fake.

Given that questions checked were ones with tens thousands views, insultingly low score indicates that assuming these answers to be popular wouldn't even be in the ballpark. Still, the formula pumps these into Qanswers value, as if it is something everyone would be happy to read (hint: it isn't).

Consider tuning the formula to make it closer match observed voting evidence (when it becomes sufficient to learn from) with the initial assumption of "a lot more voting on the answers".

When question and answers gain a lot of votes, begin ignoring answers with low / non-positive score. Or better yet, ignore answers scored less than some reasonable fraction (eg 1/10) of the top one.

• For the sake of completeness, another option would be to keep current algorithm and instead modify "feature specification" to better match the formula. Although I honestly can not imagine how apparently useless answers can be explained / specified as contributing to "hotness score" in a meaningful way.

Note there is a feature request at Prog.SE meta to test a less 'klingonic' modification of the formula:

Yet another indication of issues with current formula is how easy it is to manipulate hotness score. In my recent experiments with a particular typical hot question, it looked like changing direction of a single question vote (up->down->up) has been causing changes in score by 10-15 points. This is because any question vote is leveraged by amount of answers. Consolidated, 2-3 voters can "swing" the score by 30-50 (for comparison, current top screen at collider shows me 6 questions scored from 77 to 38). Similarly, adding / removing an answer (any answer) in a highly voted question offers even more possibilities for cheating, because of it being leveraged by question score. If it was a game, I would call this somewhat boring because of poor game balance.

Actually, one can say that sort of unconscious coordinated manipulation already happens quite regularly, when hotness lemmings attracted by titles shown on top of collider land on the questions and begin posting multiple zero-effort braindumps, making formula pump the score even higher, which attracts more lemmings that break things further and so on. "My particular concern is the poisonous effect these mis-answers have on questions, making interesting and well presented problems look the same as non-constructive popularity contests." (quote source).

For the record, in August 2010 algorithm has been tweaked as follows:

• Succeeding questions from the same site are penalized by increasing amounts. So, the first question from SO in the list gets multiplied by 1.0, the second by 0.98, the third by 0.96, etc)

• Community wiki questions are penalized, to keep the entire home page from being Poll-type questions

• The benefit of many answers is capped at 10, and we only look at the score of the top 3 answers

• We only degrade based on question age, and not the last update date on a question, so questions don't pop back up to the top every time they're edited

• Views are not counted towards the score

The core of the formula (without the site-based degrading or traffic scaling) is:

(MIN(AnswerCount, 10) * QScore) / 5 + AnswerScore      -------------------------------------------------               MAX(QAgeInHours + 1, 6) ^ 1.4

Yeah "benefit of many answers is capped at 10", how cute.

Even with tweaked formula, stuffing 8 useless, zero-score answers into +50 question would have the same effect as giving 80 upvotes to answers. At +200 question, this would be like giving 320 (over three hundreds!) upvotes to answers.

No wonder than even with above tweak, some questions stick to the top of the hot questions list forever.

### Solution:4

@Matt
You wrote:

The most important factors are the total number of votes on the question itself and whether the question has an accepted answer.

and then this formula.

This would keep the questions with accepted answer on the top of the list. I think that it should be the opposite of that. When the question has been answered it doesn't need to be discussed anymore and can be archived. If somone has similiar question they will find the answer using search.

### Solution:5

1. Every 15 minutes, look at the questions asked in the last 24 hours that have not been promoted to the home page.
2. Put the question with the most views at the top of the home page.

That spounds good to me, but maybe do it for the top two/three of each cat rather than just one. Six new additions are better than 2 - even if it is every 15 minutes.

I'd definitely like the current setup as a seperate tab however - I like how you don't have to spam your link to everyone to get it on the frontpage (where it has any real chance of being seen by the masses) a la Digg. Perhaps rename it to recent - it's recent activity (questions/comments) rather than new questions.

### Solution:6

1. Every 15 minutes, look at the questions asked in the last 24 hours that have not been promoted to the home page.

2. Put the question with the most views at the top of the home page.

Simple, right? You essentially have a list of the "hot" questions from the last 24 hours. There's no self-fulfilling prophesy. And you could use the same algorithm to select "hot" questions over the last week, the last month, and all time.

### Solution:7

Edit again: Micha(l) was right, I had the "answered" bit the wrong way round.

The most important factors are the total number of votes on the question itself and whether the question has an accepted answer. As @Blorgbeard says, reputation doesn't really affect it.

Age definitely has to be a factor, otherwise a highly voted question could stay around long after the technology it refers to is superceded. Perhaps last activity is all that needs to be factored in here, rather than when the question was asked.

Maybe something like:

((totalVotesOnQuestion + 1) ^ 2) * (answered ? 1 : 10) / timeSinceLastActivity

Where the denominator is scaled appropriately.

Edit: This would rely on people up-modding questions they want the answer to. I don't know if some extra helpful text would help this, prompting people to vote for the question if they, too, want to know the answer. This would tie in nicely with the suggestion on UserVoice to make the votes against a question count as "bounty".

### Solution:8

The first item in the list of variables to work with is "votes", but I assume that's the number of votes that the question has received. Is it possible to count the number of votes it's answers have received?

To avoid the pitfall of the questions that historically got the most views, being the questions that get the most views in the future could you define popularity based on the amount of interaction questions are getting once someone has viewed them.

Ideally you'd want some kind of running average of the votes for the question, the votes for it's answers and the frequency of answers. So that when the a good solution is found and new answers stop coming in and discussion slows down it will drop down the list of popular topics even if it is still getting many page views.

### Solution:9

Have you looked at the mouseover text on the voting arrows?

...you've demonstrated exactly why voting is useful, and it works. :)

Jeff here mentions again how important he believes voting should be where popularity is concerned. But as Inigo once said, "Why you keep using that word? I do not think it means what you think means."

Topics and threads (because that's what the hottest "questions" have become) don't have to be helpful to be popular, so I might suggest that Jeff is as brutally misusing this metric as the guy he's chastising in this quote. To be fair, so are the vast majority of the people on the "hot" questions.

Solution : Have "Hot Discussions" and "Hot Questions" be separate items. Allow high-rep users to take a normal question that is clearly a discussion piece, and make it "Go Wiki" (Can't editors do that already?), and by doing so have it no longer count as a "Hot Question". Now you can apply almost any formula you like to the questions because any open-ended question that gets to the "Hot" board will be ripe to be instantly shoved aside. Almost all the things that you cite in your formula like views, votes, and time since the last update will be more relevant, because spammers, flamers, and trolls will gravitate to the "discussion" questions.

Making it easy to get into flamewars, and easy to segregate them, ala Wikipedia discussion page, has a pretty good track record in returning relevant results, no matter the formula.

### Solution:10

My suggestion, which has been rejected by numerous site frameworks but I continue to make... Store another variable, "Views originating from any list sorted by most views", and don't count those.

### Solution:11

The formula needs to control for the average votes per question on a site, as well as the average answers per question.

Instead of (in Atwood's notation) saying Qscore, use Qscore/Ascore (average score) in the formula (and adjust the formula accordingly).

This controls for sites whose vote frequencies vary. The same thing could also apply to answer. This would also even the distribution between questions on sites; it would stop sites that accumulate too many votes from gravitating into the Hot Questions list too often.

### Solution:12

Maybe tags could be taken into account somewhere along the way. Especially since high reputation people can re-tag other's people questions, avoiding mis-tagged questions.

Just a thought...

### Solution:13

One additional thought about ranking hot questions. I think the hot questions ranking should potentially be calculated or weighted differently for answered and un-answered questions. It seems that questions that are viewed lots, but not answered are potentially more interesting than questions that are already asked and answered.

### Solution:14

Have a single popularity field for each question.

When user votes question or one of its answer up, increase popularity by fixed amount. Decrease with down votes.

If user cancels up vote, decrease questions popularity with same fixed amount as up vote would increase it.

Then decrease popularity periodically (once an hour, once a day) with this formula:

popularity = popularity * X, where X is carefully selected and 0 < X < 1,                                for example X=0.95

Then just order questions by popularity. The question with highest value in popularity field is the hottest.

This would need some figuring out what the coefficient X should be and how often formula should be applied to questions.

This would give more weight for recent votes, and high popularity spikes would eventually fade away. This way the questions that recently got a lot of up votes would get to the top. This would work even if the question was very old but for some reason got lot of a activity recently.

### Solution:15

I agree with Blorgbeard.. I don't think rep should play a major part in determining a question's popularity..

I certainly don't think it should be based on just the number of views or answers posted (both could obviously be gamed easily).

However, I believe it should be mainly based on score (?) (upvotes - downvotes). This should weight the up/down votes on the question higher than those for the answers, but those for the answers should obviously have an impact..

This would probably be made easier if the question score was automatically partially controlled by the answer content (i.e. 3 upmods on answers = +1 to question).

### Solution:16

I think that for a site like this, the hot question would be a function of what is being discussed the most. Now, if we leave it at that only, then an irrelevant discussion would may also become hot.

In that case, we should take into account the reputation of the people answering (hence contributing to the discussion).

### Solution:17

I understand that using a simple reputation metric in the calculation for which topics should be "Hot" questions would lead to a situation where a new user would have a low chance of having his question make it to the list.

How about calculating a Net Reputation metric based upon the total number of points a user has obtained over the age of his account on Stack Overflow? The age variable itself may be based upon the number of days where a user has logged into the site for at least 10 minutes.

netReputation = (reputationPoints / numDaysUserVisits);

That should allow a user who has more reputation points, and happens to be active on the site to gain a higher metric.

### Solution:18

Have you looked into the "Best" algorithm that Reddit added? It's pretty nifty. Here's Randall Munroe explaining it.

### Solution:19

Personally I don't see the benefit of "hot" questions on stackeoverflow. It's probably just bias but I come here for hard answers to specific questions not to lurk or partake in topical discussions.

### Solution:20

Help me look at this hot algorithm is correct?! php code

function hot(\$Qviews, \$Qanswers, \$Qscore, \$Ascores, \$date_ask, \$date_active)  {      \$Qage = (time() - strtotime(gmdate("Y-m-d H:i:s",strtotime(\$date_ask)))) / 3600;      \$Qage = round(\$Qage, 1);      \$Qupdated = (time() - strtotime(gmdate("Y-m-d H:i:s",strtotime(\$date_active)))) / 3600;      \$Qupdated = round(\$Qupdated, 1);        \$dividend = (log10(\$Qviews)*4) + ((\$Qanswers * \$Qscore)/5) + \$Ascores;      \$divisor = pow(((\$Qage + 1) - (\$Qage - \$Qupdated)/2), 1.5);      echo \$dividend/\$divisor . "\n";  }

test result:

hot(1265, 11, 80, 121, '2011-04-01 00:50:02Z', '2011-04-01 04:53:35Z');  hot(85, 4, 0, 6, '2011-04-01 04:05:30Z', '2011-04-01 04:24:02Z');  hot(15, 2, 0, 0, '2011-04-01 05:02:21Z', '2011-04-01 05:14:34Z');  hot(33, 3, 0, 7, '2011-04-01 04:25:43Z', '2011-04-01 05:06:01Z');  hot(12, 0, 0, 0, '2011-04-01 05:00:45Z', '2011-04-01 05:00:45Z');  hot(12, 0, 0, 0, '2011-04-01 05:02:05Z', '2011-04-01 05:02:05Z');  hot(11, 1, 0, 0, '2011-04-01 05:01:04Z', '2011-04-01 05:11:56Z');    5.97048822863  0.308542463568  0.119073201261  0.314512900506  0.10786376822  0.10786376822  0.104757620731

I collected a few data from the SO of the tab hot, but the results on the ranking value is inconsistent with the SO, this makes me very confused

### Solution:21

Basically, A user post's a question just to get a proper answer.

As a user, I just want to get a list of related question's suggestion that have been already answered and have been accepted by the OP , better if the answer has few upvotes but that doesn't matter that much because the user who have posted the answer might be an Unsung Hero

Also, As you pointed out the variables to be considred like :

Among these two , i think 'Reputation of asker' should be given the least importance as that is far more irrelevant. Consider this: Joshua_Bloch has just created an account on stackoverflow and asked a very good question with a reputation of 1, don't you show his question in suggestion to other's ?

### Solution:22

If you'll permit me to get philosophical, what is the purpose of the 'popular' or 'hot' view? I think for a Q&A site it should be to attract people to interesting but unanswered questions, but there are other possibilities. Is it to show off how an interesting question was answered? Is it just to suggest, hey, lots of other people were interested in this question, you may be as well?

I think the score should be based on the total number of votes, number of views, (relatively weak) number of answers, and the time since posting, because each of those is a measure of engagement. So you have something like (A * views) + (B * votes) + (C * number of answers) with C less than A or B. Votes is total number of votes (up or down) on answers plus net positive votes on the question. Then divide by some rapidly increasing age factor, (age of the post ^ D + age of the post ^ E), so that the hot topics are always changing. Sadly all these constants need to be found empirically, but I think this will do it.

### Solution:23

Are you planning to make any exceptions in your alogorithm for meta-questions? You joked around in one podcast about building a system that is being used to discuss the system, which of course isn't really what's happening when you count up the actual number of meta-questions, but when you look at the really popular questions it seems that way.

The general line of thinking is that once the site goes public, these posts will eventually fall by the wayside, but I'm not so sure that's true. There will always be interest in discussing how the site works, especially among your target audience.

Maybe just a slight subtraction for tag=stackoverflow after the beta is done. This would, of course, require some diligence with re-tagging.

### Solution:24

If it were me, I'd say weighted activity per minimum period through a linear filter: http://en.wikipedia.org/wiki/Linear_filter

### Solution:25

Usually, the best way to do these things is to keep it simple so that people can understand and maintain any coefficients in the algorithm. I guess they will require some tweaking as the masses flood the gates.

I am quite surprised how good the Firefox frecency algorithm works. It can't be translated directly to this, but switch some variables around and you have a pretty lightweight formula to order your posts.

### Solution:26

What I am suggesting here is not a solution but a general approach.

The problem at hand is generally referred to as the 'ranking problem' in machine learning. Given a set of examples, each of which is a vector of real numbers, induce a function to rank unseen examples. This problem is quite well studied.

What I am suggesting is that the creators of Stack Overflow extract a small subset of their database, manually rank it and put it online for researchers to download and work with (anonymize it if necessary). The UCI Machine Learning Repository would be a good venue. People who work in this area are bound to jump on it, use it to benchmark their algorithms and ideas. Soon you will see publications will real results on a subset of your database. You can get this at almost no cost to you. Many organizations have tried this approach, Netflix to name one.

### Solution:27

Random question. Maybe unrelated, pardon me in that case.

In our ratings, do we take into account toppers bias. i.e. people tend to support the widely acclaimed opinion. In our case, since the responses are sorted by votes, such a bias would be aggravated. e.g. in a substantial likelihood, this response will be ignored ;)

### Solution:28

I don't see anyone doing this. But how about using amount of time spent on reading this question and it's answers? In fact it is possible to get where on the screen the user is and whether or not the window is in focus. Using these parameters to determine whether a quesion (and which of it's answers) is most "spent time on".

And I guess "hotness" can be a function of some of the parameters you mentioned along with the one above.

### Solution:29

I think that you have to factor churn rate as this is an issue for most here - fantastic resource heavliy used IMHO.

So I would go for

and then a simple ranking of those.

View v Answers is a churn rate to answer rate (should be good as there have been lots viewing it and lots answering it) then add the Time Factor of the Question, that is, the speed of the question to the speed of the answers.

Note that this is simple problem solving algorthim not a real one.

UPDATE

What is definition of hot topic? Views to Answers or Time from Question to Number of Answers.

I think what I was trying to do was combine these to perspectives on a hot topic then provide a quick rank of the multiplication of those two routines...

UPDATE 2

Having just read this again - I still haven't put reputation and voting in...
On reputation (this is from people that have put time in) my mind thinks that a person coming with a rep of 10 may still provide the "best answer" so discount reputation out of the "hot topic"...

So the last piece would be to

Humm b*tch of a question = good luck =>

### Solution:30

I am starting to think that

• And to a lesser extent, views per person that saw the question near the top of the given list (this will be different for each tag the question has).

Are good measurement to use.

As it may predict the outcome of highlighting the question, we are after all trying to get good answers and engagement (voting) of users.

If a lot of people that see the title of a question decide to view the questions, it is likely that showing the question title to more people will get more engagement.

I.e the questions that get most benefit from being in the â€œhotâ€ list should be put in the â€œhotâ€ list.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »