Enlister baidu's recommender system for the biggest chinese q&a website

Enlister
Baidu's Recommender System for
the Biggest Chinese Q&A Website
 JasonFu | fujuxuan@gmail.com

 2012-12-10, Monday

We are leaving the age of information and
entering the age of recommendation.
 克里斯·安德森，《长尾理论》作者

1/36

ABOUT THE PAPER

2/36

BACKGROUND

 Baidu Knows offers a Q&A platform to its users for knowledge
and experience sharing.

 Today, Baidu Knows website:
 There are over 400 million questions
 More than 170 million questions were answered by users
 Around 100 million users search for answers every day
 Over 12 new questions are posted online every second

 Baidu Knows is the biggest Q&A website in China.

6/36

BACKGROUND
 An eco-system of knowledge sharing between the website’s users.
 As more questions have been answered, the search result quality of
common users will be improved.
 “Answer” is contribution, not consumption.

7/36

APPLICATION DESCRIPTION

 Baidu Knows builds an intelligent RS, Enlister, to provide the Baidu
Knows users with questions that they may be willing to answer.

 Is based on the machine learning technology

 Apply a machine learning based CTR prediction methodology to
improve the recommendation accuracy

8/36

APPLICATION DESCRIPTION

Previous Enlister
 Content-based
 Typical content-based
 CTR prediction
 Cosine similarity degree
 Stream computing

9/36

ALGORITHM

 We expect the user models to disclose the nature of our users’

choices of answering questions.

 The models have to be simple enough to accommodate massive

calculation and industrial adoption.

10/36

ALGORITHM

1. User Model
2. Click Prediction

3. Diversity Adjustment

11/36

1. User Model
Age, Gender,
Education

attributes Tags

…
user model
Interest Term
Vector

Related
interest
Questions

Abstract
Interest Vector

12/36

1. User Model
 Interest Term Vector
 A vector contains weights of terms, which implicate the correlation between
user and the term on sematic level.

 Related Questions
 Questions are browsed or answered by a user.

 Abstract Interest Vector
 A vector contains weights of terms, which implicate the correlation between
a user and an abstract concept.
 Use PLSA technology to get a conceptual topic model from millions of
question-answer pairs on the Baidu Knows website .

13/36

ALGORITHM

1. User Model
2. Click Prediction


14/36

2. Click Prediction

 The click model that we created is a kind of probabilistic
classification model. It is a binary classification model to
calculate the probability of a sample belonging to a class.

15/36

2. Click Prediction

 Sample Collection
 Positive samples: questions that the user had examined and clicked.
 Negative samples: questions that randomly choose from the question pool.

 Feature Selection
 User Attributes: (1) basic user attributes (2) other features from the statistics
 Correlation Degree: the number of matched terms, cosine similarity and bm25
between the user interest term vectors and question vectors

 Classification Algorithm: two principles (1) probabilistic classification (2) linear
classifier

16/36

2. Click Prediction
 Classification Algorithm
 maximum entropy classifier
 The probability P of a user u will click the question q can be calculated as
follows:

 Global optimization solution：limited-memory Broyden-Fletcher-
Goldfarb-Shanno (L-BFGS) ; Stochastic Gradient Descent (SGD)

17/36

ALGORITHM

1. User Model
2. Click Prediction


18/36


 The head part and the tail part of a list garnered most attention from the
users.

 For the head part, we apply a loose filtering algorithm, which only deletes
some apparent duplication in the list.

 For the tail part, we use a strict filtering algorithm to take out any
questions that have noticeable semantic level similarity to each other in
the list.

19/36

SYSTEM SETUP
 The most important concept in the Enlister system design is real-time CTR
prediction. The major data process can be described as follows :

20/36

SYSTEM SETUP
 For building the data processing flow, we construct multiple logic queues between
processing nodes.
 The processing nodes are grouped into several node groups. Each group
represents a simple logic section.

21/36

Before login After login

23/36

EXPERIMENT & EVALUATION

1. Evaluation Metrics
2. Experiment

3. Online Evaluation

26/36

 Precision（查准率）：tp/(tp+fp),识别出的真正的正面观点数/所有的识别为正面观点的条数
 Recall（查全率）：tp/(tp+fn), 识别出的真正的正面观点数/样本中所有的真正正面观点的条数
 Accuracy（准确率）： (tp + tn)/(tp + fn + fp + tn), 正确识别观点数/所有观点的条数

Confusion Matrix

27/36


2. Experiment


28/36

2. Experiment
 Sample Selection
 100,000 questions that had been viewed and clicked by users are selected from
users’ logs as positive sample, which involves 10 thousands users with 10
records per user on average.
 Negative samples: random negative samples

29/36

2. Experiment

 Sample Proportion

30/36

2. Experiment

 Optimization Algorithm
 LBFGS algorithms is chosen as the optimization algorithm in the maximum
entropy model training.

31/36


2. Experiment


32/36


 Enlister was released to the Baidu Knows users and an online evaluation
was conducted from Feb. 11th, 2012.

33/36


34/36

CONCLUSION
① Have successfully built a real-time RS that serves millions of users every
day.
② The algorithm and system design fit the recommendation scenario quite
well.
③ Great improvement had been made on the accuracy and time-sensitive
issues.
④ The number of active users had grown substantially after the system was
officially launched.
 The future work : the timing of recommendation and the utilization of
relationships between users

35/36

Enlister baidu's recommender system for the biggest chinese q&a website

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enlister baidu's recommender system for the biggest chinese q&a website

Similar to Enlister baidu's recommender system for the biggest chinese q&a website (20)

Recently uploaded

Recently uploaded (20)

Enlister baidu's recommender system for the biggest chinese q&a website

Editor's Notes