2010 09-17-張志威老師-confucius & “its” intelligent disciples

Confucius & “Its” Intelligent Disciples
Search + Social

Edward Chang
Director, Google Research, Beijing

2010-9-17 Ed Chang@NCCU Invited Talk 1

Google中国：我们在哪里？


Web 1.0

.htm

.htm .jpg

.jpg .msg

.htm .htm
.htm
.doc

.htm


Web 2.0 --- Web with People
.msg

.xls

.htm .htm

.jpg

.jpg

.htm .msg

.htm

.doc


Outline
• Search + Social Synergy
• Search  Social
• Social  Search
• Scalability and Elasticity


Related Papers
• AdHeat (Social Ads):
– AdHeat: An Influence-based Diffusion Model for Propagating Hints to Match Ads,
H.J. Bao and E. Y. Chang, WWW 2010 (best paper candidate), April 2010.
– Parallel Spectral Clustering in Distributed Systems,
Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and E. Y. Chang,
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2010.
• UserRank:
– Confucius and its Intelligent Disciples, X. Si, E. Y. Chang, Z. Gyongyi, VLDB,
September 2010 .
– Topic-dependent User Rank, Xiance Si, Z. Gyongyi, E. Y. Chang, and M.S. Sun,
Google Technical Report.
• Large-scale Collaborative Filtering:
– PLDA*: Parallel Latent Dirichlet Allocation for Large-Scale Applications, ACM
Transactions on Internet Technology, 2010.
– Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior,
W.-Y. Chen, J. Chu, E. Y. Chang, WWW 2009: 681-690.
– Combinational Collaborative Filtering for Personalized Community Recommendation,
W.-Y. Chen, E. Y. Chang, KDD 2008: 115-123.
– Parallel SVMs, E. Y. Chang, et al., NIPS 2007.

Query: What are must-see attractions at Yellowstone


Query: What are must-see attractions at Yosemite


Query: What are must-see attractions at Beijing

Hotel ads


Confucius: Google Q&A
Providing High-Quality Answers in a Timely Fashion
Differentiated Content

DC
U
S C
Search U
Community

CCS/Q&A

Discussion/Question

 Trigger a discussion/question session during search
 Provide labels to a post (semi-automatically)
 Given a post, find similar posts (automatically)
 Evaluate quality of a post, relevance and originality
 Evaluate user credentials in a topic sensitive way
 Route questions to experts
 Provide most relevant, high-quality content for Search to index
 Generate answers using NLP


Providing High-Quality Answers in a Timely Fashion

DC
U
S C
Search U
Community

CCS/Q&A

Discussion/Question

 Generate answers using NLP


Q&A Uses Machine Learning

Label suggestion
using LDA algorithm.

• Real Time topic-to-
topic (T2T)
recommendation
using LDA algorithm.

• Gives out related high
quality links to
previous questions
before human answer
appear.


Collaborative Filtering
Labels/Qs
1 1 1
1 1 1 1 1 1
Based on membership so far, 1 1 1
and memberships of others 1 1 1 1

Questions
1
1 1
Predict further membership 1 1
1 1
1 1
1 1
1 1 1 1 1


FIM-based Recommendation


FIM Preliminaries
• Observation 1: If an item A is not frequent, any pattern contains
A won’t be frequent [R. Agrawal]
 use a threshold to eliminate infrequent items
{A}  {A,B}
• Observation 2: Patterns containing A are subsets of (or found
from) transactions containing A [J. Han]
 divide-and-conquer: select transactions containing A to form a
conditional database (CDB), and find patterns containing A from
that conditional database
{A, B}, {A, C}, {A}  CDB A
{A, B}, {B, C}  CDB B


Preprocessing
• According to
Observation 1, we
count the support of
each item by
scanning the
database, and
eliminate those
infrequent items
from the
transactions.
• According to
Observation 3, we
sort items in each
transaction by the
order of descending
support value.

Example of Projection

Example of Projection of a database into CDBs.
Left: sorted transactions in order of f, c, a, b, m, p
Right: conditional databases of frequent items



Left: sorted transactions;


Recursive Projections [H. Li, et al. ACM RS 08]
• Recursive projection form
a search tree
• Each node is a CDB
• Using the order of items to
prevent duplicated CDBs.
• Each level of breath-first
search of the tree can be
done by a MapReduce
iteration.
• Once a CDB is small
enough to fit in memory,
we can invoke FP-growth
to mine this CDB, and no
more growth of the sub-
tree.

Projection using MapReduce

p:{fcam/fcam/cb} p:3, pc:3


Collaborative Filtering
Labels/Related Qs
1 1 1
1 1 1 1 1 1
Based on membership so far, 1 1 1

Questions
and memberships of others 1 1 1 1
1
1 1
Predict further membership 1 1
1 1
1 1
1 1
1 1 1 1 1


Distributed Latent Dirichlet Allocation (LDA)
• Search Users/Music/Ads/Question

– Construct a latent layer for better
for semantic matching

Users/Music/Ads/Answers
• Example:
– iPhone crack
– Apple pie

1 recipe pastry for a 9
Documents inch double crust How to install apps on
9 apples, 2/1 cup, Apple mobile phones?
brown sugar

Topic
Distribution

• Other Collaborative Filtering Apps
– Recommend Users  Users
Topic
Distribution
– Recommend Music  Users
– Recommend Ads  Users
– Recommend Answers  Q
iPhone crack Apple pie
User quries • Predict the ? In the light-blue cells

Latent Dirichlet Allocation [D. Blei, M. Jordan 04]
• α: uniform Dirichlet φ prior
α
for per document d topic
distribution (corpus level
parameter)
θ
• β: uniform Dirichlet φ prior
for per topic z word
distribution (corpus level
parameter) z
• θd is the topic distribution of
document d (document level)
• zdj the topic if the jth word in w φ β
d, wdj the specific word (word Nm K
level)
M

Combinational Collaborative Filtering Model (CCF)
【KDD2008】

Communities Communities Communities

users descriptions users descriptions

2010-9-17
33 Ed Chang@NCCU Invited Talk

LDA Gibbs Sampling: Inputs & Outputs
Inputs:
1. training data: documents as
bags of words
words topics
2. parameter: the number of
topics
docs docs
Outputs:
1. model parameters: a co-
occurrence matrix of topics and
words. topics
2. by-product: a co-occurrence words

matrix of topics and documents.


Parallel Gibbs Sampling
Inputs:
1. training data: documents as
bags of words
words topics
2. parameter: the number of
topics
docs docs
Outputs:
1. model parameters: a co-
occurrence matrix of topics and
words. topics
2. by-product: a co-occurrence words

matrix of topics and documents.


PLDA -- enhanced parallel LDA
[ACM Transactions on IT]

• PLDA is restricted by memory: Topic-word matrix has
to fit into memory
• Restricted by Amdahl’s Law: communication costs
too high

2010-9-17 Ed Chang@NCCU Invited Talk
36

PLDA -- enhanced parallel LDA
• Take advantage of bag of words modeling: each Pw
machine processes vocabulary in a word order
• Pipelining: fetching the updated topic distribution
matrix while doing Gibbs sampling

2010-9-17 Ed Chang@NCCU Invited Talk
37

Speedup
1,500x using 2,000 machines



DC
U
S C
Search U
Community

CCS/Q&A

Discussion/Question

 NLQA


UserRank
• Rank users by quantity (number of links)
and quality (weights on links) of
contributions
• Quality include:
– Relevance. Is an answer relevant to the
Q? Measured by KL divergence
between latent-topic vectors of A and Q
– Coverage. Compared among different
answers
– Originality. Detect potential plagiarism
and spam
– Promptness. Time between Q and A
posting time


Outline


Social + Data Networks


可擴展性和彈性
Scalability & Elasticity

• 計算的規模與極端
• 雲計算平台設計挑戰
• 谷歌云計算基礎技術
• 谷歌云計算新技術


計算的規模

千
3.8*10^5km
百万
十億
万億
千万億
百兆
十万兆
億兆
2010-9-17 京，亥 Ed Chang@NCCU Invited Talk 46

計算的規模

千
3.8*10^5km
百万
十億
1.2*10^10km
万億
千万億
百兆
十万兆
億兆

計算的規模

千
3.8*10^5km
百万
十億
1.2*10^10km
万億
千万億
百兆 1.5*10^16km

十万兆
億兆

高性能計算的極端
应用用户数精确度可靠度数据量

科學計算少極高低 -- 中等 Tera

股市交易大量高極高 Gega

基因排序少高高 Tera --
Peta

搜索引擎大量中等 -- 高中等 Peta


可擴展性和彈性
Scalability & Elasticity

• 計算的規模與極端
• 雲計算平台設計挑戰
• 雲計算基礎技術
• 雲計算新技術


Google文件系统 (GFS)
GFS Master

Replicas
Master Client
GFS Master
Client

C0 C1 C1 C0 C3
C3 C4 C3 C5 C4

• Master manages metadata
• Data transfers happen directly between
clients/chunkservers
• Files broken into chunks (typically 64 MB)
• Chunks triplicated across three machines for safety
• See SOSP^03 paper at Ed Chang@NCCU Invited Talk
2010-9-17 51
http://labs.google.com/papers/gfs.html

大表格(Big Table)
• 内存管理
• <Row, Column, Timestamp> triple for key -
lookup, insert, and delete API

• Example
– Row: a Web Page
– Columns: various aspects of the page
– Time stamps: time when a column was fetched

MapReduce

Data Block 1
map
Data Data Block 2 Results
datadatada Reduce datadata
tadatadata datadata
datadatada map datadata
tadatadata

Data Block 3 Data Block 5
map
map

map
Data Block 4


計算機平台(Platform)


樣品平台
• 計算機
– 16GB DRAM; 160 GB SSD; 5 x 1TB disks
• 計算機機架 (Rack)
– 40 计算机
– 48 port Gigabit Ethernet switch
• 計算機中心 (Center)
– 10,000計算機(250 racks)
– 2K port Gigabit Ethernet switch


儲存 ---計算機單機


儲存 ---計算機機架


儲存 ---計算機中心


設計挑戰
• 新程模型, 技術
– File system
– Flash (SSD); GPUs?
• 能源節約
– Hardware, software, warehouse
– Encode/compress/transmit data well
• 故障恢復
– Hardware/software faults
– Deal with stragglers


新文件系統技術: Colossus
• Latency Reduction
– Metadata distribution
– Encoding and Replication
– Data migration
• Throughput Improvement
– Adaptive cluster-level file system
• Fault Tolerant
– Reed-Solomon
– Avoid correlated faults


Colossus [Sigmod 09]


新parellel算法
MapReduce Pregel MPI
GFS/IO and task rescheduling Yes No No
overhead between iterations +1 +1

Flexibility of computation model AllReduce only ? Flexible
+0.5 +1 +1
Efficient AllReduce Yes Yes Yes
+1 +1 +1
Recover from faults between Yes Yes Apps
iterations +1 +1
Recover from faults within each Yes Yes Apps
iteration +1 +1
Final Score for scalable machine 3.5 5 4
learning


設計挑戰
• 新程式設計模型
– Parallel
• 能源節約
• 故障恢復


能源消耗全球增長趨勢


能源消耗地區增長趨勢


電源使用效率


電源使用分布


節能措施


能源消耗 --- 油當量


設計挑戰
• 新程式設計模型
– Parallel
• 能源節約
• 故障恢復


故障率
• 99.9% 正常運行時間, 一年故障?小时
– 9 小时故障/年
• 99.9%正常運行/天, 10,000 計算機, 不down機機率?
– 4.5173346 × 10-5
• 10,000 計算機 (中心级)
– 0.25 次斷電
– 3 次路由器故障
– 1,000 計算機故障
– 1,000s 硬盤故障
– etc., etc., etc.


故障後快速修復
• 複製
• 如果可能的話
– 鬆散的一致性（loosely consist）
– 近似的答案
– 不完整的答案


數據規模算法精度


赢在规模的例证
• 谷歌翻譯
• 谷歌語音識別
• 趨勢預測


流感趨勢圖


Outline


Confucius and Disciples


Concluding Remarks
• Search + Social
• Increasing quantity and complexity of data demands scalable
solutions
• Have parallelized key subroutines for mining massive data sets
– Spectral Clustering [ECML 08]
– Frequent Itemset Mining [ACM RS 08]
– PLSA [KDD 08]
– LDA [WWW 09, AAIM 09]
– UserRank [Google TR 09]
– Support Vector Machines [NIPS 07]
• Relevant papers
– http://infolab.stanford.edu/~echang/
• Open Source PSVM, PLDA
– http://code.google.com/p/psvm/
– http://code.google.com/p/plda/

Collaborators
• Prof. Chih-Jen Lin (NTU)
• Hongjie Bai (Google)
• Hongji Bao (Google)
• Wen-Yen Chen (UCSB)
• Jon Chu (MIT)
• Haoyuan Li (PKU)
• Zhiyuan Liu (Tsinghua)
• Xiance Si (Tsinghua)
• Yangqiu Song (Tsinghua)
• Matt Stanton (CMU)
• Yi Wang (Google)
• Dong Zhang (Google)
• Kaihua Zhu (Google)


References
[1] Alexa internet. http://www.alexa.com/.
[2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international
conference on Machine learning, pages 373-380, 2004.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-
1022, 2003.
[4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth
International Conference on Machine Learning, pages 167-174, 2000.
[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In
Advances in Neural Information Processing Systems 13, pages 430-436, 2001.
[6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic
analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.
Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977.
[8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE
Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984.
[9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Articial Intelligence, pages 289-296,
1999.
[10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89-
115, 2004.
[11] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in
social networks: Experiments with enron and academic email. Technical report, Computer Science, University of
Massachusetts Amherst, 2004.
[12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In
Advances in Neural Information Processing Systems 20, 2007.
[13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002.


References (cont.)
[14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ltering. In Proc. Of the
24th international conference on Machine learning, pages 791-798, 2007.
[15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social
network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining,
pages 678-684, 2005.
[16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griths. Probabilistic author-topic models for information discovery. In
Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-
315, 2004.
[17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions.
Journal on Machine Learning Research (JMLR), 3:583-617, 2002.
[18] T. Zhang and V. S. Iyengar. Recommender systems using linear classiers. Journal of Machine Learning Research,
2:313-334, 2002.
[19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and
Information Systems (KAIS), 8:374-384, 2005.
[20] L. Admic and E. Adar. How to search a social network. 2004
[21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages
5228-5235, 2004.
[22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering.
Communitcations of the ACM, 3:63-65, 1997.
[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD
Rec., 22:207-116, 1993.
[24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In
Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998.
[25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
177, 2004.


References (cont.)
[26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.
[27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
177, 2004.
[28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.
[29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM
International Conference on Data Mining, 2003.
[30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry.
Communication of ACM 35, 12:61-70, 1992.
[31] P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for aollaborative
filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages
175-186, 1994.
[32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87,
1997.
[33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In
Proceedings of ACM CHI, 1:210-217, 1995.
[34] G. Kinden, B. Smith and J. York. Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet
Computing, 7:76-80, 2003.
[35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177-
196, 2001.
[36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint
Conference in Artificial Intelligence, 1999.
[37] http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/collaborativefiltering.html
[38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007.
[39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community
recommendation, ACM KDD 2008.
[40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008.
[41] Wen-Yen Chen, Jon Chu, Junyi Luan, Hongjie Bai, Yi Wang, and Edward Y. Chang , Collaborative Filtering for
Orkut Communities: Discovery of User Latent Behavior, International World Wide Web Conference (WWW)
Madrid, Spain, April 2009 .
[42] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet
Allocation, International Conference on Algorithmic Aspects in Information and Management, June 2009.


2010 09-17-張志威老師-confucius & “its” intelligent disciples

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to 2010 09-17-張志威老師-confucius & “its” intelligent disciples

Similar to 2010 09-17-張志威老師-confucius & “its” intelligent disciples (20)

Recently uploaded

Recently uploaded (20)

2010 09-17-張志威老師-confucius & “its” intelligent disciples