Web Search And Mining (Ntuim)


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Web Search And Mining (Ntuim)

    1. 1. Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
    2. 2. Outline <ul><li>Web SE </li></ul><ul><ul><li>Inside SE </li></ul></ul><ul><ul><li>Google’s Business Models </li></ul></ul><ul><ul><li>Google’s Impacts </li></ul></ul><ul><ul><li>Recent Development </li></ul></ul><ul><ul><li>Next-Generation WSE </li></ul></ul><ul><li>Web Mining </li></ul>
    3. 3. WSE = Google Globalization!
    4. 4. WSE = Google
    5. 5. Problems of WSE Inside WSE . Fast . Coverage . Accuracy
    6. 6. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competitions
    7. 7. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competition Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
    8. 8. I. Some Must-Know Statistics
    9. 9. Online Language Populations <ul><li>Source: Global Reach (global-reach.biz/globstats) </li></ul>
    10. 10. Top Ten Languages in the Web <ul><li>Source: Internet World Stats </li></ul>More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
    11. 11. Web Content Source: Network Wizards Jan 99 Internet Domain Survey More and more non-English pages
    12. 12. Web Users and Pages (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
    13. 13. Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
    14. 14. Number of Web Pages The world’s largest search engine ? <ul><li>4,285,199,774 pages (Google) </li></ul><ul><li>4.28 billion Web pages, 880 million images, and other documents </li></ul>Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch
    15. 15. The top 10 Internet trends 2004 predicted by eOneNet.com <ul><li>1.    World Internet population will continue to grow at an exponential rate, with China taking the lead in Asia having more than 100 million Internet users. </li></ul><ul><li>2.    Broadband Internet penetration will continue to grow with China and US in the lead with an expected growth rate exceeding 30% each. </li></ul><ul><li>3.    Online retail sales will still be led by the US with an expected revenue exceeding US$80 billion. </li></ul><ul><li>4.    Paid search will account for the biggest online ad spending. With the successful paid search business models of Google and Overture, more search engines will offer paid search advertising. </li></ul>
    16. 16. The top 10 Internet trends 2004 predicted by eOneNet.com <ul><li>5.    Spams will increase at least 20% despite the new US anti-spam law. The US legislators will be forced to consider amending the anti-spam law from an opt-out law to an opt-in law. </li></ul><ul><li>6.    Ads placed in opt-in email newsletters will increase 25% as legitimate marketers find this is the easier way to comply with the anti-spam law and a better way of targeting customers. </li></ul><ul><li>7.    Rich media will continue to be hot. More than 25% of online ads served will contain rich media contents. </li></ul>
    17. 17. The top 10 Internet trends 2004 predicted by eOneNet.com <ul><li>8.    20% more small businesses will develop their own websites or use the Internet as a sales and marketing channel. </li></ul><ul><li>9.    Entertainment online will be grow at a rapid pace, with more sites offering videos and digital music download services. </li></ul><ul><li>10.    The Internet boom will revive with more Internet companies going for IPO both in the US and in Asia, in particular kicked off by the most anticipated Google IPO in Spring. </li></ul>
    18. 18. II. Inside WSE
    19. 19. Components <ul><li>Crawler/Spider </li></ul><ul><li>Index Server </li></ul><ul><li>Query Server </li></ul><ul><li>Document Delivery </li></ul>
    20. 20. Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
    21. 21. Spider <ul><li>Get all Pages from the Web </li></ul><ul><li>Web Traverse </li></ul><ul><li>Challenges </li></ul><ul><ul><li>Performance, e.g., #Pages/Per PC </li></ul></ul><ul><ul><li>Coverage </li></ul></ul><ul><ul><li>Currency </li></ul></ul><ul><ul><li>Spam Filtering </li></ul></ul><ul><ul><li>Hidden Web </li></ul></ul>
    22. 22. Index Server <ul><li>Index occurrences of all words in the pages </li></ul><ul><li>Data Cleanness </li></ul><ul><li>Challenges </li></ul><ul><ul><li>Space Overhead,#pages/PC </li></ul></ul><ul><ul><li>Incremental </li></ul></ul><ul><ul><li>Scalability & Distributed Processing </li></ul></ul><ul><ul><li>Multiple Languages </li></ul></ul>
    23. 23. System Anatomy
    24. 24. Data Structure Lexicon: fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID
    25. 25. Query Server <ul><li>Search Relevant URLs for queries via looking up indices </li></ul><ul><li>Challenges </li></ul><ul><ul><li>Speed, check #queries/Per Sec </li></ul></ul><ul><ul><li>Functions supported </li></ul></ul><ul><ul><li>Localization </li></ul></ul>
    26. 26. PageRank
    27. 27. PageRank (Cont.) <ul><li>be the set of pages that point to u. be the number of </li></ul><ul><li>links from u and let c be a factor used for normalization, then </li></ul><ul><li>a simplified version of PageRank: </li></ul>
    28. 28. Search Functions <ul><li>Phrase search, e.g. &quot;petite galerie&quot; </li></ul><ul><li>Truncation, e.g. librar*, wom*n </li></ul><ul><li>Constraining search, e.g. title:&quot;The Wall Street Journal&quot; </li></ul><ul><li>Proximity search, e.g. gold near silver </li></ul><ul><li>Boolean, e.g. +noir +film -&quot;pinot noir&quot; </li></ul><ul><li>Parentheses and Nested Boolean, e.g. silver and not (gold or platinum) </li></ul><ul><li>Limit search, e.g. limit by date range </li></ul><ul><li>Capitalization, e.g. turkey vs. Turkey </li></ul><ul><li>Ranking fields and refine search </li></ul><ul><li>LiveTopics </li></ul><ul><li>Translate Service </li></ul><ul><li>Other </li></ul>
    29. 29. Document Delivery <ul><li>Bottleneck of Bandwidth </li></ul><ul><li>Presentation </li></ul><ul><li>Caching </li></ul><ul><ul><li>Queries, Search Results </li></ul></ul><ul><ul><li>Aakman Model </li></ul></ul>
    30. 30. III. Business
    31. 31. What is Google? <ul><li>Specialized web search engine </li></ul><ul><li>Founded in 1998 by 2 graduate students at Stanford University (Larry Page and Sergey Brin) </li></ul><ul><li>Provides a comprehensive, relevant, and easy-to-use web search and browsing service (free) </li></ul><ul><li>Google’s features : fast, unbiased, and accurate results, allows access to over 4 billion web pages, and over 800 million images (most important; valid web pages) </li></ul>
    32. 32. Company Facts Employees: 1,300+ Languages spoken: 34 Worldwide Offices: 21 (Mostly in US & Europe) Annual Revenues: $900m
    33. 33. Google Revenue <ul><li>Revenue—(an e-business): </li></ul><ul><li> ½ from selling relevant text-based ads (sponsored links near search results) </li></ul><ul><li>½ from licensing its search technology to companies like Yahoo </li></ul>Source: Eric Schmidt Interview, PCWorld.com (January 30, 2002)
    34. 34. Sources of Revenue <ul><li>Adwords (150,000 advertisers) “sponsored links” ad </li></ul><ul><li>cost-per-click pricing; only when people click on the link </li></ul><ul><li>-- Advertisement is extremely cheap and effective </li></ul><ul><li>i.e. Edmunds.com spent “$250,000 a month in advertising“ because $1 spent generated $1.70. </li></ul><ul><li>Google Search Appliance </li></ul><ul><li>an integrated hardware/software solution that extends the power of Google to corporate intranets and web servers </li></ul><ul><li>-- Customers include: Cisco Systems, Sony, Procter & Gamble, Sun Microsystems, etc </li></ul>
    35. 35. Challenges (cont.) <ul><li>Easy entry into the Search Engine Industry </li></ul><ul><li>Lack of customer lock-in (vs. Microsoft); </li></ul><ul><li>Google will focus on creating services to voluntarily draw in customers </li></ul><ul><li>Large, well-known competitors are focusing on in-house search technology (Yahoo, Microsoft, AOL, eBay, Amazon) </li></ul><ul><li>Customers are becoming competitors (Yahoo, AOL) </li></ul>
    36. 36. Competitors: Ebay and Amazon <ul><li>Ebay ( www.ebay.com ) E-commerce </li></ul><ul><li>Web-based marketplace in which a community of buyers and sellers are brought together to browse, buy and sell various items </li></ul><ul><li>-- Business revenue: Charges Proceeds (Fees) </li></ul><ul><li>(5%) 0.01-$25 (2.5%) $25-$1000 (1.25%) over $1000 </li></ul><ul><li>Amazon ( www.amazon.com ) E-commerce </li></ul><ul><li>a customer-centric company that sells a range of products that it purchases from manufacturers and distributors </li></ul>
    37. 37. Competitors: Microsoft and Yahoo <ul><li>Microsoft is developing its own search engine </li></ul><ul><li>-- Can “lasso” users into its search engine through its operating system </li></ul><ul><li>-- Has the “braniacs” to implement top of the line search engine technology </li></ul><ul><li>Yahoo was customer of Google (may now become Google’s biggest competitor) </li></ul><ul><li>-- Offers placement under sponsored links and within actual results (“unethical”) </li></ul>
    38. 38. IV. Impacts
    39. 39. Impacts <ul><li>Web Computing </li></ul><ul><li>Knowledge Windows </li></ul><ul><li>New Web OS </li></ul>
    40. 40. Web Computing <ul><li>Faster than local search </li></ul><ul><li>Very-large scale of computing systems </li></ul><ul><li>Realize global users’ behaviors </li></ul><ul><li>Acquire global information sources </li></ul>
    41. 41. Web Computing <ul><li>Local disc or global disc? </li></ul><ul><li>Personal information management? </li></ul><ul><ul><li>Gmails </li></ul></ul><ul><ul><li>Photo search </li></ul></ul>
    42. 42. Knowledge Windows <ul><li>Windows of Information Search </li></ul><ul><li>Alliance with online databases </li></ul><ul><li>Windows of Personal Knowledge Management </li></ul><ul><li>Knowledge Windows </li></ul>
    43. 43. New Web OS <ul><li>Merged with Linux OS </li></ul><ul><li>Software download from end-users </li></ul><ul><li>Information Service OS </li></ul>
    44. 44. V. New Gen. of WSE
    45. 45. Advanced Google <ul><li>Is Google good enough? </li></ul><ul><ul><li>“ Takano” </li></ul></ul><ul><ul><li>“ Takano NII” </li></ul></ul><ul><ul><li>“ Takano NII Japan” </li></ul></ul><ul><li>More about Google Services </li></ul><ul><ul><li>http://www.google.com/options/ </li></ul></ul>
    46. 46. New Features in Google <ul><li>Google Labs: http://labs.google.com/ </li></ul><ul><li>Google Desktop Search </li></ul><ul><ul><li>Searching text, Web, Word, Excel, PowerPoint, Outlook, AOL Instant Messenger </li></ul></ul><ul><li>Google SMS </li></ul><ul><ul><li>Searching phone book, dictionary, product prices, … </li></ul></ul><ul><li>Google Print </li></ul><ul><ul><li>Searching books </li></ul></ul>
    47. 48. Other Search Tools <ul><li>A9.com (by Amazon) </li></ul><ul><ul><li>Bookmark, history, discover, diary </li></ul></ul><ul><ul><li>Books, movies, … </li></ul></ul><ul><li>Clusty.com (by Vivisimo) </li></ul><ul><ul><li>Clustering engine </li></ul></ul><ul><li>Snap.com (by Idealab) </li></ul><ul><ul><li>Sorting by popularity, satisfaction, Web popularity, Web satisfaction, domain, … </li></ul></ul><ul><li>Alexa.com (by Amazon) </li></ul><ul><ul><li>Average user review ratings, … </li></ul></ul><ul><li>Others: Yahoo, AskJeeves, AOL Search, HotBot, MSN, Netscape, Lycos, Altavista, LookSmart, Gigablast, Overture, About, FindWhat, Teoma, InformSearch, … </li></ul>
    48. 49. Clusty.com
    49. 50. Example on Vivisimo
    50. 51. Vivisimo (cont.)
    51. 52. New Directions <ul><li>Personalization </li></ul><ul><ul><li>Photo search, email search & filtering </li></ul></ul><ul><li>Information Extraction </li></ul><ul><ul><li>EX: Scholar search </li></ul></ul><ul><li>Information Agent </li></ul><ul><li>Deep Web Search </li></ul>
    52. 53. VI. Web Mining
    53. 54. Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
    54. 55. Improving Search via Mining Millions of Users Web texts, images, logs … Search Engine Knowledge Discovery
    55. 56. Valuable Web Resources Web logs, texts, images , … Knowledge Discovery Millions of Users Hyper Links Anchor Texts Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
    56. 57. Discovered Knowledge Web logs, texts, images , … Knowledge Discovery Millions of Users Users’ Preferences/Need: Topic, Location, Timing, … Authority/Popularity: Site, File, People, Company, Product Clusters/Associations/ Relations: Site, Page, People, Company, Product, Query
    57. 58. Web Mining for IR Web logs, texts, images , … Knowledge Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction Text mining Filtering
    58. 59. <ul><li>CS 276 / LING 239I Information Retrieval and Web Mining </li></ul><ul><li>Prabhakar Raghavan and Hinrich Schütze </li></ul><ul><li>Course Description: </li></ul><ul><li>Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean, vector space, and probabilistic retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata ; text/Web clustering , classification , wrapper , information extraction , and collaborative filtering systems; text mining . Projects can be chosen from diverse topics in information retrieval. </li></ul>
    59. 60. Computational Linguistics, 29 , Issue 3, September 2003 .
    60. 61. Research at Web Knowledge Discovery Lab
    61. 62. Research at Web Knowledge Discovery Lab <ul><li>Live series </li></ul><ul><ul><li>LiveTrans </li></ul></ul><ul><ul><ul><li>SIGIR’04, ACL’04, JCDL’04 </li></ul></ul></ul><ul><ul><ul><li>ACM Trans. On Information System, 2004 </li></ul></ul></ul><ul><ul><ul><li>Online Translation of unknown queries via Web </li></ul></ul></ul><ul><ul><li>LiveClassifier </li></ul></ul><ul><ul><ul><li>WWW’04, IJCNLP’04 </li></ul></ul></ul><ul><ul><ul><li>ACM Trans. on ALIP, 2004 </li></ul></ul></ul><ul><ul><ul><li>Training classifiers and classifying short text via Web </li></ul></ul></ul>
    62. 63. Research at Web Knowledge Discovery Lab <ul><ul><li>LiveCluster </li></ul></ul><ul><ul><ul><li>CIKM’04 </li></ul></ul></ul><ul><ul><ul><li>ACM Trans. On Information System, 2004 </li></ul></ul></ul><ul><ul><ul><li>Generating taxonomy from terms or documents </li></ul></ul></ul>
    63. 64. LiveTrans: Cross-language Web Search
    64. 65. LiveClassifier : Classifying search results into user-defined classification tree
    65. 66. LiveClassifier : Paper Title Categorization Note: no labeled training data
    66. 67. LiveCluster : Taxonomy Generation
    67. 68. Terms Clustering
    68. 69. Query Clustering 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
    69. 71. Outline <ul><li>Translating Unknown Queries (SIGIR’04) </li></ul><ul><li>Training Text Classifiers (WWW’04) </li></ul><ul><li>Generating Taxonomy/Topic Hierarchies (TOIS’04) </li></ul>
    70. 72. Translating Unknown Queries <ul><li>Anchor Text Mining </li></ul><ul><ul><li>Probabilistic Modeling (ACM TALIP’02) </li></ul></ul><ul><ul><li>Transitive Translation (ACM TOIS’04) </li></ul></ul><ul><li>Search-Result Page Mining </li></ul><ul><ul><li>Translation Extraction & Selection (JCDL’04) </li></ul></ul><ul><ul><li>CLIR & Other Applications (SIGIR’04, ACL’04) </li></ul></ul>Note: First work dealing with online translation
    71. 73. Introduction (cont.) <ul><li>Bottleneck of CLIR service </li></ul><ul><ul><li>Real queries are often short </li></ul></ul><ul><ul><li>Out-of-dictionary terms </li></ul></ul><ul><ul><li>and might have local variations </li></ul></ul><ul><ul><ul><li>Ex: proper nouns, new terminologies, … </li></ul></ul></ul><ul><li>Need for a powerful query translation engine </li></ul><ul><ul><li>Up-to-date dictionary </li></ul></ul>比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館 Digital library 班夫 / 班芙 Banff 石川県 Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre museum Chinese Translation English Terminologies
    72. 74. Web Mining of Query Translations <ul><li>Different problems for different resources </li></ul>Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result Mining OOD Yahoo <-> 雅虎
    73. 75. Anchor Text (Yahoo <-> 雅虎 ) <ul><li>Applies to most languages </li></ul><ul><li>Translation candidates are likely to appear in the same anchor-text-set </li></ul>
    74. 76. Search Result Page (National Palace Museum vs. 故宮博物院 ) <ul><li>Mixed-language characteristic in Chinese pages </li></ul>
    75. 77. Problems <ul><li>Term extraction </li></ul><ul><li>Translation selection & noisy reduction </li></ul><ul><li>Language pairs with limited corpora </li></ul><ul><li>Processing speed </li></ul><ul><li>Data cleanness (language identification) </li></ul><ul><li>Language independence </li></ul>
    76. 78. Term Extraction: SCPCD
    77. 79. … … Term Selection: Probabilistic Inference Model Page Authority Co-occurrence Page Rank <ul><li>Integrating anchor texts and link structures into probabilistic inference model </li></ul><ul><li>Based on co-occurrence & page authority </li></ul>
    78. 80. Observation of Anchor Text Source Term(Ts) Translation(Tt) 雅虎 => Yahoo
    79. 81. - in USA Taiwan - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
    80. 82. - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set Observation of Anchor Text
    81. 83. …… (#in-link= 187) …… (#in-link= 21) - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
    82. 84. Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term Selection
    83. 85. Term Selection <ul><li>How to decide the ranking? </li></ul><ul><li>S, T i : frequently co-occur in the same pages </li></ul><ul><ul><li>Not necessarily true for synonyms and antonyms </li></ul></ul><ul><li>S, T i : the result pages containing similar co-occurring context terms as feature vectors </li></ul>Query S . . . T 1 T 2 T n
    84. 86. Chi-Square Test <ul><li>Chi-Square Test: a statistical method for co-occurrence analysis [Gale & Church ‘91] </li></ul>a : # of pages containing both terms s and t b : # of pages containing term s but not t c : # of pages containing term t but not s d : # of pages containing neither term s nor t N : the total number of pages, i.e., N = a + b + c + d
    85. 87. Context Vector Analysis <ul><li>Context Vector Analysis: co-occurring context terms as feature vectors </li></ul><ul><li>Similarity measure: cosine measure </li></ul>
    86. 88. Indirect Association Problem Cisco s t s 1 t 1 系統 (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科 (Cisco)
    87. 89. Competitive Linking Algorithm t 1 system s t 2 系統 (system) Cisco 資訊 (information) 網路 (network) 電腦 (computer) St 1 思科 (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
    88. 90. Combined Method <ul><li>To take advantage of both methods </li></ul><ul><ul><li>Anchor-text-based: higher precision </li></ul></ul><ul><ul><li>Search-result-based: higher coverage </li></ul></ul>R m (s,t) : Ranking of score in different methods
    89. 91. Experiments <ul><li>Performance on Query Translation </li></ul><ul><ul><li>Test Bed: real query terms from the Dreamer search engine log in Taiwan </li></ul></ul><ul><ul><li>228,566 unique terms, during a period of 3 months in 1998 </li></ul></ul><ul><ul><li>Random-query test set : </li></ul></ul><ul><ul><ul><li>50 query terms in Chinese, randomly selected from the top 20,000 queries in the log </li></ul></ul></ul><ul><ul><ul><li>40 of them were out-of-dictionary </li></ul></ul></ul>
    90. 92. Random Query Test Set <ul><li>Many query terms didn’t appear in anchor-text sets (coverage) </li></ul>72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2. Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
    91. 93. Other Experiments <ul><li>430 popular Chinese queries, 67.4% top-1 inclusion rate </li></ul><ul><li>Common terms: randomly selected 100 common nouns and 100 common verbs from general-purpose Chinese dictionary </li></ul>
    92. 94. Transitive Translation Top-n inclusion rates obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
    93. 95. Transitive Translation Model
    94. 96. Chinese-Japanese Translation 61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect 15.1% 15.1% 14.3% 12.8% 10.5% Direct Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
    95. 97. Translation Lexicons with Regional Variations (a) Taiwan (b) Mainland China (c) Hong Kong Figure 1: E xample s of search-result page s in different Chinese regions that were obtained via the English query words “ George Bush ” from Google.
    96. 98. Summary <ul><li>A work dealing with live translation of unknown queries </li></ul><ul><li>Anchor-text-based </li></ul><ul><ul><li>High precision for high-frequency terms </li></ul></ul><ul><ul><li>Effective for proper nouns in multiple languages </li></ul></ul><ul><ul><li>Not applicable if size of anchor-text set not enough </li></ul></ul><ul><li>Search-result-based </li></ul><ul><ul><li>Exploit rich Web resources </li></ul></ul><ul><ul><li>High coverage for English-Chinese language pair </li></ul></ul>
    97. 99. LiveCluster: Generating Taxonomy from terms or documents
    98. 100. Taxonomy Generation from Terms
    99. 101. Hierarchical Query Clustering
    100. 102. The Steps <ul><li>Feature Extraction </li></ul><ul><ul><li>Use co-occurred seed terms extracted from retrieved top pages </li></ul></ul><ul><li>Term Vector </li></ul><ul><ul><li>Each query term is assigned a term vector </li></ul></ul><ul><ul><ul><li>Record the co-occurred feature terms and their frequency values in the retrieved documents. </li></ul></ul></ul><ul><li>Term Similarity </li></ul><ul><ul><li>tf*idf-based Cosine measurement </li></ul></ul><ul><li>Hierarchical Term Clustering </li></ul><ul><ul><li>Cluster popular query terms in the log into initial categories </li></ul></ul><ul><ul><li>Query terms with similar features are grouped into clusters. </li></ul></ul>
    101. 103. Feature Extraction <ul><ul><li>Use co-occurred seed terms extracted from retrieved top pages </li></ul></ul>Creative Nude Photography Network -- Fine Art Nude and ... ... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography ! Over 100 CNPN Member Sites ... Nude Places ... to be naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude . 60 minutes $39.95. ... A Brave Nude World ... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography . If you are under 18 or do not wish to view this material, You can ... nude Co-occurred feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
    102. 104. Term Weighting
    103. 105. Extraction of Basic Feature Terms <ul><li>Performance of different features: randomly selected, hi-frequency, and seed terms </li></ul><ul><ul><li>Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc. </li></ul></ul><ul><ul><li>More expressive and distinguishable in describing a particular category </li></ul></ul><ul><ul><li>Two logs compared and extracted 9,709 overlapping top query terms as feature terms </li></ul></ul>
    104. 106. Task I: Query Clustering (Cont.) <ul><li>Feature Extraction </li></ul><ul><ul><li>Use co-occurred seed terms extracted from retrieved top pages </li></ul></ul><ul><li>Term Vector </li></ul><ul><ul><li>Each query term is assigned a term vector </li></ul></ul><ul><ul><ul><li>Record the co-occurred feature terms and their frequency values in the retrieved documents. </li></ul></ul></ul><ul><li>Term Similarity </li></ul><ul><ul><li>TF *IDF-based Cosine measurement </li></ul></ul><ul><li>Hierarchical Term Clustering </li></ul><ul><ul><li>Cluster popular query terms in the log into initial categories </li></ul></ul><ul><ul><li>Query terms with similar features are grouped into clusters. </li></ul></ul>
    105. 107. Term Similarity
    106. 108. Hierarchical Term Clustering <ul><li>Agglomerative hierarchical clustering (AHC) </li></ul><ul><ul><li>Compute the similarity between all pairs of clusters </li></ul></ul><ul><ul><ul><li>Estimate similarity between all pairs of composed terms </li></ul></ul></ul><ul><ul><ul><li>Use the lowest term similarity value as the cluster similarity value </li></ul></ul></ul><ul><ul><li>Merge the most similar (closest) two clusters </li></ul></ul><ul><ul><ul><li>Complete linkage method </li></ul></ul></ul><ul><ul><li>Update the cluster vector of the new cluster </li></ul></ul><ul><ul><li>Repeat steps 2 and 3 until only a single cluster remains </li></ul></ul>
    107. 110. Clustering Results 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
    108. 111. Cluster Partition
    109. 112. Quality Function
    110. 113. Quality Function (Cont.)
    111. 114. Quality Function (Cont.)
    112. 115. Preliminary Experiment <ul><li>Test queries </li></ul><ul><ul><ul><li>Two sets: top 1k queries and random 1k queries </li></ul></ul></ul><ul><ul><ul><li>Each of the test queries has been manually assigned according classes </li></ul></ul></ul><ul><li>Evaluation metrics </li></ul><ul><ul><ul><li>F-Measure </li></ul></ul></ul>
    113. 116. Evaluation: F-Measure
    114. 117. Obtained F-Measures
    115. 119. Results of Hierarchical Structure Generation