SlideShare a Scribd company logo
1 of 119
Opportunities and Challenges of Web Search and Mining   Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
Outline   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
WSE = Google Globalization!
WSE = Google
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competitions
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competition   Impacts  . Web Computing  . Knowledge Windows  . New Paradigm of Civilization
I.  Some Must-Know   Statistics
Online Language Populations ,[object Object]
Top Ten Languages in the Web ,[object Object],More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
Web Content Source:  Network Wizards Jan 99 Internet Domain Survey More and more  non-English pages
Web Users and Pages  (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
Number of Web Pages   The world’s  largest search engine ? ,[object Object],[object Object],Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.  Source: Search Engine Watch
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
II.  Inside WSE
Components  ,[object Object],[object Object],[object Object],[object Object]
Architecture   SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable  Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
Spider ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Index Server   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
System Anatomy
Data Structure Lexicon:  fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted  by wordID. Inside barrel,  sorted by docID Inverted Index: some content as  the forward index, but sorted by wordID. doc list is sorted by docID
Query Server ,[object Object],[object Object],[object Object],[object Object],[object Object]
PageRank
PageRank (Cont.) ,[object Object],[object Object],[object Object]
Search Functions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Document Delivery   ,[object Object],[object Object],[object Object],[object Object],[object Object]
III.  Business
What is Google?   ,[object Object],[object Object],[object Object],[object Object]
Company Facts Employees:  1,300+ Languages spoken: 34 Worldwide Offices:  21 (Mostly in US & Europe) Annual Revenues: $900m
Google Revenue ,[object Object],[object Object],[object Object],Source:  Eric Schmidt Interview,  PCWorld.com (January 30, 2002)
Sources of Revenue   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Challenges (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Ebay and Amazon ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Microsoft and Yahoo ,[object Object],[object Object],[object Object],[object Object],[object Object]
IV.  Impacts
Impacts   ,[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Knowledge Windows   ,[object Object],[object Object],[object Object],[object Object]
New Web OS ,[object Object],[object Object],[object Object]
V.  New Gen. of WSE
Advanced Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
New Features in Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Other Search Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Clusty.com
Example on Vivisimo
Vivisimo  (cont.)
New Directions   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
VI.  Web Mining
Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
Improving Search via Mining Millions of Users Web texts, images, logs   … Search Engine Knowledge Discovery
Valuable Web Resources  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Hyper Links Anchor Texts  Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
Discovered Knowledge  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Users’ Preferences/Need:  Topic, Location,  Timing, … Authority/Popularity: Site, File, People,  Company, Product Clusters/Associations/ Relations:  Site, Page, People,  Company, Product,  Query
Web Mining for IR Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction   Text mining Filtering
[object Object],[object Object],[object Object],[object Object]
Computational Linguistics, 29 , Issue 3,  September 2003 .
Research at  Web   Knowledge   Discovery  Lab
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object]
LiveTrans:  Cross-language Web Search
LiveClassifier : Classifying search results into user-defined classification tree
LiveClassifier  :  Paper Title Categorization Note: no labeled training data
LiveCluster :  Taxonomy Generation
Terms Clustering
Query Clustering   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
 
Outline ,[object Object],[object Object],[object Object]
Translating Unknown Queries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Note: First work dealing with online translation
Introduction (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館  Digital library 班夫 / 班芙   Banff 石川県   Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre  museum Chinese Translation English Terminologies
Web Mining of  Query Translations ,[object Object],Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result  Mining OOD Yahoo <->  雅虎
Anchor Text (Yahoo <->  雅虎 ) ,[object Object],[object Object]
Search Result Page  (National Palace Museum vs.  故宮博物院 ) ,[object Object]
Problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Extraction: SCPCD
… … Term Selection:   Probabilistic Inference Model Page Authority Co-occurrence Page Rank ,[object Object],[object Object]
Observation of Anchor Text Source Term(Ts)  Translation(Tt) 雅虎 => Yahoo
-  in USA Taiwan  - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
-  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set  Observation of Anchor Text
…… (#in-link= 187) …… (#in-link= 21) -  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term  Selection
Term Selection  ,[object Object],[object Object],[object Object],[object Object],Query S . . . T 1 T 2 T n
Chi-Square Test ,[object Object],a : # of pages containing both terms  s  and  t b : # of pages containing term  s  but not  t c : # of pages containing term   t  but not  s d : # of pages containing neither term  s  nor  t N : the total number of pages, i.e.,  N =  a + b + c + d
Context Vector Analysis ,[object Object],[object Object]
Indirect Association Problem   Cisco s t s 1 t 1 系統  (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科  (Cisco)
Competitive Linking Algorithm t 1 system s t 2 系統   (system) Cisco 資訊   (information) 網路   (network) 電腦   (computer) St 1 思科   (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
Combined Method ,[object Object],[object Object],[object Object],R m (s,t)  : Ranking of score  in different methods
Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random Query Test Set ,[object Object],72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2.  Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
Other Experiments ,[object Object],[object Object]
Transitive Translation Top-n inclusion rates  obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
Transitive Translation Model
Chinese-Japanese Translation   61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect  15.1% 15.1% 14.3% 12.8% 10.5% Direct  Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Translation Lexicons with Regional Variations   (a)  Taiwan  (b)  Mainland China  (c)  Hong Kong Figure 1:  E xample s   of  search-result page s   in different Chinese regions that were obtained via  the English query  words  “ George Bush ”  from Google.
Summary  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
LiveCluster:  Generating Taxonomy from terms or documents
Taxonomy Generation from Terms
Hierarchical Query Clustering
The Steps   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Feature Extraction   ,[object Object],Creative Nude Photography Network -- Fine Art Nude and  ...   ...  The Creative  Nude  and  Erotic Photography  Network is the number one net portal to the best in fine art  nude  and  erotic photography ! Over 100 CNPN Member Sites  ...   Nude Places ...  to be  naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and  nude  photography are all enjoyed in the  nude . 60 minutes $39.95.  ...   A Brave Nude World ...  A Brave  Nude  World! Warning: This site contains links to fine art  nude  &  erotic photography . If you are under 18 or do not wish to view this material, You can  ...   nude Co-occurred  feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
Term Weighting
Extraction of Basic Feature Terms ,[object Object],[object Object],[object Object],[object Object]
Task I: Query Clustering   (Cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Similarity
Hierarchical Term Clustering   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Clustering Results   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
Cluster Partition
Quality Function
Quality Function  (Cont.)
Quality Function  (Cont.)
Preliminary Experiment ,[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: F-Measure
Obtained F-Measures
 
Results of Hierarchical Structure Generation

More Related Content

Similar to Web Search And Mining (Ntuim)

Internet research-1200691875464541-5
Internet research-1200691875464541-5Internet research-1200691875464541-5
Internet research-1200691875464541-5惠子 李
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebLeeFeigenbaum
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Socialmediaintransition
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Week 6
Week 6Week 6
Week 6A VD
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyMarcus Vannini
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shancjin cheng
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketingPer Axbom
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week TalkDorai Thodla
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 

Similar to Web Search And Mining (Ntuim) (20)

Internet research-1200691875464541-5
Internet research-1200691875464541-5Internet research-1200691875464541-5
Internet research-1200691875464541-5
 
Internet research
Internet researchInternet research
Internet research
 
Internet Research
Internet ResearchInternet Research
Internet Research
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic Web
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Social
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Week 6
Week 6Week 6
Week 6
 
Web 20 For Acra
Web 20 For AcraWeb 20 For Acra
Web 20 For Acra
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Web2.0!
Web2.0!Web2.0!
Web2.0!
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketing
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week Talk
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Web Search And Mining (Ntuim)

  • 1. Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
  • 2.
  • 3. WSE = Google Globalization!
  • 5. Problems of WSE Inside WSE . Fast . Coverage . Accuracy
  • 6. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competitions
  • 7. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competition Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
  • 8. I. Some Must-Know Statistics
  • 9.
  • 10.
  • 11. Web Content Source: Network Wizards Jan 99 Internet Domain Survey More and more non-English pages
  • 12. Web Users and Pages (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
  • 13. Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. II. Inside WSE
  • 19.
  • 20. Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
  • 21.
  • 22.
  • 24. Data Structure Lexicon: fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID
  • 25.
  • 27.
  • 28.
  • 29.
  • 31.
  • 32. Company Facts Employees: 1,300+ Languages spoken: 34 Worldwide Offices: 21 (Mostly in US & Europe) Annual Revenues: $900m
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44. V. New Gen. of WSE
  • 45.
  • 46.
  • 47.  
  • 48.
  • 52.
  • 53. VI. Web Mining
  • 54. Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
  • 55. Improving Search via Mining Millions of Users Web texts, images, logs … Search Engine Knowledge Discovery
  • 56. Valuable Web Resources Web logs, texts, images , … Knowledge Discovery Millions of Users Hyper Links Anchor Texts Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
  • 57. Discovered Knowledge Web logs, texts, images , … Knowledge Discovery Millions of Users Users’ Preferences/Need: Topic, Location, Timing, … Authority/Popularity: Site, File, People, Company, Product Clusters/Associations/ Relations: Site, Page, People, Company, Product, Query
  • 58. Web Mining for IR Web logs, texts, images , … Knowledge Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction Text mining Filtering
  • 59.
  • 60. Computational Linguistics, 29 , Issue 3, September 2003 .
  • 61. Research at Web Knowledge Discovery Lab
  • 62.
  • 63.
  • 65. LiveClassifier : Classifying search results into user-defined classification tree
  • 66. LiveClassifier : Paper Title Categorization Note: no labeled training data
  • 67. LiveCluster : Taxonomy Generation
  • 69. Query Clustering 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 70.  
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 79.
  • 80. Observation of Anchor Text Source Term(Ts) Translation(Tt) 雅虎 => Yahoo
  • 81. - in USA Taiwan - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
  • 82. - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set Observation of Anchor Text
  • 83. …… (#in-link= 187) …… (#in-link= 21) - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
  • 84. Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term Selection
  • 85.
  • 86.
  • 87.
  • 88. Indirect Association Problem Cisco s t s 1 t 1 系統 (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科 (Cisco)
  • 89. Competitive Linking Algorithm t 1 system s t 2 系統 (system) Cisco 資訊 (information) 網路 (network) 電腦 (computer) St 1 思科 (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
  • 90.
  • 91.
  • 92.
  • 93.
  • 94. Transitive Translation Top-n inclusion rates obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
  • 96. Chinese-Japanese Translation 61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect 15.1% 15.1% 14.3% 12.8% 10.5% Direct Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
  • 97. Translation Lexicons with Regional Variations (a) Taiwan (b) Mainland China (c) Hong Kong Figure 1: E xample s of search-result page s in different Chinese regions that were obtained via the English query words “ George Bush ” from Google.
  • 98.
  • 99. LiveCluster: Generating Taxonomy from terms or documents
  • 102.
  • 103.
  • 105.
  • 106.
  • 108.
  • 109.  
  • 110. Clustering Results 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 113. Quality Function (Cont.)
  • 114. Quality Function (Cont.)
  • 115.
  • 118.  
  • 119. Results of Hierarchical Structure Generation