• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web Search Engine Design
 

Web Search Engine Design

on

  • 2,045 views

 

Statistics

Views

Total Views
2,045
Views on SlideShare
2,044
Embed Views
1

Actions

Likes
0
Downloads
54
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Search Engine Design Web Search Engine Design Presentation Transcript

    • Web Search Engine Design Lee-Feng Chien ( 簡立峰 ) Web Knowledge Discovery Lab Institute of Information Science Academia Sinica http:/ /csmart.iis.sinica.edu.tw /
    • Outline
      • Basics of Search Engine Design
      • Why Google Can Do It?
      • New-Generation Search Technologies
    • About the Speaker
        • Working Position
          • Research Fellow, IIS, Academia Sinica (1993~)
        • Education
          • Ph.D., CS&IE, NTU, 1991
        • Professional Activities
          • Associate Editor, ACM Trans on Asian Lang. Info. Proc. (2000~)
          • Editorial Board Member, J.of Information Processing & Management (1995~2000)
          • PC member, ACM SIGIR (1999~2003)
          • Speech and Search Technology Consultant, Microsoft Research
    • Part I. Basics of Search Engine Design
    • Differences
      • Scale
        • Personal, site/intranet (Tornado/Verity), internet (Google)
        • Thousand, million or billion (documents, users, queries)
      • Media
        • Text, e.g., Web pages, documents, bibliographic data
        • Audio, e.g., music, speech, broadcast news
        • Image, e.g., pictures, computer graphics
        • Video, e.g., films
      • Subject
        • General or specific subjects, languages
      • Structure
        • Non-structure, semi-structure, structure
      • Interface
        • Web-based, WAP-based, voice-based
    • Components
      • Crawler/Spider
      • Index Server
      • Query Server
      • Document Delivery
    • Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
    • Spider
      • Get all Pages from the Web
      • Web Traverse
      • Challenges
        • Performance, e.g., #Pages/Per PC
        • Coverage
        • Currency
        • Spam Filtering
        • Hidden Web
    • Index Server
      • Index occurrences of all words in the pages
      • Data Cleanness
      • Challenges
        • Space Overhead,#pages/PC
        • Incremental
        • Scalability & Distributed Processing
        • Multiple Languages
    • Query Server
      • Search Relevant URLs for queries via looking up indices
      • Challenges
        • Speed, check #queries/Per Sec
        • Functions supported
        • Localization
    • AltaVista’s Search Functions
      • Phrase search, e.g. "petite galerie"
      • Truncation, e.g. librar*, wom*n
      • Constraining search, e.g. title:"The Wall Street Journal"
      • Proximity search, e.g. gold near silver
      • Boolean, e.g. +noir +film -"pinot noir"
      • Parentheses and Nested Boolean, e.g. silver and not (gold or platinum)
      • Limit search, e.g. limit by date range
      • Capitalization, e.g. turkey vs. Turkey
      • Ranking fields and refine search
      • LiveTopics
      • Translate Service
      • Other
    • Document Delivery
      • Bottleneck of Bandwidth
      • Presentation
      • Caching
        • Queries, Search Results
        • Aakman Model
    • Others
      • Security
      • System Maintenance
    • 評鑑
      • 收錄範圍 (Contents and Scopes)
      • 檢索功能 (Search Logic)
      • 顯示格式 (Display Results)
      • 檢索效率 (Search Performance)
      • 使用者介面 (Interface Design)
    • 收錄範圍 (Contents and Scopes)
      • 資料量 (Size of database)
      • 收錄項目 (WWW, Usenet, FTP, Gopher ... etc.)
      • 索引深度 (Index depth, e.g. HTML title, header, summary of content, full text)
      • 索引建立方式 (Automatic or manual indexing)
      • 新穎性及更新頻率 (Currency & frequency of updating the index)
      • 多國語文處理 (Multilingual)
      • 涵蓋種類 (e.g. Excite 包含 Web search, Usenet search, subject guide, city.net, NewsTracker 等 )
      • 提供評論 (review)
    • 檢索功能 (Search Logic)
      • 布林邏輯 (Boolean logic, e.g. AND, OR, NOT)
      • 複合布林邏輯 (Nested Boolean)
      • 竄字 (Truncation, automatic or user-defined)
      • 相近運算元 (Proximity, e.g. NEAR, FOLLOWED BY)
      • 片語查詢 (Phrase Searching)
      • 限制欄位 (Field Search, e.g. URL, title ... etc.)
      • 大小寫、特殊符號等處理 (Capitalization, punctuation ... etc.)
      • 關鍵語 (Keyword search)
      • 自然語句輸入 (Natural language query)
      • Relevance feedback
      • Refine search (Narrow down)
      • Weighted search
      • Duplicate detection
      • Search set manipulation
      • Other
    • 顯示格式 (Display Results)
      • 相關性排序 (Relevance Ranking)
      • 限制顯示筆數
      • 限制顯示資料的詳細程度 ( 註解或摘要 )
      • Direct Links to Resources
    • 檢索效率 (Search Performance)
      • 精確度 (Precision Ratio)
      • 查全率 (Recall Ratio)
      • 反應時間 (Response Time)
      • 連線容易程度 (Accessibility)
    • Part II. Why Google Can Do it ?
    • Spider 索引頁 Out Links 重複 (Duplication) 權威 (Authority) 從 Out link 遊走 Authorized Pages
    • Indexing& Ranking Page Title : Academia Sinica Indexed Page Anchor Text: Government Research Institution in Taiwan abstract Popularity Anchor Text: My CS Lab Authority
    • Inverted File Google’s Index File Structure
    • Distributed Search Query Query Processor SE SE SE SE Document Delivery
    • Index Space User Space Document Space Information Use Information Need Seek Use Users Authors Short Query Subject Terms Real Names X Y X1,X2... Y1,Y2... Abstract Modeling
    • Facts (I)
      • 查詢 (Query)
        • short query problem
        • 50% are personal and company names
        • Boolean or natural language query is few
      • 瀏覽 (Browsing)
        • no more 2nd page
        • precision is more important than recall
      • 資訊收集 (Robot)
        • low coverage 、 deadlinks 、 garbage sites and pages
    • Facts (II)- Accuracy
      • 誰的責任 ?
        • 使用者
          • Short query or NLQ?
          • HFQ 、 LFQ?
        • 搜尋引擎
          • 技術 , 資料量,排序 ?
    • Facts (III)- Speed
      • 誰的責任 ?
        • 使用者
          • 關鍵詞 , 頻寬
        • 搜尋引擎
          • 頻寬,文件傳遞
    • 語言比例
    • 關鍵詞長
    • 關鍵詞頻
    • 核心關鍵詞
    • 主題領域
    • Part III. New-Generation Search Technologies
    • New-Generation IR
      • Information Perspectives
        • Web IR
        • Multimedia IR
        • Semantic Web IR
        • User-Oriented IR
      • Retrieval Perspectives
        • Question Answering
        • Information Extraction
        • Information Filtering
        • Web Mining
    • New-Generation IR
      • Information Perspective
        • Web IR: Global/Specific Search Engines, Spiders
        • Multimedia IR: Speech, Music, Image, Video IR
        • Semantic Web IR: XML IR, Ontology, IE
        • User-Oriented IR: Log Mining, Ontology
      • Retrieval Perspective
        • Question Answering: NLQ, FAQ Search
        • Information Extraction
        • Information Filtering: e-mail Spam, Web Page
      • Mining Perspective
        • Web Mining, Log Mining
    • 以圖查圖
    • 影音瀏覽
    • 影片摘要
    • 文件分類
    • 跨語搜尋
    • 智慧型問答
    • 問專家
    • IR Research at W K D (I)
      • Information Perspective
        • Web IR:
        • Cross-Language Web Search
        • Concept-based Search
        • Multimedia IR:
        • Speech Retrieval
        • Image Retrieval
        • User-Oriented IR:
        • Query Taxonomy Generation
    • Cross-Language Web Search LiveTrans
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    • LiveConcept
    • LiveConcept Concept-based Web Search
    • Query: 請幫我找中美軍機擦撞 Indexing Approach Query by Exemplar Retrieved documents (Ranked) Recording Time Relevance Score Speech Query (Recognition results) Spoken Document (Recognition results) Speech Retrieval ( 陳柏琳博士)
    • Web Image Annotation 彩虹 (Rainbow) 天氣 (Weather) 花 (Flower) 自然 (Nature) 向日葵 (Sunflower) 花 (Flower) 植物 (Plant) 沙漠 (Desert) 海豹 (Seal) 哺乳類 (Mammal) 海岸 (Coast) 動物 (Animal) 太陽系 (Solar System) 慧星 (Comet) 熱帶魚 (Tropical Fish) 太空 (Universe) 瀑布 (Waterfall) 地形 (Landform) 自然 (Nature) 蟑螂 (Cockroach) 狗 (Dog) 哺乳類 (Mammal) 穿山甲 (Pangolin) 羊 (Sheep) Top 4 keywords Top 4 keywords Images Images
    • Web Image Annotation
    •  
    • Q&A Thanks ! Web Knowledge Discovery Lab Institute of Information Science Academia Sinica http:/ /csmart.iis.sinica.edu.tw /