Distributed Search - Solutions and Comparison Ngọc  Bùi [email_address]
Facts <ul><li>FB: </li></ul><ul><li>750 million active users </li></ul><ul><li>3B photos upload each month. Record 750M ph...
Centralized Search – PROBLEM? <ul><li>Lucene is great:  </li></ul><ul><ul><li>high-performance, full-featured  search libr...
GOAL <ul><li>Reliable index serving - by failover (master and nodes) </li></ul><ul><li>Scalable for traffic and index size...
Solution: <ul><li>Documents are indexed in parallel on different machines in a cluster.  When a user issues a search, it w...
Katta <ul><li>Katta is a distributed application running on many commodity hardware servers </li></ul><ul><li>An index for...
 
Pros and Cons <ul><li>Pros : </li></ul><ul><ul><li>Copy and distribute Shards automatically on Slaves. </li></ul></ul><ul>...
Elastic Search  (www. elasticsearch .org) <ul><li>Elastic Search is an Open Source, Distributed, RESTful, Search Engine bu...
 
Behind Elastic
automatic shard allocation <ul><li>There is no need for a load balancer in elasticsearch, each node can receive a request,...
HbaseDirectory – What? Directory
HbaseDirectory – What? Indexing Phase Searching Phase Directory
HbaseDirectory – What? <ul><li>Directory  is distributed?  No  but  not impossible . </li></ul><ul><li>Distributed? Using ...
Two Mode <ul><li>Hbase Directory: lazy mode </li></ul><ul><ul><li>Keep lucene index file structures, porting to Hbase </li...
Lucene index flow – Hbase flow
Performance & Conclusion <ul><li>Refer to excel file </li></ul><ul><li>HbaseDirectory – Active mode is the correct choice....
Upcoming SlideShare
Loading in …5
×

Distributed search solutions and comparison

4,138 views

Published on

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • DỊCH VỤ THIẾT KẾ POWERPOINT (Thiết kế profile cho doanh nghiệp--- Thiết kế Brochure--- Thiết kế Catalogue--- slide bài giảng--- slide bài phát biểu---slide bài TIỂU LUẬN, LUẬN VĂN TỐT NGHIỆP--- dạy học viên thiết kế powerpoint…)-----(Giá từ 8.000 đ - 10.000 đ/1trang slide)------ Mọi chi tiết vui lòng liên hệ với chúng tôi: điện thoại 0973.764.894 hoặc zalo 0973.764.894 (Miss. Huyền) ----- • Thời gian hoàn thành: 1-2 ngày sau khi nhận đủ nội dung ----- Qui trình thực hiện: ----- 1. Bạn gửi nội dung cần thiết kế về địa chỉ email: dvluanvan@gmail.com ----- 2. DỊCH VỤ THIẾT KẾ POWERPOINT báo giá chi phí và thời gian thực hiện cho bạn ----- 3. Bạn chuyển tiền tạm ứng 50% chi phí để tiến hành thiết kế ----- 4. Gửi file slide demo cho bạn xem để thống nhất chỉnh sửa hoàn thành. ----- 5. Bạn chuyển tiền 50% còn lại. ----- 6. Bàn giao file gốc cho bạn.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,138
On SlideShare
0
From Embeds
0
Number of Embeds
1,699
Actions
Shares
0
Downloads
141
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide
  • Luu term lam rowID  cac doc chua term day la 1 column family trong do co cac column identifier la document id, value la cac vi tri xuat hien cua term
  • Distributed search solutions and comparison

    1. 1. Distributed Search - Solutions and Comparison Ngọc Bùi [email_address]
    2. 2. Facts <ul><li>FB: </li></ul><ul><li>750 million active users </li></ul><ul><li>3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. </li></ul><ul><li>14M videos uploaded each month </li></ul><ul><li>More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month. </li></ul><ul><li>TBs log data daily </li></ul><ul><li>HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK? </li></ul>
    3. 3. Centralized Search – PROBLEM? <ul><li>Lucene is great: </li></ul><ul><ul><li>high-performance, full-featured search library </li></ul></ul><ul><ul><li>Incremental indexing </li></ul></ul><ul><ul><li>Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc… </li></ul></ul><ul><li>It’s great BUT : </li></ul><ul><ul><li>Slow if index is very big </li></ul></ul><ul><ul><li>Index bigger than on HDD </li></ul></ul><ul><ul><li>No load balance </li></ul></ul><ul><ul><li>No failover </li></ul></ul>
    4. 4. GOAL <ul><li>Reliable index serving - by failover (master and nodes) </li></ul><ul><li>Scalable for traffic and index size by adding nodes </li></ul><ul><li>Distributed TF-IDF </li></ul>
    5. 5. Solution: <ul><li>Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel. </li></ul><ul><li>Choices: </li></ul><ul><ul><li>Katta </li></ul></ul><ul><ul><li>Elastic Search </li></ul></ul><ul><ul><li>HbaseDirectory (our choice) </li></ul></ul>
    6. 6. Katta <ul><li>Katta is a distributed application running on many commodity hardware servers </li></ul><ul><li>An index for Katta is a folder with a set of subfolders. Those subfolder are called  index shards . </li></ul><ul><li>The distributed configuration and locking system Zookeeper is used for master-node communication. </li></ul>
    7. 8. Pros and Cons <ul><li>Pros : </li></ul><ul><ul><li>Copy and distribute Shards automatically on Slaves. </li></ul></ul><ul><ul><li>Support distributing queries and aggregating results. </li></ul></ul><ul><li>Cons : </li></ul><ul><ul><li>No indexing support. </li></ul></ul><ul><ul><li>Incremental update index is hard </li></ul></ul><ul><ul><li>Resharding is too expensive. </li></ul></ul>
    8. 9. Elastic Search (www. elasticsearch .org) <ul><li>Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of Lucene </li></ul><ul><li>Automatic Shard allocation </li></ul><ul><li>Auto shard index & update index </li></ul><ul><li>Network interface (http) for data indexing, searching and administrating  purely RESTful API. </li></ul><ul><li>Schema Free. </li></ul><ul><li>Can be integrated well with Hadoop/Map-Reduce </li></ul>
    9. 11. Behind Elastic
    10. 12. automatic shard allocation <ul><li>There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s). </li></ul><ul><li>If you want to scale out search, you can simply have more shard, replicas per shard. </li></ul>
    11. 13. HbaseDirectory – What? Directory
    12. 14. HbaseDirectory – What? Indexing Phase Searching Phase Directory
    13. 15. HbaseDirectory – What? <ul><li>Directory is distributed? No but not impossible . </li></ul><ul><li>Distributed? Using Directory on a distributed storage system. </li></ul><ul><ul><li>HDFS: slowwww </li></ul></ul><ul><ul><li>Hbase: our choice since it is optimized for random access which is appropriate for accessing lucene index. </li></ul></ul><ul><ul><li> Hbase Directory: consider Hbase as a logical “Directory”. </li></ul></ul>
    14. 16. Two Mode <ul><li>Hbase Directory: lazy mode </li></ul><ul><ul><li>Keep lucene index file structures, porting to Hbase </li></ul></ul><ul><ul><li>Only rewrite 2 libraries: FSDirectory & RAMDirectory (Directory interface) </li></ul></ul><ul><li>Hbase Directory: active mode </li></ul><ul><ul><li>Redesign index structure to utilize Hbase’s strength. </li></ul></ul><ul><ul><li>Rewrite: 2 above + Indexreader & Indexwriter </li></ul></ul>
    15. 17. Lucene index flow – Hbase flow
    16. 18. Performance & Conclusion <ul><li>Refer to excel file </li></ul><ul><li>HbaseDirectory – Active mode is the correct choice. </li></ul><ul><li>Improvement needed. </li></ul>

    ×