Your SlideShare is downloading. ×
Distributed search   solutions and comparison
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Distributed search solutions and comparison

3,793
views

Published on

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,793
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
139
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Luu term lam rowID  cac doc chua term day la 1 column family trong do co cac column identifier la document id, value la cac vi tri xuat hien cua term
  • Transcript

    • 1. Distributed Search - Solutions and Comparison Ngọc Bùi [email_address]
    • 2. Facts
      • FB:
      • 750 million active users
      • 3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend.
      • 14M videos uploaded each month
      • More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.
      • TBs log data daily
      • HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?
    • 3. Centralized Search – PROBLEM?
      • Lucene is great:
        • high-performance, full-featured search library
        • Incremental indexing
        • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc…
      • It’s great BUT :
        • Slow if index is very big
        • Index bigger than on HDD
        • No load balance
        • No failover
    • 4. GOAL
      • Reliable index serving - by failover (master and nodes)
      • Scalable for traffic and index size by adding nodes
      • Distributed TF-IDF
    • 5. Solution:
      • Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.
      • Choices:
        • Katta
        • Elastic Search
        • HbaseDirectory (our choice)
    • 6. Katta
      • Katta is a distributed application running on many commodity hardware servers
      • An index for Katta is a folder with a set of subfolders. Those subfolder are called  index shards .
      • The distributed configuration and locking system Zookeeper is used for master-node communication.
    • 7.  
    • 8. Pros and Cons
      • Pros :
        • Copy and distribute Shards automatically on Slaves.
        • Support distributing queries and aggregating results.
      • Cons :
        • No indexing support.
        • Incremental update index is hard
        • Resharding is too expensive.
    • 9. Elastic Search (www. elasticsearch .org)
      • Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of Lucene
      • Automatic Shard allocation
      • Auto shard index & update index
      • Network interface (http) for data indexing, searching and administrating  purely RESTful API.
      • Schema Free.
      • Can be integrated well with Hadoop/Map-Reduce
    • 10.  
    • 11. Behind Elastic
    • 12. automatic shard allocation
      • There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).
      • If you want to scale out search, you can simply have more shard, replicas per shard.
    • 13. HbaseDirectory – What? Directory
    • 14. HbaseDirectory – What? Indexing Phase Searching Phase Directory
    • 15. HbaseDirectory – What?
      • Directory is distributed? No but not impossible .
      • Distributed? Using Directory on a distributed storage system.
        • HDFS: slowwww
        • Hbase: our choice since it is optimized for random access which is appropriate for accessing lucene index.
        •  Hbase Directory: consider Hbase as a logical “Directory”.
    • 16. Two Mode
      • Hbase Directory: lazy mode
        • Keep lucene index file structures, porting to Hbase
        • Only rewrite 2 libraries: FSDirectory & RAMDirectory (Directory interface)
      • Hbase Directory: active mode
        • Redesign index structure to utilize Hbase’s strength.
        • Rewrite: 2 above + Indexreader & Indexwriter
    • 17. Lucene index flow – Hbase flow
    • 18. Performance & Conclusion
      • Refer to excel file
      • HbaseDirectory – Active mode is the correct choice.
      • Improvement needed.

    ×