Distributed search solutions and comparison

  • 3,757 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,757
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
138
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Luu term lam rowID  cac doc chua term day la 1 column family trong do co cac column identifier la document id, value la cac vi tri xuat hien cua term

Transcript

  • 1. Distributed Search - Solutions and Comparison Ngọc Bùi [email_address]
  • 2. Facts
    • FB:
    • 750 million active users
    • 3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend.
    • 14M videos uploaded each month
    • More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.
    • TBs log data daily
    • HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?
  • 3. Centralized Search – PROBLEM?
    • Lucene is great:
      • high-performance, full-featured search library
      • Incremental indexing
      • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc…
    • It’s great BUT :
      • Slow if index is very big
      • Index bigger than on HDD
      • No load balance
      • No failover
  • 4. GOAL
    • Reliable index serving - by failover (master and nodes)
    • Scalable for traffic and index size by adding nodes
    • Distributed TF-IDF
  • 5. Solution:
    • Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.
    • Choices:
      • Katta
      • Elastic Search
      • HbaseDirectory (our choice)
  • 6. Katta
    • Katta is a distributed application running on many commodity hardware servers
    • An index for Katta is a folder with a set of subfolders. Those subfolder are called  index shards .
    • The distributed configuration and locking system Zookeeper is used for master-node communication.
  • 7.  
  • 8. Pros and Cons
    • Pros :
      • Copy and distribute Shards automatically on Slaves.
      • Support distributing queries and aggregating results.
    • Cons :
      • No indexing support.
      • Incremental update index is hard
      • Resharding is too expensive.
  • 9. Elastic Search (www. elasticsearch .org)
    • Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of Lucene
    • Automatic Shard allocation
    • Auto shard index & update index
    • Network interface (http) for data indexing, searching and administrating  purely RESTful API.
    • Schema Free.
    • Can be integrated well with Hadoop/Map-Reduce
  • 10.  
  • 11. Behind Elastic
  • 12. automatic shard allocation
    • There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).
    • If you want to scale out search, you can simply have more shard, replicas per shard.
  • 13. HbaseDirectory – What? Directory
  • 14. HbaseDirectory – What? Indexing Phase Searching Phase Directory
  • 15. HbaseDirectory – What?
    • Directory is distributed? No but not impossible .
    • Distributed? Using Directory on a distributed storage system.
      • HDFS: slowwww
      • Hbase: our choice since it is optimized for random access which is appropriate for accessing lucene index.
      •  Hbase Directory: consider Hbase as a logical “Directory”.
  • 16. Two Mode
    • Hbase Directory: lazy mode
      • Keep lucene index file structures, porting to Hbase
      • Only rewrite 2 libraries: FSDirectory & RAMDirectory (Directory interface)
    • Hbase Directory: active mode
      • Redesign index structure to utilize Hbase’s strength.
      • Rewrite: 2 above + Indexreader & Indexwriter
  • 17. Lucene index flow – Hbase flow
  • 18. Performance & Conclusion
    • Refer to excel file
    • HbaseDirectory – Active mode is the correct choice.
    • Improvement needed.