Distributed search solutions and comparison

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Luu term lam rowID  cac doc chua term day la 1 column family trong do co cac column identifier la document id, value la cac vi tri xuat hien cua term


  • 1. Distributed Search - Solutions and Comparison Ngọc Bùi [email_address]
  • 2. Facts
    • FB:
    • 750 million active users
    • 3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend.
    • 14M videos uploaded each month
    • More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.
    • TBs log data daily
  • 3. Centralized Search – PROBLEM?
    • Lucene is great:
      • high-performance, full-featured search library
      • Incremental indexing
      • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc…
    • It’s great BUT :
      • Slow if index is very big
      • Index bigger than on HDD
      • No load balance
      • No failover
  • 4. GOAL
    • Reliable index serving - by failover (master and nodes)
    • Scalable for traffic and index size by adding nodes
    • Distributed TF-IDF
  • 5. Solution:
    • Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.
    • Choices:
      • Katta
      • Elastic Search
      • HbaseDirectory (our choice)
  • 6. Katta
    • Katta is a distributed application running on many commodity hardware servers
    • An index for Katta is a folder with a set of subfolders. Those subfolder are called  index shards .
    • The distributed configuration and locking system Zookeeper is used for master-node communication.
  • 7.  
  • 8. Pros and Cons
    • Pros :
      • Copy and distribute Shards automatically on Slaves.
      • Support distributing queries and aggregating results.
    • Cons :
      • No indexing support.
      • Incremental update index is hard
      • Resharding is too expensive.
  • 9. Elastic Search (www. elasticsearch .org)
    • Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of Lucene
    • Automatic Shard allocation
    • Auto shard index & update index
    • Network interface (http) for data indexing, searching and administrating  purely RESTful API.
    • Schema Free.
    • Can be integrated well with Hadoop/Map-Reduce
  • 10.  
  • 11. Behind Elastic
  • 12. automatic shard allocation
    • There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).
    • If you want to scale out search, you can simply have more shard, replicas per shard.
  • 13. HbaseDirectory – What? Directory
  • 14. HbaseDirectory – What? Indexing Phase Searching Phase Directory
  • 15. HbaseDirectory – What?
    • Directory is distributed? No but not impossible .
    • Distributed? Using Directory on a distributed storage system.
      • HDFS: slowwww
      • Hbase: our choice since it is optimized for random access which is appropriate for accessing lucene index.
      •  Hbase Directory: consider Hbase as a logical “Directory”.
  • 16. Two Mode
    • Hbase Directory: lazy mode
      • Keep lucene index file structures, porting to Hbase
      • Only rewrite 2 libraries: FSDirectory & RAMDirectory (Directory interface)
    • Hbase Directory: active mode
      • Redesign index structure to utilize Hbase’s strength.
      • Rewrite: 2 above + Indexreader & Indexwriter
  • 17. Lucene index flow – Hbase flow
  • 18. Performance & Conclusion
    • Refer to excel file
    • HbaseDirectory – Active mode is the correct choice.
    • Improvement needed.