Large scale     searchPatterns for dealing with large-scale           search systems
overview•How to provide a scalable platform forboth users and data•Issues introduced by a scalable platform•Patterns for d...
Big dataAs data volumes increase they can prove too large for any one server to manage
Big data: PartitioningData can be partitioned into suitably sized“shards” and placed on different servers
Partitioning: divide and        conquer                     ?                         ?               ?                   ...
Big Search Loads                 ?       ?       ?         ?   ?       ?       ?       ?   ?                         ?Howe...
Replication                ?       ?       ?        ?   ?       ?       ?       ?   ?                        ?To spread th...
Scaling SummaryPartitioning                       Replicationcoping with data volumes   coping with user volumes (and prov...
Issues So far so good - but a scalable system withmany servers raises the concerns of balancing        Consistency and Ava...
Consistency vs                           availabilityServers          !                                   Content Freshnes...
Consistency vs                           availabilityServers                                           earliest cross-serv...
Consistency vs                   availability     Consistency                                                  Availabilit...
PatternsConsistency   Availability                                              Full Consistency             All servers a...
PatternsConsistency   Availability                                            shard consistency             Within each “s...
PatternsConsistency   Availability                                       managed inconsistency             All servers app...
PatternsConsistency   Availability                                          sticky user sessions             All servers a...
PatternsConsistency   Availability                                        every man for himself Description All servers ar...
considerations when     selecting a pattern•Pick an acceptable user experience as a starting point  •“I always expect to b...
Upcoming SlideShare
Loading in …5
×

Patterns for large scale search

740 views
677 views

Published on

Patterns for managing content on a distributed search system

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
740
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Patterns for large scale search

  1. 1. Large scale searchPatterns for dealing with large-scale search systems
  2. 2. overview•How to provide a scalable platform forboth users and data•Issues introduced by a scalable platform•Patterns for dealing with the issues
  3. 3. Big dataAs data volumes increase they can prove too large for any one server to manage
  4. 4. Big data: PartitioningData can be partitioned into suitably sized“shards” and placed on different servers
  5. 5. Partitioning: divide and conquer ? ? ? ?Each user’s search queries all shards in paralleland combines results to provide fast responses
  6. 6. Big Search Loads ? ? ? ? ? ? ? ? ? ?However, as user volumes increase, the loads on each shard server can become too great
  7. 7. Replication ? ? ? ? ? ? ? ? ? ?To spread the load of many simultaneous users, indexes need to be replicated
  8. 8. Scaling SummaryPartitioning Replicationcoping with data volumes coping with user volumes (and providing redundancy in the event of failure)
  9. 9. Issues So far so good - but a scalable system withmany servers raises the concerns of balancing Consistency and Availability...
  10. 10. Consistency vs availabilityServers ! Content Freshness As the number of required servers increases, there is an increased probability that a server will fail or lag when adding new content.
  11. 11. Consistency vs availabilityServers earliest cross-server latest available consistent content content ???? These potential inconsistencies across servers introduces a dilemma - search the latest available content or older, consistent content?
  12. 12. Consistency vs availability Consistency Availability FULL Shard Managed sticky user Every man for Consistency Consistency InConsistency sessions himselfWhat follows is a number of architectural patterns, each of which will make atrade-off between the consistency and availability of content being searched
  13. 13. PatternsConsistency Availability Full Consistency All servers are designed to coordinate together when applying batches of Description new content. If any one server fails to apply updates, all servers abandon this batch of updates. •All users of the system see the same version of content i.e. the same point in time. Pros •Complex distributed transaction software is required to coordinate updates. Cons •Any failure on a server delays the visibility of new content on all servers.
  14. 14. PatternsConsistency Availability shard consistency Within each “shard” replica servers strive to maintain identical copies of Description the same content. New content additions are coordinated within each shard with any failure of a replica server aborting additions to that shard. •Update failures are isolated to impacting availability of new content on a single shard. Pros •All users see the same (potentially uneven) content. •Complex distributed transaction software may be required to coordinate updates. Cons •Any failure delays the visibility of new content in that shard. •Shards may be “uneven” in the points-in-time they represent
  15. 15. PatternsConsistency Availability managed inconsistency All servers apply updates independently, with an agreed tolerance for “drift” between the freshness of content held on servers. When this Description threshold is reached the servers with the newest content halt updates until the drift gap is closed (this may require removing a failing replica server from active service). •New content is continually made available to users until pre-defined Pros tolerances for failures are exceeded •Different users may see different results depending on which (almost) replica the load balancer chooses to service their queries Cons •Individual users hitting the refresh button may also see different results as a result of non-exact replica servers •Shards may be “uneven” in the points-in-time they represent
  16. 16. PatternsConsistency Availability sticky user sessions All servers are allowed to update independently. The load balancer is configured to route a user’s searches to the same choice of replica server Description in each shard whenever possible to hide any temporary drift between replicas. •New content is continually made available to users. Pros •Individual users should not experience a “step back in time” when repeating the same query due to inconsistent replicas. •Different users may see different results depending on which (almost) Cons replica the load balancer chooses to service the query. •Shards may be “uneven” in the points-in-time they represent.
  17. 17. PatternsConsistency Availability every man for himself Description All servers are allowed to update independently. Pros •New content is continually made available to users •Different users may see different results depending on which (almost) replica the load balancer chooses to service the query. Cons •Individual users hitting the refresh button may also see different results as a result of non-exact replica servers
  18. 18. considerations when selecting a pattern•Pick an acceptable user experience as a starting point •“I always expect to be acting on the latest available information” •“I need all results to represent the same point in time” •“I expect hitting the refresh button to take me forward in time, never back” •“I expect to always see the same content as my colleagues”•Recognise not all user requirements are realisable so rank them byimportance.•Pick a pattern that works best for the selected requirements •Consider mixing some patterns e.g. “Managed Inconsistency” with “Sticky user sessions” seems a good compromise between maintaining (perceived) consistency and content availability •Consider different strategies for different user groups e.g. VIP users will always see the guaranteed-latest content.

×