Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Solr Compute Cloud - An Elastic SolrCloud Infrastructure

1,740 views

Published on

Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.

This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.

Published in: Engineering
  • Be the first to comment

Solr Compute Cloud - An Elastic SolrCloud Infrastructure

  1. 1. Solr Compute Cloud – An Elastic Solr Infrastructure Nitin Sharma - Member of technical staff, BloomReach - nitin.sharma@bloomreach.com
  2. 2. Abstract Scaling search platforms is an extremely hard problem • Serving hundreds of millions of documents • Low latency • High throughput workloads • Optimized cost. At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: • Supports heterogeneous workloads while hosted in the cloud. • Dynamically grows/shrinks search servers • Application and Pipeline level isolation, NRT search and indexing. • Offers latency guarantees and application-specific performance tuning. • Provides high-availability features like cluster replacement, cross-data center support, disaster recovery etc.
  3. 3. About Us BloomReach BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable. Myself I work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.
  4. 4. The BloomReach Personalized Discovery Platform
  5. 5. BloomReach’s Applications Organic Search Content understanding What it does Content optimization, management and measurement Benefit Enhanced discoverability and customer acquisition in organic search What it does Personalized onsite search and navigation across devices Benefit Relevant and consistent onsite experiences for new and known users What it does Merchandising tool that understa nds products and identifies oppo rtunities Benefit Prioritize and optimize online merchandising SNAP Compass
  6. 6. Agenda • BloomReach search use cases and architecture • Old architecture and issues • Scaling challenges • Elastic SolrCloud architecture and benefits • Lessons learned
  7. 7. BloomReach Search Use Cases 1. Front-end (serving) queries – Uptime and Latency sensitive 2. Batch search pipelines – Throughput sensitive 3. Time bound indexing requirements – Customer Specific 4. Time bound Solr config updates
  8. 8. BloomReach Search Architecture Zookeeper Ensemble Map Reduce Solr Cluster Pipelines (Reads) Indexing Pipelines Pipeline 1 Pipeline 2 Pipeline n Indexing 1 Indexing 2 Indexing n Heavy Load Moderate Load Light Load Legend Public API Search Traffic Search Traffic
  9. 9. Throughput Issues… Zookeeper Ensemble Solr Cluster Pipeline 1 Pipeline 2 Pipeline n Indexing 1 Indexing 2 Indexing n Public API Search Traffic ● Heterogeneous read workload ● Same collection - different pipelines, different query patterns, different schedule ● Cache tuning is virtually impossible ● Larger pipeline starving the small ones ● Machine utilization determines throughput and stability of a pipeline at any point ● No isolation among jobs
  10. 10. Stability and Uptime Issues… Zookeeper Ensemble Solr Cluster Pipeline 1 Pipeline 2 Pipeline n Indexing 1 Indexing 2 Indexing n Public API Search Traffic ● Bad clients – bring down the cluster/degrade performance ● Bad queries (with heavy load) – render nodes unresponsive ● Garbage collection issues ● ZK stability issues (as we scale collections) ● CPU /Load Issues ● Higher number of concurrent pipelines, higher number of issues
  11. 11. Indexing Issues… Zookeeper Ensemble Solr Cluster Pipeline 1 Pipeline 2 Pipeline n Indexing 1 Indexing 2 Indexing n Public API Search Traffic ● Commit frequencies vary with indexer types ● Indexer run during another pipeline – performance ● Indexer client leaks ● Too many stored fields ● Non-batch updates
  12. 12. Rethinking… • Shared cluster for pipelines does not scale. • Guaranteeing an uptime of 99.99+ - non trivial • Every job runs great in isolation. When you put them together, they fail. • Running index-heavy load and read-heavy load - cluster performance issues. • Any direct access to production cluster – cluster stability (client leaks, bad queries etc.). What if every pipeline had its own cluster?
  13. 13. Solr Compute Cloud (SC2) • Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly. • Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away. • Technologies behind SC2 (built in House) Cluster Management API - Dynamic cluster provisioning and resource allocation. Solr HAFT – High availability and data management library for SolrCloud. • Isolation - Pipelines get their own cluster. One cannot disrupt another. • Dynamic Scaling – Every pipeline can state its own replication requirements. • Production Safeguard - No direct access. Safeguards from bad clients/access patterns. • Cost Saving – Provision for the average; withstand peak with elastic growth.
  14. 14. Solr Compute Cloud Zookeeper Ensemble Solr Cluster Request: {Collection: A, Replica: 6} Pipeline 1 Solr Compute Cloud API Solr Cluster Collection A Replicas: 6 1. Read pipeline requests collection and desired replicas from SC2 API. 2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data). 3. SC2 calls HAFT service to replicate data from production to provisioned cluster. 4. Pipeline uses this cluster to run job. 1 4 2 3 Solr HAFT Service 3 Read Replicate
  15. 15. Solr Compute Cloud… Zookeeper Ensemble Solr Cluster Pipeline 1 Solr Compute Cloud API Solr Cluster Collection A Replicas: 6 1. Pipeline finishes running the job. 2. Pipeline calls SC2 API to terminate the cluster. 3. SC2 terminates the cluster. Terminate: {Cluster} 2 3 Solr HAFT Service 1
  16. 16. Solr Compute Cloud – Read Pipeline View Zookeeper Ensemble Pipeline 1 Solr Compute Cloud API Solr Cluster Collection A Replicas: 6 Request: {Collection: A, Replica: 6} Pipeline 2 Solr Cluster Collection B Replicas: 2 Request: {Collection: B, Replica: 2} Pipeline n Solr Cluster Collection C Replicas: 1 Request: {Collection: C, Replica: 1} Solr HAFT Service Production Solr Cluster
  17. 17. Solr Compute Cloud – Indexing Zookeeper Ensemble Production Solr Cluster Request: {Collection: A, Replica: 2} Indexing Solr Compute Cloud API Solr Cluster Collection A Replicas: 6 1. Read pipeline requests collection and desired replicas from SC2 API. 2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data). 1. Indexer uses this cluster to index the data. 2. Indexer calls HAFT service to replicate the index from dynamic cluster to production. 3. HAFT service reads data from dynamic cluster and replicates to production Solr. 1 3 2 Replicate Solr HAFT Service 4 5 Read
  18. 18. Solr Compute Cloud – Global View Zookeeper Ensemble Solr Compute Cloud API Solr HAFT Service Production Solr Cluster Indexing Pipelines 1 Elastic Clusters Indexing Pipelines n Read Pipelines 1 Read Pipelines n Provision: {Cluster} Terminate: {Cluster} Replicate Index Replicate Index Run Job
  19. 19. Solr Compute Cloud API 1. API to provision clusters on demand. 2. Dynamic cluster and resource allocation (includes cost optimization) 3. Track request state, cluster performance and cost. 4. Terminate long-running, runaway clusters.
  20. 20. Solr HAFT Service 1. High availability and fault tolerance 2. Home-grown technology 3. Open Source -  (Work in progress) 4. Features • One push disaster recovery • High availability operations • Replace node • Add replicas • Repair collection • Collection versioning • Cluster backup operations • Dynamic replica creation • Cluster clone • Cluster swap • Cluster state reconstruction
  21. 21. Solr HAFT Service – Functional View Black Box Recording Index Management Actions Custom Commit Node Replacement Collection Versioning Solr HAFT Service Clone Collections Clone Alias Node Repair Clone Cluster Lucene Segment Optimize High Availability Actions Cluster Backup Operations Solr Metadata Zookeeper Metadata Dynamic Replica Creation Cluster Clone Cluster Swap Cluster State Reconstruction Verification Monitoring
  22. 22. Disaster Recovery in New Architecture Zookeeper Ensemble Old Production Solr Cluster Zookeeper Ensemble New Solr Cluster Push Button Recovery Solr HAFT Service Brave Soul on Pager Duty 1 2 3 DNS 1. Guy on Pager clicks the recovery button 2. Solr HAFT Service triggers Cluster Setup State Reconstruction Cluster Clone Cluster Swap 3. Production DNS – New Cluster
  23. 23. SC2 vs Non-SC2 (Stability Features) Property Non-SC2 SC2 Linear Scalability for Heterogeneous Workload Pipeline Level Isolation Dynamic Collection Scaling Prevention from Bad Clients Pipeline Specific Performance No Direct Access to Production Cluster Can Sleep at night? 
  24. 24. SC2 vs Non-SC2 (Availability Features) Property Non-SC2 SC2 Cross Data-Center Support Cluster Cloning Collection Versioning One-Push Disaster Recovery Repair API for Nodes/Collections Node Replacement
  25. 25. Lessons Learned 1. Solr is a search platform. Do not use it as a database (for scans and lookups). Evaluate your stored fields. 2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches. 3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.) 4. Instrument every piece of infrastructure and collect metrics. 5. Build automated disaster recovery (You will need it. )
  26. 26. Questions? Thank You! Nitin Sharma nitin.sharma@bloomreach.com https://www.linkedin.com/in/knitinsharma

×