My presentation focuses on how we implemented Solr 4 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 620M documents and is growing by 3 to 4 million documents per day. My presentation will include details about: 1) Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper, 2) Operations concerns like how to handle a failed node and monitoring, 3) How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput, 4) Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4 is scalable, stable, and is production ready.
This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
The document discusses benchmarking the performance of SolrCloud clusters. It describes Timothy Potter's experience operating a large SolrCloud cluster at Dachis Group. It outlines an methodology for benchmarking indexing performance by varying the number of servers, shards, and replicas. Results show near-linear scalability as nodes are added. The document also introduces the Solr Scale Toolkit for deploying and managing SolrCloud clusters using Python and AWS. It demonstrates integrating Solr with tools like Logstash and Kibana for log aggregation and dashboards.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
Presented by Stephane Gamard, Chief Technology Officer, Searchbox
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
This document provides an overview of searching in the cloud using Apache Solr. It discusses how Solr allows for full-text search across distributed servers and datasets. Key features of SolrCloud include centralized configuration in Zookeeper, automatic failover, near-real-time indexing, leader election, and optimistic locking for durable writes across shards. The document also covers Solr schemas, indexing data from various sources, caching, and using SolrJ and SolrCloud.
This document provides information about integrating Apache Solr and Apache Spark. It discusses using Solr as a data source and sink for Spark applications, including indexing data from Spark jobs into Solr in real-time and exposing Solr query results as Spark RDDs. The document also summarizes the Spark Streaming and RDD APIs and provides code examples for indexing tweets from Spark Streaming into Solr and reading from Solr into a DataFrame.
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
The document discusses benchmarking the performance of SolrCloud clusters. It describes Timothy Potter's experience operating a large SolrCloud cluster at Dachis Group. It outlines an methodology for benchmarking indexing performance by varying the number of servers, shards, and replicas. Results show near-linear scalability as nodes are added. The document also introduces the Solr Scale Toolkit for deploying and managing SolrCloud clusters using Python and AWS. It demonstrates integrating Solr with tools like Logstash and Kibana for log aggregation and dashboards.
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
Presented by Stephane Gamard, Chief Technology Officer, Searchbox
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
This document provides an overview of searching in the cloud using Apache Solr. It discusses how Solr allows for full-text search across distributed servers and datasets. Key features of SolrCloud include centralized configuration in Zookeeper, automatic failover, near-real-time indexing, leader election, and optimistic locking for durable writes across shards. The document also covers Solr schemas, indexing data from various sources, caching, and using SolrJ and SolrCloud.
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
Presented by Erick Erickson, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shards and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster. We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
This document provides tips for tuning Solr for high performance. It discusses optimizing queries and facets for CPU usage, tuning memory usage such as using docValues, optimizing disk usage through merge policies and commit settings, reducing network overhead through batching and caching, and techniques like deep paging to improve performance for large result sets. The document emphasizes only indexing and retrieving necessary fields to reduce resource usage and tuning garbage collection to avoid pauses.
SolrCloud uses Zookeeper to elect a leader node for each shard. The leader coordinates write requests to ensure consistency. When the leader dies, Zookeeper detects this and elects a new leader based on the nodes' sequence numbers registered with Zookeeper. The new leader syncs updates with replicas and can replay logs if any replicas are too far behind. This allows write requests to continue being served with high availability despite leader failures.
Solr Cloud allows Solr to be distributed and run across multiple servers for increased performance, scalability, availability, and elasticity. It uses Zookeeper for coordination and shares an index across multiple cores and collections. Documents are routed and replicated to shards and replicas based on a hashing function or custom routing rules to partition the data. Queries are distributed and results merged to provide scalable search across an elastic, fault-tolerant cluster.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
This document discusses integrating Solr and Spark. It provides an example of using Solr as a sink for streaming data from Spark Streaming. It also describes reading data from Solr into Spark using SolrRDD and exposing it as a Spark SQL DataFrame. Additional capabilities covered include querying Solr from the Spark shell, document matching using stored queries, and reading term vectors from Solr for machine learning with MLLib.
In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.
Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
This document discusses scaling SolrCloud to support a large number of collections. It identifies four main problems in scaling: 1) large cluster state size, 2) overseer performance issues with thousands of collections, 3) difficulty moving data between collections, and 4) limitations in exporting full result sets. The document outlines solutions implemented to each problem, including splitting the cluster state, optimizing the overseer, improving data management between collections, and enabling distributed deep paging to export full result sets. Testing showed the ability to support 30 hosts, 120 nodes, 1000 collections, over 6 billion documents, and sustained performance targets.
This document discusses scaling Solr using SolrCloud. It provides an overview of Solr history and architectures. It then describes how SolrCloud addresses limitations of earlier architectures by utilizing Apache ZooKeeper for coordination across Solr nodes and shards. Key concepts discussed include collections, shards, replicas, and routing queries across shards. The document also covers configuration topics like caches, indexing tuning, and monitoring.
This document discusses SolrCloud failover and testing. It provides an overview of how SolrCloud uses ZooKeeper to elect an overseer node to monitor cluster state and automatically create a new replica on an available node when one goes down, allowing failover capability. It also discusses challenges with distributed testing and recommends focusing more on backfilling tests when changing code, fixing frequently failing tests, and adding more unit tests to improve Solr's testing culture.
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
This document discusses building and running Solr as a service in the cloud. It covers:
- The challenges of deploying Solr in cloud environments and the need for a managed service.
- The architecture of the Solr-as-a-Service, which uses Docker, Mesos, and other tools to provide multi-tenant Solr clusters.
- Key aspects of managing Solr clusters in the cloud service, including software upgrades, resizing clusters, handling replicas, and balancing clusters.
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
The document describes Netflix's Rebalance API for SolrCloud, which provides strategies and capabilities for scaling Solr clusters while maintaining service level agreements (SLAs). The Rebalance API includes scaling strategies for operations like auto-sharding, redistributing data, and replacing nodes with zero downtime. It also includes allocation strategies that determine core placement. The goal is to allow for fine-grained SLA management when indexing and querying large datasets across multiple data centers. The Rebalance API has been open sourced and is aimed at forming the basis for auto-scaling in Solr.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
SFDX is an amazing tool, that allows for fast development, robust technology usage and great deployment performance.
During last years we have delivered multiple large projects and many smaller ones exclusively using SFDX.
We will have a look on how SFDX is currently working, what it is allowing to do and what are items that do NOT work as one would expect. We will go through real life scenarios that we have encountered during our projects and explain how we should be working with SFDX in a world of devs, admins, and business admins.
Topic 1 – Admins do work and get it to source control by one click
* Creating new field
* Deleting field / Renaming field
Topic 2 – Can profiles be managed by SFDX?
Topic 4 – Delta automatic deployment
* how to achieve them, what are the community tools to use
Topic 5 – Customer made changes to production, am I done?
Topic 6 – Metadata that doesn’t make sense and how to work with them
* we go through metadata that behaves differently and each user needs to understand how (translations, record types and many more)
I would love to have attendees leaving room with knowledge of what is possible and what are quick wins, what are gotchas and what to not try at all 🙂
Session will be guided with presentation, where all points will be explained and shown via examples within IDE on life scratch orgs.
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
Presented by Erick Erickson, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shards and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster. We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
This document provides tips for tuning Solr for high performance. It discusses optimizing queries and facets for CPU usage, tuning memory usage such as using docValues, optimizing disk usage through merge policies and commit settings, reducing network overhead through batching and caching, and techniques like deep paging to improve performance for large result sets. The document emphasizes only indexing and retrieving necessary fields to reduce resource usage and tuning garbage collection to avoid pauses.
SolrCloud uses Zookeeper to elect a leader node for each shard. The leader coordinates write requests to ensure consistency. When the leader dies, Zookeeper detects this and elects a new leader based on the nodes' sequence numbers registered with Zookeeper. The new leader syncs updates with replicas and can replay logs if any replicas are too far behind. This allows write requests to continue being served with high availability despite leader failures.
Solr Cloud allows Solr to be distributed and run across multiple servers for increased performance, scalability, availability, and elasticity. It uses Zookeeper for coordination and shares an index across multiple cores and collections. Documents are routed and replicated to shards and replicas based on a hashing function or custom routing rules to partition the data. Queries are distributed and results merged to provide scalable search across an elastic, fault-tolerant cluster.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Solr Exchange: Introduction to SolrCloudthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will provide an architectural overview of SolrCloud and highlight its most important features. Specifically, Tim covers topics such as: sharding, replication, ZooKeeper fundamentals, leaders/replicas, and failure/recovery scenarios. Any discussion of a complex distributed system would not be complete without a discussion of the CAP theorem. Mr. Potter will describe why Solr is considered a CP system and how that impacts the design of a search application.
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
This document discusses integrating Solr and Spark. It provides an example of using Solr as a sink for streaming data from Spark Streaming. It also describes reading data from Solr into Spark using SolrRDD and exposing it as a Spark SQL DataFrame. Additional capabilities covered include querying Solr from the Spark shell, document matching using stored queries, and reading term vectors from Solr for machine learning with MLLib.
In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.
Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
This document discusses scaling SolrCloud to support a large number of collections. It identifies four main problems in scaling: 1) large cluster state size, 2) overseer performance issues with thousands of collections, 3) difficulty moving data between collections, and 4) limitations in exporting full result sets. The document outlines solutions implemented to each problem, including splitting the cluster state, optimizing the overseer, improving data management between collections, and enabling distributed deep paging to export full result sets. Testing showed the ability to support 30 hosts, 120 nodes, 1000 collections, over 6 billion documents, and sustained performance targets.
This document discusses scaling Solr using SolrCloud. It provides an overview of Solr history and architectures. It then describes how SolrCloud addresses limitations of earlier architectures by utilizing Apache ZooKeeper for coordination across Solr nodes and shards. Key concepts discussed include collections, shards, replicas, and routing queries across shards. The document also covers configuration topics like caches, indexing tuning, and monitoring.
This document discusses SolrCloud failover and testing. It provides an overview of how SolrCloud uses ZooKeeper to elect an overseer node to monitor cluster state and automatically create a new replica on an available node when one goes down, allowing failover capability. It also discusses challenges with distributed testing and recommends focusing more on backfilling tests when changing code, fixing frequently failing tests, and adding more unit tests to improve Solr's testing culture.
Cross Datacenter Replication aka CDCR has been a long requested feature in Apache Solr. In this talk, we will discuss CDCR as released in Apache Solr 6.0 and beyond to understand its use-cases, limitations, setup and performance. We will also take a quick look at the future enhancements that can further simplify and scale this feature.
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
This document discusses building and running Solr as a service in the cloud. It covers:
- The challenges of deploying Solr in cloud environments and the need for a managed service.
- The architecture of the Solr-as-a-Service, which uses Docker, Mesos, and other tools to provide multi-tenant Solr clusters.
- Key aspects of managing Solr clusters in the cloud service, including software upgrades, resizing clusters, handling replicas, and balancing clusters.
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
Lucidworks Senior Engineer and Lucene/Solr Committer Tim Potter presents common use cases for integrating Spark and Solr, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution with Spark and Solr.
These slides were presented at the Great Indian Developer Summit 2014 at Bangalore. See http://www.developermarch.com/developersummit/session.html?insert=ShalinMangar2
"SolrCloud" is the name given to Apache Solr's feature set for fault tolerant, highly available, and massively scalable capabilities. SolrCloud has enabled organizations to scale, impressively, into the billions of documents with sub-second search!
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
The document describes Netflix's Rebalance API for SolrCloud, which provides strategies and capabilities for scaling Solr clusters while maintaining service level agreements (SLAs). The Rebalance API includes scaling strategies for operations like auto-sharding, redistributing data, and replacing nodes with zero downtime. It also includes allocation strategies that determine core placement. The goal is to allow for fine-grained SLA management when indexing and querying large datasets across multiple data centers. The Rebalance API has been open sourced and is aimed at forming the basis for auto-scaling in Solr.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
Solr 4.0 dramatically improves scalability, performance, and flexibility. An overhauled Lucene underneath sports near real-time (NRT) capabilities allowing indexed documents to be rapidly visible and searchable. Lucene’s improvements also include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage. These Lucene improvements automatically make Solr much better, and Solr magnifies these advances with “SolrCloud.” SolrCloud enables highly available and fault tolerant clusters for large scale distributed indexing and searching. There are many other changes that will be surveyed as well. This talk will cover these improvements in detail, comparing and contrasting to previous versions of Solr.
SFDX is an amazing tool, that allows for fast development, robust technology usage and great deployment performance.
During last years we have delivered multiple large projects and many smaller ones exclusively using SFDX.
We will have a look on how SFDX is currently working, what it is allowing to do and what are items that do NOT work as one would expect. We will go through real life scenarios that we have encountered during our projects and explain how we should be working with SFDX in a world of devs, admins, and business admins.
Topic 1 – Admins do work and get it to source control by one click
* Creating new field
* Deleting field / Renaming field
Topic 2 – Can profiles be managed by SFDX?
Topic 4 – Delta automatic deployment
* how to achieve them, what are the community tools to use
Topic 5 – Customer made changes to production, am I done?
Topic 6 – Metadata that doesn’t make sense and how to work with them
* we go through metadata that behaves differently and each user needs to understand how (translations, record types and many more)
I would love to have attendees leaving room with knowledge of what is possible and what are quick wins, what are gotchas and what to not try at all 🙂
Session will be guided with presentation, where all points will be explained and shown via examples within IDE on life scratch orgs.
The document discusses benchmarking the performance of Apache Solr. It describes testing the indexing performance of SolrCloud clusters of varying sizes. The results show that indexing performance scales nearly linearly as nodes are added. It also discusses using the Solr Scale Toolkit, which is a set of tools for deploying, managing, and benchmarking SolrCloud clusters. Future work mentioned includes benchmarking mixed workloads and integrating chaos monkey tests.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
1. The document provides an overview of InfluxEnterprise, including its core open source functionality, high availability features, scalability, fine-grained authorization, support options, and on-premise or cloud deployment options.
2. It discusses signs that an organization may be ready for InfluxEnterprise, such as high CPU usage, issues with single node deployments, and needing improved data durability or throughput.
3. The document covers InfluxEnterprise cluster architecture including meta nodes, data nodes, replication patterns, ingestion and query rates for different replication configurations, and examples for mothership, durable data ingest, and integrating with ElasticSearch deployments.
This is a summary of the sessions I attended at PASS Summit 2017. Out of the week-long conference, I put together these slides to summarize the conference and present at my company. The slides are about my favorite sessions that I found had the most value. The slides included screenshotted demos I personally developed and tested alike the speakers at the conference.
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...DevOpsDays Tel Aviv
This document proposes using machine learning techniques to analyze logs and surface the most relevant ones. It discusses using both unsupervised and supervised learning. Unsupervised techniques like clustering could analyze large amounts of unlabeled data to group similar logs. Supervised learning would involve acquiring labels to train classifiers on what is relevant versus irrelevant. The proposed solution involves normalizing logs, acquiring labels, training models, and then classifying and enhancing new logs. It suggests this could be done at scale using tools like Spark.
Megastore is a scalable data storage system developed by Google to meet the requirements of modern interactive online services. It blends the scalability of NoSQL databases with the convenience of SQL, providing ACID transactions across entity groups. Megastore uses Bigtable for data storage and an improved Paxos algorithm to synchronously replicate transaction logs across data centers, achieving high availability even in the case of data center failures.
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
Morningstar’s Risk Model project is created by stitching together statistical and machine learning models to produce risk and performance metrics for millions of financial securities. Previously, we were running a single version of this application, but needed to expand it to allow for customizations based on client demand. With the goal of running hundreds of custom Risk Model runs at once at an output size of around 1TB of data each, we had a challenging technical problem on our hands! In this presentation, we’ll talk about the challenges we faced replatforming this application to Spark, how we solved them, and the benefits we saw.
Some things we’ll touch on include how we created customized models, the architecture of our machine learning application, how we maintain an audit trail of data transformations (for rigorous third party audits), and how we validate the input data our model takes in and output data our model produces. We want the attendees to walk away with some key ideas of what worked for us when productizing a large scale machine learning platform.
This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
The document discusses HDInsight cluster architecture and configuration. It describes how HDInsight clusters connect to Azure data stores like Azure Blob Storage and Azure Data Lake Store. It also discusses using Azure Data Factory for HDInsight orchestration and monitoring an HDInsight cluster.
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
AOL experienced explosive growth and needed a new database that was both flexible and easy to deploy with little effort. They chose MongoDB. Due to the complexity of internal systems and the data, most of the migration process was spent building a new identity platform and adapters for legacy apps to talk to MongoDB. Systems were migrated in 4 phases to ensure that users were not impacted during the switch. Turning on dual reads/writes to both legacy databases and MongoDB also helped get production traffic into MongoDB during the process. Ultimately, the project was successful with the help of MongoDB support. Today, the team has 15 shards, with 60-70 GB per shard.
Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, and fault-tolerant database. It originated at Facebook in 2007 to solve their inbox search problem. Some key companies using Cassandra include Twitter, Facebook, Digg, and Rackspace. Cassandra's data model is based on Google's Bigtable and its distribution design is based on Amazon's Dynamo.
Watch video: https://www.youtube.com/watch?v=SgmmoRCmIa4&list=PLIuWze7quVLDSxJKDj3pRSqvmHAzQ_9vd&index=6
Here is the summary of what you'll learn:
00:02:00 Welcome
00:03:32 Meet Chafik, CEO of Brainboard.co
00:05:00 Our goal at Brainboard
00:06:00 Terraform modules definition
00:20:00 Build your own modules
00:21:00 Azure
00:48:00 AWS
00:52:00 Best practices
00:56:00 Review some of the most used community modules
00:56:43 Lambda
01:00:30 AKS
01:04:00 Where to host your modules?
01:06:04 Challenges of maintaining modules within a team
01:09:00 Build your own modules’ catalog
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
6 Ways of Solve Your Oracle Dev-Test Problems Using All-Flash Storage and Cop...Catalogic Software
By combining all-flash storage with copy data management, you can provision timely, space-efficient, masked Oracle copies both easily and automatically.
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr’s full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
The document discusses various techniques for optimizing and scaling MongoDB deployments. It covers topics like schema design, indexing, monitoring workload, vertical scaling using resources like RAM and SSDs, and horizontal scaling using sharding. The key recommendations are to optimize the schema and indexes first before scaling, understand the workload, and ensure proper indexing when using sharding for horizontal scaling.
This document provides an introduction to SolrCloud, which enables horizontal scaling of a Solr search index using sharding and replication. Key terminology is defined, including ZooKeeper, nodes, collections, shards, replicas, and leaders. The document outlines the high-level SolrCloud architecture and discusses features like sharding, document routing, replication, distributed indexing and querying. Challenges around consistency and availability are also covered.
This document discusses building distributed search applications using Apache Solr. It provides an agenda that covers topics such as Solr architecture, schema configuration, indexing data, querying, SolrCloud, and performance factors. It also references a demo app that will be used for hands-on examples during the presentation.
Similar to Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Analytics (20)
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Analytics
1. Scaling Solr 4 to Power Big Search in Social Media
Analytics
Timothy Potter
Architect, Big Data Analytics, Dachis Group / Co-author Solr In Action
2. ® 2011 Dachis Group.
dachisgroup.com
• Anyone running SolrCloud in
production today?
• Who is running pre-Solr 4 version in
production?
• Who has fired up Solr 4.x in SolrCloud
mode?
• Personal interest – who was
purchased Solr in Action in MEAP?
Audience poll
3. ® 2011 Dachis Group.
dachisgroup.com
• Gain insights into the key design decisions you need
to make when using Solr cloud
Wish I knew back then ...
• Solr 4 feature overview in context
• Zookeeper
• Distributed indexing
• Distributed search
• Real-time GET
• Atomic updates
• A day in the life ...
• Day-to-day operations
• What happens if you lose a node?
Goals of this talk
4. ® 2011 Dachis Group.
dachisgroup.com
Our business intelligence platform analyzes relationships, behaviors, and
conversations between 30,000 brands and 100M social accounts every 15 minutes.
About Dachis Group
6. ® 2011 Dachis Group.
dachisgroup.com
• In production on 4.2.0
• 18 shards ~ 33M docs / shard, 25GB on disk per shard
• Multiple collections
• ~620 Million docs in main collection (still growing)
• ~100 Million docs in 30-day collection
• Inherent Parent / Child relationships (tweet and re-tweets)
• ~5M atomic updates to existing docs per day
• Batch-oriented updates
• Docs come in bursts from Hadoop; 8,000 docs/sec
• 3-4M new documents per day (deletes too)
• Business Intelligence UI, low(ish) query volume
Solution Highlights
7. ® 2011 Dachis Group.
dachisgroup.com
• Scalability
Scale-out: sharding and replication
A little scale-up too: Fast disks (SSD), lots of RAM!
• High-availability
Redundancy: multiple replicas per shard
Automated fail-over: automated leader election
• Consistency
Distributed queries must return consistent results
Accepted writes must be on durable storage
• Simplicity - wip
Self-healing, easy to setup and maintain,
able to troubleshoot
• Elasticity - wip
Add more replicas per shard at any time
Split large shards into two smaller ones
Pillars of my ideal search solution
8. ® 2011 Dachis Group.
dachisgroup.com
Nuts and Bolts
Nice tag cloud wordle.net!
9. ® 2011 Dachis Group.
dachisgroup.com
1. Zookeeper needs at least 3 nodes to establish quorum with fault
tolerance. Embedded is only for evaluation purposes, you need to
deploy a stand-alone ensemble for production
2. Every Solr core creates ephemeral “znodes” in Zookeeper which
automatically disappear if the Solr process crashes
3. Zookeeper pushes notifications to all registered “watchers” when a
znode changes; Solr caches cluster state
1. Zookeeper provides “recipes” for solving common problems faced
when building distributed systems, e.g. leader election
2. Zookeeper provides centralized configuration distribution, leader
election, and cluster state notifications
Zookeeper in a nutshell
10. ® 2011 Dachis Group.
dachisgroup.com
• Number and size of indexed fields
• Number of documents
• Update frequency
• Query complexity
• Expected growth
• Budget
Number of shards?
Yay for shard splitting in 4.3 (SOLR-3755)!
11. ® 2011 Dachis Group.
dachisgroup.com
We use Uwe Schindler’s advice on 64-bit Linux:
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>
See: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
java -Xmx4g ...
(hint: rest of our RAM goes to the OS to load index in memory mapped I/O)
Small cache sizes with aggressive eviction – spread GC penalty out over time vs. all at once every time
you open a new searcher
<filterCache class="solr.LFUCache" size="50"
initialSize="50" autowarmCount="25"/>
Index Memory Management
12. ® 2011 Dachis Group.
dachisgroup.com
• Not a master
• Leader is a replica (handles queries)
• Accepts update requests for the shard
• Increments the _version_ on the new or
updated doc
• Sends updates (in parallel) to all
replicas
Leader = Replica + Addl’ Work
13. ® 2011 Dachis Group.
dachisgroup.com
Don’t let your tlog’s get too big – use “hard” commits with openSearcher=“false”
Distributed Indexing
View of cluster state from Zk
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
Zookeeper
CloudSolrServer
“smart client”
Hash on docID
1
2
3
Set the _version_
tlogtlog
Get URLs of current leaders?
4
5
2 shards with 1 replica each
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>60000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
8,000 docs / sec
to 18 shards
14. ® 2011 Dachis Group.
dachisgroup.com
Send query request to any node
Two-stage process
1. Query controller sends query to all
shards and merges results
One host per shard must be online
or queries fail
2. Query controller sends 2nd query to
all shards with documents in the
merged result set to get requested
fields
Solr client applications built for 3.x do
not need to change (our query code still
uses SolrJ 3.6)
Limitations
JOINs / Grouping need custom hashing
Distributed search
View of cluster state from Zk
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
Zookeeper
CloudSolrServer
1
3
q=*:*
Get URLs of all live nodes
4
2
Query controller
Or just a load balancer works too
get fields
15. ® 2011 Dachis Group.
dachisgroup.com
Search by daily activity volume
Drive analysis
that measures
the impact of
a social message
over time ...
Company posts
a tweet on Monday,
how much activity
around that message
on Thursday?
16. ® 2011 Dachis Group.
dachisgroup.com
Problem: Find all documents that had activity on a specific day
• tweets that had retweets or YouTube videos that had comments
• Use Solr join support to find parent documents by matching on child criteria
fq=_val_:"{!join from=echo_grouping_id_s to=id}day_tdt:[2013-05-01T00:00:00Z
TO 2013-05-02T00:00:00Z}" ...
... But, joins don’t work in distributed queries and is probably too slow anyway
Solution: Index daily activity into multi-valued fields. Use real-time GET to lookup
document by ID to get the current daily volume fields
fq:daily_volume_tdtm('2013-05-02’)
sort=daily_vol(daily_volume_s,'2013-04-01','2013-05-01')+desc
daily_volume_tdtm: [2013-05-01, 2013-05-02] <= doc has child signals on May 1 and 2
daily_volume_ssm: 2013-05-01|99, 2013-05-02|88 <= stored only field, doc had 99 child signals on May 1, 88 on May 2
daily_volume_s: 13050288|13050199 <= flattened multi-valued field for sorting using a custom ValueSource
Atomic updates and real-time get
17. ® 2011 Dachis Group.
dachisgroup.com
Will it work? Definitely!
Search can be addicting to your organization, queries we
tested for 6 months ago vs. what we have today are vastly
different
Buy RAM – OOMs and aggressive garbage collection
cause many issues
Give RAM from ^ to the OS – MMapDirectory
Need a disaster recovery process in addition to Solr cloud
replication; helps with migrating to new hardware too
Use Jetty ;-)
Store all fields! Atomic updates are a life saver
Lessons learned
18. ® 2011 Dachis Group.
dachisgroup.com
Schema will evolve – we thought we understood our data model but have since
added at least 10 new fields and deprecated some too
Partition if you can! e.g. 30-day collection
We don't optimize – segment merging works great
Size your staging environment so that shards have about as many docs and same
resources as prod. I have many more nodes in prod but my staging servers have
roughly the same number of docs per shard, just fewer shards.
Don’t be afraid to customize Solr! It’s designed to be customized with plug-ins
• ValueSource is very powerful
• Check out PostFilters:
{!frange l=1 u=1 cost=200 cache=false}imca(53313,employee)
Lessons learned cont.
19. ® 2011 Dachis Group.
dachisgroup.com
• Backups
.../replication?command=backup&location=/mnt/backups
• Monitoring
Replicas serving queries?
All replicas report same number of docs?
Zookeeper health
New search warm-up time
• Configuration update process
Our solrconfig.xml changes frequently – see Solr’s zkCli.sh
• Upgrade Solr process (it’s moving fast right now)
• Recover failed replica process
• Add new replica
• Kill the JVM on OOM (from Mark Miller)
-XX:OnOutOfMemoryError=/home/solr/on_oom.sh
-XX:+HeapDumpOnOutOfMemoryError
Minimum DevOps Reqts
20. ® 2011 Dachis Group.
dachisgroup.com
Nodes will crash! (ephemeral znodes)
Or, sometimes you just need to restart a
JVM (rolling restarts to upgrade)
Peer sync via update log (tlog)
100 updates else ...
Good ol’ Solr replication from leader to
replica
Node recovery
21. ® 2011 Dachis Group.
dachisgroup.com
• Moving to a near real-time streaming model using Storm
• Buying more RAM per node
• Looking forward to shard splitting as it has
become difficult to re-index 600M docs
• Re-building the index with DocValues
• We've had shards get out of sync after major failure –
resolved it by going back to raw data and doing a key by key
comparison of what we expected to be in the index and re-indexing
any missing docs.
• Custom hashing to put all docs for a specific brand in the same
shard
Roadmap / Futures
22. ® 2011 Dachis Group.
dachisgroup.com
If you find yourself in this
situation, buy more RAM!
Obligatory lolcats slide
Discuss Solr cloud concepts in context of a real-world applicationFeel free to contact me with additional questions (or better yet, post to the Solr user mailing list)
Solr is a fundamental part of our infrastructure
Not petabyte scale, but we do deep analytics on brand-related data harvested from social networksScreen from our Advocate reporting interface that uses Solr to compute analytics and find signals from brand advocates
We use atomic updates, real time GET, custom ValueSources, PostFilter,CloudSolrServer for high throughput indexing from HadoopWe upgraded from Solr 4.1 to 4.2 in production without any downtime. We did a rolling restart and made sure at least one host per shard was online at all times. We did this in between index updatesTechnical details:Use MMapDirectory, keeps our JVM Heaps small(ish)Use small filterCache
Solr requires both scaling out and up – you must have fast disks and CPU and lots of RAMNot quite there on simplicity and elasticity but have come a very long way in short order
There’s some new terminology and some old concepts, like master/slave don’t apply anymore
Zk gives impression that SolrCloud is somehow complex because it uses ZookeeperLow overhead technology once it is setup – we’ve had 100% uptime on Zookeeper (knock on wood)
Tended to overshard to allow for growth, but that can be expensiveCurrently you must set the number of shards for a collection when bootstrapping the clusterWill be less of a problem with Solr 4.3 and beyond with shard splittingEven if you have small documents with only a few fields, if you have 100's of millions of these documents, you can benefit from sharding. Think about this in terms of sorting. Imagine a query that matches 10M documents with a custom sort criteria to return documents sorted by date in descending order. Solr has to sort all 10M matching hits just to produce a page size of 10. The sort alone will require roughly 80 megabytes of memory to sort the results. However, if your 100M docs are split across 10 shards, then each shard is sorting roughly 1M docs in parallel. There is a little overhead in that Solr needs to resort the top 10 hits per shard (100) but it should be obvious that sorting 1M documents on 10 nodes in parallel will always be faster than sorting 10M documents on a single node.
Remember – all objects in your caches must be invalidated when you open a new searcher using overly large caches and a huge max heap can lead to high GC overhead
Smart client knows the current leaders by asking Zk, but doesn’t know which leader to assign the doc to (that is planned though)Node accepting the new document computes a hash on the document ID to determine which shard to assign the doc toNew document is sent to the appropriate shard leader, which sets the _version_ and indexes itLeader saves the update request to durable storage in the update log (tlog)Leader sends the update to all active replicas in parallelCommits – sent around the clusterInteresting question is how are batches of updates from the client handled?
In our environment, the query controller selects 18 of 36 possible nodes to queryWarming queries should not be distributed (distrib=false) gets set automatically
Quick case study about real time get and atomic updates.
We update millions of signals every day using atomic updatesAll fields must be storedSo if on May 2, we want to update the daily count value, we get the document by ID, merge the updated value into the existing list, and then re-index just the updated fields + the _version_ fieldOptimistic locking allows our update code to run whenever with less coordinationWe update other fields on docs from different workflows at different timesUses the special _version_ field to support optimistic lockingMerge updated daily value into an existing multi-valued field, can’t just append because we compute the daily volume multiple times per day.Doesn’t have to be committed.If _version_ field is set toSolr does>1Versions must match or the update fails1The document must exist<0The document must not exist0No concurrency control desired, existing value is overwritten
Work closely with UI team to understand query needs and design them together, this helps avoid creating inefficient queries and helps identify criteria that should be included in warming queries.There's a lot of power and potential to cause problems at scale in queries.
Hmmm ... anyone know a sys admin that dresses like that?
We upgraded from 4.1 to 4.2 without any downtime or manual intervention – all automated