Scale – How do we scale when the throughput is not expected, when the resources are exhaustedCluster Mgmt – Cluster management is always a complex taskMaster or Slave –What happens when master goes down, what happens when slave becomes master and then master comes back upBackups / Restore - When and how do we take backups in distributed setupBiz Impact – Worry about the business impact when the system is down for planned or unplanned maintenance Configuration Mess – Most distributed systems it’s a mess to configure a cluster, too many of them.
Elasticsearch is one of the distributed systems which addresses all of the problems we saw in the previous slide and much more.Its fast and scalable, it’s a distributed search engine based on Lucene. An ElasticSearch node is actually only a java process and can be run anywhere (nothing prevents you from having 2 nodes running on the same machine).
Index in elasticsearch is equivalent to a database in DB worldType is equivalent to a Table in DB worldDocument is equivalent to a Row in a table in DB world
Indexing a documentDocument is any arbitrary JSON string as long as its valid.In this example library is an index name, books is a type under which document is stored.An index could hold as many types as required. Each type defines the structure of the document called mapping that carries information about the fields, like type, how to analyze, format to be parsed etc.Auto Field Type DetectionISBN is a string typePrice is a double typeWhen this document is indexed, ES will auto detect the field types and creates mapping for a given type if none exists already in the system. MappingMapping can be queried via HTTP using _mapping under a given index and type.You could see ES has identified the types of the fields automatically based on the data.This can be overridden by creating mapping upfront for the types, each type will have its own mapping for every field.When a new field is indexed as part of a new document, ES will automatically identify and update the type mapping with this new field. Mapping also contains details like analyzer settings for fields, storage configuration, parsing formats and much more.
Shows how a simple search query looks like. Query system in ES is very extensive, its got filter queries, boolean queries, faceted searches, term queries almost everything that lucene supports.Could specify the fields that have to be retrieved if there is a match, could specify which fields to search the given query in and so on.
Hits array will have an entry for every document that has a match for the given query, it also carries score which signifies how close the match is in this document compared to other docs.
Nodes – Three kind of nodes Master : Master nodes are potential candidates for being elected master of a cluster. A cluster master holds the cluster state and handles the distribution of shards in the cluster. In fact, it takes care of the well-being of the cluster. When a cluster master goes down, the cluster automatically starts a new master election process and elects a new master from all the nodes with the master roleData : Data nodes hold the actual data in one or more shards, which are actually Lucene indexes. They are responsible for performing indexing and executing search queries. Client: Client nodes respond to the ElasticSearch REST interface and are responsible for routing the queries to the data nodes holding the relevant shards and for aggregating the results from the individual shards. Client nodes could also be used as a load balancer and there is no need for an external load balancer.Nodes discover themselves automatically via IP multicast or unicast to form an ElasticSearch cluster.ShardsPrimary and Replica ShardsOne can define indexes that are horizontally split into shards. Shards are then automatically distributed by ElasticSearch on different nodes and can be replicated to be resilient in case of failure (using replicas).
Though we say its very seamless to configure a cluster and add more nodes to a cluster, there are few configs one should be wary of in case of productioncluster.name – This is used as an identifier to auto discover other nodes on the network.discovery.zen.minimum_master_nodes – Avoids split brainsbootstrap.mlockall – Controls memory swapsMiscPlugin Model (River plugins, Site plugins, Rest plugins)Number of client SDKs (Name a language you have a client SDK available already)Flexibility of overriding the configuration, node level, index level and few of them could be updated dynamicallyCommunityVery active and frequent releasesAlways improving the architecture and design, Field data to data values (which is based on file system and not memory) This reduces the heap memory footprint and leverages the file system cache to speed up.Percolator API changes, made it more intuitiveBack up restore in a distributes system is done via HTTP API instead of using rsync going around all the nodes and identifying the primary shards.
NATC 2013 - Setup a scalable and distributed search platform in minutes by SrivatsaKatta, ThoughtWorks
Setup a Scalable and Distributed
Search Platform in Minutes
- Srivatsa Katta
• The problem
• Introducing elasticsearch concepts
• Cluster architecture
• That isn’t all of it..
Master or Slave
Backups and Restore