Your SlideShare is downloading. ×
NATC 2013 - Setup a scalable and distributed search platform in minutes by SrivatsaKatta, ThoughtWorks
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NATC 2013 - Setup a scalable and distributed search platform in minutes by SrivatsaKatta, ThoughtWorks


Published on

NATC 2013 - Setup a scalable and distributed search platform in minutes by SrivatsaKatta, ThoughtWorks

NATC 2013 - Setup a scalable and distributed search platform in minutes by SrivatsaKatta, ThoughtWorks

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Scale – How do we scale when the throughput is not expected, when the resources are exhaustedCluster Mgmt – Cluster management is always a complex taskMaster or Slave –What happens when master goes down, what happens when slave becomes master and then master comes back upBackups / Restore - When and how do we take backups in distributed setupBiz Impact – Worry about the business impact when the system is down for planned or unplanned maintenance Configuration Mess – Most distributed systems it’s a mess to configure a cluster, too many of them.
  • Elasticsearch is one of the distributed systems which addresses all of the problems we saw in the previous slide and much more.Its fast and scalable, it’s a distributed search engine based on Lucene. An ElasticSearch node is actually only a java process and can be run anywhere (nothing prevents you from having 2 nodes running on the same machine).
  • Index in elasticsearch is equivalent to a database in DB worldType is equivalent to a Table in DB worldDocument is equivalent to a Row in a table in DB world
  • Indexing a documentDocument is any arbitrary JSON string as long as its valid.In this example library is an index name, books is a type under which document is stored.An index could hold as many types as required. Each type defines the structure of the document called mapping that carries information about the fields, like type, how to analyze, format to be parsed etc.Auto Field Type DetectionISBN is a string typePrice is a double typeWhen this document is indexed, ES will auto detect the field types and creates mapping for a given type if none exists already in the system. MappingMapping can be queried via HTTP using _mapping under a given index and type.You could see ES has identified the types of the fields automatically based on the data.This can be overridden by creating mapping upfront for the types, each type will have its own mapping for every field.When a new field is indexed as part of a new document, ES will automatically identify and update the type mapping with this new field. Mapping also contains details like analyzer settings for fields, storage configuration, parsing formats and much more.
  • Shows how a simple search query looks like. Query system in ES is very extensive, its got filter queries, boolean queries, faceted searches, term queries almost everything that lucene supports.Could specify the fields that have to be retrieved if there is a match, could specify which fields to search the given query in and so on.
  • Hits array will have an entry for every document that has a match for the given query, it also carries score which signifies how close the match is in this document compared to other docs.
  • Nodes – Three kind of nodes Master : Master nodes are potential candidates for being elected master of a cluster. A cluster master holds the cluster state and handles the distribution of shards in the cluster. In fact, it takes care of the well-being of the cluster. When a cluster master goes down, the cluster automatically starts a new master election process and elects a new master from all the nodes with the master roleData : Data nodes hold the actual data in one or more shards, which are actually Lucene indexes. They are responsible for performing indexing and executing search queries. Client: Client nodes respond to the ElasticSearch REST interface and are responsible for routing the queries to the data nodes holding the relevant shards and for aggregating the results from the individual shards. Client nodes could also be used as a load balancer and there is no need for an external load balancer.Nodes discover themselves automatically via IP multicast or unicast to form an ElasticSearch cluster.ShardsPrimary and Replica ShardsOne can define indexes that are horizontally split into shards. Shards are then automatically distributed by ElasticSearch on different nodes and can be replicated to be resilient in case of failure (using replicas).
  • Though we say its very seamless to configure a cluster and add more nodes to a cluster, there are few configs one should be wary of in case of – This is used as an identifier to auto discover other nodes on the network.discovery.zen.minimum_master_nodes – Avoids split brainsbootstrap.mlockall – Controls memory swapsMiscPlugin Model (River plugins, Site plugins, Rest plugins)Number of client SDKs (Name a language you have a client SDK available already)Flexibility of overriding the configuration, node level, index level and few of them could be updated dynamicallyCommunityVery active and frequent releasesAlways improving the architecture and design, Field data to data values (which is based on file system and not memory) This reduces the heap memory footprint and leverages the file system cache to speed up.Percolator API changes, made it more intuitiveBack up restore in a distributes system is done via HTTP API instead of using rsync going around all the nodes and identifying the primary shards.
  • Transcript

    • 1. Setup a Scalable and Distributed Search Platform in Minutes - Srivatsa Katta
    • 2. Agenda • The problem • Introducing elasticsearch concepts • Cluster architecture • That isn’t all of it..
    • 3. The problem Scale Cluster Mgmt.. Master or Slave Backups and Restore Biz Impact Configuration Mess
    • 4. Elasticsearch Concepts
    • 5. Definitions Index Table Type Document 2 John Designer
    • 6. Indexing curl -XPOST http://localhost:9200/library/books -d’ { "isbn" :"0316206849", "title":"The Cuckoos Calling", "description":"A brilliant debut mystery in a classic vein. Detective Cormoran Strike investigates a supermodels suicide.", "tags":[ "mystery", "thriller", "kindle" ], "category":"mystery", "price":1136.00 }’
    • 7. Search curl –XPOST localhost:9200/library/books/_search –d’ { "query": { "query_string": { "query": "cuckoos" } }, "fields": [ "isbn" ] }’
    • 8. Search Result { "took": 2, "timed_out": false, "_shards": { … }, "hits": { "total": 1, "max_score": 0.09965338, "hits": [ { … "_id": "ne06dmUWRByZLx1g6jH1rg", "_score": 0.09965338, "fields": { "isbn": "0316206849" } … }
    • 9. Cluster Architecture nodes, shards, discovery, fault tolerance …
    • 10. Nodes and Shards S1 S2 S3 S1 S2 S3 Node-2 Node-1 (Master) S1 S2 S3 Primary Replica Node-3
    • 11. Lets see it in action !!
    • 12. That isn’t all of it.. Configurations, path.logs Memory bootstrap.mlockall Split Brain discovery.zen.minimum_master_nodes gateway.recover_after_nodes File handles
    • 13. Thank you !!