SlideShare a Scribd company logo
1 of 27
Download to read offline
ElasticSearch 7
Presented By
Anurag
ES1.1 -
Introduction to
ELK Stack
ElasticSearch
● Elasticsearch is a search engine based on the Apache Lucene
library.
● Open Code Business Model
● Rest based
● Distributed
● Most Popular enterprise search engine
● Netflix, Linkedin, Amazon, Oracle and many big names
Elastic (ELK) Stack
The Beats are lightweight data shippers, written in
Go, that run on your servers to capture all sorts of
operational data (logs, metrics, or network packet
data). Beats send the operational data to
Elasticsearch, either directly or via Logstash
Logstash is a server-side data processing
pipeline that ingests data from a multitude of
sources, transforms it, and then sends it to your
favorite "stash."
Kibana is a browser-based analytics and
search dashboard for Elasticsearch.
Distributed RESTful search Engine
How do ElasticSearch and Lucene Differ
Just as a car (ES) and the engine (Lucene) of a car differ
ES makes use of Lucene to manage the indices.
Lucene is a Java library. You can include it in your project and refer to its functions using function calls.
Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work
beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate
Lucene instance. So to summarize
1. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
2. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is
aware of or built for. Elasticsearch provides this abstraction of distributed structure.
3. Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data
monitoring API, Cluster management, etc.
ES 1.2 Document
Ranking
Indexing
● Elasticsearch is able to achieve low
latency in responses because, instead of
searching the text directly, it searches in
an index instead.
● Document? The basic unit of data in ES
● Inverted Index (like at the back of a book)
○ Created by tokenizing the terms in
each document
○ Created a sorted list of all unique
terms (terms are normalized,
stemmed etc)
○ Assosciate list of documents where
the word can be found
○ Similar to the index at the back of a
book
Doc1: I am learning the cool stuff
Doc2: I am learning to learn
Inverted Index:
Am -> [Doc1, Doc2]
Cool -> [Doc1]
I -> [Doc1, Doc2]
Learn -> [Doc1, Doc2] // root for of learning
the -> [Doc1]
…
Retrieving
● Term Frequency (TF)
○ Frequency of term in given
document
● Document Frequency (DF)
○ Frequency of term in all
documents
● IDF (Inverse Document
Frequency)
○ IDF = 1 / DF
● Relevance
○ Relevance = TF * IDF
○ Relevance = TF / DF
Search Term: learn
TF1 = 1
TF2 = 2
IDF = ⅓
Rev1 = TF1 * IDF = ⅓
Rev2 = TF2 * IDF = ⅔
Rev2 > Rev1
ES 1.3 ES Cluster
Node Structure
● Index - Logical Namespace of collection of documents
● Shard - Horizontal Partition of an Index
○ Eg Documents 1-10 in one shard, 11-20 in other and so on.
○ In Elasticsearch, each Shard is a self-contained Lucene index in itself.
Cluster Structure
P1
R4
P2
R1
P3
R2
P4
R3
● Here we can see a cluster of 4
nodes
● Each node has 2 shards
● Primary and Replica shards
● For robustness and fault
tolerance, each shard is replicated
● Even if a node goes down, and a
primary shard is lost, a replica can
be made primary until recovery
● Number of replica shards has to be
set at the time of cluster creation
● Write operations on Primary and
repeated on replicas and read from
either
Types on Nodes
● Master Node
○ Cluster wide operations (creating and deleting indexes, keeping track of
index nodes, assigning shards, healthchecks etc)
● Data Node
○ Hold data and index
● Client Node
○ Load Balancer (neither data nor master nodes)
ElasticSearch 1.4
CRUD - Write
Operations
Breaking a shard into Segments
● For ES the basic unit of storage is a shard
● For Lucene the basic unit of storage is a segment
● Each segment is an inverted index
● New documents are added to new segment
● Segments are in memory and data is later persisted to
disk
● Segments are immutable
Coordination Stage
● shard_number = hash(document_id) % (num_of_primary_shards)
● All nodes know where a shard exists
● Document passed to node which contains particular shard_number
Translog
Source:
https://www.elastic.co/guide/en/elasticsearch/referenc
e/current/index-modules-translog.html
Translog and Memory Buffer
● Request written to translog
● Document added to memory buffer (which stores all the newly index documents)
● If the request is successful on the primary shard, the request is parallelly sent to the replica shards.
● In-sync shards which are always in sync with primary
● The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all
primary and insync shards.
Refresh Operation
● In Elasticsearch, the _refresh operation is set to be executed every second by default.
● During this operation, the in-memory buffer contents is copied to a newly created segment in the memory.
● As a result, new data becomes available for search.
Flush Operation
● Flush essentially means that all the documents in the in-memory buffer are written to new Lucene
segments.
● These, along with all existing in-memory segments, are committed to the disk, which clears the
translog. This commit is essentially a Lucene commit.
ElasticSearch 1.5
CRUD - Update &
Delete
Elasticsearch Delete
● Documents in Elasticsearch are immutable and hence, cannot be deleted or modified to
represent any changes.
● Every segment on disk has a .del file associated with it.
● When a delete request is sent, the document is not really deleted, but marked as deleted
in the .del file.
● This document may still match a search query but is filtered out of the results.
● When segments are merged, the documents marked as deleted in the .del file are not
included in the new merged segment.
Elasticsearch Update
● When a new document is created, Elasticsearch assigns a version number to that
document.
● Every change to the document results in a new version number.
● When an update is performed, the old version is marked as deleted in the .del file and
the new version is indexed in a new segment.
● The older version may still match a search query, however, it is filtered out from the
results.
ElasticSearch 1.6
CRUD - Read
Operations
ElasticSearch Read
● In this phase, the coordinating node routes the search request to all the shards
(primary or replica) in the index.
● The shards perform search independently and create a set of results sorted by
relevance score.
● All the shards return the document IDs of the matched documents and relevant
scores to the coordinating node.
● By default, each shard sends the top 10 results to the coordinating node
● The coordinating node sorts the results globally, and creates a list of the top 10 hits.
● The coordinating node then requests the original documents from all the shards.
All the shards enrich the documents and return them to the coordinating node.
● Results are aggregated and sent to the clients
ElasticSearch Read
That’s all folks!
References
1. https://qbox.io/blog/refresh-flush-operations-elasticsearch-guide
2. https://www.elastic.co/guide/index.html
3. https://blog.insightdatascience.com/anatomy-of-an-elasticsearch-cluster-part-i-
7ac9a13b05db

More Related Content

What's hot

Testing Asynchronous Algorithms Exhaustively on node.js
Testing Asynchronous Algorithms Exhaustively on node.jsTesting Asynchronous Algorithms Exhaustively on node.js
Testing Asynchronous Algorithms Exhaustively on node.jsMaxMotovilov
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using LokiKnoldus Inc.
 
cisco uccx - creating script to read xml files
cisco uccx - creating script to read xml filescisco uccx - creating script to read xml files
cisco uccx - creating script to read xml filesFaisal Khan
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
Advanced database protocols
Advanced database protocolsAdvanced database protocols
Advanced database protocolsHitesh Mohapatra
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 

What's hot (9)

Testing Asynchronous Algorithms Exhaustively on node.js
Testing Asynchronous Algorithms Exhaustively on node.jsTesting Asynchronous Algorithms Exhaustively on node.js
Testing Asynchronous Algorithms Exhaustively on node.js
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
 
Angular meteor presentation
Angular meteor presentationAngular meteor presentation
Angular meteor presentation
 
Inverted index
Inverted indexInverted index
Inverted index
 
cisco uccx - creating script to read xml files
cisco uccx - creating script to read xml filescisco uccx - creating script to read xml files
cisco uccx - creating script to read xml files
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Advanced database protocols
Advanced database protocolsAdvanced database protocols
Advanced database protocols
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Android Database
Android DatabaseAndroid Database
Android Database
 

Similar to Elasticsearch Architechture

Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and ElasticsearchDean Hamstead
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackRohit Sharma
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 

Similar to Elasticsearch Architechture (20)

Elastic search
Elastic searchElastic search
Elastic search
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and Elasticsearch
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Lecture2 oracle ppt
Lecture2 oracle pptLecture2 oracle ppt
Lecture2 oracle ppt
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Elasticsearch Architechture

  • 3. ElasticSearch ● Elasticsearch is a search engine based on the Apache Lucene library. ● Open Code Business Model ● Rest based ● Distributed ● Most Popular enterprise search engine ● Netflix, Linkedin, Amazon, Oracle and many big names
  • 4. Elastic (ELK) Stack The Beats are lightweight data shippers, written in Go, that run on your servers to capture all sorts of operational data (logs, metrics, or network packet data). Beats send the operational data to Elasticsearch, either directly or via Logstash Logstash is a server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite "stash." Kibana is a browser-based analytics and search dashboard for Elasticsearch. Distributed RESTful search Engine
  • 5. How do ElasticSearch and Lucene Differ Just as a car (ES) and the engine (Lucene) of a car differ ES makes use of Lucene to manage the indices. Lucene is a Java library. You can include it in your project and refer to its functions using function calls. Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate Lucene instance. So to summarize 1. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. 2. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure. 3. Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data monitoring API, Cluster management, etc.
  • 7. Indexing ● Elasticsearch is able to achieve low latency in responses because, instead of searching the text directly, it searches in an index instead. ● Document? The basic unit of data in ES ● Inverted Index (like at the back of a book) ○ Created by tokenizing the terms in each document ○ Created a sorted list of all unique terms (terms are normalized, stemmed etc) ○ Assosciate list of documents where the word can be found ○ Similar to the index at the back of a book Doc1: I am learning the cool stuff Doc2: I am learning to learn Inverted Index: Am -> [Doc1, Doc2] Cool -> [Doc1] I -> [Doc1, Doc2] Learn -> [Doc1, Doc2] // root for of learning the -> [Doc1] …
  • 8. Retrieving ● Term Frequency (TF) ○ Frequency of term in given document ● Document Frequency (DF) ○ Frequency of term in all documents ● IDF (Inverse Document Frequency) ○ IDF = 1 / DF ● Relevance ○ Relevance = TF * IDF ○ Relevance = TF / DF Search Term: learn TF1 = 1 TF2 = 2 IDF = ⅓ Rev1 = TF1 * IDF = ⅓ Rev2 = TF2 * IDF = ⅔ Rev2 > Rev1
  • 9. ES 1.3 ES Cluster
  • 10. Node Structure ● Index - Logical Namespace of collection of documents ● Shard - Horizontal Partition of an Index ○ Eg Documents 1-10 in one shard, 11-20 in other and so on. ○ In Elasticsearch, each Shard is a self-contained Lucene index in itself.
  • 11. Cluster Structure P1 R4 P2 R1 P3 R2 P4 R3 ● Here we can see a cluster of 4 nodes ● Each node has 2 shards ● Primary and Replica shards ● For robustness and fault tolerance, each shard is replicated ● Even if a node goes down, and a primary shard is lost, a replica can be made primary until recovery ● Number of replica shards has to be set at the time of cluster creation ● Write operations on Primary and repeated on replicas and read from either
  • 12. Types on Nodes ● Master Node ○ Cluster wide operations (creating and deleting indexes, keeping track of index nodes, assigning shards, healthchecks etc) ● Data Node ○ Hold data and index ● Client Node ○ Load Balancer (neither data nor master nodes)
  • 13. ElasticSearch 1.4 CRUD - Write Operations
  • 14. Breaking a shard into Segments ● For ES the basic unit of storage is a shard ● For Lucene the basic unit of storage is a segment ● Each segment is an inverted index ● New documents are added to new segment ● Segments are in memory and data is later persisted to disk ● Segments are immutable
  • 15. Coordination Stage ● shard_number = hash(document_id) % (num_of_primary_shards) ● All nodes know where a shard exists ● Document passed to node which contains particular shard_number
  • 17. Translog and Memory Buffer ● Request written to translog ● Document added to memory buffer (which stores all the newly index documents) ● If the request is successful on the primary shard, the request is parallelly sent to the replica shards. ● In-sync shards which are always in sync with primary ● The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all primary and insync shards.
  • 18. Refresh Operation ● In Elasticsearch, the _refresh operation is set to be executed every second by default. ● During this operation, the in-memory buffer contents is copied to a newly created segment in the memory. ● As a result, new data becomes available for search.
  • 19. Flush Operation ● Flush essentially means that all the documents in the in-memory buffer are written to new Lucene segments. ● These, along with all existing in-memory segments, are committed to the disk, which clears the translog. This commit is essentially a Lucene commit.
  • 20. ElasticSearch 1.5 CRUD - Update & Delete
  • 21. Elasticsearch Delete ● Documents in Elasticsearch are immutable and hence, cannot be deleted or modified to represent any changes. ● Every segment on disk has a .del file associated with it. ● When a delete request is sent, the document is not really deleted, but marked as deleted in the .del file. ● This document may still match a search query but is filtered out of the results. ● When segments are merged, the documents marked as deleted in the .del file are not included in the new merged segment.
  • 22. Elasticsearch Update ● When a new document is created, Elasticsearch assigns a version number to that document. ● Every change to the document results in a new version number. ● When an update is performed, the old version is marked as deleted in the .del file and the new version is indexed in a new segment. ● The older version may still match a search query, however, it is filtered out from the results.
  • 23. ElasticSearch 1.6 CRUD - Read Operations
  • 24. ElasticSearch Read ● In this phase, the coordinating node routes the search request to all the shards (primary or replica) in the index. ● The shards perform search independently and create a set of results sorted by relevance score. ● All the shards return the document IDs of the matched documents and relevant scores to the coordinating node. ● By default, each shard sends the top 10 results to the coordinating node ● The coordinating node sorts the results globally, and creates a list of the top 10 hits. ● The coordinating node then requests the original documents from all the shards. All the shards enrich the documents and return them to the coordinating node. ● Results are aggregated and sent to the clients
  • 27. References 1. https://qbox.io/blog/refresh-flush-operations-elasticsearch-guide 2. https://www.elastic.co/guide/index.html 3. https://blog.insightdatascience.com/anatomy-of-an-elasticsearch-cluster-part-i- 7ac9a13b05db