SlideShare a Scribd company logo
CF Software Package
Ernesto Reig
Damian McDonald
Elasticsearch – basics and beyond
Agenda
Introduction
• Elasticsearch definition and key points
• Inverted indexes
Cluster configuration and architecture
• Shards and replica
• Memory
• SSD Disks
• Logs
• Cluster topology
Modeling the data
• Mapping
• Analysis
• Handling relationships
JVM and Cluster monitoring
Introduction
Introduction (1): Elasticsearch definition and key points
Elasticsearch is not a NO-SQL database
Elasticsearch is not a Search Engine (uses Apache Lucene)
Elasticsearch is a server used to search & analyze data in real time.
• It is distributed, scalable and highly available.
• It is meant for real-time search and analytics capabilities.
• It comes with a sophisticated RESTful API.
3 key points in Elasticsearch:
• Proper cluster configuration and architecture
• Proper Data Mappings
• Proper JVM and cluster monitoring
Elasticsearch is fragile, delicate, sensitive, frail and tricky
“With great power comes great responsibility” Benjamin Parker
Introduction (2): Apache Lucene Inverted indexes
1. Spiderman is my favourite hero
2. Batman is a hero
3. Ernesto is a hero better than Spiderman and Batman
Term Count Docs
Spiderman 2 1, 3
is 3 1,2,3
my 1 1
favourite 1 1
hero 3 1,2,3
Batman 2 2,3
a 2 2,3
Ernesto 1 3
better 1 3
than 1 3
and 1 3
Cluster configuration and architecture
Configuration (1): Shards and Replica
• Shard: Apache Lucene Index
• Replica: copy of a shard
• Elasticsearch Index: 1 or more shards
• Question 1: How many shards do we need? And how many replicas?
• Question 2: Does it make sense to have one shard and its corresponding replica in the
same node?
• Question 3: Is it useful having a 1-node cluster with "number_of_replicas": 1?
• General rule:
– Max Number of nodes = number of shards * (number of replica + 1)
Configuration (2)
• Dedicated memory should not be more than 50% of the total memory available.
– Example 16g:
• ./bin/elasticsearch -Xmx8g -Xms8g
• export ES_HEAP_SIZE=8g
– Xms and max Xmx should be the same
• Do not give more than 32 GB!
– ( http://www.elastic.co/guide/en/elasticsearch/guide/master/heap-
sizing.html#compressed_oops)
• Enable mlockall to avoid memory swapping:
– bootstrap.mlockall: true
• Use SSD disks
• Change logs path:
– path.logs: /var/log/elasticsearch
Configuration (3): cluster topology (1)
• A well designed topology will make the cluster to:
– Increase search speed
– Reduce CPU consumption
– Reduce memory consumption
– Accept more concurrent requests per second
– Reduce probability of split brain
– Reduce probability of other errors in general.
– Reduce hardware costs
• Data nodes and 2 types of non-data nodes:
– data nodes
• http.enabled: false
• node.data: true
• node.master: false
– dedicated master nodes
• http.enabled: false
• node.data: false
• node.master: true
– client nodes. Smart load balancers
• http.enabled: true
• node.data: false
• node.master: false
Configuration (4): cluster topology (2)
With this configuration we can use
machines with different hardware
configuration for every type of node.
This way we can save a lot
of money invested in hardware!!
Example of cluster topology with 2
HTTP nodes, 2 master nodes and
1 to X data nodes
Modeling the data
Modeling the data (1): Mapping
• Mapping is the process of defining how a document should be mapped to
the Search Engine
– Default Dynamic Mapping
• An index may store documents of different "mapping types”
• Mapping types are a way to divide the documents in an index into logical
groups. Think of it as tables in a database
• Components:
– Fields: _id, _type, _source, _all, _parent, _index, _size,…
– Types: the datatype for each field in a document (eg strings, numbers, objects
etc)
• Core Types: string, integer/long, float/double, boolean, and null.
• Array
• Object
• Nested
• IP
• Geo Point
• Geo Shape
• Attachment
Modeling the data (2): Analysis
• Analysis is a process that consists of the following:
– First, tokenizing a block of text into individual terms suitable for use in an inverted index,
– Then normalizing these terms into a standard form to improve their “searchability,” or recall
• This job is performed by analyzers. An analyzer is really just a wrapper that
combines three functions into a single package:
– 0 or more Character filters
– 1 Tokenizer
– 0 or more Token filters
• Analysis is performed to both:
– break indexed (analyzed) fields when a document is indexed
– process query strings
• Elasticsearch provides many character filters, tokenizers, and token filters
out of the box. These can be combined to create custom analyzers
suitable for different purposes.
Modeling the data (3): Analysis steps example
Original sentence: Batman & Robin aren´t my favourite heroes
Batman
and
Robin
aren´t
my
favourite
heroes
1st) Character filter: Batman and Robin aren´t my favourite heroes
2nd) Tokenizer:
3rd) Token Filter:
batman
--
robin
aren
my
favourite
heroes
Indexed:
Modeling the data (4): Handling relationships
Handling relationships between entities is not as obvious as it is with a
dedicated relational store. The golden rule of a relational database—normalize
your data—does not apply to Elasticsearch.
Four common techniques are used to manage relational data in Elasticsearch:
• Application-side joins
• Data denormalization
• Nested objects
• Parent/child relationships
PUT /my_index/user/1
{
"name": "John Smith",
"email": "john@smith.com",
"dob": "1970/10/24"
}
PUT /my_index/blogpost/2
{
"title": "Relationships",
"body": "It's complicated...",
"user": 1
}
Modeling the data (5): Handling relationships – Application-side joins
We can (partly) emulate a relational database by implementing joins in our application:
Problem: This approach is only suitable when the first entity (the user in this example)
has a small number of documents and, preferably, they seldom change.
PUT /my_index/user/1
{
"name": "John Smith",
"email": "john@smith.com",
"dob": "1970/10/24"
}
PUT /my_index/blogpost/2
{
"title": "Relationships",
"body": "It's complicated...",
"user": {
"id": 1,
"name": "John Smith"
}
}
Modeling the data (6): Handling relationships – Data denormalization
Having redundant copies of data in each document that requires access to it removes the need for
joins:
Problem: if we want to update the name, or remove a user object, we have to reindex
also the whole blogpost document.
PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [ "cash", "shares" ],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}
Modeling the data (7): Handling relationships – Nested objects
Given the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it
makes sense to store closely related entities within the same document:
Problem: As with denormalization, to update, add, or remove a nested object, we have to reindex the
whole document also the whole blogpost document.
Find children by parent:
GET /company/employee/_search
{
"query": {
"has_parent": {
"type": "branch",
"query": {
"match": {
"country": "UK"
}
}
}
}
}
Index a child document:
PUT /company
{
"mappings": {
"branch": {},
"employee": {
"_parent": {
"type": "branch"
}
}
}
}
Modeling the data (8): Handling relationships – Parent/child relationship
The parent-child functionality allows you to associate one document type with another, in a one-to-many relationship—
one parent to many children. Advantages:
• The parent document can be updated without reindexing the children.
• Child documents can be added, changed, or deleted without affecting either the parent or other children.
• Child documents can be returned as the results of a search request.
Find parents by children:
GET /company/branch/_search
{
"query": {
"has_child": {
"type": "employee",
"query": {
“term": {
“name": “John"
}
}
}
}
}
JVM and Cluster monitoring
JVM and Cluster monitoring
• Servers CPU and disk usage
• Elasticsearch logs
• Elasticsearch plugins:
– Marvel
– Bigdesk
– Watcher
• Watch stats (http://localhost:9200/_stats)
• JVM
– Jstat: jstat –gcutil es_pid 2000 1000 (ES pid with jps)
– Visual JVM plugin
– Memory dump – jmap
• Hot threads API
• Before going to production: Apache Jmeter tests!
Thank You

More Related Content

What's hot

MongoDB
MongoDBMongoDB
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Thamme Gowda
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...MongoDB
 
Oslo
OsloOslo
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
QBiC_Tue
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDBlehresman
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
MongoDB - An Introduction
MongoDB - An IntroductionMongoDB - An Introduction
MongoDB - An Introduction
dinkar thakur
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
QBiC_Tue
 
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
ijdms
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Lucidworks
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
Yadhu Kiran
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
Thomas Lee
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
IEEEFINALYEARPROJECTS
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
Globus
 

What's hot (20)

MongoDB
MongoDBMongoDB
MongoDB
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
Socialite, the Open Source Status Feed Part 1: Design Overview and Scaling fo...
 
MongoDB
MongoDBMongoDB
MongoDB
 
Oslo
OsloOslo
Oslo
 
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
 
Modeling Data in MongoDB
Modeling Data in MongoDBModeling Data in MongoDB
Modeling Data in MongoDB
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
MongoDB - An Introduction
MongoDB - An IntroductionMongoDB - An Introduction
MongoDB - An Introduction
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
 
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
MongoDB DOC v1.5
MongoDB DOC v1.5MongoDB DOC v1.5
MongoDB DOC v1.5
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Annotating search results from web databases
Annotating search results from web databasesAnnotating search results from web databases
Annotating search results from web databases
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 

Similar to Elasticsearch - basics and beyond

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Elk presentation1#3
Elk presentation1#3Elk presentation1#3
Elk presentation1#3
uzzal basak
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
Daniel N
 
Database system
Database system Database system
Database system
Hitesh Mohapatra
 
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
Continuent
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
Indexing in eXist database
Indexing in eXist database Indexing in eXist database
Indexing in eXist database
redchilly
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptx
prakashvs7
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_Introduction
ThenmozhiK5
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
Jiaheng Lu
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
Novartis Institutes for BioMedical Research
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
Rupali Rana
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
Vinay Kumar
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
cadejaumafiq
 

Similar to Elasticsearch - basics and beyond (20)

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Elk presentation1#3
Elk presentation1#3Elk presentation1#3
Elk presentation1#3
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Database system
Database system Database system
Database system
 
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Indexing in eXist database
Indexing in eXist database Indexing in eXist database
Indexing in eXist database
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
DS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptxDS-UNIT 1 FINAL (2).pptx
DS-UNIT 1 FINAL (2).pptx
 
Data Structures_Introduction
Data Structures_IntroductionData Structures_Introduction
Data Structures_Introduction
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Elasticsearch - basics and beyond

  • 1. CF Software Package Ernesto Reig Damian McDonald Elasticsearch – basics and beyond
  • 2. Agenda Introduction • Elasticsearch definition and key points • Inverted indexes Cluster configuration and architecture • Shards and replica • Memory • SSD Disks • Logs • Cluster topology Modeling the data • Mapping • Analysis • Handling relationships JVM and Cluster monitoring
  • 4. Introduction (1): Elasticsearch definition and key points Elasticsearch is not a NO-SQL database Elasticsearch is not a Search Engine (uses Apache Lucene) Elasticsearch is a server used to search & analyze data in real time. • It is distributed, scalable and highly available. • It is meant for real-time search and analytics capabilities. • It comes with a sophisticated RESTful API. 3 key points in Elasticsearch: • Proper cluster configuration and architecture • Proper Data Mappings • Proper JVM and cluster monitoring Elasticsearch is fragile, delicate, sensitive, frail and tricky “With great power comes great responsibility” Benjamin Parker
  • 5. Introduction (2): Apache Lucene Inverted indexes 1. Spiderman is my favourite hero 2. Batman is a hero 3. Ernesto is a hero better than Spiderman and Batman Term Count Docs Spiderman 2 1, 3 is 3 1,2,3 my 1 1 favourite 1 1 hero 3 1,2,3 Batman 2 2,3 a 2 2,3 Ernesto 1 3 better 1 3 than 1 3 and 1 3
  • 7. Configuration (1): Shards and Replica • Shard: Apache Lucene Index • Replica: copy of a shard • Elasticsearch Index: 1 or more shards • Question 1: How many shards do we need? And how many replicas? • Question 2: Does it make sense to have one shard and its corresponding replica in the same node? • Question 3: Is it useful having a 1-node cluster with "number_of_replicas": 1? • General rule: – Max Number of nodes = number of shards * (number of replica + 1)
  • 8. Configuration (2) • Dedicated memory should not be more than 50% of the total memory available. – Example 16g: • ./bin/elasticsearch -Xmx8g -Xms8g • export ES_HEAP_SIZE=8g – Xms and max Xmx should be the same • Do not give more than 32 GB! – ( http://www.elastic.co/guide/en/elasticsearch/guide/master/heap- sizing.html#compressed_oops) • Enable mlockall to avoid memory swapping: – bootstrap.mlockall: true • Use SSD disks • Change logs path: – path.logs: /var/log/elasticsearch
  • 9. Configuration (3): cluster topology (1) • A well designed topology will make the cluster to: – Increase search speed – Reduce CPU consumption – Reduce memory consumption – Accept more concurrent requests per second – Reduce probability of split brain – Reduce probability of other errors in general. – Reduce hardware costs • Data nodes and 2 types of non-data nodes: – data nodes • http.enabled: false • node.data: true • node.master: false – dedicated master nodes • http.enabled: false • node.data: false • node.master: true – client nodes. Smart load balancers • http.enabled: true • node.data: false • node.master: false
  • 10. Configuration (4): cluster topology (2) With this configuration we can use machines with different hardware configuration for every type of node. This way we can save a lot of money invested in hardware!! Example of cluster topology with 2 HTTP nodes, 2 master nodes and 1 to X data nodes
  • 12. Modeling the data (1): Mapping • Mapping is the process of defining how a document should be mapped to the Search Engine – Default Dynamic Mapping • An index may store documents of different "mapping types” • Mapping types are a way to divide the documents in an index into logical groups. Think of it as tables in a database • Components: – Fields: _id, _type, _source, _all, _parent, _index, _size,… – Types: the datatype for each field in a document (eg strings, numbers, objects etc) • Core Types: string, integer/long, float/double, boolean, and null. • Array • Object • Nested • IP • Geo Point • Geo Shape • Attachment
  • 13. Modeling the data (2): Analysis • Analysis is a process that consists of the following: – First, tokenizing a block of text into individual terms suitable for use in an inverted index, – Then normalizing these terms into a standard form to improve their “searchability,” or recall • This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package: – 0 or more Character filters – 1 Tokenizer – 0 or more Token filters • Analysis is performed to both: – break indexed (analyzed) fields when a document is indexed – process query strings • Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes.
  • 14. Modeling the data (3): Analysis steps example Original sentence: Batman & Robin aren´t my favourite heroes Batman and Robin aren´t my favourite heroes 1st) Character filter: Batman and Robin aren´t my favourite heroes 2nd) Tokenizer: 3rd) Token Filter: batman -- robin aren my favourite heroes Indexed:
  • 15. Modeling the data (4): Handling relationships Handling relationships between entities is not as obvious as it is with a dedicated relational store. The golden rule of a relational database—normalize your data—does not apply to Elasticsearch. Four common techniques are used to manage relational data in Elasticsearch: • Application-side joins • Data denormalization • Nested objects • Parent/child relationships
  • 16. PUT /my_index/user/1 { "name": "John Smith", "email": "john@smith.com", "dob": "1970/10/24" } PUT /my_index/blogpost/2 { "title": "Relationships", "body": "It's complicated...", "user": 1 } Modeling the data (5): Handling relationships – Application-side joins We can (partly) emulate a relational database by implementing joins in our application: Problem: This approach is only suitable when the first entity (the user in this example) has a small number of documents and, preferably, they seldom change.
  • 17. PUT /my_index/user/1 { "name": "John Smith", "email": "john@smith.com", "dob": "1970/10/24" } PUT /my_index/blogpost/2 { "title": "Relationships", "body": "It's complicated...", "user": { "id": 1, "name": "John Smith" } } Modeling the data (6): Handling relationships – Data denormalization Having redundant copies of data in each document that requires access to it removes the need for joins: Problem: if we want to update the name, or remove a user object, we have to reindex also the whole blogpost document.
  • 18. PUT /my_index/blogpost/1 { "title": "Nest eggs", "body": "Making your money work...", "tags": [ "cash", "shares" ], "comments": [ { "name": "John Smith", "comment": "Great article", "age": 28, "stars": 4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this please", "age": 31, "stars": 5, "date": "2014-10-22" } ] } Modeling the data (7): Handling relationships – Nested objects Given the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it makes sense to store closely related entities within the same document: Problem: As with denormalization, to update, add, or remove a nested object, we have to reindex the whole document also the whole blogpost document.
  • 19. Find children by parent: GET /company/employee/_search { "query": { "has_parent": { "type": "branch", "query": { "match": { "country": "UK" } } } } } Index a child document: PUT /company { "mappings": { "branch": {}, "employee": { "_parent": { "type": "branch" } } } } Modeling the data (8): Handling relationships – Parent/child relationship The parent-child functionality allows you to associate one document type with another, in a one-to-many relationship— one parent to many children. Advantages: • The parent document can be updated without reindexing the children. • Child documents can be added, changed, or deleted without affecting either the parent or other children. • Child documents can be returned as the results of a search request. Find parents by children: GET /company/branch/_search { "query": { "has_child": { "type": "employee", "query": { “term": { “name": “John" } } } } }
  • 20. JVM and Cluster monitoring
  • 21. JVM and Cluster monitoring • Servers CPU and disk usage • Elasticsearch logs • Elasticsearch plugins: – Marvel – Bigdesk – Watcher • Watch stats (http://localhost:9200/_stats) • JVM – Jstat: jstat –gcutil es_pid 2000 1000 (ES pid with jps) – Visual JVM plugin – Memory dump – jmap • Hot threads API • Before going to production: Apache Jmeter tests!