SlideShare a Scribd company logo
1 of 57
Download to read offline
Elasticsearch
Crash Course for Data Engineers
Duy Do (@duydo)
About
● A Father, A Husband, A Software Engineer
● Founder of Vietnamese Elasticsearch Community
● Author of Vietnamese Elasticsearch Analysis Plugin
● Technical Consultant at Sentifi AG
● Co-Founder at Krom
● Follow me @duydo
Elasticsearch is Everywhere
What is Elasticsearch?
Elasticsearch is a distributed
search and analytics engine,
designed for horizontal
scalability with easy
management.
Basic Terms
● Cluster is a collection of nodes.
● Node is a single server, part of a
cluster.
● Index is a collection of shards ~
database.
● Shard is a collection of
documents.
● Type is a category/partition of an
index ~ table in database.
● Document is a Json object ~
record in database.
Distributed & Scalable
Shards & Replicas
One node, One shard
Node 1
employees
P0
PUT /employees
{
“settings”: {
“number_of_shards”: 1,
“number_of_replicas”: 0
}
}
Two nodes, One shard
Node 1
employees
P0
PUT /employees
{
“settings”: {
“number_of_shards”: 1,
“number_of_replicas”: 0
}
}
Node 2
One node, Two shards
Node 1
employees
P0
PUT /employees
{
“settings”: {
“number_of_shards”: 2,
“number_of_replicas”: 0
}
}
P1
Two Nodes, Two Shards
Node 1
employees
P0
PUT /employees
{
“settings”: {
“number_of_shards”: 2,
“number_of_replicas”: 0
}
}
Node 2
employees
P1P1
Two nodes, Two shards, One replica of each shard
Node 1
employees
P0
PUT /employees
{
“settings”: {
“number_of_shards”: 2,
“number_of_replicas”: 1
}
}
R1
Node 2
employees
P1 R0
Index Management
Create Index
PUT /employees
{
“settings”: {...},
“mappings”: {
“type_one”: {...},
“type_two”: {...}
},
“aliases”: {
“alias_one”: {...},
“alias_two”: {...}
}
}
Index Settings
PUT /employees/_settings
{
“number_of_replicas”: 1
}
Index Mappings
PUT /employees/_mappings
{
“employee”: {
“properties”: {
“name”: {“type”: “string”},
“gender”: {“type”: “string”, “index”: “not_analyzed”},
“email”: {“type”: “string”, “index”: “not_analyzed”},
“dob”: {“type”: “date”},
“country”: {“type”: “string”, “index”: “not_analyzed”},
“salary”: {“type”: “double”},
}
}
}
Delete Index
DELETE /employees
Put Data In, Get Data Out
Index a Document with ID
PUT /employees/employee/1
{
“name”: “Duy Do”,
“email”: “duy.do@sentifi.com”,
“dob”: “1984-06-20”,
“country”: “VN”
“gender”: “male”,
“salary”: 100.0
}
Index a Document without ID
POST /employees/employee/
{
“name”: “Duy Do”,
“email”: “duy.do@sentifi.com”,
“dob”: “1984-06-20”,
“country”: “VN”
“gender”: “male”,
“salary”: 100.0
}
Retrieve a Document
GET /employees/employee/1
Update a Document
POST /employees/employee/1/_update
{
“doc”:{
“salary”: 500.0
}
}
Delete a Document
DELETE /employees/employee/1
Searching
Structured Search
Date, Times, Numbers, Text
● Finding Exact Values
● Finding Multiple Exact Values
● Ranges
● Working with Null Values
● Combining Filters
Finding Exact Values
GET /employees/employee/_search
{
“query”: {
“term”: {
“country”: “VN”
}
}
}
SQL: SELECT * FROM employee WHERE country = ‘VN’;
Finding Multiple Exact Values
GET /employees/employee/_search
{
“query”: {
“terms”: {
“country”: [“VN”, “US”]
}
}
}
SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
Ranges
GET /employees/employee/_search
{
“query”: {
“range”: {
“dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”}
}
}
}
SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
Working with Null values
GET /employees/employee/_search
{
“query”: {
“filtered”: {
“filter”: {
“exists”: {“field”: “email”}
}
}
}
}
SELECT * FROM employee WHERE email IS NOT NULL;
Working with Null Values
GET /employees/employee/_search
{
“query”: {
“filtered”: {
“filter”: {
“missing”: {“field”: “email”}
}
}
}
}
SELECT * FROM employee WHERE email IS NULL;
Combining Filters
GET /employees/employee/_search
{
“query”: {
“filtered”: {
“filter”: {
“bool”: {
“must”:[{“exists”: {“field”: “email”}}],
“must_not”:[{“term”: {“gender”: “female”}}],
“should”:[{“terms”: {“country”: [“VN”, “US”]}}]
}
}
}
}
}
Combining Filters
SQL:
SELECT * FROM employee
WHERE email IS NOT NULL
AND gender != ‘female’
AND (country = ‘VN’ OR country = ‘US’);
More Queries
● Prefix
● Wildcard
● Regex
● Fuzzy
● Type
● Ids
● ...
Full-Text Search
Relevance, Analysis
● Match Query
● Combining Queries
● Boosting Query Clauses
Match Query - Single Word
GET /employees/employee/_search
{
“query”: {
“match”: {
“name”: {
“query”: “Duy”
}
}
}
}
Match Query - Multi Words
GET /employees/employee/_search
{
“query”: {
“match”: {
“name”: {
“query”: “Duy Do”,
“operator”: “and”
}
}
}
}
Combining Queries
GET /employees/employee/_search
{
“query”: {
“bool”: {
“must”:[{“match”: {“name”: “Do”}}],
“must_not”:[{“term”: {“gender”: “female”}}],
“should”:[{“terms”: {“country”: [“VN”, “US”]}}]
}
}
}
Boosting Query Clauses
GET /employees/employee/_search
{
“query”: {
“bool”: {
“must”:[{“term”: {“gender”: “female”}}], # default boost 1
“should”:[
{“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most
important
{“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than
#1 but not as important as #2
],
}
}
}
More Queries
● Multi Match
● Common Terms
● Query Strings
● ...
Analytics
Aggregations
Analyze & Summarize
● How many needles in the
haystack?
● What is the average length of
the needles?
● What is the median length of
the needles, broken down by
manufacturer?
● How many needles are added
to the haystacks each month?
● What are the most popular
needle manufacturers?
● ...
Buckets & Metrics
SELECT COUNT(country) # a metric
FROM employee
GROUP BY country # a bucket
GET /employees/employee/_search
{
“aggs”: {
“by_country”: {
“terms”: {“field”: “country”}
}
}
}
Bucket is a collection of
documents that meet certain
criteria.
Metric is simple mathematical
operations such as: min, max,
mean, sum and avg.
Combination
Buckets & Metrics
● Partitions employees by
country (bucket)
● Then partitions each country
bucket by gender (bucket)
● Finally calculate the average
salary for each gender bucket
(metric)
Combination Query
GET /employees/employee/_search
{
“aggs”: {
“by_country”: { “terms”: {“field”: “country”},
“aggs”: {
“by_gender”: { “terms”: {“field”: “gender”},
“aggs”: {
“avg_salary”: {“avg”: “field”: “salary”}
}
}
}
}
}
}
More Aggregations
● Histogram
● Date Histogram
● Date Range
● Filter/Filters
● Missing
● Geo Distance
● Nested
● ...
Best Practices
Indexing
● Use bulk indexing APIs.
● Tune your bulk size 5-10MB.
● Partitions your time series data
by time period (monthly, weekly,
daily).
● Use aliases for your indices.
● Turn off refresh, replicas while
indexing. Turn on once it’s done
● Multiple shards for parallel
indexing.
● Multiple replicas for parallel
reading.
Mapping
● Disable _all field
● Keep _source field, do not store
any field.
● Use not_analyzed if possible
Query
● Use filters instead of queries
if possible.
● Consider orders and scope of
your filters.
● Do not use string query.
● Do not load too many results
with single query, use scroll
API instead.
Tools
Kibana for Discovery, Visualization
Sense for Query
Marvel for Monitoring

More Related Content

What's hot

Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Jonathan Katz
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrievalqqlan
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratopSandesh Rao
 
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...butest
 
Oracle ASM 11g - The Evolution
Oracle ASM 11g - The EvolutionOracle ASM 11g - The Evolution
Oracle ASM 11g - The EvolutionAlex Gorbachev
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache SparkDatabricks
 
Oracle Databases on AWS - Getting the Best Out of RDS and EC2
Oracle Databases on AWS - Getting the Best Out of RDS and EC2Oracle Databases on AWS - Getting the Best Out of RDS and EC2
Oracle Databases on AWS - Getting the Best Out of RDS and EC2Maris Elsins
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cGlen Hawkins
 
Frappé Framework - A Full Stack Web Framework
Frappé Framework - A Full Stack Web FrameworkFrappé Framework - A Full Stack Web Framework
Frappé Framework - A Full Stack Web Frameworkrushabh_mehta
 
Inspirage Webinar on Epm integration agent
Inspirage Webinar on Epm integration agentInspirage Webinar on Epm integration agent
Inspirage Webinar on Epm integration agentDayalan Punniyamoorthy
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationDataWorks Summit
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Learn Thai - FSI Basic Course (Part 1)
Learn Thai - FSI Basic Course (Part 1)Learn Thai - FSI Basic Course (Part 1)
Learn Thai - FSI Basic Course (Part 1)101_languages
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Database overview in bangla
Database overview in banglaDatabase overview in bangla
Database overview in banglaNazmul hossain
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Rundeck
 

What's hot (20)

5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrieval
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratop
 
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
Part-of-Speech Tagging for Bengali Thesis submitted to Indian ...
 
Oracle ASM 11g - The Evolution
Oracle ASM 11g - The EvolutionOracle ASM 11g - The Evolution
Oracle ASM 11g - The Evolution
 
PostgreSQL
PostgreSQL PostgreSQL
PostgreSQL
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Oracle Databases on AWS - Getting the Best Out of RDS and EC2
Oracle Databases on AWS - Getting the Best Out of RDS and EC2Oracle Databases on AWS - Getting the Best Out of RDS and EC2
Oracle Databases on AWS - Getting the Best Out of RDS and EC2
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
 
Frappé Framework - A Full Stack Web Framework
Frappé Framework - A Full Stack Web FrameworkFrappé Framework - A Full Stack Web Framework
Frappé Framework - A Full Stack Web Framework
 
Inspirage Webinar on Epm integration agent
Inspirage Webinar on Epm integration agentInspirage Webinar on Epm integration agent
Inspirage Webinar on Epm integration agent
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop Implementation
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Learn Thai - FSI Basic Course (Part 1)
Learn Thai - FSI Basic Course (Part 1)Learn Thai - FSI Basic Course (Part 1)
Learn Thai - FSI Basic Course (Part 1)
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Database overview in bangla
Database overview in banglaDatabase overview in bangla
Database overview in bangla
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 

Similar to Elasticsearch for Data Engineers

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachSymfonyMu
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Enhancement of Searching and Analyzing the Document using Elastic Search
Enhancement of Searching and Analyzing the Document using Elastic SearchEnhancement of Searching and Analyzing the Document using Elastic Search
Enhancement of Searching and Analyzing the Document using Elastic SearchIRJET Journal
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and AnalysisOpenThink Labs
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
 
Redis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch AggregationsRedis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch AggregationsRedis Labs
 
Elasticsearch a real-time distributed search and analytics engine
Elasticsearch a real-time distributed search and analytics engineElasticsearch a real-time distributed search and analytics engine
Elasticsearch a real-time distributed search and analytics enginegautam kumar
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-stepsMatteo Moci
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best PracticesLewis Lin 🦊
 
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Codemotion
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsTiziano Fagni
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics MongoDB
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"George Stathis
 

Similar to Elasticsearch for Data Engineers (20)

Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approach
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Enhancement of Searching and Analyzing the Document using Elastic Search
Enhancement of Searching and Analyzing the Document using Elastic SearchEnhancement of Searching and Analyzing the Document using Elastic Search
Enhancement of Searching and Analyzing the Document using Elastic Search
 
Connect to NoSQL Database using Node JS.pptx
Connect to NoSQL Database using Node JS.pptxConnect to NoSQL Database using Node JS.pptx
Connect to NoSQL Database using Node JS.pptx
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
 
Redis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch AggregationsRedis Day TLV 2018 - RediSearch Aggregations
Redis Day TLV 2018 - RediSearch Aggregations
 
Elasticsearch a real-time distributed search and analytics engine
Elasticsearch a real-time distributed search and analytics engineElasticsearch a real-time distributed search and analytics engine
Elasticsearch a real-time distributed search and analytics engine
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analytics
 
Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
 

Recently uploaded

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Elasticsearch for Data Engineers

  • 1. Elasticsearch Crash Course for Data Engineers Duy Do (@duydo)
  • 2. About ● A Father, A Husband, A Software Engineer ● Founder of Vietnamese Elasticsearch Community ● Author of Vietnamese Elasticsearch Analysis Plugin ● Technical Consultant at Sentifi AG ● Co-Founder at Krom ● Follow me @duydo
  • 4.
  • 6. Elasticsearch is a distributed search and analytics engine, designed for horizontal scalability with easy management.
  • 7. Basic Terms ● Cluster is a collection of nodes. ● Node is a single server, part of a cluster. ● Index is a collection of shards ~ database. ● Shard is a collection of documents. ● Type is a category/partition of an index ~ table in database. ● Document is a Json object ~ record in database.
  • 10. One node, One shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } }
  • 11. Two nodes, One shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 1, “number_of_replicas”: 0 } } Node 2
  • 12. One node, Two shards Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } P1
  • 13. Two Nodes, Two Shards Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 0 } } Node 2 employees P1P1
  • 14. Two nodes, Two shards, One replica of each shard Node 1 employees P0 PUT /employees { “settings”: { “number_of_shards”: 2, “number_of_replicas”: 1 } } R1 Node 2 employees P1 R0
  • 16. Create Index PUT /employees { “settings”: {...}, “mappings”: { “type_one”: {...}, “type_two”: {...} }, “aliases”: { “alias_one”: {...}, “alias_two”: {...} } }
  • 18. Index Mappings PUT /employees/_mappings { “employee”: { “properties”: { “name”: {“type”: “string”}, “gender”: {“type”: “string”, “index”: “not_analyzed”}, “email”: {“type”: “string”, “index”: “not_analyzed”}, “dob”: {“type”: “date”}, “country”: {“type”: “string”, “index”: “not_analyzed”}, “salary”: {“type”: “double”}, } } }
  • 20. Put Data In, Get Data Out
  • 21. Index a Document with ID PUT /employees/employee/1 { “name”: “Duy Do”, “email”: “duy.do@sentifi.com”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  • 22. Index a Document without ID POST /employees/employee/ { “name”: “Duy Do”, “email”: “duy.do@sentifi.com”, “dob”: “1984-06-20”, “country”: “VN” “gender”: “male”, “salary”: 100.0 }
  • 23. Retrieve a Document GET /employees/employee/1
  • 24. Update a Document POST /employees/employee/1/_update { “doc”:{ “salary”: 500.0 } }
  • 25. Delete a Document DELETE /employees/employee/1
  • 27. Structured Search Date, Times, Numbers, Text ● Finding Exact Values ● Finding Multiple Exact Values ● Ranges ● Working with Null Values ● Combining Filters
  • 28. Finding Exact Values GET /employees/employee/_search { “query”: { “term”: { “country”: “VN” } } } SQL: SELECT * FROM employee WHERE country = ‘VN’;
  • 29. Finding Multiple Exact Values GET /employees/employee/_search { “query”: { “terms”: { “country”: [“VN”, “US”] } } } SQL: SELECT * FROM employee WHERE country = ‘VN’ OR country = ‘US’;
  • 30. Ranges GET /employees/employee/_search { “query”: { “range”: { “dob”: {“gt”: “1984-01-01”, “lt”: “2000-01-01”} } } } SQL: SELECT * FROM employee WHERE dob BETWEENS ‘1984-01-01’ AND ‘2000-01-01’;
  • 31. Working with Null values GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “exists”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NOT NULL;
  • 32. Working with Null Values GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “missing”: {“field”: “email”} } } } } SELECT * FROM employee WHERE email IS NULL;
  • 33. Combining Filters GET /employees/employee/_search { “query”: { “filtered”: { “filter”: { “bool”: { “must”:[{“exists”: {“field”: “email”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } } } }
  • 34. Combining Filters SQL: SELECT * FROM employee WHERE email IS NOT NULL AND gender != ‘female’ AND (country = ‘VN’ OR country = ‘US’);
  • 35. More Queries ● Prefix ● Wildcard ● Regex ● Fuzzy ● Type ● Ids ● ...
  • 36. Full-Text Search Relevance, Analysis ● Match Query ● Combining Queries ● Boosting Query Clauses
  • 37. Match Query - Single Word GET /employees/employee/_search { “query”: { “match”: { “name”: { “query”: “Duy” } } } }
  • 38. Match Query - Multi Words GET /employees/employee/_search { “query”: { “match”: { “name”: { “query”: “Duy Do”, “operator”: “and” } } } }
  • 39. Combining Queries GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“match”: {“name”: “Do”}}], “must_not”:[{“term”: {“gender”: “female”}}], “should”:[{“terms”: {“country”: [“VN”, “US”]}}] } } }
  • 40. Boosting Query Clauses GET /employees/employee/_search { “query”: { “bool”: { “must”:[{“term”: {“gender”: “female”}}], # default boost 1 “should”:[ {“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most important {“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than #1 but not as important as #2 ], } } }
  • 41. More Queries ● Multi Match ● Common Terms ● Query Strings ● ...
  • 43. Aggregations Analyze & Summarize ● How many needles in the haystack? ● What is the average length of the needles? ● What is the median length of the needles, broken down by manufacturer? ● How many needles are added to the haystacks each month? ● What are the most popular needle manufacturers? ● ...
  • 44. Buckets & Metrics SELECT COUNT(country) # a metric FROM employee GROUP BY country # a bucket GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”} } } }
  • 45. Bucket is a collection of documents that meet certain criteria.
  • 46. Metric is simple mathematical operations such as: min, max, mean, sum and avg.
  • 47. Combination Buckets & Metrics ● Partitions employees by country (bucket) ● Then partitions each country bucket by gender (bucket) ● Finally calculate the average salary for each gender bucket (metric)
  • 48. Combination Query GET /employees/employee/_search { “aggs”: { “by_country”: { “terms”: {“field”: “country”}, “aggs”: { “by_gender”: { “terms”: {“field”: “gender”}, “aggs”: { “avg_salary”: {“avg”: “field”: “salary”} } } } } } }
  • 49. More Aggregations ● Histogram ● Date Histogram ● Date Range ● Filter/Filters ● Missing ● Geo Distance ● Nested ● ...
  • 51. Indexing ● Use bulk indexing APIs. ● Tune your bulk size 5-10MB. ● Partitions your time series data by time period (monthly, weekly, daily). ● Use aliases for your indices. ● Turn off refresh, replicas while indexing. Turn on once it’s done ● Multiple shards for parallel indexing. ● Multiple replicas for parallel reading.
  • 52. Mapping ● Disable _all field ● Keep _source field, do not store any field. ● Use not_analyzed if possible
  • 53. Query ● Use filters instead of queries if possible. ● Consider orders and scope of your filters. ● Do not use string query. ● Do not load too many results with single query, use scroll API instead.
  • 54. Tools
  • 55. Kibana for Discovery, Visualization