Introduction to Elastic Search and Beyond
Foursquare
• Searching 50 millions venues in real time? Foursquare does it
using Elasticsearch
GitHub
• Searches 20TB of data using Elasticsearch, including 1.3 billion
files and 130 billion lines of code.
SoundCloud
• Uses Elasticsearch to provide immediate and relevant results for
their online audio distribution platform rea...
StumbleUpon
• Elasticsearch is mission critical to StumbleUpon, delivering
millions of recommendations to their community ...
Fog Creek
• Elasticsearch powers Fog Creek Software’s instant search for 30
million requests per month across 40 billion l...
Agenda
●

●
●
●
●
●

Introduction
Setting up Elastic Search
Creating & Query ES index
Analyzer
Facets
Percolator
Introduction
Elastic What ???
Elastic Whattt .... ??
• ElasticSearch is a distributed, RESTful, free/open source
server based on Apache Lucene.
• Develo...
Features
and
Advantages
Document Oriented & Schema Free
●

Complex Entities stored as JSON documents.

●

No need for a schema upfront. Detects sc...
Distributed & Highly Available
●

●

●

●

Built to scale horizontally out of the box.
Indices can be divided into shards ...
Distributed Search

Master Node

Search for Employee b
Redundancy
Primary Shards

Master Node

Redundant Shards
Node Auto Discovery

Relegated to a data node role

Takes over as Master
Fail Safe

New Master Node

Automatically assumes Master role in case Master node goes down.
Multi Tenant with Multi Types
●

A cluster can host multiple indices which can be queried
independently or as a group.
cur...
RESTful API & Clients
Any action can be performed using a simple RESTful API
JSON over HTTP.
●

●

●

Native Java and Groo...
What can it do?
●

Search
Find all companies in the ‘search’ domain.

●

Facets
Return the average annual revenue for all ...
Use Cases

Any application that needs searching, data analysis, aggregation
capabilities.
Setting Up
Installing Elastic Search
Adding the HEAD plugin
How to install and run it?
●

●

●

Download & extract from
http://www.elasticsearch.org/download/
$ bin/elasticsearch –f
...
Add On ...
●

There is a nice web front end tool for browsing and
interacting with an Elastic Search cluster.
git clone gi...
Basic Operations
Creating and Querying an Index
Quick API Intro
curl -<method-type> '<host>:<port>/<indices>/<type>'

Elastic Search

Relational DB

Indices or Index

Dat...
Quick API Intro
●

Index a document
curl -XPOST 'localhost:9200/demo/employee' -d '{
"name" : "Mohit Garg",
"age" : 28
}'
...
Quick API Intro
●

Finding a document based on free text search.
curl -XGET 'localhost:9200/demo/_search?q=Mohit'

●

Get/...
Quick API Intro
●

Update (Update if found, else fail)
curl -XPOST 'localhost:9200/demo/employee/3/_update' -d '{
"doc" : ...
Searching Criteria
Match All Query
A query that matches all documents.
{
“query”:{
"match_all" : { }
}
}
Query String Query
A query that uses a query parser in order to parse its content.

●

{
"query": {
"query_string": {
"que...
Range Query
●

Matches documents with fields that have terms within a certain
range. The following example returns all doc...
Sorting And Pagination
●

Pagination

{
"from" : 5, "size" : 10
"query" : {...}
}

●

Sorting

{
"sort" : *{ “salary" : { ...
Analyzer
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
2) Fast jumping spiders
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
The

1
1

fox

Hmmm... Lets index everything ...

qui...
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
2) Fast jumping spiders

1

fox

Do we really need
wo...
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
2) Fast jumping spiders

1

fox

Some words “stem” fr...
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
2) Fast jumping spiders

1

fox

Some words are simpl...
Indexing it Right !!!
1) The quick brown fox jumps over the lazy dog
2) Fast jumping spiders

1

fox

Case Insensitive ??
...
Pre Process before Indexing
●
●
●
●

●

Drop stopwords
Lowercase everything
Reduce words to stems
Consider synonyms

Let's...
Text analysis in ElasticSearch
Text is analyzed at 2 levels ..

●

In document fields when preparing it for indexing.

●

...
Configuration
●

Analyzers can be configured per index.

curl -XPUT 'localhost:9200/demo' -d '{
"settings" : {
“analysis" ...
What is Mapping?
●

The indexed data is based on document and fields.

●

Mapping define how these documents should be han...
Elastsearch Data Types
●

●
●
●
●

Core Types include but are not limited to String, Boolean,
Number(can be Double, Float,...
Mapping API
curl -XPOST 'localhost:9200/demo123' -d '{
"mappings":{
"employee" : {
"properties" : {
"name" : { "type" : "s...
Advanced Queries
Match Query
●

●

Query that accept text/numerics/dates, analyzes it, and
constructs a query out of it.
The default match ...
Term Or Terms Query
●

Term Query : Matches documents that have fields that contain a
term (not analyzed).

{ "query":{
"t...
Bool Query
●

A query that matches documents matching boolean
combinations of other queries
➢
➢

➢

must - The clause (que...
Bool Query Example
{
"query": {
"bool": {
"must": {
"term": { gender": "Female" }
},
"must_not": {
"range": { "salary": { ...
Filters
Logical Filter (And, Or, Not)
And,Or Filter : A filter that matches documents using “and” OR
“or” boolean operator on othe...
Exists Or Missing Filter
Exist Filter: Filters documents where a specific field has a value in them.
{
"query": {
"match_a...
Faceting
Faceted Search
●

●

●

●

The usual purpose of a full-text search engine is to return a small
number of documents matchin...
Facets .. What they look like
How to use Facet?
●

Suppose we want to get the facet counts of age
field.

{
"query" : {
"match_all" : {
}
},
"facets" : ...
Facets Filters
{
"query" : {
"match_all" : {}
},
"facets" : {
"age_facets" : {
"terms" : {
"field" : "age"
},
"facet_filte...
Range Facet
●

Enables splitting numeric & date field values into
range “buckets” and returns aggregated
document counts p...
Statistical Facet
●

Computes statistics from numeric values

"facets" : {
"statistics" : {
"statistical" : {
"field" : "a...
Histogram facet
●

Computes a distribution of numeric values into fixed interval
buckets
Each value is “rounded” into the ...
Percolator
Search ... Reversed
What is Percolation?
●

“Reversed search”
Instead of storing documents, and then searching them with queries
... store que...
Percolator API

• Registering a Query
curl -XPUT
'http://localhost:9200/_percolator/demo/testpercolator' -d '{
"query":{
"...
Percolator API
●

Filter a document against a Percolator

curl -XGET 'http://localhost:9200/demo1/employee/_percolate' -d ...
References
www.elasticsearch.com
● http://www.elasticsearch.org/guide/
● fogcreek.com (Video)
●http://www.slideshare.net/d...
Let’s Connect

IntelliGrape Software (P) Ltd
SDF L-6, NSEZ,
Noida Phase 2,
India
Phone : +91-120-6493668
Fax : +91-120-420...
Upcoming SlideShare
Loading in...5
×

Intro to Elasticsearch and beyond

2,925

Published on

Elasticsearch is a tool for querying written words. It can perform some other nifty tasks, but at its core it’s made for wading through text, returning text similar to a given query and/or statistical analyses of a corpus of text.

More specifically, elasticsearch is a standalone database server, written in Java, that takes data in and stores it in a sophisticated format optimized for language based searches. Working with it is convenient as its main protocol is implemented with HTTP/JSON. Elasticsearch is also easily scalable, supporting clustering and leader election out of the box.

Whether it’s searching a database of retail products by description, finding similar text in a body of crawled web pages, or searching through posts on a blog, elasticsearch is a fantastic choice. When facing the task of cutting through the semi-structured muck that is natural language, Elasticsearch is an excellent tool.

Published in: Education, Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,925
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
1
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Intro to Elasticsearch and beyond

  1. 1. Introduction to Elastic Search and Beyond
  2. 2. Foursquare • Searching 50 millions venues in real time? Foursquare does it using Elasticsearch
  3. 3. GitHub • Searches 20TB of data using Elasticsearch, including 1.3 billion files and 130 billion lines of code.
  4. 4. SoundCloud • Uses Elasticsearch to provide immediate and relevant results for their online audio distribution platform reaching 180 million people
  5. 5. StumbleUpon • Elasticsearch is mission critical to StumbleUpon, delivering millions of recommendations to their community ever day.
  6. 6. Fog Creek • Elasticsearch powers Fog Creek Software’s instant search for 30 million requests per month across 40 billion lines of code.
  7. 7. Agenda ● ● ● ● ● ● Introduction Setting up Elastic Search Creating & Query ES index Analyzer Facets Percolator
  8. 8. Introduction Elastic What ???
  9. 9. Elastic Whattt .... ?? • ElasticSearch is a distributed, RESTful, free/open source server based on Apache Lucene. • Developed by Shay Banon, written in Java. • Latest and Greatest - 0.90.1 search
  10. 10. Features and Advantages
  11. 11. Document Oriented & Schema Free ● Complex Entities stored as JSON documents. ● No need for a schema upfront. Detects schema on the fly. ● Customizes the schema at field level also.
  12. 12. Distributed & Highly Available ● ● ● ● Built to scale horizontally out of the box. Indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically.” Clusters will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible Each index is fully sharded with a configurable number of shards.
  13. 13. Distributed Search Master Node Search for Employee b
  14. 14. Redundancy Primary Shards Master Node Redundant Shards
  15. 15. Node Auto Discovery Relegated to a data node role Takes over as Master
  16. 16. Fail Safe New Master Node Automatically assumes Master role in case Master node goes down.
  17. 17. Multi Tenant with Multi Types ● A cluster can host multiple indices which can be queried independently or as a group. curl -XGET 'http://localhost:9200/demo,demo2/employee/_search?q=quick' ● Support for more than one type per index.
  18. 18. RESTful API & Clients Any action can be performed using a simple RESTful API JSON over HTTP. ● ● ● Native Java and Groovy clients. Community supported clients for Pearl,Python,Ruby, PHP, Javascript,Scala,Clojure and so on ... using
  19. 19. What can it do? ● Search Find all companies in the ‘search’ domain. ● Facets Return the average annual revenue for all companies. ● Combine Return the average annual revenue for all companies founded between 2010 and 2012 in the ‘search’ domain. ● Indexing Index thousands of Documents per second All in near Real time.
  20. 20. Use Cases Any application that needs searching, data analysis, aggregation capabilities.
  21. 21. Setting Up Installing Elastic Search Adding the HEAD plugin
  22. 22. How to install and run it? ● ● ● Download & extract from http://www.elasticsearch.org/download/ $ bin/elasticsearch –f There is no step 3 :D
  23. 23. Add On ... ● There is a nice web front end tool for browsing and interacting with an Elastic Search cluster. git clone git://github.com/mobz/elasticsearch-head.git firefox <cloned_dir_location>/index.html
  24. 24. Basic Operations Creating and Querying an Index
  25. 25. Quick API Intro curl -<method-type> '<host>:<port>/<indices>/<type>' Elastic Search Relational DB Indices or Index Database Type Table Field Column
  26. 26. Quick API Intro ● Index a document curl -XPOST 'localhost:9200/demo/employee' -d '{ "name" : "Mohit Garg", "age" : 28 }' ● Delete document curl -XDELETE 'localhost:9200/demo/employee/1'
  27. 27. Quick API Intro ● Finding a document based on free text search. curl -XGET 'localhost:9200/demo/_search?q=Mohit' ● Get/Retrieve a document by ID curl -XGET 'localhost:9200/crunchbase/person/1 ● Using a Query DSL curl -XPOST 'localhost:9200/_search?pretty' -d '{ "query" : { .... } }'
  28. 28. Quick API Intro ● Update (Update if found, else fail) curl -XPOST 'localhost:9200/demo/employee/3/_update' -d '{ "doc" : { "name" : "Mohit Garg" } }' ● Upsert (Update if found, else insert) curl -XPOST localhost:9200/demo1/employee/abc/_update -d '{ "doc":{ – "name":"Mohit Garg" – }, "upsert" : { – "name" : "Mohit Garg", "age":"24" } }'
  29. 29. Searching Criteria
  30. 30. Match All Query A query that matches all documents. { “query”:{ "match_all" : { } } }
  31. 31. Query String Query A query that uses a query parser in order to parse its content. ● { "query": { "query_string": { "query": "Female OR Stephania" } } }
  32. 32. Range Query ● Matches documents with fields that have terms within a certain range. The following example returns all documents where age is between 10 and 20. { "range" : { "age" : { "from" : 10, "to" : 20 } } }
  33. 33. Sorting And Pagination ● Pagination { "from" : 5, "size" : 10 "query" : {...} } ● Sorting { "sort" : *{ “salary" : { "order" : "desc" } }], "query" : {...} }
  34. 34. Analyzer
  35. 35. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders
  36. 36. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog The 1 1 fox Hmmm... Lets index everything ... quick brown 2) Fast jumping spiders 1 1 jumps 1 over 1 the 1 lazy 1 dog 1 Fast 2 jumping 2 spiders 2
  37. 37. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Do we really need words like “the”, “a”, “or” ? 1 brown Wait.... quick 1 jumps 1 lazy 1 dog 1 Fast 2 jumping 2 spiders 2
  38. 38. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Some words “stem” from others. 1 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 Fast 2 spiders 2
  39. 39. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Some words are simply synonyms of others 1,2 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 Fast 1,2 spiders 2
  40. 40. Indexing it Right !!! 1) The quick brown fox jumps over the lazy dog 2) Fast jumping spiders 1 fox Case Insensitive ?? 1,2 brown Wait .... quick 1 jump 1,2 lazy 1 dog 1 fast 1,2 spiders 2
  41. 41. Pre Process before Indexing ● ● ● ● ● Drop stopwords Lowercase everything Reduce words to stems Consider synonyms Let's test it out on ES localhost:9200/_analyze?tokenizer=standard&filters=lowercase,stop' -d 'The quick brown Fox jumped over the lazy Dog'
  42. 42. Text analysis in ElasticSearch Text is analyzed at 2 levels .. ● In document fields when preparing it for indexing. ● In query when preparing it for searching.
  43. 43. Configuration ● Analyzers can be configured per index. curl -XPUT 'localhost:9200/demo' -d '{ "settings" : { “analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "whitespace", "filter" : [ "lowercase", "stop" ,"snowball" ] } } } } }';
  44. 44. What is Mapping? ● The indexed data is based on document and fields. ● Mapping define how these documents should be handled ➢ ➢ ➢ ➢ ➢ ➢ How should the documents be indexed? What are the data types of the document fields? How to treat object typed fields? What are the relations between different types of docs? How to handle document metadata? Define boosts per fields / document type
  45. 45. Elastsearch Data Types ● ● ● ● ● Core Types include but are not limited to String, Boolean, Number(can be Double, Float, Integer etc), Binary. Array Object Nested Multi-Field
  46. 46. Mapping API curl -XPOST 'localhost:9200/demo123' -d '{ "mappings":{ "employee" : { "properties" : { "name" : { "type" : "string", "analyzer" : "snowball" }, "department" : {"type" : "object"," properties" : {"since" : { "type" : "date" },"name" : {"type" :"string"} }}}}}}' Get the current mappings of a specific type : curl -XGET 'localhost:9200/demo/employee/_mapping'
  47. 47. Advanced Queries
  48. 48. Match Query ● ● Query that accept text/numerics/dates, analyzes it, and constructs a query out of it. The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. { "query": { "match": { "name": { "query": "Mohit Garg", "operator": "and" } }}}
  49. 49. Term Or Terms Query ● Term Query : Matches documents that have fields that contain a term (not analyzed). { "query":{ "term" : { "age" : 24 } }} ● Terms Query : Same as Term Query but you can specify multiple values. { "query": { "terms": { "age": [ 30,40,50 ], "minimum_match": 2 }}}
  50. 50. Bool Query ● A query that matches documents matching boolean combinations of other queries ➢ ➢ ➢ must - The clause (query) must appear in matching documents. should - Same as must query, but you can minimum_number_should_match parameter to specify the how many should should be matched. must_not - The clause (query) must not appear in the matching documents.
  51. 51. Bool Query Example { "query": { "bool": { "must": { "term": { gender": "Female" } }, "must_not": { "range": { "salary": { "from": 200, "to": 300 } } }, "should": [ { "match": { "department": "HR" } }, { "term": { "salary": 400 } } ], "minimum_number_should_match": 1 } } }
  52. 52. Filters
  53. 53. Logical Filter (And, Or, Not) And,Or Filter : A filter that matches documents using “and” OR “or” boolean operator on other queries. { "filter": { "and": [ { "range": {"age": { "from": 20, "to": 30 } } }, { "term": {"salary": 2600}} ] } } Not Filter "filter": { "not": {"term": { "salary": 2600 }}}
  54. 54. Exists Or Missing Filter Exist Filter: Filters documents where a specific field has a value in them. { "query": { "match_all": {} }, "filter": { "exists": { "field": "age" } } } Missing Filter : Filters documents where a specific field has no value in them.
  55. 55. Faceting
  56. 56. Faceted Search ● ● ● ● The usual purpose of a full-text search engine is to return a small number of documents matching your query. Facets provide aggregated data based on a search query. In the simplest case, a terms facet can return facet counts for various facet values for a specific field. ElasticSearch supports more facet implementations, such as statistical or date histogram facets.
  57. 57. Facets .. What they look like
  58. 58. How to use Facet? ● Suppose we want to get the facet counts of age field. { "query" : { "match_all" : { } }, "facets" : { "age_facets" : { "terms" : { "field" : "age" }}}}
  59. 59. Facets Filters { "query" : { "match_all" : {} }, "facets" : { "age_facets" : { "terms" : { "field" : "age" }, "facet_filter" : { "range" : { "salary" : { "from" : 100, "to" : 150 }} }}}}
  60. 60. Range Facet ● Enables splitting numeric & date field values into range “buckets” and returns aggregated document counts per “bucket” (stats per range) { "query": { "match_all": {} }, "facets": { "age_groups": { "range": { "field": "age", "ranges": [ { "to": 25 }, {"from": 26,"to": 40}, {"from": 45} ] } } }}
  61. 61. Statistical Facet ● Computes statistics from numeric values "facets" : { "statistics" : { "statistical" : { "field" : "age" } } }
  62. 62. Histogram facet ● Computes a distribution of numeric values into fixed interval buckets Each value is “rounded” into the bucket. So a value of 135 and a bucket interval of 10 will “fall” into bucket 130 ● Get the age distribution in intervals of 5 "facets": { "field" : "age", "interval" : 5 } } }
  63. 63. Percolator Search ... Reversed
  64. 64. What is Percolation? ● “Reversed search” Instead of storing documents, and then searching them with queries ... store queries and “percolate” documents through them
  65. 65. Percolator API • Registering a Query curl -XPUT 'http://localhost:9200/_percolator/demo/testpercolator' -d '{ "query":{ "term":{ "gender":"Female" } } }'
  66. 66. Percolator API ● Filter a document against a Percolator curl -XGET 'http://localhost:9200/demo1/employee/_percolate' -d '{ "doc":{ "age":38, "name":"Alexis Wuckert", "gender":"Female", "salary":3800 } }' Response { "ok" : true, "matches" : ["testpercolator"]}
  67. 67. References www.elasticsearch.com ● http://www.elasticsearch.org/guide/ ● fogcreek.com (Video) ●http://www.slideshare.net/dnoble00/scaling-analytics-withelasticsearch ● ● And lots of Googling ...
  68. 68. Let’s Connect IntelliGrape Software (P) Ltd SDF L-6, NSEZ, Noida Phase 2, India Phone : +91-120-6493668 Fax : +91-120-4207689 Email : query@intelligrape.com /company/intelligrape /intelligrape.software /IntelliGrape

×