Montreal Elasticsearch Meetup

Loïc Bertron
Director of Research & Development @Cedrom-SNI
!
Working on Big Data for Cedrom-SNI : social media, tv & radio aggregation
Introduced Elasticsearch at Cedrom-Sni
!
Cedrom-Sni
!
10k+ different sources, 750k+ new docs/days
Our job : Ingesting, enriching, extracting analytics and intelligence from docs
loic.bertron@cedrom-sni.com
linkedin.com/in/loicbertron
@loicbertron
Who am I ?

ElasticSearch is offering advanced search features to any application or
website easily, scaling on a large amount of data.
«
»
ElasticSearch

Simple : Plug & Play - Schema free - RESTful API
!
Elastic : Automatically discover all others instances
!
Strong : Replication & Load balancing - Scales massively - Lucene based
!
Fast : Requests executed in parallel - Real Time
!
Full featured : Search, Analytics, Facets, Percolator, Geo search, Suggest, Plugins …
What is ElasticSearch ?

Document as JSON
• Object representing your data
• Grouped in an index
• One index can have multiples types of documents
{
"message": "Introducing #ElasticSearch",
"post_date": "2014-03-12T18:30:00",
"author": {
"first_name" : "Loïc",
"email" : "loic.bertron@cedrom-sni.com"
},
"employee_at_Cedrom" : true,
"Tags" : ["Meetup","Montreal"]
}

• API REST : http://host:port/[index]/[type]/[_action/id] 
HTTP Methods: GET, POST, PUT, DELETE
• Documents
• http://node1:9200/twitter/tweet/1 (POST)
• http://node1:9200/twitter/tweet/1 (GET)
• http://node1:9200/twitter/tweet/1 (DELETE)
• Search
• http://node1:9200/twitter/tweet/_search (GET)
• http://node1:9200/twitter/_search (GET)
• http://node1:9200/_search (GET)
• Metadata
• http://node1:9200/twitter/_status (GET)
• http://node1:9200/_shutdown (POST)
API

Index a document
$ curl -X PUT http://node1:9200/twitter/tweet/1 -d '{
"user": "loicbertron",
"post_date": "2014-03-12T18:30:00",
"message": "Introducing #ElasticSearch"
}'

{
"ok":true,
"_index":"twitter",
"_type":"tweet",
"_id":"1"
"_version":"1"
}
Index a document

Update a document
"post_date": "2014-03-12T18:40:00",
"message": "Introducing #ElasticSearch to the #Community"
}'

{
"ok":true,
"_index":"twitter",
"_type":"tweet",
"_id":"1"
"_version":"2"
}
Update a document

$ curl -XGET http://node1:9200/twitter/tweet/_search -d '{
"query": {
"term": { "message": "ElasticSearch" }
}
}'
Search for documents
$ curl -XGET http://node1:9200/twitter/tweet/_search?q=elasticsearch

{
"took" : 24,
"timed_out" : false,
"_shards" : { "total" : 2, "successful" : 2, "failed" : 0 },
"hits" : {
"total" : 1,
"max_score" : 0.227,
"hits" : [ {
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}

{
"took" : 24,
"hits" : {
"total" : 1,
"hits" : [ {
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}
Execution
time

{
"took" : 24,
"hits" : {
"total" : 1,
"hits" : [ {
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}
# of documents
matching

{
"took" : 24,
"hits" : {
"total" : 1,
"hits" : [ {
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}
Infos

{
"took" : 24,
"hits" : {
"total" : 1,
"hits" : [ {
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}
Score

{
"took" : 24,
"hits" : {
"total" : 1,
"hits" : [ {
"_type" : "tweet",
"_id" : "1",
"_score" : 0.227, "_source" : {
"post_date": "2014-03-12T18:40:00",
}
} ]
}
}
Document

Search operand
Terms quebec
quebec ontario
Phrases "city of montréal"
Proximity "montreal collusion" ~5
Fuzzy schwarzenegger ~0.8
Wildcards queb*
Boosting Quebec^5 montreal
Range [2011/03/12 TO 2014/03/12]
[java to json]
Boolean quebec AND NOT montreal
+quebec -montreal
(quebec OR ottawa) AND NOT toronto
Fields title:montreal^10 OR body:montreal
$ curl -XGET http://node1:9200/twitter/tweet/_search?q=<Your Query>

$ curl -XGET http://node1:9200/twitter/tweet/_search -d ‘{
"query": {
"filtered" : {
"query" : {
"bool" : {
!
"must" : {
"match" : {
"author.first_name" : {
"query" : "loic",
"fuzziness" : 0.1
}
}
},
!
"must" : {
"multi_match" : {
"query" : "elasticsearch",
"fields" : ["title^10","body"]
}
}
}
},
!
"filter": {
"and" : [
{"terms" : { "tags" : ["search","scale","store"] } },
{"range" : { "created_at" : {"from": "2013" } } } ,
{"term": { "featured" : true } }
]
}
}
}
}’
Query DSL

"query": {
"filtered" : {
"query" : {
"bool" : {
!
"must" : {
"match" : {
"query" : "loic",
"fuzziness" : 0.1
}
}
},
!
"must" : {
"multi_match" : {
}
}
}
},
!
"filter": {
"and" : [
]
}
}
}
}’
Query DSL
"must" : {
"match" : {
"query" : "loic",
"fuzziness" : 0.1
}
}

"query": {
"filtered" : {
"query" : {
"bool" : {
!
"must" : {
"match" : {
"query" : "loic",
"fuzziness" : 0.1
}
}
},
!
"must" : {
"multi_match" : {
}
}
}
},
!
"filter": {
"and" : [
]
}
}
}
}’
Query DSL
"must" : {
"multi_match" : {
}
}

"query": {
"filtered" : {
"query" : {
"bool" : {
!
"must" : {
"match" : {
"query" : "loic",
"fuzziness" : 0.1
}
}
},
!
"must" : {
"multi_match" : {
}
}
}
},
!
"filter": {
"and" : [
]
}
}
}
}’
Query DSL
"filter": {
"and" : [
]
}

Ranges
Term
Term
Ranges
Facets

$ curl -XPOST http://node1:9200/articles/_search -d '{
"aggregations" : {
"tag_cloud" : { "terms" : {"field" : "tags"} }
}
}'
Tag Cloud
"aggregations" : {
"tag_cloud" :[
{"terms": "Quebec", "count" : 5},
{"terms": "Montréal", "count" : 3},
...
]
}

$ curl -XPOST http://node1:9200/students/_search?search_type=count -d '{
"facets": {
"scores-per-subject" : {
"terms_stats" : {
"key_field" : "subject",
"value_field" : "score"
}
}
}
}'
Stats
"facets" : {
"scores-per-subject" : {
"_type" : "terms_stats",
"missing" : 0,
"terms" : [ {
"term" : "math",
"count" : 4,
"total_count" : 4,
"min" : 25.0,
"max" : 92.0,
"total" : 267.0,
"mean" : 66.75
}, […]
}
}

Advanced facets : Aggregations
{
"rank": "21",
"city": "Boston",
"state": "MA",
"population2012": "636479",
"population2010": "617594",
"land_area": "48.277",
"density": "12793",
"ansi": "619463",
"location": {
"lat": "42.332",
"lon": "71.0202"
}
}

curl -XGET "node1:9200/cities/_search?pretty" -d '{
"aggs" : {
"mean_density_by_state" : {
"terms" : {
"field" : "state"
},
"aggs": {
"mean_density": {
"avg" : {
"field" : "density"
}
}
}
}
}
}'

"aggregations" : {
"mean_density_by_state" : {
"terms" : [ {
"term" : "CA",
"doc_count" : 69,
"mean_density" : {
"value" : 5558.623188405797
}
}, {
"term" : "TX",
"doc_count" : 32,
"mean_density" : {
"value" : 2496.625
}
}, {
"term" : "FL",
"doc_count" : 20,
"mean_density" : {
"value" : 4006.6
}
}, {
"term" : "CO",
"doc_count" : 11,

Facets
Terms
Terms Stats
Statistical
Range
Histogram
Date Histogram
Filter
Query
Geo Distance

Noeud 1
Cluster
État du cluster : Vert
Node 1
Cluster
Shard 0
Shard 1
cluster state : Yellow
Architecture
$ curl -XPUT localhost:9200/twitter -d '{
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
}
}'

Noeud 1
Cluster
État du cluster : Vert
Noeud 1
Cluster
Shard 0
Shard 1
État du cluster : Jaune
Node 1
Cluster
Shard 0
Shard 1
cluster state : Green
Node 2
Shard 0
Shard 1
adding a second node
Architecture

Node 1
Cluster
Shard 0
Shard 1
Node 2
Shard 1
Shard 0
Architecture

Node 1
Cluster
Shard 0
Node 3
Shard 1
Node 2
Shard 1
Shard 0
Architecture

Node 1
Cluster
Shard 0
Node 3 Node 4
Shard 1
Node 2
Shard 1
Shard 0
Architecture

Node 1
Cluster
Shard 0
Node 3 Node 4
Shard 1
Node 2
Shard 1
Shard 0
"post_date": "2014-03-12T18:30:00",
}'
Architecture

Node 1
Cluster
Shard 0
Node 3 Node 4
Shard 1
Node 2
Shard 1
Shard 0
Doc 1
"post_date": "2014-03-12T18:30:00",
}'
Architecture

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
"post_date": "2014-03-12T18:30:00",
}'
Architecture
Node 1 Node 2 Node 3 Node 4

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
{
"ok":true,
"_index":"twitter",
"_type":"tweet",
"_id":"1"
"_version":"1"
}
Architecture

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Architecture
"post_date": "2014-03-12T18:45:00",
"message": "The crowd is on fire #ElasticSearch"
}'

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Doc 2
Architecture
"post_date": "2014-03-12T18:45:00",
}'

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Doc 2 Doc 2
Architecture
"post_date": "2014-03-12T18:45:00",
}'

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Doc 2 Doc 2
{
"ok":true,
"_index":"twitter",
"_type":"tweet",
"_id":"2"
"_version":"1"
}
Architecture

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Doc 2 Doc 2
"query": {
}
}'
Architecture

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1
Doc 1
Doc 2
Doc 2
"query": {
}
}'
Architecture

Cluster
Shard 0
Shard 1Shard 1
Shard 0
Doc 1 Doc 1
Doc 2 Doc 2
Architecture

Cluster
Shard 1Shard 1
Shard 0
Doc 1
Doc 2 Doc 2
Architecture
Node 2 Node 3 Node 4

Cluster
Shard 1
Node 2
Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Architecture
Node 3 Node 4
Shard 0
Doc 1

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Architecture
"post_date": "2014-03-12T19:00:00",
"message": "A third message about #ElasticSearch"
}'
Shard 0
Doc 1

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Doc 3
Architecture
"post_date": "2014-03-12T19:00:00",
}'
Shard 0
Doc 1

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Doc 3
Architecture
"post_date": "2014-03-12T19:00:00",
}'
Shard 0
Doc 1
Doc 3

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Doc 3
{
"ok":true,
"_index":"twitter",
"_type":"tweet",
"_id":"3"
"_version":"1"
}
Architecture
Shard 0
Doc 1
Doc 3

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Shard 0
Doc 1
Doc 3
"query": {
}
}'
Architecture
Shard 0
Doc 1
Doc 3

Cluster
Shard 1Shard 1
Doc 2
Doc 2
Shard 0
Doc 1Doc 3
"query": {
}
}'
Architecture
Shard 0
Doc 1
Doc 3

Cluster
Shard 1Shard 1
Doc 2 Doc 2
Architecture
Node 2 Node 4

How users see search ?
ResultUser Query List of results

How search engine works?
1. Fetch document field
2. Pick configured anlyser
3. Parse text inot tokens
4. Apply token filters
5. Store into index

Analyzer
curl -XGET "http://localhost:9200/docs/_analyze?
analyzer=standard&pretty=1" -d "Édith Piaf vedette du feu d'artifice"

Analyzer
{
"tokens" : [ {
"token" : "édith",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "piaf",
"start_offset" : 6,
"end_offset" : 10,
"position" : 2
}, {
"token" : "vedette",
"start_offset" : 11,
"end_offset" : 18,
"position" : 3
}, {
"token" : "du",
"end_offset" : 21,
"position" : 4
}, {
"token" : "feu",
"end_offset" : 25,
"position" : 5
}, {
"token" : "d'artifice",
"end_offset" : 36,
"position" : 6
} ]
}

composed of a single tokenizer and zero or more ﬁlters
Analyzer

Cutting out a string of words & transforming :
!
Whitespace tokenizer :
«Édith piaf» -> «Édith», «Piaf»
!
Standard tokenizer :
«Édith piaf!» -> «édith», «piaf»
Tokenizer

Modify, delete or add tokens
!
Asciifolding filter :
«Édith Piaf» -> «Edith Piaf»
!
Stemmer filter (english) :
«stemming» -> «stem»
«fishing», «fished», «fisher» -> «fish»
«cats,catlike» -> «cat»
!
Phonetic :
«quick» -> «Q200»
«quik» -> «Q200»
!
Edge nGram :
«Montreal» -> [«Mon», «Mont», «Montr»]
Filters

Analyzer
{
"tokens" : [ {
"token" : "edith",
"start_offset" : 0,
"end_offset" : 5,
"position" : 1
}, {
"token" : "piaf",
"start_offset" : 6,
"end_offset" : 10,
"position" : 2
}, {
"token" : "vedet",
"end_offset" : 18,
"position" : 3
}, {
"token" : "feu",
"end_offset" : 25,
"position" : 5
},
!
!
{
"token" : "artific",
"end_offset" : 36,
"position" : 6
} ]
}

1.Documents get indexed
2.I come back often on the search page to run my request
3.I hope that my document will be well ranked to be on top of the results page
4.if not, i won’t never see my document
Regular search engine usage

1. Register my query
2. When document get indexed, the percolator look for a match again registered queries
Percolator

Real Time Updates !
Percolator

Percolator
curl -XPUT 'http://node1:9200/twitter/.percolator/elasticsearch' -d '{
"query" : {
"match" : {
"message" : "elasticsearch"
}
}
}'

Percolator
$ curl -X GET http://node1:9200/twitter/tweet/_percolate -d '{
"doc" : {
"post_date": "2014-03-12T19:00:00",
}
}'

Percolator
{
"took" : 19,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"total" : 1,
"matches" : [
{
"_id" : "elasticsearch"
}
]
}

{
"name": "Jules Verne",
"biography": "One of the greatest author",
!
"books": [
{
"title": "Vingt mille lieues sous les mers",
"genre": "Novel",
"publisher": "Hetzel"
}
{
"title": "Les Châteaux en Californie",
"genre": "Drama",
"publisher": "Marc Soriano"
}
]
}
Inner objects

curl -XPUT node1:9200/authors/bare_author/1 -d'{
"name": "Jules Verne",
"biography": « One of the greets author"
}'
curl -XPOST node1:9200/authors/book/1?parent=1 -d '{
"title": "Les Châteaux en Californie",
"genre": "Drama",
"publisher": "Marc Soriano"
}'
!
curl -XPOST node1:9200/authors/book/2?parent=1 -d '{
"title": "Vingt mille lieues sous les mers",
"genre": "Novel",
"publisher": "Hetzel"
!
}'
Parents / Childs

Others features
• Suggest API : Did you mean ?, Autocomplete, …
• Results Highlight
• More like this
• Backup Data : Snapshot / Restore
• File System
• Amazon S3
• HDFS
• Google Compute Engine
• Microsoft Azure
• Hadoop connector

Clients
• Perl
• Python
• Ruby
• Php
• Javascript
• Java
• .Net
• Scala
• Clojure
• Erlang
• Eventmachine
• Bash
• Ocaml
• Smalltalk
• Cold Fusion

Thank you
Thank you David Pilato for his presentation : https://speakerdeck.com/dadoonet/tours-jug-elasticsearch
Thank you Kevin Kluge for his presentation : https://speakerdeck.com/elasticsearch/elasticsearch-in-20-minutes

Suggest
curl -s -XPOST 'localhost:9200/_search?search_type=count' -d '{
"suggest" : {
"my-title-suggestions-1" : {
"text" : "devloping",
"term" : {
"size" : 3,
"field" : "title"
}
}
}
}'

Suggest
"suggest": {
"my-title-suggestions-1": [
{
"text": "devloping",
"offset": 0,
"length": 9,
"options": [
{
"text": "developing",
"freq": 77,
"score": 0.8888889
},
{
"text": "deloping",
"freq": 1,
"score": 0.875
},
{
"text": "deploying",
"freq": 2,
"score": 0.7777778
}
]
}

More Like This
curl -XGET 'http://node1:9200/twitter/tweet/1/_mlt?mlt_fields=tag,content&min_doc_freq=1'
{
"more_like_this" : {
"fields" : ["name.first", "name.last"],
"like_text" : "text like this one",
"min_term_freq" : 1,
"max_query_terms" : 12,
"percent_terms_to_match" : 0.95
}
}

{
"query" : {...},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 150,
"tag_schema" : "styled",
"fields" : {
"_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
"bio.title" : { "number_of_fragments" : 0 },
"bio.author" : { "number_of_fragments" : 0 },
"bio.content" : { "number_of_fragments" : 5, "order" : "score" }
}
}
}
Highlight

Hadoop
• Java library for integrating Elasticsearch and Hadoop
• Pig, Hive, Cascading, MapReduce
• Search and Real Time Analytics with Elasticsearch, Hadoop as Data Lake
• Scales with Hadoop

Montreal Elasticsearch Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Montreal Elasticsearch Meetup

Similar to Montreal Elasticsearch Meetup (20)

Recently uploaded

Recently uploaded (20)

Montreal Elasticsearch Meetup