Data Exploration with Elasticsearch

Data
explora+on
with
Elas+csearch

Aleksander
M.
Stensby

Monokkel
A/S

•  Aleksander
M.
Stensby

•  CEO
in
Monokkel
AS

•  Previously
COO
in
Integrasco
AS

•  Working
with
search
and
data
analysis
since
2004

www.monokkel.io

•  Daglig
leder
i
Monokkel
AS

•  Tidligere
COO
i
Integrasco
AS

•  Persistering,
Prosessering
og
Presentasjon
av
data

Persistence
–
Processing
–
PresentaHon

Agenda

•  Search
fundamentals
primer

•  Intro
to
elasHcsearch

•  Search,
ﬁlter
and
aggregate!

Agenda

•  Search
fundamentals
primer

•  Intro
to
elasHcsearch

•  Search,
ﬁlter
and
aggregate!

…
and
some
bonus
visualisaHon!

What
we
will
not
cover
today…

•  All
the
diﬀerent
searches,
ﬁlters
and

aggregaHons
available
in
elasHcsearch
J

•  Details
on
tokenizaHon,
analyzers…

•  ElasHcsearch
in
producHon
and
performance

tuning…

•  Data
integraHon

Search
fundamentals
101

Fields
(Key Value)
Title
Content
Signature

“We know what we
are, but know not
what we may be.”

Term
Frequency

we
3

know
2

what
2

are
1

but
1

not
1

may
1

be
1

“We know what
we are, but
know not what
we may be.”
Term Vector

“We were born to run”
“No one told you when
to run”
“Some were born to sing
the blues”

The
Inverted
Index

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
1. “We were born to
run ”
2. “No one told you
when to run”
3. “Some were born to
sing the blues”

Searching

born

run ”
when to run”
sing the blues”

The
Boolean
Model

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
born

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
born
blues

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
born
OR
blues

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
born
AND
blues

Term
Frequency

blues
1

born
2

no
1

one
1

run
2

sing
1

some
1

the
1

to
3

told
1

we
1

were
2

when
1

you
1

Documents

3

1,3

2

2

1,2

3

3

3

1,2,3

2

1

1,3

2

2

dictionary postings
born
NOT
blues

Relevancy
and
Ranking

•  Term
frequency

•  Inverse
document
frequency

•  Field-‐length
norm

Similarity

run ”
when to run”
sing the blues”
[2,
0]

[0,
0]

[2,
5]

0

0
1
2
3
4
5

1

2

3

“blues”

“born”

query:

[2,5]

doc
3:

[2,5]

doc
2:

[0,0]

doc
1:

[2,0]

Search
fundamentals
101!

•  TokenizaHon

•  NormalizaHon
(case,
stop
words
etc)

•  Stemming,
synonyms

Brief
history
of
elasHcsearch

Shay
Banon

-‐>
AbstracHon
Layer
on
top
of
Lucene

-‐>
Compass

-‐>
Rewricen
high
performance,

real-‐Hme,
distributed

-‐>
ElasHcsearch

-‐>
February
2010

elasHcsearch

•  Open
source
search
engine
-‐
wricen
in
Java

•  Built
on
top
of
Lucene

•  Simple,
coherent,
RESTful
API

•  Distributed,
scalable
search
engine
with
real-‐
Hme
analyHcs

{
}

“more
useable
and
concise
API,
scalability,
and

opera+onal
tools
on
top
of
Lucene’s
search

implementa+on”

ElasHcsearch
nodes
and
cluster

node
node
node
cluster

ElasHcsearch
shards,
nodes

index = shard
node

Lucene
index
and
segments

segments
lucene
index

Much
more
than
just
search!

•  Real-‐Hme
analyHcs

•  Log
analysis

•  PredicHon
modelling

•  RecommendaHons

DEMO

•  Install
ElasHcSearch

•  Load
in
some
data

•  Run
a
very
basic
search

Easy
peasy…

•  hcp://www.elasHcsearch.org/download

•  bin/elasHcsearch

or
bin/elasHcsearch.bat
on
windows

•  hcp://localhost:9200/

or
curl
–X
GET
hcp://localhost:9200/

Easy
peasy
lemon
squeezy!

hcp://localhost:9200/<index>/<type>/[<id>]

Indexing
data

curl
-‐XPUT
'hcp://localhost:9200/monokkel/user/aleks'

-‐d
'{
"name"
:
"Aleksander
Stensby"
}’

Indexing
data

•  shakespeare.json

– hcp://www.elasHcsearch.org/guide/en/kibana/
current/snippets/shakespeare.json

•  curl
-‐XPUT
localhost:9200/_bulk
-‐-‐data-‐binary

@shakespeare.json

hcp://localhost:9200/<index>/<type>/

hcp://localhost:9200/<index>/

hcp://localhost:9200/

_search

Mapping

•  Is
it
a
number?
String?
Date?

•  Combining
mulHple
ﬁelds?

•  Default
values?

•  Stored?

•  Analyzed?

•  How
should
we
tokenize/analyse/normalize

the
ﬁeld?

Mapping

curl
-‐XPUT
hcp://localhost:9200/shakespeare
-‐d
'

{

"mappings"
:
{

"_default_"
:
{

"properHes"
:
{

"speaker"
:
{"type":
"string",
"index"
:
"not_analyzed"
},

"play_name"
:
{"type":
"string",
"index"
:
"not_analyzed"
},

"line_id"
:
{
"type"
:
"integer"
},

"speech_number"
:
{
"type"
:
"integer"
}

}

}

}

}

';

The
Query
DSL

{

"query":
{YOUR_QUERY_HERE}

}

Match
Query

{

"query":
{

"match":
{"text_entry"
:
"romeo"}

}

}

MulH
Match
Query

{

"query":
{

"mulM_match":
{

"query":

"romeo",

"ﬁelds":

[
"text_entry",
"speaker"
]

}

}

}

Bool
Query

{

"query":
{

"bool":
{

"must":

[
],

"must_not":
[
],

"should":
[
]

}

}

}

Bool
Query

{

"query":
{

"bool":
{

"must":

{
"match":
{"text_entry":
"romeo"
}},

"must_not":
{
"match":
{"speaker":

"ROMEO"
}},

"should":
[

{
"match":
{"speaker":
"JULIET"
}},

{
"match":
{"speaker":
"FRIAR
LAURENCE"
}}

]

}

}

}

And
lots
more…

filtered
query

prefix
query

simple
query
string
query

range
query

regexp
query

term
query

terms
query

wildcard
query

dis
max
query

geoshape
query

nested
query

more
like
this
query

more
like
this
field
query

boosHng
query

common
terms
query

constant
score
query

fuzzy
like
this
query

fuzzy
like
this
field
query

funcHon
score
query

fuzzy
query

has
child
query

has
parent
query

ids
query

indices
query

span
first
query

span
mulH
term
query

span
near
query

span
not
query

span
or
query

span
term
query

top
children
query

minimum
should
match

mulH
term
query
rewrite

template
query

hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-‐dsl-‐queries.html

Filtering

•  Filters
do
not
score
so
they
are
faster
to

execute
than
queries

•  Filters
can
be
cached
in
memory
-‐
signiﬁcantly

faster
than
queries

If relevance is not important, use
filters, otherwise, use queries!

The
Filtered
Query:

{

"query":
{

"ﬁltered":
{

"query":

{YOUR_QUERY_HERE},

"ﬁlter":
{YOUR_FILTER_HERE}

}

}

}

The
Filtered
Query:

{

"query":
{

"ﬁltered":
{

"query":

{
"match":
{"content":
"monokkel"
}},

"ﬁlter":
{
"term":
{
"tag":
"awesome"
}}

}

}

}

Term
Filter

{

"query":
{

"ﬁltered":
{

"ﬁlter":
{

"term":
{

"speaker":
"ROMEO"

}

}

}

}

}

Terms
Filter

{

"query":
{

"ﬁltered":
{

"ﬁlter":
{

"terms":
{

"speaker":
["ROMEO",
"JULIET"]

}

}

}

}

}

Bool
Filter

{

"query":
{

"ﬁltered":
{

"ﬁlter":
{

"bool"
:
{

"must"
:

[],

"should"
:

[],

"must_not"
:
[]

}

}

}

}

}

Range
Filter

{

"query":
{

"ﬁltered":
{

"ﬁlter":
{

"range"
:
{

"price"
:
{

"gt"
:
20,

"lt"
:
40

}

}

}

}

}

}

And
lots
more…

match
all
filter

and
filter

not
filter

or
filter

prefix
filter

query
filter

regexp
filter

type
filter

geo
bounding
box
filter

geo
distance
filter

geo
distance
range
filter

geo
polygon
filter

geoshape
filter

geohash
cell
filter

has
child
filter

has
parent
filter

ids
filter

indices
filter

limit
filter

nested
filter

script
filter

hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-‐dsl-‐filters.html

Kibana

•  hcp://www.elasHcsearch.org/overview/
kibana/installaHon/

•  bin/kibana

or
bin/kibana.bat
on
windows

•  hcp://localhost:5601/

AggregaHons

•  Buckets
and
Metrics:

par++oning
documents
based
on
a
criteria

SELECT
COUNT(color)

FROM
table

GROUP
BY
color

An
aggrega+on
is
a
combina+on
of
buckets
and

metrics

metric
bucket

AggregaHons

{

"aggs":
{

"speakers":
{

"terms":
{

"ﬁeld":
"speaker"

}

}

}

}

your aggregation name
bucket type

AggregaHons

{

"aggs":
{

"beertypes":
{

"terms":
{

"ﬁeld":
"beertype"

}

}

}

}

AggregaHons

{

"aggs":
{

"beertypes":
{

"terms":
{

"ﬁeld":
"beertype"

},

"aggs":
{

"avg_ibu":
{

"avg":
{

"ﬁeld":
"ibu"

}

}

}

}

}

}

your aggregation name
metric type

AggregaHons

min

max

sum

avg

stats

extended
stats

value
count

percenHles

percenHle
ranks

cardinality

top
hits

scripted
metric

global

filter

filters

missing

nested

reverse
nested

children

terms

significant
terms

range

date
range

ipv4
range

histogram

date
historgram

geo
bounds

geo
distance

geohash
grid

hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/search-‐aggrega+ons.html

And
a
whole
lot
more!

•  Geosearch,
distance
and
bounds

•  ”More
Like
This”

•  Suggesters
/
Autocomplete

•  PercolaMon

•  Language
drivers

•  ScripMng

Further
reading
and
some
great

resources!

•  hcp://www.elasHcsearch.org/guide/

•  hcp://blog.monokkel.io/

•  hcps://found.no/foundaHon/

Shameful
self-‐promoHon

/ Tarjei Romtveit
/ Tarjei Romtveit

Data Exploration with Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Data Exploration with Elasticsearch

Similar to Data Exploration with Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Data Exploration with Elasticsearch