Elasticsearch:
first steps with an
Aggregate-oriented
database
Jug Roma
28/11/2013
Matteo Moci
Me
Matteo Moci
@matteomoci
http://mox.fm
Software Engineer
R&D, new product development
Agenda
• 2 Use cases
• Elasticsearch Basics
• Data Design for scaling
Social Media Analytics Platform
for Marketing Agencies
Scenario

• Using Elasticsearch as:
• Analytics engine
Aggregate repository
•
Use case 1

• count values distribution over
time
Before

• ~10M documents
• Heaviest query:
~10 minutes
•
• Our staff had a problem
After

• ~10M documents
• Heaviest query:
~1 second (also with larger
•
dataset)
Use case 2
• Aggregate-oriented repository
• ...as in DDD

http://ptgmedia.pearsoncmg.com/images/chap10_9780321834577/elem...
Elasticsearch
Distributed RESTful search and analytics
real time data and analytics
distributed
high availability
multi te...
Elasticsearch basics
• Install
• API
• Types mapping
• Facets
• Relations
Install
$ wget https://
download.elasticsearch.org/...
$ tar -xf
elasticsearch-0.90.7.tar.gz
Run!
Run!
$ ./elasticsearch-0.90.7/bin/
elasticsearch -f

es
Hulk
Run!
$ ./elasticsearch-0.90.7/bin/
elasticsearch -f
$ ./elasticsearch-0.90.7/bin/
elasticsearch -f
es
Hulk
Run!
$ ./elasticsearch-0.90.7/bin/
elasticsearch -f
$ ./elasticsearch-0.90.7/bin/
elasticsearch -f
es
Hulk

Thor
Index a document
$ curl -X PUT localhost:9200/
products/product/1 -d '{
"name" : "Camera"
}'
Search
$ curl	‐X	GET 'localhost:9200/
products/product/_search?
q=Camera'
Shards and Replicas
es
Hulk
Products
1

2

1

2
Shards and Replicas
es
Hulk
Products

Thor

1

2

1

2
Shards and Replicas
es
Hulk
Products

Thor
Products

1

2

1

2
Shards and Replicas
es
Hulk
Products

Thor
Products
2

1
1

2
Shards and Replicas
es
Hulk
Products

Thor
Products
2

1
2

1
Integration

Hulk

Thor
9300

9300
Integration
TransportClient

Hulk

Thor
9300

9300
Async Java API
this.client.prepareGet("documents", "document", id)
//async, non blocking APIs
//use a listener to handle r...
Mapping
Mappings define how primitive
types are stored and analyzed
Mapping
• JSON data is parsed on indexing
• Mapping is done on first field indexing
• Inferred if not configured (!)
• Types:...
"text": {
"type": "multi_field",
"fields": {
"text": {
"type": "string",
"index": "analyzed",
"index_analyzer": "whitespac...
Mapping - lessons
• schema can evolve (e.g. add fields)
• inferred if not specified (!)
• worst case: reindex
• use aliases ...
Search with Facets
final TermsFacetBuilder userFacet =
FacetBuilders.termsFacet(MENTION_FACET_NAME)
.field(USER_ID).size(m...
Query

Facets
Date Histogram Facet
The histogram facet works with numeric data by
building a histogram across intervals of the field valu...
{
 
 
 
 
 
 
 
 
 
 
 
}

 
 
 
 
 
 
 
 
 
 
 

"query" : {
    "match_all" : {}
},
"facets" : {
    "histo1" : {
      ...
Facets - lessons
•

•
•

Bug in 0.90.x:
https://github.com/elasticsearch/elasticsearch/
issues/1305*
Solutions:
use 1 shar...
Analyzers
A Lucene analyzer consists of a tokenizer and
an arbitrary amount of filters (+ char filters)
{
"index":{
"analysis":{
"filter":{
"bigram_shingle_filter":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2...
Relations between
Documents
Author

1

N

Book

• nested: faster reads, update needs reindex, cross object

match
• parent...
Nested Documents
Specify Book type is “nested” in Author’s Mapping
We can query Authors with a query on properties
of nest...
curl -XGET localhost:9200/authors/nested_author/
_search -d '
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filt...
Parent and Child
Indexing happens separately
Specify _parent type in Child mapping (Book)
When indexing Books, specify id ...
curl -XPOST localhost:9200/authors/book/_mapping -d
'{
"book":{
"_parent": {"type": "bare_author"}
}
}'

curl -XPOST local...
Parent and Child query
curl -XPOST localhost:9200/authors/bare_author/
_search -d '{
"query": {
"has_child": {
"type": "bo...
Data Design
Index Configurations
• One index “per user”
• Single index
• SI + Routing: 1 index + custom doc routing
•

to s...
One Index per user
Hulk

Thor

User1 s0

User1 s1

User2 s0

+ different sharding per user
- small users own (and cost) at...
Single Index
Hulk

Thor

Users s0

Users s3

Users s2

+ filter by user id, support growth
- search hits all shards
Single Index + routing
Hulk

Thor

Users s0

Users s3

Users s2

+ a user’s data is all in one shard,
allows large overall...
Index per time range
Hulk

Thor

2013_01 s1

2013_01 s2

2013_02 s1

+ allows change in future indices
Data Design - lessons
Test, test, test your use case!
Take a single node with one shard and
throw load at it, checking the...
...ES has lots of other
features!
• Bulk operations
• Percolator (alerts, classification, …)
• Suggesters (“Did you mean …?...
Thanks!
@matteomoci
http://mox.fm
Upcoming SlideShare
Loading in...5
×

Elasticsearch first-steps

2,273

Published on

Elasticsearch: first steps with an aggregate-oriented database

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,273
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Elasticsearch first-steps

  1. 1. Elasticsearch: first steps with an Aggregate-oriented database Jug Roma 28/11/2013 Matteo Moci
  2. 2. Me Matteo Moci @matteomoci http://mox.fm Software Engineer R&D, new product development
  3. 3. Agenda • 2 Use cases • Elasticsearch Basics • Data Design for scaling
  4. 4. Social Media Analytics Platform for Marketing Agencies
  5. 5. Scenario • Using Elasticsearch as: • Analytics engine Aggregate repository •
  6. 6. Use case 1 • count values distribution over time
  7. 7. Before • ~10M documents • Heaviest query: ~10 minutes • • Our staff had a problem
  8. 8. After • ~10M documents • Heaviest query: ~1 second (also with larger • dataset)
  9. 9. Use case 2 • Aggregate-oriented repository • ...as in DDD http://ptgmedia.pearsoncmg.com/images/chap10_9780321834577/elementLinks/10fig05.jpg
  10. 10. Elasticsearch Distributed RESTful search and analytics real time data and analytics distributed high availability multi tenancy full-text search schema free RESTful, JSON API
  11. 11. Elasticsearch basics • Install • API • Types mapping • Facets • Relations
  12. 12. Install $ wget https:// download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.7.tar.gz
  13. 13. Run!
  14. 14. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
  15. 15. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
  16. 16. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk Thor
  17. 17. Index a document $ curl -X PUT localhost:9200/ products/product/1 -d '{ "name" : "Camera" }'
  18. 18. Search $ curl ‐X GET 'localhost:9200/ products/product/_search? q=Camera'
  19. 19. Shards and Replicas es Hulk Products 1 2 1 2
  20. 20. Shards and Replicas es Hulk Products Thor 1 2 1 2
  21. 21. Shards and Replicas es Hulk Products Thor Products 1 2 1 2
  22. 22. Shards and Replicas es Hulk Products Thor Products 2 1 1 2
  23. 23. Shards and Replicas es Hulk Products Thor Products 2 1 2 1
  24. 24. Integration Hulk Thor 9300 9300
  25. 25. Integration TransportClient Hulk Thor 9300 9300
  26. 26. Async Java API this.client.prepareGet("documents", "document", id) //async, non blocking APIs //use a listener to handle result. non-blocking .execute(new ActionListener<GetResponse>() { @Override public void onResponse(GetResponse getFields) { // } @Override public void onFailure(Throwable e) { // }
  27. 27. Mapping Mappings define how primitive types are stored and analyzed
  28. 28. Mapping • JSON data is parsed on indexing • Mapping is done on first field indexing • Inferred if not configured (!) • Types: float, long, boolean, date (+formatting), object, nested • String type can have arbitrary analyzers • Fields can be split up in more fields
  29. 29. "text": { "type": "multi_field", "fields": { "text": { "type": "string", "index": "analyzed", "index_analyzer": "whitespace", "analyzer": "whitespace" }, "text_bigram": { "type": "string", "index": "analyzed", "index_analyzer": "bigram_analyzer", "search_analyzer": "bigram_analyzer" }, "text_trigram": { "type": "string", "index": "analyzed", "index_analyzer": "trigram_analyzer", "search_analyzer": "trigram_analyzer"
  30. 30. Mapping - lessons • schema can evolve (e.g. add fields) • inferred if not specified (!) • worst case: reindex • use aliases to enable zero downtime
  31. 31. Search with Facets final TermsFacetBuilder userFacet = FacetBuilders.termsFacet(MENTION_FACET_NAME) .field(USER_ID).size(maxUsersAmount); SearchResponse response; response = client.prepareSearch(Indices.USERS) .setTypes(USER_TYPE) .setQuery(someQuery).setSize(0) .setSearchType(SearchType.COUNT) .addFacet(userFacet).execute().actionGet() ; final TermsFacet facets = (TermsFacet) response.getFacets().facetsAsMap() .get(MENTION_FACET_NAME);
  32. 32. Query Facets
  33. 33. Date Histogram Facet The histogram facet works with numeric data by building a histogram across intervals of the field values. Each value is placed in a “bucket”
  34. 34. {                       }                       "query" : {     "match_all" : {} }, "facets" : {     "histo1" : {         "histogram" : {             "field" : "followers",             "interval" : 10         }     } }
  35. 35. Facets - lessons • • • Bug in 0.90.x: https://github.com/elasticsearch/elasticsearch/ issues/1305* Solutions: use 1 shard ask for top 100 instead of 10 *will be solved in 1.0 with aggregation module
  36. 36. Analyzers A Lucene analyzer consists of a tokenizer and an arbitrary amount of filters (+ char filters)
  37. 37. { "index":{ "analysis":{ "filter":{ "bigram_shingle_filter":{ "type":"shingle", "max_shingle_size":2, "min_shingle_size":2, ... "analyzer":{ "bigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "bigram_shingle_filter" ] }, "trigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "trigram_shingle_filter" ] } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" }, "trigram_shingle_filter": { "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, } } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" } } ... } }
  38. 38. Relations between Documents Author 1 N Book • nested: faster reads, update needs reindex, cross object match • parent/child: same shard, no reindex on update, difficult sorting
  39. 39. Nested Documents Specify Book type is “nested” in Author’s Mapping We can query Authors with a query on properties of nested Books “Authors who published at least a book with Penguin, in scifi genre”
  40. 40. curl -XGET localhost:9200/authors/nested_author/ _search -d ' { "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "path": "books", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"books.publisher": "penguin"}}, {"term": {"books.genre": "scifi"}} ]
  41. 41. Parent and Child Indexing happens separately Specify _parent type in Child mapping (Book) When indexing Books, specify id of Author
  42. 42. curl -XPOST localhost:9200/authors/book/_mapping -d '{ "book":{ "_parent": {"type": "bare_author"} } }' curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{ "name": "Revelation Space", "genre": "scifi", "publisher": "penguin" }'
  43. 43. Parent and Child query curl -XPOST localhost:9200/authors/bare_author/ _search -d '{ "query": { "has_child": { "type": "book", "query" : { "filtered": { "query": { "match_all": {}}, "filter" : { "and": [ {"term": {"publisher": "penguin"}}, {"term": {"genre": "scifi"}} ]
  44. 44. Data Design Index Configurations • One index “per user” • Single index • SI + Routing: 1 index + custom doc routing • to shards Time: 1 index per time window * * we can search across indices
  45. 45. One Index per user Hulk Thor User1 s0 User1 s1 User2 s0 + different sharding per user - small users own (and cost) at least 1 shard
  46. 46. Single Index Hulk Thor Users s0 Users s3 Users s2 + filter by user id, support growth - search hits all shards
  47. 47. Single Index + routing Hulk Thor Users s0 Users s3 Users s2 + a user’s data is all in one shard, allows large overallocation
  48. 48. Index per time range Hulk Thor 2013_01 s1 2013_01 s2 2013_02 s1 + allows change in future indices
  49. 49. Data Design - lessons Test, test, test your use case! Take a single node with one shard and throw load at it, checking the shard capacity The shard is the scaling unit: overallocate to enable future scaling #shards > #nodes
  50. 50. ...ES has lots of other features! • Bulk operations • Percolator (alerts, classification, …) • Suggesters (“Did you mean …?”) • Index templates (Automatic index • • • configuration) Monitoring API (Amount of memory used, number of operations, …) Plugins ...
  51. 51. Thanks! @matteomoci http://mox.fm
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×