Elasticsearch first-steps
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Elasticsearch first-steps

  • 2,233 views
Uploaded on

Elasticsearch: first steps with an aggregate-oriented database

Elasticsearch: first steps with an aggregate-oriented database

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,233
On Slideshare
1,655
From Embeds
578
Number of Embeds
4

Actions

Shares
Downloads
22
Comments
0
Likes
2

Embeds 578

http://jugroma.blogspot.it 516
http://jugroma.ugolandini.com 40
https://twitter.com 20
http://www.linkedin.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Elasticsearch: first steps with an Aggregate-oriented database Jug Roma 28/11/2013 Matteo Moci
  • 2. Me Matteo Moci @matteomoci http://mox.fm Software Engineer R&D, new product development
  • 3. Agenda • 2 Use cases • Elasticsearch Basics • Data Design for scaling
  • 4. Social Media Analytics Platform for Marketing Agencies
  • 5. Scenario • Using Elasticsearch as: • Analytics engine Aggregate repository •
  • 6. Use case 1 • count values distribution over time
  • 7. Before • ~10M documents • Heaviest query: ~10 minutes • • Our staff had a problem
  • 8. After • ~10M documents • Heaviest query: ~1 second (also with larger • dataset)
  • 9. Use case 2 • Aggregate-oriented repository • ...as in DDD http://ptgmedia.pearsoncmg.com/images/chap10_9780321834577/elementLinks/10fig05.jpg
  • 10. Elasticsearch Distributed RESTful search and analytics real time data and analytics distributed high availability multi tenancy full-text search schema free RESTful, JSON API
  • 11. Elasticsearch basics • Install • API • Types mapping • Facets • Relations
  • 12. Install $ wget https:// download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.7.tar.gz
  • 13. Run!
  • 14. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
  • 15. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
  • 16. Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk Thor
  • 17. Index a document $ curl -X PUT localhost:9200/ products/product/1 -d '{ "name" : "Camera" }'
  • 18. Search $ curl ‐X GET 'localhost:9200/ products/product/_search? q=Camera'
  • 19. Shards and Replicas es Hulk Products 1 2 1 2
  • 20. Shards and Replicas es Hulk Products Thor 1 2 1 2
  • 21. Shards and Replicas es Hulk Products Thor Products 1 2 1 2
  • 22. Shards and Replicas es Hulk Products Thor Products 2 1 1 2
  • 23. Shards and Replicas es Hulk Products Thor Products 2 1 2 1
  • 24. Integration Hulk Thor 9300 9300
  • 25. Integration TransportClient Hulk Thor 9300 9300
  • 26. Async Java API this.client.prepareGet("documents", "document", id) //async, non blocking APIs //use a listener to handle result. non-blocking .execute(new ActionListener<GetResponse>() { @Override public void onResponse(GetResponse getFields) { // } @Override public void onFailure(Throwable e) { // }
  • 27. Mapping Mappings define how primitive types are stored and analyzed
  • 28. Mapping • JSON data is parsed on indexing • Mapping is done on first field indexing • Inferred if not configured (!) • Types: float, long, boolean, date (+formatting), object, nested • String type can have arbitrary analyzers • Fields can be split up in more fields
  • 29. "text": { "type": "multi_field", "fields": { "text": { "type": "string", "index": "analyzed", "index_analyzer": "whitespace", "analyzer": "whitespace" }, "text_bigram": { "type": "string", "index": "analyzed", "index_analyzer": "bigram_analyzer", "search_analyzer": "bigram_analyzer" }, "text_trigram": { "type": "string", "index": "analyzed", "index_analyzer": "trigram_analyzer", "search_analyzer": "trigram_analyzer"
  • 30. Mapping - lessons • schema can evolve (e.g. add fields) • inferred if not specified (!) • worst case: reindex • use aliases to enable zero downtime
  • 31. Search with Facets final TermsFacetBuilder userFacet = FacetBuilders.termsFacet(MENTION_FACET_NAME) .field(USER_ID).size(maxUsersAmount); SearchResponse response; response = client.prepareSearch(Indices.USERS) .setTypes(USER_TYPE) .setQuery(someQuery).setSize(0) .setSearchType(SearchType.COUNT) .addFacet(userFacet).execute().actionGet() ; final TermsFacet facets = (TermsFacet) response.getFacets().facetsAsMap() .get(MENTION_FACET_NAME);
  • 32. Query Facets
  • 33. Date Histogram Facet The histogram facet works with numeric data by building a histogram across intervals of the field values. Each value is placed in a “bucket”
  • 34. {                       }                       "query" : {     "match_all" : {} }, "facets" : {     "histo1" : {         "histogram" : {             "field" : "followers",             "interval" : 10         }     } }
  • 35. Facets - lessons • • • Bug in 0.90.x: https://github.com/elasticsearch/elasticsearch/ issues/1305* Solutions: use 1 shard ask for top 100 instead of 10 *will be solved in 1.0 with aggregation module
  • 36. Analyzers A Lucene analyzer consists of a tokenizer and an arbitrary amount of filters (+ char filters)
  • 37. { "index":{ "analysis":{ "filter":{ "bigram_shingle_filter":{ "type":"shingle", "max_shingle_size":2, "min_shingle_size":2, ... "analyzer":{ "bigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "bigram_shingle_filter" ] }, "trigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "trigram_shingle_filter" ] } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" }, "trigram_shingle_filter": { "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, } } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" } } ... } }
  • 38. Relations between Documents Author 1 N Book • nested: faster reads, update needs reindex, cross object match • parent/child: same shard, no reindex on update, difficult sorting
  • 39. Nested Documents Specify Book type is “nested” in Author’s Mapping We can query Authors with a query on properties of nested Books “Authors who published at least a book with Penguin, in scifi genre”
  • 40. curl -XGET localhost:9200/authors/nested_author/ _search -d ' { "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "path": "books", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"books.publisher": "penguin"}}, {"term": {"books.genre": "scifi"}} ]
  • 41. Parent and Child Indexing happens separately Specify _parent type in Child mapping (Book) When indexing Books, specify id of Author
  • 42. curl -XPOST localhost:9200/authors/book/_mapping -d '{ "book":{ "_parent": {"type": "bare_author"} } }' curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{ "name": "Revelation Space", "genre": "scifi", "publisher": "penguin" }'
  • 43. Parent and Child query curl -XPOST localhost:9200/authors/bare_author/ _search -d '{ "query": { "has_child": { "type": "book", "query" : { "filtered": { "query": { "match_all": {}}, "filter" : { "and": [ {"term": {"publisher": "penguin"}}, {"term": {"genre": "scifi"}} ]
  • 44. Data Design Index Configurations • One index “per user” • Single index • SI + Routing: 1 index + custom doc routing • to shards Time: 1 index per time window * * we can search across indices
  • 45. One Index per user Hulk Thor User1 s0 User1 s1 User2 s0 + different sharding per user - small users own (and cost) at least 1 shard
  • 46. Single Index Hulk Thor Users s0 Users s3 Users s2 + filter by user id, support growth - search hits all shards
  • 47. Single Index + routing Hulk Thor Users s0 Users s3 Users s2 + a user’s data is all in one shard, allows large overallocation
  • 48. Index per time range Hulk Thor 2013_01 s1 2013_01 s2 2013_02 s1 + allows change in future indices
  • 49. Data Design - lessons Test, test, test your use case! Take a single node with one shard and throw load at it, checking the shard capacity The shard is the scaling unit: overallocate to enable future scaling #shards > #nodes
  • 50. ...ES has lots of other features! • Bulk operations • Percolator (alerts, classification, …) • Suggesters (“Did you mean …?”) • Index templates (Automatic index • • • configuration) Monitoring API (Amount of memory used, number of operations, …) Plugins ...
  • 51. Thanks! @matteomoci http://mox.fm