Elasticsearch first-steps
Upcoming SlideShare
Loading in...5
×
 

Elasticsearch first-steps

on

  • 1,957 views

Elasticsearch: first steps with an aggregate-oriented database

Elasticsearch: first steps with an aggregate-oriented database

Statistics

Views

Total Views
1,957
Views on SlideShare
1,385
Embed Views
572

Actions

Likes
2
Downloads
20
Comments
0

4 Embeds 572

http://jugroma.blogspot.it 513
http://jugroma.ugolandini.com 37
https://twitter.com 20
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Elasticsearch first-steps Elasticsearch first-steps Presentation Transcript

    • Elasticsearch: first steps with an Aggregate-oriented database Jug Roma 28/11/2013 Matteo Moci
    • Me Matteo Moci @matteomoci http://mox.fm Software Engineer R&D, new product development
    • Agenda • 2 Use cases • Elasticsearch Basics • Data Design for scaling
    • Social Media Analytics Platform for Marketing Agencies
    • Scenario • Using Elasticsearch as: • Analytics engine Aggregate repository •
    • Use case 1 • count values distribution over time
    • Before • ~10M documents • Heaviest query: ~10 minutes • • Our staff had a problem
    • After • ~10M documents • Heaviest query: ~1 second (also with larger • dataset)
    • Use case 2 • Aggregate-oriented repository • ...as in DDD http://ptgmedia.pearsoncmg.com/images/chap10_9780321834577/elementLinks/10fig05.jpg
    • Elasticsearch Distributed RESTful search and analytics real time data and analytics distributed high availability multi tenancy full-text search schema free RESTful, JSON API
    • Elasticsearch basics • Install • API • Types mapping • Facets • Relations
    • Install $ wget https:// download.elasticsearch.org/... $ tar -xf elasticsearch-0.90.7.tar.gz
    • Run!
    • Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
    • Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk
    • Run! $ ./elasticsearch-0.90.7/bin/ elasticsearch -f $ ./elasticsearch-0.90.7/bin/ elasticsearch -f es Hulk Thor
    • Index a document $ curl -X PUT localhost:9200/ products/product/1 -d '{ "name" : "Camera" }'
    • Search $ curl ‐X GET 'localhost:9200/ products/product/_search? q=Camera'
    • Shards and Replicas es Hulk Products 1 2 1 2
    • Shards and Replicas es Hulk Products Thor 1 2 1 2
    • Shards and Replicas es Hulk Products Thor Products 1 2 1 2
    • Shards and Replicas es Hulk Products Thor Products 2 1 1 2
    • Shards and Replicas es Hulk Products Thor Products 2 1 2 1
    • Integration Hulk Thor 9300 9300
    • Integration TransportClient Hulk Thor 9300 9300
    • Async Java API this.client.prepareGet("documents", "document", id) //async, non blocking APIs //use a listener to handle result. non-blocking .execute(new ActionListener<GetResponse>() { @Override public void onResponse(GetResponse getFields) { // } @Override public void onFailure(Throwable e) { // }
    • Mapping Mappings define how primitive types are stored and analyzed
    • Mapping • JSON data is parsed on indexing • Mapping is done on first field indexing • Inferred if not configured (!) • Types: float, long, boolean, date (+formatting), object, nested • String type can have arbitrary analyzers • Fields can be split up in more fields
    • "text": { "type": "multi_field", "fields": { "text": { "type": "string", "index": "analyzed", "index_analyzer": "whitespace", "analyzer": "whitespace" }, "text_bigram": { "type": "string", "index": "analyzed", "index_analyzer": "bigram_analyzer", "search_analyzer": "bigram_analyzer" }, "text_trigram": { "type": "string", "index": "analyzed", "index_analyzer": "trigram_analyzer", "search_analyzer": "trigram_analyzer"
    • Mapping - lessons • schema can evolve (e.g. add fields) • inferred if not specified (!) • worst case: reindex • use aliases to enable zero downtime
    • Search with Facets final TermsFacetBuilder userFacet = FacetBuilders.termsFacet(MENTION_FACET_NAME) .field(USER_ID).size(maxUsersAmount); SearchResponse response; response = client.prepareSearch(Indices.USERS) .setTypes(USER_TYPE) .setQuery(someQuery).setSize(0) .setSearchType(SearchType.COUNT) .addFacet(userFacet).execute().actionGet() ; final TermsFacet facets = (TermsFacet) response.getFacets().facetsAsMap() .get(MENTION_FACET_NAME);
    • Query Facets
    • Date Histogram Facet The histogram facet works with numeric data by building a histogram across intervals of the field values. Each value is placed in a “bucket”
    • {                       }                       "query" : {     "match_all" : {} }, "facets" : {     "histo1" : {         "histogram" : {             "field" : "followers",             "interval" : 10         }     } }
    • Facets - lessons • • • Bug in 0.90.x: https://github.com/elasticsearch/elasticsearch/ issues/1305* Solutions: use 1 shard ask for top 100 instead of 10 *will be solved in 1.0 with aggregation module
    • Analyzers A Lucene analyzer consists of a tokenizer and an arbitrary amount of filters (+ char filters)
    • { "index":{ "analysis":{ "filter":{ "bigram_shingle_filter":{ "type":"shingle", "max_shingle_size":2, "min_shingle_size":2, ... "analyzer":{ "bigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "bigram_shingle_filter" ] }, "trigram_analyzer":{ "tokenizer":"whitespace", "filter":[ "standard", "trigram_shingle_filter" ] } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" }, "trigram_shingle_filter": { "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, } } "output_unigrams":"false", "output_unigrams_if_no_shingles":"fal se" } } ... } }
    • Relations between Documents Author 1 N Book • nested: faster reads, update needs reindex, cross object match • parent/child: same shard, no reindex on update, difficult sorting
    • Nested Documents Specify Book type is “nested” in Author’s Mapping We can query Authors with a query on properties of nested Books “Authors who published at least a book with Penguin, in scifi genre”
    • curl -XGET localhost:9200/authors/nested_author/ _search -d ' { "query": { "filtered": { "query": {"match_all": {}}, "filter": { "nested": { "path": "books", "query":{ "filtered": { "query": { "match_all": {}}, "filter": { "and": [ {"term": {"books.publisher": "penguin"}}, {"term": {"books.genre": "scifi"}} ]
    • Parent and Child Indexing happens separately Specify _parent type in Child mapping (Book) When indexing Books, specify id of Author
    • curl -XPOST localhost:9200/authors/book/_mapping -d '{ "book":{ "_parent": {"type": "bare_author"} } }' curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{ "name": "Revelation Space", "genre": "scifi", "publisher": "penguin" }'
    • Parent and Child query curl -XPOST localhost:9200/authors/bare_author/ _search -d '{ "query": { "has_child": { "type": "book", "query" : { "filtered": { "query": { "match_all": {}}, "filter" : { "and": [ {"term": {"publisher": "penguin"}}, {"term": {"genre": "scifi"}} ]
    • Data Design Index Configurations • One index “per user” • Single index • SI + Routing: 1 index + custom doc routing • to shards Time: 1 index per time window * * we can search across indices
    • One Index per user Hulk Thor User1 s0 User1 s1 User2 s0 + different sharding per user - small users own (and cost) at least 1 shard
    • Single Index Hulk Thor Users s0 Users s3 Users s2 + filter by user id, support growth - search hits all shards
    • Single Index + routing Hulk Thor Users s0 Users s3 Users s2 + a user’s data is all in one shard, allows large overallocation
    • Index per time range Hulk Thor 2013_01 s1 2013_01 s2 2013_02 s1 + allows change in future indices
    • Data Design - lessons Test, test, test your use case! Take a single node with one shard and throw load at it, checking the shard capacity The shard is the scaling unit: overallocate to enable future scaling #shards > #nodes
    • ...ES has lots of other features! • Bulk operations • Percolator (alerts, classification, …) • Suggesters (“Did you mean …?”) • Index templates (Automatic index • • • configuration) Monitoring API (Amount of memory used, number of operations, …) Plugins ...
    • Thanks! @matteomoci http://mox.fm