Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Data Models with Elasticsearch

1,233 views

Published on

At bol.com, a leading ecommerce platform in The Netherlands, we have done extensive research into what it would take to use ElasticSearch as the main search provider. We will explain the specific challenges and requirements of running an Elasticsearch cluster at bol.com-scale, and show how we have used generated data to do performance and scalability tests on different ways to model a hierarchical data model into Elasticsearch. We will describe the benefits and drawbacks of the different data model options, and their consequences for the design of the index and search applications.

Published in: Software

Scalable Data Models with Elasticsearch

  1. 1. Elasticsearch Meetup | Amsterdam | April 7, 2016 Maarten Roosendaal & Anne Veling
  2. 2. introduction • Anne Veling – Elasticsearch consultancy and custom training – Performance and Stability Troubleshooting – Software Architect, Team Lead
  3. 3. • Hierarchical data model, multiple levels • High volume – searches – data changes • Complex query requirements – Both Product and Offer fields in query – Facet on both levels bol.com challenge
  4. 4. Products and Offers faster indexing faster searching
  5. 5. Test Data Creation • Node.js Script creating random data – Product • Title: two random nouns from noun list • Category: pick one out 26 nouns • Half have no offer, half between 1-4 – Offer • Random price between 1-20 • Seller: pick one out of 10k • Stream in memory, flush out to disk in 3 flavors – Each flavor keeping its own bulk size of 100k – For 1M, 10M and 100M products
  6. 6. Document { "seller": "seller1203", "price": 7, "stock": 2, "deliveryCode": 1, "product": { "id": "product95826", "familyId": "family56744", "title": "lunchroom representative", "category": "crime" } }
  7. 7. Nested
  8. 8. Nested { "_id": "product95826", "familyId": "family56744", "title": "lunchroom representative", "category": "crime", "offers": [ { "seller": "seller1203", "price": 7, "stock": 2, "deliveryCode": 1 } ] }
  9. 9. Parent/Child { "_id": "product95826", "familyId": "family56744", "title": "lunchroom representative", "category": "crime” } { "_parent": "product95826" "seller": "seller1203", "price": 7, "stock": 2, "deliveryCode": 1 }
  10. 10. • Zipped data files – 1M: 86Mb – 10M: 860Mb – 100M: 8.6Gg Getting it there
  11. 11. Indexing?
  12. 12. Indexing • 1M product set, local naive – 80s Document – 41s Nested – 64s Parent/Child • ES index bottleneck: – Your source system and latency it can slurp it up faster than you can serve it
  13. 13. Let’s take a break
  14. 14. Use Cases Use Case A Use Case B Use Case C Product Search Word in Title Word in Title ∃ DeliveryC = 0 Word in Title ∃ Price < P Order By Relevance Relevance (Lowest) Price Display for top N products Product Fields Cheapest Offer fields Product Fields Correct Cheapest Offer fields Product Fields Cheapest Offer fields Aggregate On Category Category Category ∀ Offer SellerId ∀ Correct Offers SellerId ∀ Correct Offers SellerId ∀ Offer Price ∀ Correct Offers Price ∀ Correct Offers Price ∀ Offer DeliveryCode ∀ Correct Offers DeliveryCode ∀ Correct Offers DeliveryCode • Product • Offer
  15. 15. Use Cases D: query B, roll up by family • Families (with products with offers) – with product.title:lunchroom – filter by product.offer.deliveryCode:tom orrow
  16. 16. Searching for a lunchroom How hard can it be?
  17. 17. Let’s search POST /boltest1m_doc/_search -> 3046 { "query": { "term": { "product.title": { "value": "lunchroom" } } } } POST /boltest1m_nested/_search -> 2026 { "query": { "term": { "title": { "value": "lunchroom" } } } } POST /boltest1m_parentchild/_search -> 2022 { "query": { "has_parent": { "parent_type": "product", "query": { "term": { "title": { "value": "lunchroom" } } } } } }
  18. 18. ElasticSearch docs (and Lucene docs) Product with Doc Nested Parent/ Child no offer 1 1 (1) 1 1 offer 1 1 (2) 2 2 offers 2 1 (3) 3
  19. 19. Real Queries • Add Details, Sorting • Product Facets – Category • Offer Facets – Seller ID – Price Buckets – Delivery Code Compare the numbers… Explain the differences...
  20. 20. A: Doc
  21. 21. A: Nested
  22. 22. A: Parent/Child
  23. 23. Query Tips • Use aggregations – Cardinality – top_hits ♥️ (with top_score) • Smart Grouping & Field Collapsing • Slooooow 😢 – inner_hits • Don’t forget post-filtering or result page lookup
  24. 24. Ice Cream Bounty for making top_hits aggregation fast
  25. 25. Testing
  26. 26. Results 0 20 40 60 80 100 120 140 160 180 200 a b c d 1m tun 30102015 32 GB new queries doc nested parentchild 0 500 1000 1500 2000 2500 3000 3500 a b c d 10m tun 30102015 32 GB new queries doc nested parentchild
  27. 27. Conclusions • Parent/Child has limitations – Combining cross-level queries with aggregations in one go • Doc not as fast as we’d expected – Because we needed top_hits aggregation • Elasticsearch scales predictably
  28. 28. Conclusions • For us, nested was the best solution • What is yours? • What are you searching for? – What are the rows? – What are the facets about?
  29. 29. Lessons Learned • Testing the scalability of your data model – Fast iterations early on – Valuable insight in indexing and search requirements • Data Modeling is hard – Do it early – Make it fun
  30. 30. Tech Lessons Learned • Don’t forget to tune the ES cluster – Configure memory ;) • If bulk file last line has no n, gets ignored! – count the differences • 100k bulk files with .000 suffixes ought to be enough for everyone, right? • Do not underestimate Sneakernet
  31. 31. Thank You @anneveling anne@beyondtrees.com

×