Elastic search from the trenches


Published on

Presentation given on NYC software engineering group [http://www.meetup.nycsoftware.org/events/102941112/] on Feb 07 2013.
An overview of the implementation of ElasticSearch as the new search and browse engine at emusic.com.
This talk shows the challenges that the team faced while putting this amazing solution to work.
By replacing a proprietary legacy Oracle Endeca product to ElasticSearch emusic was able to reduce by 400% the number of nodes used by the search engine, response times were down by 200% and with an 80% increase of traffic load.
We serve over 5 million requests a day on our search servers.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Elastic search from the trenches

  1. 1. ElasticSearch from the trenches Vinicius Carvalho / @vccarvalho blog.furiousbob.com
  2. 2. Aboutemusic.com is one of the leading digital musicretailers, is committed to serving musicenthusiasts with aggressive development oftools and features solely designed toenhance personalized music experienceSearching is a BIG part of emusic.comdiscovery experience
  3. 3. Search challengesResults MUST be relevant How do we define relevancy?Low response times ( < 100ms)High availability (Users don’t tolerate notfinding what they are looking for)Static search results are old school, yourengine should “know” your users preferencesHow to recover from catastrophic crashes
  4. 4. GoalsImprove results relevancy : The Adele 21problemGet rid of proprietary softwareHave an extensible search engineUnderstand what’s happening under the hoodIntegrate with other user projects :recommendation, affinity, activity
  5. 5. SEABRO ProjectReplace current proprietary Endeca searchengine to a modern search engineGet more relevant resultsFlexible APIFacet searching (for browsing)Ability to scale out3 months in execution (1st phase)
  6. 6. Why ElasticSearch?Built from ground up to scale outPowerful DSLSchema-less -> JSONDistributed LuceneVery powerful API, allows automation ofthe whole cluster using simple curlcommandsIt’s bonsai cool
  7. 7. SOLR: The contenderPowered by LuceneJust too much XML files to get anythingdoneUses XMLNo Query DSL
  8. 8. Elasticsearch at a glanceclient q1 node r1client node q1client r1
  9. 9. Cluster with 1 node Cluster state Node name Cool Names Shards
  10. 10. Cluster with 2 nodes : Rebalancing Shard being relocated
  11. 11. Cluster with 2 nodes : stable Recently relocated shard
  12. 12. Cluster with 3 nodes : stable
  13. 13. Cluster with 3 nodes : adding replicas Yellow state, not all replicas set
  14. 14. Cluster with 3 nodes : adding replicas Index aliasing Primary Shard
  15. 15. Cluster with 3 nodes : node crash Without replication, our cluster now is missing one shard.
  16. 16. Client Node typesThree types Data node : joins the cluster, fetch shard data Client node : joins the cluster, participates on sorting operations, don’t fetch data Transport node : Don’t join the cluster, only send requests towards it
  17. 17. Extensible architecture : pluginsSite : GUI interfacesRiver : integration hook to fetch data fromother sources (DB, MQ, FS ... )Transport : Allow different transports to beplugged into ES coreScripting : Allow adding new languages tothe scripting systemAnalysis : Custom analyzers for indexing/searchingMisc : You know, everything else
  18. 18. Site plugins
  19. 19. Our numbers25+ million documentsMulti types: Songs, Albums, Artists, AudioBooks, Composers, Labels, Authors,Conductors, Composers ...5+ million search requests per day~ 100 gb index. And it only takes 1 hour tobuild it from ground up (thanks to Akkaengine)
  20. 20. Indexing flow ES cluster a kOracle k a actors
  21. 21. Lessons learned
  22. 22. Lesson #1 Get professional Helpelasticsearch.org is verywell documentedBut when it comes toprod, ask the expertsGet professional supportfrom elasticsearch.com
  23. 23. Lesson #2 Understand shardsSharding is what makedistributable searchpossibleUnderstanding what theymean and how can theyspeed up your engine is amust
  24. 24. Lesson #2Understand shards
  25. 25. Lesson #2 Understand shardsIncreasing the number of shards will boostyour query timesEach shard maps to a Lucene Index Reader/Writer, the more power your box has, themore shards you should haveReplication will boost cluster response time
  26. 26. Lesson # 3Design your data flow ahead of your schema The way you model your schemas have a deep impact on how fast your engine can become Don’t be afraid to replicate information using a different structure
  27. 27. Lesson #4 Master Query DSLJust like you know SQL, you shouldunderstand the query DSL pretty very wellIndexed Data won’t find itselfUnderstand that sometimes you must changedata representation in order to get things tobe found
  28. 28. Lesson #5Learn at least a bit about lucene internals Understanding how lucene’s scoring works helps designing better queries Elasticsearch supports custom score using scripts Could hurt you on performance :(
  29. 29. Lesson #6Put slow queries to work. Use explainExplain gives useful information on howdocuments are being scoredSlow query log will show you which queriesare actually hurting you Sometimes its just document cache misses
  30. 30. Lesson #7 Take GC by the hornsES nodes can demand a lot of memoryJDK still thinks its 2003 when it comes tomemory sizeMemory fragmentationFull GC times can bring your cluster to itsknees
  31. 31. Lesson #7 Take GC by the hornsMaximum 30 GB per nodeBeefy machines = more nodes per machineChanged full GC threshold to start whenmemory reaches 60% -> Giving JVM plentyof room until memory is claimed
  32. 32. Lesson #8 Caching can eat up your memoryCaching is a necessary evil but: Field cache stores sorted and faceted data Filter cache stores filtered dataCache eviction must be controlled
  33. 33. Lesson #8Caching can eat up your memoryYour queries and how you facet will have ahuge impact on cache sizeBigger your shard is, more memory you willneed for cachingFacet caching for multi valued fields in 0.20is not optimal, take that in consideration
  34. 34. Lesson #9 Monitor your clusterKeep an eye on your clusterIt’s vital to monitor both system metrics(CPU, memory, file system) but also correlatethat with query informationES provides nice plugins like bigdesk andparamedic. But history is vital so getsomething like sematext SPM
  35. 35. Lesson #10 Distributed systems are hardNeedless tosay, but don’texpect allthat powerto come forfree
  36. 36. Lesson #11Have an A/B testing suite readyDefining relevancy is hardPeople have different views on relevanceHard to explain to a user why Joe Doe doesnot show up on its query results
  37. 37. Lesson # 11Have an A/B testing suite readyStart with a baseline search that returns “relevantenough” resultsGive points for every record found, the higher it isthe more points it getSum it all, and you have your scoreWhen updating your queries, run the suite andcheck if you get better results
  38. 38. Lesson # 12 Track user interactionMonitor how many “clicks” your users areexecuting once you changed queriesAgain, your definition of relevant may not bewhat your users expectAdapt
  39. 39. Final wordsIn the end, ES proved to be a very reliableand affordable solutionNot only we increased the quality of resultsbut we have also reduced the queryresponse timesRequest time dropped over 200%. Clustersize reduced by 400% and with a 80%increase in loadYES We did save money and increasedquality at the same time
  40. 40. Next steps
  41. 41. Classify dataClassify data during indexing time instead ofusing custom scripts
  42. 42. Search +Recommendation
  43. 43. Click analysis