Introduction To ElasticSearch (DamnData)


Published on

Slides to the Introduction to ElasticSearch talk given at Damn Data. In these slides I present a use case from ElasticSearch detailing some of the core functionalities of the search & analytics platform.

A blog post with more details about this subject is available here:

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good afternoonMy name is Jurriaan,And I want to thank Thijs for inviting me to speak on this conference.
  • I want to introduceElasticSearch to you.Worked with ElasticSearch for the last 2 yearsAt a company called EngagorSocial media monitoring and management tool.We’re based in Gent, have an office in SFO and are a team of 25 people now, 10 technicalInstead of diving into the technical details firstI want to start with showing you one of the coolestthings we’ve built with it so far
  • Engagor is basically a huge database of social messages. (profiles, keyword searches)Our clients use Engagor to address those messages, like replying to it, or assigning it to a team member or adding metadata.This is a page in Engagor where you see the amount of incoming messages per day and how often and how fast they are being replied to.The dataset is about 40k social messages, data from the last 28days.The purple bars are the response times per day.And there’s graphs with details on response times during and outside business hoursOur clients use this to evaluate performance of the customer support they deliver eg through twitter & facebook
  • This is the same page, but now of data from the last 3 months (140k messages) and grouped by week.Andthey use it to improve there response times … As you can see here.
  • Not only this, but our users can also drill down and search and filter into the dataset.Here’s is a filter for messages with a certain tag, negative sentiment and from a certain region.They might want to have a better response time for certain times of messages. Give priorities.This filter can then be applied on the previous page, and you get statistics about the subset of data in real time.
  • This shows the page in our debug mode, showing a bunch of statistics about the page that’s rendered.This particular request was 32ms.But we see that on average, and also for bigger acounts, when searching in millions of messages we get great performance.And it’s realtime. As soon as a message comes in, action is done, it’s in these statistics. No pre-processing.
  • From the what and who to the how …Main component of Engagor is ElasticSearchWhat I want to do for the next 20 minutes is… quickly go over a few verybasic ES things… explain how the example I started with is implemented… and finish with some of the lessons we learned.
  • If we talk ES, we have to talk LuceneThat’s the search engineElasticSearchis built on top of Lucene. Noticethe flat design of that Lucene logo. This was made in 1999
  • Apache Foundation projectLucene is a proven technology for search indices.
  • ES joined in 2010.And it was built to be a full featured search product, with scalability features built in from the bottom.
  • What does that mean?
  • So, how do we get started …PrettysimpleDownload, unzip, start.
  • When it’s running. You can view that it’s running in your browser.The easiest way to interact with ElasticSearch is through it’s REST api.It’s JSON in and JSON out.
  • Example of adding something to ElasticSearch from your command shell via a HTTP PUT request done by curl.You can do this right after installation. No need to create an index or configure anything, just add data right away.ElasticSearch is smart about what type of data you’re giving it.
  • Adding a second record (document).
  • Now we’re adding some more interesting data to the mix …Updating an existing record. (Mind the new version number.)(An update is an atomic DELETE & PUT.)
  • Want to see view a document?HTTP GETThe url consist of index, type & id.
  • So, what we have right now is a NoSQL store
  • Yes, ES is a NoSQL store(and you could use it to replace your current type of data source, but I’m not saying you should)But that’s not the field where ES shines.
  • ES is for search. So let’s do a search.Here we do a GET request (in the browser)that searches our newly created index for the word “croatia”It returns the match from last Friday.
  • The language for searching in your ES cluster is called the Query DSLThere’s 3 different basic types of searches
  • In ES terms this maps to the following words …
  • By now I’ve added all matches from the qualification campaign, and I will do a fulltext search for croatia.The query string supports everything you can do with Lucene, so that includesWildcardsNear searchesField restrictions
  • And these are the results. We played Croatia twice, and won once.
  • This is a filtered search, where we will only get back exact matches.If you can, it’s better to use ES filters. Since they can than be cached by ES.
  • This ES HTTP Request will facet our data on the opponent field.Thus returning how often we played each opponent.
  • Which is obviously, now we’re qualified, 2 times against each of the other 5 teams in our group.That covers it for the 3 basic types of searches.
  • I need to explain one more thing before we can go back to the example we started with. And that’s document relations.The equivalent of Mysql join2 typesWe are using nested documents for mentions & actions
  • So, if you remember this screenshot …This is a single ES call …How is that call set-up?
  • What setup and hardware is needed to make this work for Engagor?
  • With all of this, on a typical day this is how ES our dashboard looks like …Lots of greenBlue indicating the current master
  • And if by now you’re thinking …I don’t believe you.Well, you’re right.
  • There have been days where it looked like this
  • ANIMATIONAnd where server density – our monitoring platform – looked like this.We’ve seen servers with load averages up to 1800.I wonder if that’s a record setting value?Getting the full set-up and configuration right, is a bit more work then unzipping the software and starting up 20 nodes
  • So I want to move on to some lessons learned
  • Our set-up (firehose)RabbitMQ in frontsometimes (when relocating, initializing, recovering from …) indexing slows down
  • I want to end with 2 quotes from a presentation from this summer On whyElasticSearch was builtWhat it’s goals were …
  • Use the right tool for the jobES does filtering, free text & analytics in a single bogThat’s definitely easier then having to move big piles of data between different systems
  • When I look at the features we can build for our clients, it definitely does that for us.Thanks to elasticsearch we’ve been able to offer our clients features like this,and it’s interesting to see which queries they’re using for their day to day business.
  • And I’m not sure when exactly this happenedProbably around the time of their hugefundingBut the tagline is no longer“You know, for search” …But is now fully buzzword compliantAnd on that bombshell
  • Ready to dive in?
  • Introduction To ElasticSearch (DamnData)

    1. 1. LUCENE • • • • Tagline: “Proven Search Capabilities” Free & Open Source Created in 1999 Features: • Indexes & Analyzes Data • Tokenizing, Stemming, Filtering • Search Queries • Phrases, wildcards, proximity searches, ranges, fielded searches • Relevance Scoring, Field Sorting
    2. 2. ELASTICSEARCH • • • • Tagline: “You know, for Search” Free & Open Source Created by Shay Banon @kimchy Versions • First public release, v0.4 in February 2010 • A rewrite of earlier “Compass” project, w/ scalability built-in from the very core • Latest release 0.90.5 • In Java, so inherently cross-platform
    3. 3. DISTRIBUTED & HIGHLY AVAILABLE • Multiple servers (nodes) running in a cluster • Acts as single service (internal routing) • Data is split into shards (# shards is configurable) • Zero or more replicas • Replicas on different servers (server pools) for failover • Node in cluster goes down? Replica takes over. • Self managing cluster • Automatic master detection + failover • Responsible for distribution/relocating shards
    4. 4. $ cd ~/Downloads $ wget https://download […] /elasticsearch-0.90.5.tar.gz $ tar -xzf elasticsearch-0.90.5.tar.gz $ cd elasticsearch-0.90.5/ $ ./bin/elasticsearch
    5. 5. $ curl -XPUT http://localhost:9200/reddevils/matches/1 -d '{"date": "2013-10-15T19:00:00Z", "opponent": "Wales", "result": "1-1"}' {"ok":true,"_index":"reddevils","_type":"matches","_id":"1"," _version":1}
    6. 6. $ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"}' {"ok":true,"_index":"reddevils","_type":"matches","_id":"2"," _version":1}
    7. 7. $ curl -XPUT http://localhost:9200/reddevils/matches/2 -d '{"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2", "girlfriend_attention_span": 30}’ {"ok":true,"_index":"reddevils","_type":"matches","_id":"2"," _version":2}
    8. 8. “Aha! A NoSQL store?!”
    9. 9. QUERY DSL • Full Text Search • Search for “Croatia” • Structured Search • Search for “All matches where outcome was „1-1‟” • Analytics • Search for “Average attention span of my girlfriend” • Incl. custom functions (scripts) • … or a combination of those!
    10. 10. QUERY DSL (CONT‟D) • Searching in your data set … • queries: full text search & relevance scoring • filters: exact matches • Aggregating information from your data set … • facets: • Averages • Sums • Date histograms •…
    11. 11. curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=tru e' -d '{ "query": { "query_string": { "query": "croatia" } } }'
    12. 12. { "took" : 18, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.40240064, "hits" : [ { "_index" : "reddevils", "_type" : "matches", "_id" : "2", "_score" : 0.40240064, "_source" : {"date": "2013-10-11T15:00:00Z", "opponent": "Croatia", "result": "1-2"} }, { "_index" : "reddevils", "_type" : "matches", "_id" : "4", "_score" : 0.3125, "_source" : {"date": "2012-09-11T15:00:00Z", "opponent": "Croatia", "result": "1-1"} }] } }
    13. 13. curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{ "query": { "constant_score": { "filter": { "term": { "result": "1-1” } } } } }’
    14. 14. curl -XGET 'http://localhost:9200/reddevils/matches/_search?pretty=true' -d '{ "size": 0, "facets": { "opponent": { "terms": { "field": "opponent" } } } }'
    15. 15. { … "facets" : { "opponent" : { "_type" : "terms", "missing" : 0, "total" : 10, "other" : 0, "terms" : [ { "term" : "wales”, "count" : 2 }, { "term" : "serbia”, "count" : 2 }, … { "term" : "croatia”, "count" : 2 }] …
    16. 16. DOCUMENT RELATIONS • ElasticSearch provides 2 mechanisms • Parent/Child Documents • add links between documents by defining parent/child ids. • query example: “return children where parent matches x” • use case: linking “product” and “offer” documents. • query-time join • Nested Documents • use case: “actions” on a “mention” (Engagor) • denormalized in Lucene index • in Lucene index data is stored nearby • thus local join, thus very fast. • index-time join
    17. 17. EXAMPLE EXPLAINED • • • • range filter on publish_date query_string w/ (internal version of) user defined query string date_histogram facet on mention-document publish_date field term_stats facet per action type on “delay” field nesteddocument “action” of mention-document • result contains: • amount of mentions with action • amount of actions • total delay of actions • facet_filter per defined facet.
    18. 18. THE ENGAGOR SETUP • Running ES since 2 years • 1 billion social messages, sharded by client • 20 nodes cluster • 24GB RAM, 12-18 reserved for ES • Main data source • Other storage systems in place mainly for backup • Usage: • write heavy (indexing new data all the time, real time) • less reads (no need for micro-optimizing read caches, yet) • # updates on data depends on client use case • social care and/or pure analytics
    19. 19. 3 lessons learned …
    20. 20. 1/3: INDEXING SPEED • Bulk Indexing is faster, obviously • Less network overhead • With RabbitMQ • Handles peaks in data • Allows us to slow down throughput to ES while still consuming firehoses from our 3rd party services • Bulk w/ Timeouts • (so Engagor users get their messages near-realtime)
    21. 21. 2/3: CHOOSE SHARDING STRATEGY WISELY • Plan # shards on expected growth, not on current set-up • But, take care … • We have several shards per monitored topic (related to # customers and volume of data) • Biggest problem in our cluster right now is big # shards • Bugfixes in latest versions • You can use “aliases” to create “virtual shards”/”windows on shards”
    22. 22. 3/3: TRY TO KEEP UP WITH RELEASES • ElasticSearch is a young product • 0.90 releases • September 2013 • August 2013 • June 2013 • May 2013 • April 2013 • The 1.0 release is for early 2014. • Updates help you • Great improvements over every release • Much needed bugfixes over every release • Bonus Tip: + keep your JVM up to date
    23. 23. “filtering, free text search & analytics all in the same box”
    24. 24. “power of search and data-digging in the hands of your users”
    25. 25. flexible and powerful open source, distributed real-time search and analytics engine for the cloud
    26. 26. $ sudo bin/service/elasticsearch stop