Your SlideShare is downloading. ×

An Introduction to Elastic Search.


Published on

Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their …

Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.

An accompanying blog post about this subject can be found at

Published in: Technology
1 Comment
  • I have just posted an accompanying blog post to this subject, that will be a bit more helpful to people who didn't attend this talk:
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • This talk is about adding search to your own website.Implementing a search engine for your own content.First the bit about how it’s done in MySQL, and what problems that brings with it;Then about what a search engine should do for you;Then about how Elastic Search helps;And code of course, lots of code.
  • I’m leading the development team at Engagor, a startup in the city centreof Gent, Belgium.We’re 2 years old.We build a social media monitoring & management product.Before that I worked for Massive Media for 5 years.As a lead developer I worked on the Netlog & Gatchaproducts.
  • Search can be a “simple” textsearch.Here I’m searching Tumblr for funny gifs, because that’s what Tumblris for.
  • But search can go deeper and more into detail too ..Here I’m usingAND, OR, NOTNestingRestrictionson fields
  • Or very difficult …Searching in a mixed set of dataProfilesPhotosFriend connectionsSearching in a graph …
  • My first thought when I’d have to add search to a php/mysql site … It sort of works …
  • Problems arisewhen you have lots of data …To speed things up you add indecesto your MySQL tables.
  • And the library analogy for a MySQL index is this …An index card box.
  • MySQL has an index type esp. for full text search.Natural Language:case insensitive, accent incensitiveQuery Expansion:Search for “database” > returns results that has often has words like “mysql”, “oracle”, … A second search with extended query string happens to find related documents too.Boolean ModeAdds operators to Natural Language type
  • Anyone using a similar system to this?Implemented yourself, or from a CMS?phpBB has/had a table like this.
  • Example of FULLTEXT MySQL search query.
  • Example of FULLTEXT MySQL search query in boolean mode. A bit more powerful.
  • Now, we add restrictions on a certain fields. Now you need combined indeces to keep this fast.
  • And even more restrictions.Indeces on all involved feeds? In all combinations? In all orders?Lots of indeces make WRITE operations on your tables slower.
  • MySQL just isn’t built for complex search …So let’s look at what a system built for search needs …
  • So it’s old.But still active.Used a lot.
  • It is however a java library. It’s not a fully managed service.
  • If you have lots of data; you need a search engine to be scalable & highly available.
  • And that’s where ElasticSearch comes in.
  • Download, unzip, start.
  • This time we install it in the right place, and wrapped in a service.
  • Now it runs on your localhost. As a HTTP service, so open your browser and surf to your Elastic Search server.
  • HTTP Access, it’s brilliant!Do you want to secure it? Add firewalls …Do you want to cache it? Add Varnish …
  • Example of adding something to ElasticSearch from your command shell via a HTTP PUT request done by curl.You can do this right after installation. No need to create an index or configure anything, just add data right away.
  • Adding a second record (document).
  • Updating an existing record. (Mind the new version number.)
  • Want to see the record?Surf to it.The url consist ofindex, type & id.
  • Want to add a new field (column)?Update your document with new field added.
  • It’s right there.
  • So it’s schemaless.And actually we did ZERO CONFIGURATION. We didn’t have to create indeces or tell Elastic Search what type of data we’ll be adding.Actually, you can configure a mapping/schema:To require certain fields to be of a certain typeTo avoid text fields of being analyzed (text analysis)Basically: to speed things up …
  • What we’ve demoed so far is aNoSQL store. That’s cool. But not all.
  • Here we do a GET request (in the browser) that searches our newly created index for the word “smashing”It returns the 2nd document.
  • Curl in PHP is simple.Simplest example of how to do a search to elastic from PHP.
  • I’ve mentioned a few Elastic Search specific terms.Here’s the full breakdown of terminology and how they related to MySQL concepts.
  • Back to the clustering features Elastic Search adds …
  • Here you have a Engagor specific dashboard that shows the 12 servers in the Engagor cluster.You see:server12 is the master;That every node knows each other;There are 11K shards;Each with one replica.Health = green means every shard has a replica (on a different server).If one server goes down: no problem.
  • Now let’s look at how Engagor uses Elastic Search.First, what do we do?From a technical point of view:Engagor = Huge database of social messages.Facebook, Twitter, Instagram, News sites …We save those that are clients are interested in to our application and offer:Statistics about the tracked dataWorkflow toolsAutomation tools
  • This is the time to show a slide I stole from a presentation from our CEO Folke.We started 2 years ago.I joined after 6 months.Now a team of 16 people.7 of them technical profilesweb developersdata scientistsbackend developersfrontend developersOur customers includeMobistarTelenetMicrosoft EuropeEuropean CommisionAlproSeveral agenciesThey use Engagor for customer support (“call center” software for social media)marketing insightscrisis detection…
  • Here you see an example of a search on our cluster for a certain twitter user’s handle.You see it returns 260 social messages.Each message has data like the:IdService it’s fromContentDateAuthor details
  • Here we’re searching in a topic about Belgacom & Mobile Vikings.“All messages from users with at least 1000 followers, that are negative and from Belgium or in Dutch.”On 4 different fields, and nested … It just works.
  • This is the Query SQL we’re sending to Elastic for the previous search.
  • Here we are in a topic about Coca Cola, thus high volume.About 50k message per day. 28 days.That’s 1,4M messages we’re searching in.This is a graph of messages per day.
  • This is the inbox. Showing the last 10 messages in that topic.Performance: about half a second.
  • Same inbox, but now only showing messages with the word “thirsty”.Performance: again about half a second.(Only 1 sample, so this is not really a benchmark .)
  • Now, there’s another feature of search engines you might not immediately think about.
  • Think of it as an equivalent to MySQL GROUP BY query.
  • The pie chart:Facet on sentiment field of a mention. Returns totals per value.Sentiment Per Day:Facet on combined fields: sentiment + day(dateadd).Returns data used for second chart.You can also use the filter like in the inbox, to see these facets for a filtered set of data. Eg. Sentiment per day for mentions coming from Belgium only.
  • For the Telenet Twitter profile:Totals of messages per dayPer typeRetweetsMentionsOwn tweetsEvery “segment” (color) is facet with custom filter/search querySingle ElasticSearch call to get all this information.
  • Percolation example:Use it to route documents …Eg. you have a stream of data coming in and need to decide what to do with those documents based on queries.Everything matching x is for client AEverything matching y is for client B
  • There’s a good competition going on. Looks like Elastic Search has made Solr alert again; since they’re also focusing on clustering features now.
  • It’s not because you have a great search engine, that you have great search experience on your site …Users aren’t very good at it …But searching for things is not that easy …
  • Look at how natural language differs from search query language.
  • This happens quite often at the Engagor office.Not that we blame our users …It’s just difficult to get it right.
  • If you want to play with it, and are excited …But you’re boss is all #meh.
  • ElasticSearch is a mature product.Soundcloud is using it.
  • StumbleUpon is using it.
  • Foursquare switched to it …When checking in somewhere, they show:Locations close to youLocations you previously checked inLocations that are popularThat’s a pretty nifty search query right there.
  • There’s also a real company behind Elastic Search, backing it with:TutorialsTrainingSupportSLA’sConsulting
  • Recently redesigned site. Documentation is a bit frightening at first, since it’s hard to know what to search for, but I hope this presentation solves that.IRC channel is very active; quick answers.
  • If you’re interested in working with Elastic Search, or in fact, any of these other cool technologies …We can always use good profiles: junior or senior, backend or frontend … Send us your resumes ;).March 27th: Now we’re esp. searching for someone who’s good at Javascript.
  • Already gave an overview of how it’s like to work with the technologies at Engagor.Go to to find out how it’s like to work with the people of our team ;).
  • So, that’s it.
  • So, that’s it.
  • Transcript

    • 1. Jurriaan Persyn@oemebamo – CTO Engagor
    • 2. SELECT *FROM myauwesomewebsite WHERE `text` LIKE „%shizzle%‟
    • 3. TEXT SEARCH WITH PHP/MYSQL• FULLTEXT index • Only for CHAR, VARCHAR & TEXT columns • For MyISAM & InnoDB tables • Configurable Stop Words • Types: • Natural Language • Natural Language with Query Expansion • Boolean Full Text
    • 4. MYSQL FULLTEXT BOOLEAN MODEOperators: •+ AND •- NOT • OR implied •() Nesting •* Wildcard •“ Literal matches
    • 5. TEXT SEARCH WITH PHP/MYSQL (CONT‟D)• Typical columns for search table: • Type • Id • Text • Priority• Process: • Blog posts, comments, … • Save (filtered) duplicate of text in search table. • When searching … • Search table and translate to original data via type/idThis is how most php/mysql sites implement their search, right?
    • 6. SELECT *FROM mysearchtableWHERE MATCH(text) AGAINST („shizzle‟)
    • 7. SELECT * FROM mysearchtable WHERE MATCH(text) AGAINST („+shizzle –”manizzle”‟ IN BOOLEAN MODE)
    • 8. SELECT * FROM jobsWHERE role = „DEVELOPER‟AND MATCH(job_description) AGAINST („node.js‟)
    • 9. SELECT * FROM jobs jJOIN jobs_benefits jb ON = jb.job_idWHERE j.role = „DEVELOPER‟AND (MATCH(job_description) AGAINST („node.js -asp‟ IN BOOLEAN MODE)AND jb.free_espresso = TRUE
    • 10. WHAT IS A SEARCH ENGINE?• Efficient indexing of data • On all fields / combination of fields• Analyzing data • Text Search • Tokenizing • Stemming • Filtering • Understanding locations • Date parsing• Relevance scoring
    • 11. TOKENIZING• Finding word boundaries • Not just explode(„ „, $text); • Chinese has no spaces. (Not every single character is a word.)• Understand patterns: • URLs • Emails • #hashtags • Twitter @mentions • Currencies (EUR, €, …)
    • 12. STEMMING• “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.” • Conjugations • Plurals• Example: • Fishing, Fished, Fish, Fisher > Fish • Better > Good• Several ways to find the stem: • Lookup tables • Suffix-stripping • Lemmatization •…• Different stemmers for every language.
    • 13. FILTERING• Remove stop words • Different for every language• HTML • If you‟re indexing web content, not every character is meaningful.
    • 14. UNDERSTANDING LOCATIONS• Reverse geocoding of locations to longitude & latitude• Search on location: • Bounding box searches • Distance searches • Searching nearby • Geo Polygons • Searching a country(Note: MySQL also has geospatial indeces.)
    • 15. RELEVANCE SCORING• From the matched documents, which ones do you show first?• Several strategies: • How many matches in document? • How many matches in document as percentage of length? • Custom scoring algorithms • At index time • At search time • … A combinationThink of Google PageRank.
    • 16. “There‟s an app software forthat.”
    • 17. APACHE LUCENE• “Information retrieval software library”• Free/open source• Supported by Apache Foundation• Created by Doug Cutting• Written in 1999
    • 18. “There‟s software a Java libraryfor that.”
    • 19. ELASTICSEARCH• “You know, for Search”• Also Free & Open Source• Built on top of Lucene• Created by Shay Banon @kimchy• Versions • First public release, v0.4 in February 2010 • A rewrite of earlier “Compass” project, now with scalability built-in from the very core • Now stable version at 0.20.6 • Beta branch at 0.90 (working towards 1.0 release)• In Java, so inherently cross-platform
    • 20. WHAT DOES IT ADD TO LUCENE?• RESTfull Service • JSON API over HTTP • Want to use it from PHP? • CURL Requests, as if you‟d do requests to the Facebook Graph API.• High Availability & Performance • Clustering• Long Term Persistency • Write through to persistent storage system.
    • 21. $ cd ~/Downloads$ wget…/elasticsearch-0.20.5.tar.gz$ tar –xzf elasticsearch-0.20.5.tar.gz$ cd elasticsearch-0.20.5/$ ./bin/elasticsearch
    • 22. $ cd ~/Downloads$ wget…/elasticsearch-0.20.5.tar.gz$ tar –xzf elasticsearch-0.20.5.tar.gz$ git clone elasticsearch-servicewrapper$ sudo mv elasticsearch-0.20.5 /usr/local/share$ cd elasticsearch-servicewrapper$ sudo mv service /usr/local/share/elasticsearch-0.20.5/bin$ cd /usr/local/share$ sudo ln -s elasticsearch-0.20.5 elasticsearch$ sudo chown -R root:wheel elasticsearch$ cd /usr/local/share/elasticsearch$ sudo bin/service/elasticsearch start
    • 23. $ sudo bin/service/elasticsearch startStarting ElasticSearch...Waiting for ElasticSearch......running: PID:83071$
    • 24. $ curl -XPUT http://localhost:9200/test/stupid-hypes/planking-d {"name":"Planking", "stupidity_level":"5"}{"ok":true,"_index":"test","_type":"stupid-hypes","_id":"planking","_version":1}
    • 25. $ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-smashing -d {"name":"Gallon Smashing","stupidity_level":"5"}{"ok":true,"_index":"test","_type":"stupid-hypes","_id":"gallon-smashing","_version":1}
    • 26. $ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-smashing -d {"name":"Gallon Smashing","stupidity_level":"10"}{"ok":true,"_index":"test","_type":"stupid-hypes","_id":"gallon-smashing","_version":2}
    • 27. $ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-smashing -d {"name":"Gallon Smashing","stupidity_level":"10", "lifetime":30}’{"ok":true,"_index":"test","_type":"stupid-hypes","_id":"gallon-smashing","_version":3}
    • 28. SCHEMALESS, DOCUMENT ORIENTED• No need to configure schema upfront• No need for slow ALTER TABLE –like operations• You can define a mapping (schema) to customize the indexing process • Require fields to be of certain type • If you want text fields that should not be analyzed (stemming, …)
    • 29. “Ok, so it‟s a NoSQL store?”
    • 30. TERMINOLOGYMySQL Elastic SearchDatabase IndexTable TypeRow DocumentColumn FieldSchema MappingIndex Everything is indexedSQL Query DSLSELECT * FROM table … GET http://…UPDATE table SET … PUT http://…
    • 31. DISTRIBUTED & HIGHLY AVAILABLE• Multiple servers (nodes) running in a cluster • Acting as single service • Nodes in cluster that store data or nodes that just help in speeding up search queries.• Sharding • Indeces are sharded (# shards is configurable) • Each shard can have zero or more replicas • Replicas on different servers (server pools) for failover • One in the cluster goes down? No problem.• Master • Automatic Master detection + failover • Responsible for distribution/balancing of shards
    • 32. SCALING ISSUES?• No need for an external load balancer • Since cluster does it‟s own routing. • Ask any server in the cluster, it will delegate to correct node.• What if … • More data > More shards. • More availability > More replicas per shard.
    • 33. PERFORMANCE TWEAKING• Bulk Indexing• Multi-Get • Avoids network latency (HTTP Api)• Api with administrative & monitoring interface • Cluster‟s availability state • Health • Nodes‟ memory footprint• Alternatives voor HTTP Api? • Java library • PHP wrappers (Sherlock, Elastica, …) • But simplicity of HTTP Api is brilliant to work with, latency is hardly an issue.
    • 34. Still with me?
    • 35. Some Examples
    • 36. Query DSL Example:(language:nl OR OR author.followers:[1000 TO *] (-sub_category:like) ((-status:857.assigned) (-status:857.done))
    • 37. FACETS• Instead of returning the matching documents …• … return data about the distribution of values in the set of matching documents • Or a subset of the matching documents• Possibilities: • Totals per unique value • Averages of values • Distributions of values •…
    • 38. TERMINOLOGY (CONT‟D)MySQL Elastic SearchSELECT field, COUNT(*) FacetFROM table GROUP BY field
    • 39. ADVANCED FEATURES• Nested documents (Child-Parent) • Like MySQL joins?• Percolation Index • Store queries in Elastic • Send it documents • Get returned which queries match• Index Warming • Register search queries that cause heavy load • New data added to index will be warmed • So next time query is executed: pre cached
    • 40. WHAT ARE MY OTHER OPTIONS?• RDBMS • MySQL, …• NoSQL • MongoDB, …• Search Engines • Solr • Sphinx • Xapian • Lucene itself• SaaS • Amazon CloudSearch
    • 41. … VS. SOLR•+ • Also built on Lucene • So similar feature set • Also exposes Lucene functionality, like Elastic Search, so easy to extend. • A part of Apache Lucene project • Perfect for Single Server search•- • Clustering is there. But it‟s definitely not as simple as ElasticSearch‟ • Fragmented code base. (Lots of branches.)Engagor used to run on Solr.
    • 42. … VS. SPHINX•+ • Great for single server full text searches; • Has graceful integration with SQL database; • (Eg. for indexing data) • Faster than the others for simple searches;•- • No out of the box clustering; • Not built on Lucene; lacks some advanced features;Netlog & Twoo use Sphinx.
    • 43. WANT TO USE IT?• In an existing project: • As an extra layer next to your data … • Send to both your database & elasticsearch; • Consistency problems?; • Or as replacement for database • Elastic is as persistent as MySQL; • If you don‟t need RDBMS features; • @Engagor: Our social messages are only in Elastic
    • 44. “Users are incredibly bad atfinding and researching thingson the web.” Nielsen (March 2013)
    • 45. “Pathetic and useless arewords that come to mind afterthis year‟s user testing.” Nielsen (March 2013)
    • 46. “I‟m searching for apples andpears.”
    • 47. “apples AND pears”
    • 48. “apples OR pears”
    • 49. “It‟s too young. Is it even stableenough?” Your boss (Tomorrow Morning)
    • 50. #elasticsearch
    • 51. elasticsearch node.js real time notifications rabbitmq backbone.js
    • 52. $ cd /usr/local/share/elasticsearch$ sudo bin/service/elasticsearch stop
    • 53. Sources include:•••••••• works-with-elastic-search/•• in-terms-o•