Jurriaan Persyn
@oemebamo – CTO Engagor
SELECT *
FROM myauwesomewebsite
   WHERE `text` LIKE
      „%shizzle%‟
TEXT SEARCH WITH PHP/MYSQL
• FULLTEXT index
  • Only for CHAR, VARCHAR & TEXT columns
    • For MyISAM & InnoDB tables
  • Configurable Stop Words
  • Types:
   • Natural Language
   • Natural Language with Query Expansion
   • Boolean Full Text
MYSQL FULLTEXT BOOLEAN MODE
Operators:
 •+          AND
 •-          NOT
 •           OR implied
 •()         Nesting
 •*          Wildcard
 •“          Literal matches
TEXT SEARCH WITH PHP/MYSQL (CONT‟D)
• Typical columns for search table:
  • Type
  • Id
  • Text
  • Priority

• Process:
  • Blog posts, comments, …
    • Save (filtered) duplicate of text in search table.
  • When searching …
    • Search table and translate to original data via type/id

This is how most php/mysql sites implement their search, right?
SELECT *
FROM mysearchtable
WHERE MATCH(text)
 AGAINST („shizzle‟)
SELECT *
    FROM mysearchtable
    WHERE MATCH(text)
  AGAINST („+shizzle –”ma
nizzle”‟ IN BOOLEAN MODE)
SELECT * FROM jobs
WHERE role = „DEVELOPER‟
AND MATCH(job_description)
   AGAINST („node.js‟)
SELECT * FROM jobs j
JOIN jobs_benefits jb ON j.id =
           jb.job_id
WHERE j.role = „DEVELOPER‟
AND (MATCH(job_description)
 AGAINST („node.js -asp‟ IN
     BOOLEAN MODE)
AND jb.free_espresso = TRUE
WHAT IS A SEARCH ENGINE?
• Efficient indexing of data
  • On all fields / combination of fields
• Analyzing data
  • Text Search
     • Tokenizing
     • Stemming
     • Filtering
  • Understanding locations
  • Date parsing
• Relevance scoring
TOKENIZING
• Finding word boundaries
  • Not just explode(„ „, $text);
  • Chinese has no spaces. (Not every single character is a
    word.)
• Understand patterns:
  • URLs
  • Emails
  • #hashtags
  • Twitter @mentions
  • Currencies (EUR, €, …)
STEMMING
• “Stemming is the process for reducing inflected (or sometimes
  derived) words to their stem, base or root form.”
   • Conjugations
   • Plurals
• Example:
   • Fishing, Fished, Fish, Fisher > Fish
   • Better > Good
• Several ways to find the stem:
   • Lookup tables
   • Suffix-stripping
   • Lemmatization
   •…
• Different stemmers for every language.
FILTERING
• Remove stop words
  • Different for every language
• HTML
  • If you‟re indexing web content, not every character is
    meaningful.
UNDERSTANDING LOCATIONS
• Reverse geocoding of locations to longitude & latitude
• Search on location:
  • Bounding box searches
  • Distance searches
    • Searching nearby
  • Geo Polygons
    • Searching a country

(Note: MySQL also has geospatial indeces.)
RELEVANCE SCORING
• From the matched documents, which ones do you show first?
• Several strategies:
  • How many matches in document?
  • How many matches in document as percentage of length?
  • Custom scoring algorithms
    • At index time
    • At search time
  • … A combination

Think of Google PageRank.
“There‟s an app software for
that.”
APACHE LUCENE
•   “Information retrieval software library”
•   Free/open source
•   Supported by Apache Foundation
•   Created by Doug Cutting
•   Written in 1999
“There‟s software a Java library
for that.”
ELASTICSEARCH
• “You know, for Search”
• Also Free & Open Source
• Built on top of Lucene
• Created by Shay Banon @kimchy
• Versions
   • First public release, v0.4 in February 2010
     • A rewrite of earlier “Compass” project, now with scalability
       built-in from the very core
   • Now stable version at 0.20.6
   • Beta branch at 0.90 (working towards 1.0 release)
• In Java, so inherently cross-platform
WHAT DOES IT ADD TO LUCENE?
• RESTfull Service
  • JSON API over HTTP
  • Want to use it from PHP?
     • CURL Requests, as if you‟d do requests to the Facebook
       Graph API.
• High Availability & Performance
  • Clustering
• Long Term Persistency
  • Write through to persistent storage system.
$ cd ~/Downloads
$ wget https://github.com/…/elasticsearch-0.20.5.tar.gz
$ tar –xzf elasticsearch-0.20.5.tar.gz
$ cd elasticsearch-0.20.5/
$ ./bin/elasticsearch
$ cd ~/Downloads
$ wget https://github.com/…/elasticsearch-0.20.5.tar.gz
$ tar –xzf elasticsearch-0.20.5.tar.gz
$ git clone https://github.com/elasticsearch/elasticsearch-
servicewrapper.git elasticsearch-servicewrapper
$ sudo mv elasticsearch-0.20.5 /usr/local/share
$ cd elasticsearch-servicewrapper
$ sudo mv service /usr/local/share/elasticsearch-0.20.5/bin
$ cd /usr/local/share
$ sudo ln -s elasticsearch-0.20.5 elasticsearch
$ sudo chown -R root:wheel elasticsearch
$ cd /usr/local/share/elasticsearch
$ sudo bin/service/elasticsearch start
$ sudo bin/service/elasticsearch start
Starting ElasticSearch...
Waiting for ElasticSearch...
.
.
.
running: PID:83071
$
$ curl -XPUT http://localhost:9200/test/stupid-hypes/planking
-d '{"name":"Planking", "stupidity_level":"5"}'

{"ok":true,"_index":"test","_type":"stupid-
hypes","_id":"planking","_version":1}
$ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-
smashing -d '{"name":"Gallon Smashing",
"stupidity_level":"5"}'

{"ok":true,"_index":"test","_type":"stupid-
hypes","_id":"gallon-smashing","_version":1}
$ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-
smashing -d '{"name":"Gallon Smashing",
"stupidity_level":"10"}'

{"ok":true,"_index":"test","_type":"stupid-
hypes","_id":"gallon-smashing","_version":2}
$ curl -XPUT http://localhost:9200/test/stupid-hypes/gallon-
smashing -d '{"name":"Gallon Smashing",
"stupidity_level":"10", "lifetime":30}’

{"ok":true,"_index":"test","_type":"stupid-
hypes","_id":"gallon-smashing","_version":3}
SCHEMALESS, DOCUMENT ORIENTED
• No need to configure schema upfront
• No need for slow ALTER TABLE –like operations
• You can define a mapping (schema) to customize the indexing
  process
  • Require fields to be of certain type
  • If you want text fields that should not be analyzed (stemming,
    …)
“Ok, so it‟s a NoSQL store?”
TERMINOLOGY

MySQL                   Elastic Search
Database                Index
Table                   Type
Row                     Document
Column                  Field
Schema                  Mapping
Index                   Everything is indexed
SQL                     Query DSL
SELECT * FROM table …   GET http://…
UPDATE table SET …      PUT http://…
DISTRIBUTED & HIGHLY AVAILABLE
• Multiple servers (nodes) running in a cluster
  • Acting as single service
  • Nodes in cluster that store data or nodes that just help in
    speeding up search queries.
• Sharding
  • Indeces are sharded (# shards is configurable)
  • Each shard can have zero or more replicas
     • Replicas on different servers (server pools) for failover
     • One in the cluster goes down? No problem.
• Master
  • Automatic Master detection + failover
  • Responsible for distribution/balancing of shards
SCALING ISSUES?
• No need for an external load balancer
  • Since cluster does it‟s own routing.
  • Ask any server in the cluster, it will delegate to correct node.

• What if …
  • More data             >       More shards.
  • More availability     >       More replicas per shard.
PERFORMANCE TWEAKING
• Bulk Indexing
• Multi-Get
  • Avoids network latency (HTTP Api)
• Api with administrative & monitoring interface
  • Cluster‟s availability state
  • Health
  • Nodes‟ memory footprint
• Alternatives voor HTTP Api?
  • Java library
  • PHP wrappers (Sherlock, Elastica, …)
     • But simplicity of HTTP Api is brilliant to work with, latency is
       hardly an issue.
Still with me?
Some Examples
Query DSL Example:

(language:nl OR location.country:be OR location.country:aa)
(tag:sentiment.negative) author.followers:[1000 TO *] (-
sub_category:like) ((-status:857.assigned) (-status:857.done))
FACETS
• Instead of returning the matching documents …
• … return data about the distribution of values in the set of
  matching documents
   • Or a subset of the matching documents
• Possibilities:
   • Totals per unique value
   • Averages of values
   • Distributions of values
   •…
TERMINOLOGY (CONT‟D)

MySQL                       Elastic Search
SELECT field, COUNT(*)      Facet
FROM table GROUP BY field
ADVANCED FEATURES
• Nested documents (Child-Parent)
   • Like MySQL joins?
• Percolation Index
   • Store queries in Elastic
   • Send it documents
   • Get returned which queries match
• Index Warming
   • Register search queries that cause heavy load
   • New data added to index will be warmed
   • So next time query is executed: pre cached
WHAT ARE MY OTHER OPTIONS?
• RDBMS
  • MySQL, …
• NoSQL
  • MongoDB, …
• Search Engines
  • Solr
  • Sphinx
  • Xapian
  • Lucene itself
• SaaS
  • Amazon CloudSearch
… VS. SOLR
•+
  • Also built on Lucene
    • So similar feature set
    • Also exposes Lucene functionality, like Elastic Search, so
      easy to extend.
  • A part of Apache Lucene project
  • Perfect for Single Server search
•-
  • Clustering is there. But it‟s definitely not as simple as
    ElasticSearch‟
  • Fragmented code base. (Lots of branches.)


Engagor used to run on Solr.
… VS. SPHINX
•+
  • Great for single server full text searches;
  • Has graceful integration with SQL database;
    • (Eg. for indexing data)
  • Faster than the others for simple searches;
•-
  • No out of the box clustering;
  • Not built on Lucene; lacks some advanced features;


Netlog & Twoo use Sphinx.
WANT TO USE IT?
• In an existing project:
   • As an extra layer next to your data …
     • Send to both your database & elasticsearch;
     • Consistency problems?;
   • Or as replacement for database
     • Elastic is as persistent as MySQL;
     • If you don‟t need RDBMS features;
        • @Engagor: Our social messages are only in Elastic
“Users are incredibly bad at
finding and researching things
on the web.”
                         Nielsen (March 2013)
       http://www.nngroup.com/articles/search-navigation/
“Pathetic and useless are
words that come to mind after
this year‟s user testing.”
                         Nielsen (March 2013)
       http://www.nngroup.com/articles/search-navigation/
“I‟m searching for apples and
pears.”
“apples AND pears”
“apples OR pears”
“It‟s too young. Is it even stable
enough?”
          Your boss (Tomorrow Morning)
elasticsearch.org

irc.freenode.net
   #elasticsearch
elasticsearch
          node.js
         socket.io
  real time notifications
         rabbitmq
       backbone.js
         gearman

jobs@engagor.com
$ cd /usr/local/share/elasticsearch
$ sudo bin/service/elasticsearch stop
Sources include:
•    http://www.elasticsearch.org/videos/2010/02/07/es-introduction.html
•    http://www.elasticsearchtutorial.com/
•    http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-elasticsearch
•    http://www.slideshare.net/medcl/elastic-search-quick-intro
•    http://www.slideshare.net/macrochen/elastic-search-apachesolr-10881377
•    http://www.slideshare.net/cyber_jso/elastic-search-introduction
•    http://www.slideshare.net/infochimps/elasticsearch-v4
•    http://engineering.foursquare.com/2012/08/09/foursquare-now-uses-elastic-search-and-on-a-related-note-slashem-also-
     works-with-elastic-search/
•    http://stackoverflow.com/questions/10213009/solr-vs-elasticsearch
•    http://stackoverflow.com/questions/11115523/how-does-amazon-cloudsearch-compares-to-elasticsearch-solr-or-sphinx-
     in-terms-o
•    http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/

An Introduction to Elastic Search.

  • 3.
  • 7.
    SELECT * FROM myauwesomewebsite WHERE `text` LIKE „%shizzle%‟
  • 10.
    TEXT SEARCH WITHPHP/MYSQL • FULLTEXT index • Only for CHAR, VARCHAR & TEXT columns • For MyISAM & InnoDB tables • Configurable Stop Words • Types: • Natural Language • Natural Language with Query Expansion • Boolean Full Text
  • 11.
    MYSQL FULLTEXT BOOLEANMODE Operators: •+ AND •- NOT • OR implied •() Nesting •* Wildcard •“ Literal matches
  • 12.
    TEXT SEARCH WITHPHP/MYSQL (CONT‟D) • Typical columns for search table: • Type • Id • Text • Priority • Process: • Blog posts, comments, … • Save (filtered) duplicate of text in search table. • When searching … • Search table and translate to original data via type/id This is how most php/mysql sites implement their search, right?
  • 13.
    SELECT * FROM mysearchtable WHEREMATCH(text) AGAINST („shizzle‟)
  • 14.
    SELECT * FROM mysearchtable WHERE MATCH(text) AGAINST („+shizzle –”ma nizzle”‟ IN BOOLEAN MODE)
  • 15.
    SELECT * FROMjobs WHERE role = „DEVELOPER‟ AND MATCH(job_description) AGAINST („node.js‟)
  • 16.
    SELECT * FROMjobs j JOIN jobs_benefits jb ON j.id = jb.job_id WHERE j.role = „DEVELOPER‟ AND (MATCH(job_description) AGAINST („node.js -asp‟ IN BOOLEAN MODE) AND jb.free_espresso = TRUE
  • 18.
    WHAT IS ASEARCH ENGINE? • Efficient indexing of data • On all fields / combination of fields • Analyzing data • Text Search • Tokenizing • Stemming • Filtering • Understanding locations • Date parsing • Relevance scoring
  • 19.
    TOKENIZING • Finding wordboundaries • Not just explode(„ „, $text); • Chinese has no spaces. (Not every single character is a word.) • Understand patterns: • URLs • Emails • #hashtags • Twitter @mentions • Currencies (EUR, €, …)
  • 20.
    STEMMING • “Stemming isthe process for reducing inflected (or sometimes derived) words to their stem, base or root form.” • Conjugations • Plurals • Example: • Fishing, Fished, Fish, Fisher > Fish • Better > Good • Several ways to find the stem: • Lookup tables • Suffix-stripping • Lemmatization •… • Different stemmers for every language.
  • 21.
    FILTERING • Remove stopwords • Different for every language • HTML • If you‟re indexing web content, not every character is meaningful.
  • 22.
    UNDERSTANDING LOCATIONS • Reversegeocoding of locations to longitude & latitude • Search on location: • Bounding box searches • Distance searches • Searching nearby • Geo Polygons • Searching a country (Note: MySQL also has geospatial indeces.)
  • 23.
    RELEVANCE SCORING • Fromthe matched documents, which ones do you show first? • Several strategies: • How many matches in document? • How many matches in document as percentage of length? • Custom scoring algorithms • At index time • At search time • … A combination Think of Google PageRank.
  • 24.
    “There‟s an appsoftware for that.”
  • 26.
    APACHE LUCENE • “Information retrieval software library” • Free/open source • Supported by Apache Foundation • Created by Doug Cutting • Written in 1999
  • 27.
    “There‟s software aJava library for that.”
  • 30.
    ELASTICSEARCH • “You know,for Search” • Also Free & Open Source • Built on top of Lucene • Created by Shay Banon @kimchy • Versions • First public release, v0.4 in February 2010 • A rewrite of earlier “Compass” project, now with scalability built-in from the very core • Now stable version at 0.20.6 • Beta branch at 0.90 (working towards 1.0 release) • In Java, so inherently cross-platform
  • 31.
    WHAT DOES ITADD TO LUCENE? • RESTfull Service • JSON API over HTTP • Want to use it from PHP? • CURL Requests, as if you‟d do requests to the Facebook Graph API. • High Availability & Performance • Clustering • Long Term Persistency • Write through to persistent storage system.
  • 32.
    $ cd ~/Downloads $wget https://github.com/…/elasticsearch-0.20.5.tar.gz $ tar –xzf elasticsearch-0.20.5.tar.gz $ cd elasticsearch-0.20.5/ $ ./bin/elasticsearch
  • 33.
    $ cd ~/Downloads $wget https://github.com/…/elasticsearch-0.20.5.tar.gz $ tar –xzf elasticsearch-0.20.5.tar.gz $ git clone https://github.com/elasticsearch/elasticsearch- servicewrapper.git elasticsearch-servicewrapper $ sudo mv elasticsearch-0.20.5 /usr/local/share $ cd elasticsearch-servicewrapper $ sudo mv service /usr/local/share/elasticsearch-0.20.5/bin $ cd /usr/local/share $ sudo ln -s elasticsearch-0.20.5 elasticsearch $ sudo chown -R root:wheel elasticsearch $ cd /usr/local/share/elasticsearch $ sudo bin/service/elasticsearch start
  • 34.
    $ sudo bin/service/elasticsearchstart Starting ElasticSearch... Waiting for ElasticSearch... . . . running: PID:83071 $
  • 36.
    $ curl -XPUThttp://localhost:9200/test/stupid-hypes/planking -d '{"name":"Planking", "stupidity_level":"5"}' {"ok":true,"_index":"test","_type":"stupid- hypes","_id":"planking","_version":1}
  • 37.
    $ curl -XPUThttp://localhost:9200/test/stupid-hypes/gallon- smashing -d '{"name":"Gallon Smashing", "stupidity_level":"5"}' {"ok":true,"_index":"test","_type":"stupid- hypes","_id":"gallon-smashing","_version":1}
  • 38.
    $ curl -XPUThttp://localhost:9200/test/stupid-hypes/gallon- smashing -d '{"name":"Gallon Smashing", "stupidity_level":"10"}' {"ok":true,"_index":"test","_type":"stupid- hypes","_id":"gallon-smashing","_version":2}
  • 40.
    $ curl -XPUThttp://localhost:9200/test/stupid-hypes/gallon- smashing -d '{"name":"Gallon Smashing", "stupidity_level":"10", "lifetime":30}’ {"ok":true,"_index":"test","_type":"stupid- hypes","_id":"gallon-smashing","_version":3}
  • 42.
    SCHEMALESS, DOCUMENT ORIENTED •No need to configure schema upfront • No need for slow ALTER TABLE –like operations • You can define a mapping (schema) to customize the indexing process • Require fields to be of certain type • If you want text fields that should not be analyzed (stemming, …)
  • 43.
    “Ok, so it‟sa NoSQL store?”
  • 46.
    TERMINOLOGY MySQL Elastic Search Database Index Table Type Row Document Column Field Schema Mapping Index Everything is indexed SQL Query DSL SELECT * FROM table … GET http://… UPDATE table SET … PUT http://…
  • 47.
    DISTRIBUTED & HIGHLYAVAILABLE • Multiple servers (nodes) running in a cluster • Acting as single service • Nodes in cluster that store data or nodes that just help in speeding up search queries. • Sharding • Indeces are sharded (# shards is configurable) • Each shard can have zero or more replicas • Replicas on different servers (server pools) for failover • One in the cluster goes down? No problem. • Master • Automatic Master detection + failover • Responsible for distribution/balancing of shards
  • 49.
    SCALING ISSUES? • Noneed for an external load balancer • Since cluster does it‟s own routing. • Ask any server in the cluster, it will delegate to correct node. • What if … • More data > More shards. • More availability > More replicas per shard.
  • 50.
    PERFORMANCE TWEAKING • BulkIndexing • Multi-Get • Avoids network latency (HTTP Api) • Api with administrative & monitoring interface • Cluster‟s availability state • Health • Nodes‟ memory footprint • Alternatives voor HTTP Api? • Java library • PHP wrappers (Sherlock, Elastica, …) • But simplicity of HTTP Api is brilliant to work with, latency is hardly an issue.
  • 51.
  • 52.
  • 57.
    Query DSL Example: (language:nlOR location.country:be OR location.country:aa) (tag:sentiment.negative) author.followers:[1000 TO *] (- sub_category:like) ((-status:857.assigned) (-status:857.done))
  • 61.
    FACETS • Instead ofreturning the matching documents … • … return data about the distribution of values in the set of matching documents • Or a subset of the matching documents • Possibilities: • Totals per unique value • Averages of values • Distributions of values •…
  • 62.
    TERMINOLOGY (CONT‟D) MySQL Elastic Search SELECT field, COUNT(*) Facet FROM table GROUP BY field
  • 65.
    ADVANCED FEATURES • Nesteddocuments (Child-Parent) • Like MySQL joins? • Percolation Index • Store queries in Elastic • Send it documents • Get returned which queries match • Index Warming • Register search queries that cause heavy load • New data added to index will be warmed • So next time query is executed: pre cached
  • 66.
    WHAT ARE MYOTHER OPTIONS? • RDBMS • MySQL, … • NoSQL • MongoDB, … • Search Engines • Solr • Sphinx • Xapian • Lucene itself • SaaS • Amazon CloudSearch
  • 67.
    … VS. SOLR •+ • Also built on Lucene • So similar feature set • Also exposes Lucene functionality, like Elastic Search, so easy to extend. • A part of Apache Lucene project • Perfect for Single Server search •- • Clustering is there. But it‟s definitely not as simple as ElasticSearch‟ • Fragmented code base. (Lots of branches.) Engagor used to run on Solr.
  • 68.
    … VS. SPHINX •+ • Great for single server full text searches; • Has graceful integration with SQL database; • (Eg. for indexing data) • Faster than the others for simple searches; •- • No out of the box clustering; • Not built on Lucene; lacks some advanced features; Netlog & Twoo use Sphinx.
  • 69.
    WANT TO USEIT? • In an existing project: • As an extra layer next to your data … • Send to both your database & elasticsearch; • Consistency problems?; • Or as replacement for database • Elastic is as persistent as MySQL; • If you don‟t need RDBMS features; • @Engagor: Our social messages are only in Elastic
  • 70.
    “Users are incrediblybad at finding and researching things on the web.” Nielsen (March 2013) http://www.nngroup.com/articles/search-navigation/
  • 71.
    “Pathetic and uselessare words that come to mind after this year‟s user testing.” Nielsen (March 2013) http://www.nngroup.com/articles/search-navigation/
  • 72.
    “I‟m searching forapples and pears.”
  • 73.
  • 75.
  • 76.
    “It‟s too young.Is it even stable enough?” Your boss (Tomorrow Morning)
  • 81.
  • 82.
    elasticsearch node.js socket.io real time notifications rabbitmq backbone.js gearman jobs@engagor.com
  • 84.
    $ cd /usr/local/share/elasticsearch $sudo bin/service/elasticsearch stop
  • 86.
    Sources include: • http://www.elasticsearch.org/videos/2010/02/07/es-introduction.html • http://www.elasticsearchtutorial.com/ • http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-elasticsearch • http://www.slideshare.net/medcl/elastic-search-quick-intro • http://www.slideshare.net/macrochen/elastic-search-apachesolr-10881377 • http://www.slideshare.net/cyber_jso/elastic-search-introduction • http://www.slideshare.net/infochimps/elasticsearch-v4 • http://engineering.foursquare.com/2012/08/09/foursquare-now-uses-elastic-search-and-on-a-related-note-slashem-also- works-with-elastic-search/ • http://stackoverflow.com/questions/10213009/solr-vs-elasticsearch • http://stackoverflow.com/questions/11115523/how-does-amazon-cloudsearch-compares-to-elasticsearch-solr-or-sphinx- in-terms-o • http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/

Editor's Notes

  • #3 This talk is about adding search to your own website.Implementing a search engine for your own content.First the bit about how it’s done in MySQL, and what problems that brings with it;Then about what a search engine should do for you;Then about how Elastic Search helps;And code of course, lots of code.
  • #4 I’m leading the development team at Engagor, a startup in the city centreof Gent, Belgium.We’re 2 years old.We build a social media monitoring & management product.Before that I worked for Massive Media for 5 years.As a lead developer I worked on the Netlog & Gatchaproducts.
  • #5 Search can be a “simple” textsearch.Here I’m searching Tumblr for funny gifs, because that’s what Tumblris for.
  • #6 But search can go deeper and more into detail too ..Here I’m usingAND, OR, NOTNestingRestrictionson fields
  • #7 Or very difficult …Searching in a mixed set of dataProfilesPhotosFriend connectionsSearching in a graph …
  • #8 My first thought when I’d have to add search to a php/mysql site … It sort of works …
  • #9 Problems arisewhen you have lots of data …To speed things up you add indecesto your MySQL tables.
  • #10 And the library analogy for a MySQL index is this …An index card box.
  • #11 MySQL has an index type esp. for full text search.Natural Language:case insensitive, accent incensitiveQuery Expansion:Search for “database” > returns results that has often has words like “mysql”, “oracle”, … A second search with extended query string happens to find related documents too.Boolean ModeAdds operators to Natural Language type
  • #13 Anyone using a similar system to this?Implemented yourself, or from a CMS?phpBB has/had a table like this.
  • #14 Example of FULLTEXT MySQL search query.
  • #15 Example of FULLTEXT MySQL search query in boolean mode. A bit more powerful.
  • #16 Now, we add restrictions on a certain fields. Now you need combined indeces to keep this fast.
  • #17 And even more restrictions.Indeces on all involved feeds? In all combinations? In all orders?Lots of indeces make WRITE operations on your tables slower.
  • #18 MySQL just isn’t built for complex search …So let’s look at what a system built for search needs …
  • #27 So it’s old.But still active.Used a lot.
  • #28 It is however a java library. It’s not a fully managed service.
  • #29 If you have lots of data; you need a search engine to be scalable & highly available.
  • #30 And that’s where ElasticSearch comes in.
  • #33 Download, unzip, start.
  • #34 This time we install it in the right place, and wrapped in a service.
  • #35 Now it runs on your localhost. As a HTTP service, so open your browser and surf to your Elastic Search server.
  • #36 HTTP Access, it’s brilliant!Do you want to secure it? Add firewalls …Do you want to cache it? Add Varnish …
  • #37 Example of adding something to ElasticSearch from your command shell via a HTTP PUT request done by curl.You can do this right after installation. No need to create an index or configure anything, just add data right away.
  • #38 Adding a second record (document).
  • #39 Updating an existing record. (Mind the new version number.)
  • #40 Want to see the record?Surf to it.The url consist ofindex, type & id.
  • #41 Want to add a new field (column)?Update your document with new field added.
  • #42 It’s right there.
  • #43 So it’s schemaless.And actually we did ZERO CONFIGURATION. We didn’t have to create indeces or tell Elastic Search what type of data we’ll be adding.Actually, you can configure a mapping/schema:To require certain fields to be of a certain typeTo avoid text fields of being analyzed (text analysis)Basically: to speed things up …
  • #44 What we’ve demoed so far is aNoSQL store. That’s cool. But not all.
  • #45 Here we do a GET request (in the browser) that searches our newly created index for the word “smashing”It returns the 2nd document.
  • #46 Curl in PHP is simple.Simplest example of how to do a search to elastic from PHP.
  • #47 I’ve mentioned a few Elastic Search specific terms.Here’s the full breakdown of terminology and how they related to MySQL concepts.
  • #48 Back to the clustering features Elastic Search adds …
  • #49 Here you have a Engagor specific dashboard that shows the 12 servers in the Engagor cluster.You see:server12 is the master;That every node knows each other;There are 11K shards;Each with one replica.Health = green means every shard has a replica (on a different server).If one server goes down: no problem.
  • #54 Now let’s look at how Engagor uses Elastic Search.First, what do we do?From a technical point of view:Engagor = Huge database of social messages.Facebook, Twitter, Instagram, News sites …We save those that are clients are interested in to our application and offer:Statistics about the tracked dataWorkflow toolsAutomation tools
  • #55 This is the time to show a slide I stole from a presentation from our CEO Folke.We started 2 years ago.I joined after 6 months.Now a team of 16 people.7 of them technical profilesweb developersdata scientistsbackend developersfrontend developersOur customers includeMobistarTelenetMicrosoft EuropeEuropean CommisionAlproSeveral agenciesThey use Engagor for customer support (“call center” software for social media)marketing insightscrisis detection…
  • #56 Here you see an example of a search on our cluster for a certain twitter user’s handle.You see it returns 260 social messages.Each message has data like the:IdService it’s fromContentDateAuthor details
  • #57 Here we’re searching in a topic about Belgacom & Mobile Vikings.“All messages from users with at least 1000 followers, that are negative and from Belgium or in Dutch.”On 4 different fields, and nested … It just works.
  • #58 This is the Query SQL we’re sending to Elastic for the previous search.
  • #59 Here we are in a topic about Coca Cola, thus high volume.About 50k message per day. 28 days.That’s 1,4M messages we’re searching in.This is a graph of messages per day.
  • #60 This is the inbox. Showing the last 10 messages in that topic.Performance: about half a second.
  • #61 Same inbox, but now only showing messages with the word “thirsty”.Performance: again about half a second.(Only 1 sample, so this is not really a benchmark .)
  • #62 Now, there’s another feature of search engines you might not immediately think about.
  • #63 Think of it as an equivalent to MySQL GROUP BY query.
  • #64 The pie chart:Facet on sentiment field of a mention. Returns totals per value.Sentiment Per Day:Facet on combined fields: sentiment + day(dateadd).Returns data used for second chart.You can also use the filter like in the inbox, to see these facets for a filtered set of data. Eg. Sentiment per day for mentions coming from Belgium only.
  • #65 For the Telenet Twitter profile:Totals of messages per dayPer typeRetweetsMentionsOwn tweetsEvery “segment” (color) is facet with custom filter/search querySingle ElasticSearch call to get all this information.
  • #66 Percolation example:Use it to route documents …Eg. you have a stream of data coming in and need to decide what to do with those documents based on queries.Everything matching x is for client AEverything matching y is for client B
  • #68 There’s a good competition going on. Looks like Elastic Search has made Solr alert again; since they’re also focusing on clustering features now.
  • #71 It’s not because you have a great search engine, that you have great search experience on your site …Users aren’t very good at it …But searching for things is not that easy …
  • #73 Look at how natural language differs from search query language.
  • #75 This happens quite often at the Engagor office.Not that we blame our users …It’s just difficult to get it right.
  • #77 If you want to play with it, and are excited …But you’re boss is all #meh.
  • #78 ElasticSearch is a mature product.Soundcloud is using it.
  • #79 StumbleUpon is using it.
  • #80 Foursquare switched to it …When checking in somewhere, they show:Locations close to youLocations you previously checked inLocations that are popularThat’s a pretty nifty search query right there.
  • #81 There’s also a real company behind Elastic Search, backing it with:TutorialsTrainingSupportSLA’sConsulting
  • #82 Recently redesigned site. Documentation is a bit frightening at first, since it’s hard to know what to search for, but I hope this presentation solves that.IRC channel is very active; quick answers.
  • #83 If you’re interested in working with Elastic Search, or in fact, any of these other cool technologies …We can always use good profiles: junior or senior, backend or frontend … Send us your resumes ;).March 27th: Now we’re esp. searching for someone who’s good at Javascript.
  • #84 Already gave an overview of how it’s like to work with the technologies at Engagor.Go to http://startupshizzle.tumblr.com to find out how it’s like to work with the people of our team ;).
  • #85 So, that’s it.
  • #87 So, that’s it.