You know, for search
  February 18th, 2011, RivieraJUG
About me
●   Lukáš Vlček ( @lukasvlcek )
●   Java developer since 2001
●   Joined Red Hat (JBoss division) in 2010
●   Member of JBoss.org team, focusing on search
In the beginning there was...
Zzzzz...




...Shay Banon is not sleeping but dreaming!
Highly-available
                {JSON}



   RESTful
Search Engine                                     Distributed




                                              CLOUD
 Asynchronous




       Open Sourced
                         ... buzzword dreaming!
          (ASL2)
!                              *
                                 :-)

*
        ... and @ElasticSearch
                     was born!
Dreams do come true




https://github.com/elasticsearch/elasticsearch




        http://www.elasticsearch.org
What is ElasticSearch ?
●   Distributed
●   Highly-available
●   RESTful search engine (on top of Lucene)
●   Designed to speak JSON (JSON in, JSON out)
●   and more...


    ... but first, let's check simple examples.
Demo #1




RESTful JSON teaser
RESTful
●   Network interface for data indexing, searching
    and administration.

curl ­XGET 'http://localhost:9200/index1,index2/typeA,typeB/_search' ­d '{
  “query“ : { “match_all“ : {} }
}'




You can query one or more indices.
Indices can have aliases, you can also
use _all for all indices.


Each index have one or more types, something
like columns in DB table.
Highly available
●   For each index you can specify:
    ●   Number of shards
        –   Each index has fixed number of shards
    ●   Number of replicas
        –   Each shard can have 0-many replicas, can be changed
            dynamically
Distributed
●   Check next slides...
ZEN Discovery




      Node 1                    Node 2              Node 3              Node 4



                      A                    B                        C
A: { shards: 3, replicas: 2 }       B: { shards: 2, replicas: 3 }       C: { shards: 1, replicas: 0 }

 A1      A1      A1
                                     B1      B1      B1      B1
 A2      A2      A2                                                      C1
                                     B2      B2      B2      B2
 A3      A3      A3



        Gateway (longterm persistency of cluster data & metadata)
ZEN Discovery




      Node 1                    Node 2              Node 3              Node 4



                      A                    B                        C
A: { shards: 3, replicas: 2 }       B: { shards: 2, replicas: 3 }       C: { shards: 1, replicas: 4 }

 A1      A1      A1
                                     B1      B1      B1      B1
 A2      A2      A2                                                      C1     C1     C1      C1       C1
                                     B2      B2      B2      B2
 A3      A3      A3
                                                                                Can not allocate all replicas!
                                                                                Check Health API


        Gateway (longterm persistency of cluster data & metadata)
Talking to the cluster
●   Native client in Java and Groovy

          Client




                    Node 1   Node 2    Node 3


●   Client type:
    ● Node client
    ● Transport Client
Talking to the cluster
●   REST client

             Client




                       Node 1      Node 2      Node 3


●   Many clients built on top of REST API
    ●   Perl, PHP, Python, Ruby, Erlang, ... etc
Nodes do not have to be equal
●   Can be a master
●   Can be a data node
●   Can allow for REST transport interface
    ●     Http, memcached, thrift
●   Index store (file, memory)               Node 1
●   JMX enabled
●   Thread pool type
●   ...                             Node 2            Node 3
Gateway
●   Long time persistency allows for whole (and
    partial) cluster backup and recovery.

    Types:
    ●   Local (default)
    ●   NFS
    ●   HDFS
    ●   AWS: S3
Distributed queries
●   You can control the type of the search query
    per search request:
    ●   Query and Fetch
    ●   Query then Fetch
    ●   Dfs, Query and Fetch
    ●   Dfs, Query then Fetch
Demo #2




 Dynamic allocation of indices,
shards, replicas and Health API
Admin API
●   Indices
       –   Status
       –   CRUD operation
       –   Mapping, Open/Close, Update settings
       –   Flush, Refresh, Snapshot, Optimize
●   Cluster
       –   Health
       –   State
       –   Node Info and stats
       –   Shutdown
Demo #3




Admin API: getting JVM and OS stats
Rich query API
●   There is rich Query DSL for search, includes:
    ●   Queries
         –   Boolean, Fuzzy, MLT, Prefix, DisMax, ...
    ●   Filters
         –   And/Or/Not, Boolean, Geo, Missing, Exists, ...
    ●   Highlighting
    ●   Sort
    ●   Facets
         –   on a next slide...
Facets
●   Facets allows to provide aggregated data on
    the search request.
    ●   query
    ●   filter
    ●   terms
    ●   range
    ●   (date) histogram
    ●   statistical
    ●   geo distance
Scripting support
●   There is a support for using scripting languages
    in many places (for example for custom scoring,
    script fields, script key in facets ...)
    ●   mvel (default)
    ●   JS
    ●   Groovy
    ●   Python
Demo #4




      Java API: Indexing data
REST API: Faceted search, Highlighting
Parent / Child
●   The parent/child support allows to define a
    parent relationship from a child to a parent type.
    ●   has_child (query, filter)
    ●   top_children (filter)
River
●   Let's listen on stream of changes and index the
    data...
    ●   CouchDB
    ●   RabbitMQ
    ●   Twitter
    ●   Wikipedia
Versioning (new in 0.15)
●   “update if current” functionality
●   ie: I can get a document, change it and then put
    it back in (referencing the version ID I fetched)
    and it will either index or fail (if the document
    has been modified in the interim)
●   Completely real-time
Percolator (new in 0.15)
●   The percolator API allows to register queries
    against an index, and then send a percolate
    request which includes a document, and getting
    back the queries that match on that document
    out of set of registered queries.
Q&A
Thank you!

Elastic Search

  • 1.
    You know, forsearch February 18th, 2011, RivieraJUG
  • 2.
    About me ● Lukáš Vlček ( @lukasvlcek ) ● Java developer since 2001 ● Joined Red Hat (JBoss division) in 2010 ● Member of JBoss.org team, focusing on search
  • 3.
    In the beginningthere was...
  • 5.
    Zzzzz... ...Shay Banon isnot sleeping but dreaming!
  • 6.
    Highly-available {JSON} RESTful Search Engine Distributed CLOUD Asynchronous Open Sourced ... buzzword dreaming! (ASL2)
  • 7.
    ! * :-) * ... and @ElasticSearch was born!
  • 8.
    Dreams do cometrue https://github.com/elasticsearch/elasticsearch http://www.elasticsearch.org
  • 9.
    What is ElasticSearch? ● Distributed ● Highly-available ● RESTful search engine (on top of Lucene) ● Designed to speak JSON (JSON in, JSON out) ● and more... ... but first, let's check simple examples.
  • 10.
  • 11.
    RESTful ● Network interface for data indexing, searching and administration. curl ­XGET 'http://localhost:9200/index1,index2/typeA,typeB/_search' ­d '{   “query“ : { “match_all“ : {} } }' You can query one or more indices. Indices can have aliases, you can also use _all for all indices. Each index have one or more types, something like columns in DB table.
  • 12.
    Highly available ● For each index you can specify: ● Number of shards – Each index has fixed number of shards ● Number of replicas – Each shard can have 0-many replicas, can be changed dynamically
  • 13.
    Distributed ● Check next slides...
  • 14.
    ZEN Discovery Node 1 Node 2 Node 3 Node 4 A B C A: { shards: 3, replicas: 2 } B: { shards: 2, replicas: 3 } C: { shards: 1, replicas: 0 } A1 A1 A1 B1 B1 B1 B1 A2 A2 A2 C1 B2 B2 B2 B2 A3 A3 A3 Gateway (longterm persistency of cluster data & metadata)
  • 15.
    ZEN Discovery Node 1 Node 2 Node 3 Node 4 A B C A: { shards: 3, replicas: 2 } B: { shards: 2, replicas: 3 } C: { shards: 1, replicas: 4 } A1 A1 A1 B1 B1 B1 B1 A2 A2 A2 C1 C1 C1 C1 C1 B2 B2 B2 B2 A3 A3 A3 Can not allocate all replicas! Check Health API Gateway (longterm persistency of cluster data & metadata)
  • 16.
    Talking to thecluster ● Native client in Java and Groovy Client Node 1 Node 2 Node 3 ● Client type: ● Node client ● Transport Client
  • 17.
    Talking to thecluster ● REST client Client Node 1 Node 2 Node 3 ● Many clients built on top of REST API ● Perl, PHP, Python, Ruby, Erlang, ... etc
  • 18.
    Nodes do nothave to be equal ● Can be a master ● Can be a data node ● Can allow for REST transport interface ● Http, memcached, thrift ● Index store (file, memory) Node 1 ● JMX enabled ● Thread pool type ● ... Node 2 Node 3
  • 19.
    Gateway ● Long time persistency allows for whole (and partial) cluster backup and recovery. Types: ● Local (default) ● NFS ● HDFS ● AWS: S3
  • 20.
    Distributed queries ● You can control the type of the search query per search request: ● Query and Fetch ● Query then Fetch ● Dfs, Query and Fetch ● Dfs, Query then Fetch
  • 21.
    Demo #2 Dynamicallocation of indices, shards, replicas and Health API
  • 22.
    Admin API ● Indices – Status – CRUD operation – Mapping, Open/Close, Update settings – Flush, Refresh, Snapshot, Optimize ● Cluster – Health – State – Node Info and stats – Shutdown
  • 23.
    Demo #3 Admin API:getting JVM and OS stats
  • 24.
    Rich query API ● There is rich Query DSL for search, includes: ● Queries – Boolean, Fuzzy, MLT, Prefix, DisMax, ... ● Filters – And/Or/Not, Boolean, Geo, Missing, Exists, ... ● Highlighting ● Sort ● Facets – on a next slide...
  • 25.
    Facets ● Facets allows to provide aggregated data on the search request. ● query ● filter ● terms ● range ● (date) histogram ● statistical ● geo distance
  • 26.
    Scripting support ● There is a support for using scripting languages in many places (for example for custom scoring, script fields, script key in facets ...) ● mvel (default) ● JS ● Groovy ● Python
  • 27.
    Demo #4 Java API: Indexing data REST API: Faceted search, Highlighting
  • 28.
    Parent / Child ● The parent/child support allows to define a parent relationship from a child to a parent type. ● has_child (query, filter) ● top_children (filter)
  • 29.
    River ● Let's listen on stream of changes and index the data... ● CouchDB ● RabbitMQ ● Twitter ● Wikipedia
  • 30.
    Versioning (new in0.15) ● “update if current” functionality ● ie: I can get a document, change it and then put it back in (referencing the version ID I fetched) and it will either index or fail (if the document has been modified in the interim) ● Completely real-time
  • 31.
    Percolator (new in0.15) ● The percolator API allows to register queries against an index, and then send a percolate request which includes a document, and getting back the queries that match on that document out of set of registered queries.
  • 32.
  • 33.