15th January 2013 – Hippo meetupLuca CavannaSoftware developer & Search consultant at Trifork Amsterdamluca.cavanna@trifor...
Trifork (aka Jteam/Dutchworks/Orange11)     Focus areas:     –   Big data & Search     –   Mobile     –   Custom solutions...
Agenda●   Search introduction    –   Lucene foundation    –   Why do we need Solr or elasticsearch?●   Scaling with Solr● ...
Apache Lucene●   High-performance, full-featured text search engine    library written entirely in Java●   It indexes docu...
Inverted index                                                       term    freq   Posting list1   The old night keeper k...
Inverted index●   Indexing    –   Text analysis         ●   Tokenization, lowercasing and more●   The inverted index can c...
Indexing●   Lucene writes indexes as segments●   Segments are not modifiable: Write-Once●   Each segment is a searchable m...
Indexing: the commit operation●   Documents are searchable only after a commit!●   Commit gives also durability●   The mos...
Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ●   With the Lucene near-real time API you dont need a     c...
Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,             new St...
Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title",  ...
Whats missing? ●   A common way to represent documents ●   Interface to send document to (HTTP) ●   A way to represent que...
Enterprise search servers
Scaling – why? ‣ The more concurrent searches you run, the slower they   get ‣ Indexing and searching on the same machine ...
Solr replication example
Solr replication (pull approach)   • Master-slave based solution   • Single machine for indexing data (master)   • Multipl...
SolrCloud   • A set of new distributed capabilities in Solr      • uses Apache Zookeeper as a system of record for       t...
elasticsearch●   Distributed search engine built on top of Lucene●   Apache 2 license●   Written in Java●   RESTful●   Cre...
elasticsearch●   Schemaless    –   Uses defaults and automatic type guessing    –   Custom mappings may be defined if need...
elasticsearch distributed terminology●   Node: a running instance of elasticsearch which belongs    to a cluster (usually ...
elasticsearch distributed terminology●   More shards:    –   improve indexing performance    –   increase data distributio...
Transaction Log   • Indexed docs are fully persistent      • No need for a Lucene IndexWriter#commit   • Managed using a t...
Index - Shards & Replicas      Node                  Node                               curl -XPUT localhost:9200/hippo -d...
Index - Shards & Replicas         Node                     Node              Shard 0               Shard 0             (pr...
Indexing - 1   • Automatic sharding, push replication     Node                    Node         Shard 0               Shard...
Indexing - 2      Node                    Node          Shard 0               Shard 0         (primary)              (repl...
Search - 1   • Scatter / Gather search              Node                     Node                  Shard 0                ...
Search - 2   • Automatic balancing between replicas              Node                     Node                  Shard 0   ...
Search - 3   • Automatic failover              Node                     Node                  Shard 0                Shard...
Adding a node  • “Hot” reallocation of shards to the new node    Node              Node        Shard 0           Shard 0  ...
Adding a node  • “Hot” reallocation of shards to the new node    Node              Node                Node        Shard 0...
Adding a node  • “Hot” reallocation of shards to the new node    Node              Node                Node        Shard 0...
Node failure    Node            Node          Node        Shard 0                     Shard 0       (primary)             ...
Node failure - 1   • Replicas can automatically become primaries                       Node              Node             ...
Node failure - 2   • Shards are automatically assigned and do “hot”     recovery                       Node               ...
Dynamic Replicas    Node           Node                  Node        Shard 0      Shard 0       (primary)     (replica)   ...
Dynamic Replicas    Node           Node                  Node        Shard 0      Shard 0                 Shard 0       (p...
Indexing (Push) - ElasticSearch •   Documents added through push requests •   Full JSON Object representation of Documents...
Indexing (Pull) - ElasticSearch •   Data flows from sources using ‘Rivers’ •   Continues to add data as it ‘flows’ •   Can...
Searching - ElasticSearch •   Search request in Request Body •   Powerful and extensible Query DSL •   Separation of Query...
Search Example - ElasticSearch$ curl -XGET http://localhost:9200/hippo/users/_search -d {   "query" : {                   ...
Thanks  There would be a lot more to say:    • Query DSL    • Scripting module (pluggable implementation)    • Percolator ...
Upcoming SlideShare
Loading in...5
×

Hippo meetup: enterprise search with Solr and elasticsearch

1,593

Published on

Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,593
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hippo meetup: enterprise search with Solr and elasticsearch

  1. 1. 15th January 2013 – Hippo meetupLuca CavannaSoftware developer & Search consultant at Trifork Amsterdamluca.cavanna@trifork.nl - @lucacavanna
  2. 2. Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) ● Hippo partner ● Hippo related search projects: – uva.nl – working on rijksoverheid.nl
  3. 3. Agenda● Search introduction – Lucene foundation – Why do we need Solr or elasticsearch?● Scaling with Solr● Elasticsearch distributed nature● Elasticsearch features
  4. 4. Apache Lucene● High-performance, full-featured text search engine library written entirely in Java● It indexes documents as collections of fields● A field is a string based key-value pair● What data structure does it use under the hood?
  5. 5. Inverted index term freq Posting list1 The old night keeper keeps the keep in the town and 1 6 big 2 232 In the big old house in the big old gown. dark 1 63 The house in the town had the big old keep did 1 4 grown 1 24 Where the old night keeper never did sleep. had 1 3 house 2 235 The night keeper keeps the keep in the night in 5 123566 And keeps in the dark and sleeps in the light. keep 3 135 keeper 3 145 keeps 3 156 light 1 6 never 1 4 night 3 145 old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4
  6. 6. Inverted index● Indexing – Text analysis ● Tokenization, lowercasing and more● The inverted index can contain more data – Term offsets and more● The inverted index itself doesnt contain the text for displaying the search results
  7. 7. Indexing● Lucene writes indexes as segments● Segments are not modifiable: Write-Once● Each segment is a searchable mini index● Each segment contains – Inverted index – Stored fields – ...and more
  8. 8. Indexing: the commit operation● Documents are searchable only after a commit!● Commit gives also durability● The most expensive operation in Lucene!!!
  9. 9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ● With the Lucene near-real time API you dont need a commit to make new documents searchable ● Less expensive than commit ● Doesnt guarantee durability though ● Exposed as soft commit in Solr 4.0
  10. 10. Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
  11. 11. Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
  12. 12. Whats missing? ● A common way to represent documents ● Interface to send document to (HTTP) ● A way to represent queries ● Interface to send queries to (HTTP) ● Configuration ● Caching ● Distributed infrastructure ● And more....
  13. 13. Enterprise search servers
  14. 14. Scaling – why? ‣ The more concurrent searches you run, the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
  15. 15. Solr replication example
  16. 16. Solr replication (pull approach) • Master-slave based solution • Single machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
  17. 17. SolrCloud • A set of new distributed capabilities in Solr • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
  18. 18. elasticsearch● Distributed search engine built on top of Lucene● Apache 2 license● Written in Java● RESTful● Created and mainly developed by Shay Banon● A company behind it: elasticsearch.com● Regular releases – Latest release 0.20.2
  19. 19. elasticsearch● Schemaless – Uses defaults and automatic type guessing – Custom mappings may be defined if needed● JSON oriented● Multi tenancy – Multiple indexes per node, multiple types per index● Designed to be distributed from the beginning● Almost everything is available as API (including configuration)● Wide range of administration APIs
  20. 20. elasticsearch distributed terminology● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server)● Cluster: one or more nodes with the same cluster name● Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards.● Index: a logical namespace which points to one or more shards – Your code wont deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
  21. 21. elasticsearch distributed terminology● More shards: – improve indexing performance – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well!● More replicas: – increase failover – improve querying performance
  22. 22. Transaction Log • Indexed docs are fully persistent • No need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
  23. 23. Index - Shards & Replicas Node Node curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
  24. 24. Index - Shards & Replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
  25. 25. Indexing - 1 • Automatic sharding, push replication Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/1 -d { "name" : { "first" : "Luca", Client "last" : "Cavanna" } }
  26. 26. Indexing - 2 Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/2 -d { "name" : { Client "first" : "Jeroen", "last" : "Reijn" } }
  27. 27. Search - 1 • Scatter / Gather search Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
  28. 28. Search - 2 • Automatic balancing between replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
  29. 29. Search - 3 • Automatic failover Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) failure (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  30. 30. Adding a node • “Hot” reallocation of shards to the new node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  31. 31. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  32. 32. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) Shard 1 Shard 1 (replica) (primary)
  33. 33. Node failure Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  34. 34. Node failure - 1 • Replicas can automatically become primaries Node Node Shard 0 (primary) Shard 1 (primary)
  35. 35. Node failure - 2 • Shards are automatically assigned and do “hot” recovery Node Node Shard 0 Shard 0 (replica) (primary) Shard 1 Shard 1 (primary) (replica)
  36. 36. Dynamic Replicas Node Node Node Shard 0 Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } }
  37. 37. Dynamic Replicas Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_replicas" : 2 } }
  38. 38. Indexing (Push) - ElasticSearch • Documents added through push requests • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
  39. 39. Indexing (Pull) - ElasticSearch • Data flows from sources using ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River
  40. 40. Searching - ElasticSearch • Search request in Request Body • Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
  41. 41. Search Example - ElasticSearch$ curl -XGET http://localhost:9200/hippo/users/_search -d { "query" : { { "term" : { "first_name" : "luca" } "_shards": { } "total" : 5,} "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
  42. 42. Thanks There would be a lot more to say: • Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×