• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hippo meetup: enterprise search with Solr and elasticsearch
 

Hippo meetup: enterprise search with Solr and elasticsearch

on

  • 1,665 views

Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the ...

Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.

Statistics

Views

Total Views
1,665
Views on SlideShare
1,511
Embed Views
154

Actions

Likes
5
Downloads
27
Comments
0

1 Embed 154

https://twitter.com 154

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hippo meetup: enterprise search with Solr and elasticsearch Hippo meetup: enterprise search with Solr and elasticsearch Presentation Transcript

    • 15th January 2013 – Hippo meetupLuca CavannaSoftware developer & Search consultant at Trifork Amsterdamluca.cavanna@trifork.nl - @lucacavanna
    • Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) ● Hippo partner ● Hippo related search projects: – uva.nl – working on rijksoverheid.nl
    • Agenda● Search introduction – Lucene foundation – Why do we need Solr or elasticsearch?● Scaling with Solr● Elasticsearch distributed nature● Elasticsearch features
    • Apache Lucene● High-performance, full-featured text search engine library written entirely in Java● It indexes documents as collections of fields● A field is a string based key-value pair● What data structure does it use under the hood?
    • Inverted index term freq Posting list1 The old night keeper keeps the keep in the town and 1 6 big 2 232 In the big old house in the big old gown. dark 1 63 The house in the town had the big old keep did 1 4 grown 1 24 Where the old night keeper never did sleep. had 1 3 house 2 235 The night keeper keeps the keep in the night in 5 123566 And keeps in the dark and sleeps in the light. keep 3 135 keeper 3 145 keeps 3 156 light 1 6 never 1 4 night 3 145 old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4
    • Inverted index● Indexing – Text analysis ● Tokenization, lowercasing and more● The inverted index can contain more data – Term offsets and more● The inverted index itself doesnt contain the text for displaying the search results
    • Indexing● Lucene writes indexes as segments● Segments are not modifiable: Write-Once● Each segment is a searchable mini index● Each segment contains – Inverted index – Stored fields – ...and more
    • Indexing: the commit operation● Documents are searchable only after a commit!● Commit gives also durability● The most expensive operation in Lucene!!!
    • Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ● With the Lucene near-real time API you dont need a commit to make new documents searchable ● Less expensive than commit ● Doesnt guarantee durability though ● Exposed as soft commit in Solr 4.0
    • Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
    • Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
    • Whats missing? ● A common way to represent documents ● Interface to send document to (HTTP) ● A way to represent queries ● Interface to send queries to (HTTP) ● Configuration ● Caching ● Distributed infrastructure ● And more....
    • Enterprise search servers
    • Scaling – why? ‣ The more concurrent searches you run, the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
    • Solr replication example
    • Solr replication (pull approach) • Master-slave based solution • Single machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
    • SolrCloud • A set of new distributed capabilities in Solr • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
    • elasticsearch● Distributed search engine built on top of Lucene● Apache 2 license● Written in Java● RESTful● Created and mainly developed by Shay Banon● A company behind it: elasticsearch.com● Regular releases – Latest release 0.20.2
    • elasticsearch● Schemaless – Uses defaults and automatic type guessing – Custom mappings may be defined if needed● JSON oriented● Multi tenancy – Multiple indexes per node, multiple types per index● Designed to be distributed from the beginning● Almost everything is available as API (including configuration)● Wide range of administration APIs
    • elasticsearch distributed terminology● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server)● Cluster: one or more nodes with the same cluster name● Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards.● Index: a logical namespace which points to one or more shards – Your code wont deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
    • elasticsearch distributed terminology● More shards: – improve indexing performance – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well!● More replicas: – increase failover – improve querying performance
    • Transaction Log • Indexed docs are fully persistent • No need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
    • Index - Shards & Replicas Node Node curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
    • Index - Shards & Replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
    • Indexing - 1 • Automatic sharding, push replication Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/1 -d { "name" : { "first" : "Luca", Client "last" : "Cavanna" } }
    • Indexing - 2 Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/2 -d { "name" : { Client "first" : "Jeroen", "last" : "Reijn" } }
    • Search - 1 • Scatter / Gather search Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
    • Search - 2 • Automatic balancing between replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
    • Search - 3 • Automatic failover Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) failure (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) Shard 1 Shard 1 (replica) (primary)
    • Node failure Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Node failure - 1 • Replicas can automatically become primaries Node Node Shard 0 (primary) Shard 1 (primary)
    • Node failure - 2 • Shards are automatically assigned and do “hot” recovery Node Node Shard 0 Shard 0 (replica) (primary) Shard 1 Shard 1 (primary) (replica)
    • Dynamic Replicas Node Node Node Shard 0 Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } }
    • Dynamic Replicas Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_replicas" : 2 } }
    • Indexing (Push) - ElasticSearch • Documents added through push requests • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
    • Indexing (Pull) - ElasticSearch • Data flows from sources using ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River
    • Searching - ElasticSearch • Search request in Request Body • Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
    • Search Example - ElasticSearch$ curl -XGET http://localhost:9200/hippo/users/_search -d { "query" : { { "term" : { "first_name" : "luca" } "_shards": { } "total" : 5,} "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
    • Thanks There would be a lot more to say: • Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?