Hippo meetup: enterprise search with Solr and elasticsearch
Upcoming SlideShare
Loading in...5
×
 

Hippo meetup: enterprise search with Solr and elasticsearch

on

  • 1,747 views

Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the ...

Presentation used at the Hippo meetup about enterprise search which took place in Amsterdam. The talk started with a general introduction about search with lucene, scaling with Solr and the distributed problems that elasticsearch successfully addresses.

Statistics

Views

Total Views
1,747
Slideshare-icon Views on SlideShare
1,593
Embed Views
154

Actions

Likes
5
Downloads
27
Comments
0

1 Embed 154

https://twitter.com 154

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hippo meetup: enterprise search with Solr and elasticsearch Hippo meetup: enterprise search with Solr and elasticsearch Presentation Transcript

    • 15th January 2013 – Hippo meetupLuca CavannaSoftware developer & Search consultant at Trifork Amsterdamluca.cavanna@trifork.nl - @lucacavanna
    • Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) ● Hippo partner ● Hippo related search projects: – uva.nl – working on rijksoverheid.nl
    • Agenda● Search introduction – Lucene foundation – Why do we need Solr or elasticsearch?● Scaling with Solr● Elasticsearch distributed nature● Elasticsearch features
    • Apache Lucene● High-performance, full-featured text search engine library written entirely in Java● It indexes documents as collections of fields● A field is a string based key-value pair● What data structure does it use under the hood?
    • Inverted index term freq Posting list1 The old night keeper keeps the keep in the town and 1 6 big 2 232 In the big old house in the big old gown. dark 1 63 The house in the town had the big old keep did 1 4 grown 1 24 Where the old night keeper never did sleep. had 1 3 house 2 235 The night keeper keeps the keep in the night in 5 123566 And keeps in the dark and sleeps in the light. keep 3 135 keeper 3 145 keeps 3 156 light 1 6 never 1 4 night 3 145 old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4
    • Inverted index● Indexing – Text analysis ● Tokenization, lowercasing and more● The inverted index can contain more data – Term offsets and more● The inverted index itself doesnt contain the text for displaying the search results
    • Indexing● Lucene writes indexes as segments● Segments are not modifiable: Write-Once● Each segment is a searchable mini index● Each segment contains – Inverted index – Stored fields – ...and more
    • Indexing: the commit operation● Documents are searchable only after a commit!● Commit gives also durability● The most expensive operation in Lucene!!!
    • Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ● With the Lucene near-real time API you dont need a commit to make new documents searchable ● Less expensive than commit ● Doesnt guarantee durability though ● Exposed as soft commit in Solr 4.0
    • Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
    • Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
    • Whats missing? ● A common way to represent documents ● Interface to send document to (HTTP) ● A way to represent queries ● Interface to send queries to (HTTP) ● Configuration ● Caching ● Distributed infrastructure ● And more....
    • Enterprise search servers
    • Scaling – why? ‣ The more concurrent searches you run, the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
    • Solr replication example
    • Solr replication (pull approach) • Master-slave based solution • Single machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
    • SolrCloud • A set of new distributed capabilities in Solr • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
    • elasticsearch● Distributed search engine built on top of Lucene● Apache 2 license● Written in Java● RESTful● Created and mainly developed by Shay Banon● A company behind it: elasticsearch.com● Regular releases – Latest release 0.20.2
    • elasticsearch● Schemaless – Uses defaults and automatic type guessing – Custom mappings may be defined if needed● JSON oriented● Multi tenancy – Multiple indexes per node, multiple types per index● Designed to be distributed from the beginning● Almost everything is available as API (including configuration)● Wide range of administration APIs
    • elasticsearch distributed terminology● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server)● Cluster: one or more nodes with the same cluster name● Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards.● Index: a logical namespace which points to one or more shards – Your code wont deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
    • elasticsearch distributed terminology● More shards: – improve indexing performance – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well!● More replicas: – increase failover – improve querying performance
    • Transaction Log • Indexed docs are fully persistent • No need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
    • Index - Shards & Replicas Node Node curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
    • Index - Shards & Replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }
    • Indexing - 1 • Automatic sharding, push replication Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/1 -d { "name" : { "first" : "Luca", Client "last" : "Cavanna" } }
    • Indexing - 2 Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/2 -d { "name" : { Client "first" : "Jeroen", "last" : "Reijn" } }
    • Search - 1 • Scatter / Gather search Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
    • Search - 2 • Automatic balancing between replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Clientcurl -XPUT localhost:9200/hippo/_search?q=luca
    • Search - 3 • Automatic failover Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) failure (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) Shard 1 Shard 1 (replica) (primary)
    • Node failure Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
    • Node failure - 1 • Replicas can automatically become primaries Node Node Shard 0 (primary) Shard 1 (primary)
    • Node failure - 2 • Shards are automatically assigned and do “hot” recovery Node Node Shard 0 Shard 0 (replica) (primary) Shard 1 Shard 1 (primary) (replica)
    • Dynamic Replicas Node Node Node Shard 0 Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } }
    • Dynamic Replicas Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) curl -XPUT localhost:9200/hippo -d { "index" : { Client "number_of_replicas" : 2 } }
    • Indexing (Push) - ElasticSearch • Documents added through push requests • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
    • Indexing (Pull) - ElasticSearch • Data flows from sources using ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River
    • Searching - ElasticSearch • Search request in Request Body • Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
    • Search Example - ElasticSearch$ curl -XGET http://localhost:9200/hippo/users/_search -d { "query" : { { "term" : { "first_name" : "luca" } "_shards": { } "total" : 5,} "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
    • Thanks There would be a lot more to say: • Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?