• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Elasticsearch & "PeopleSearch"
 

Elasticsearch & "PeopleSearch"

on

  • 1,533 views

Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and ...

Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and explained why ES is a good fit for Traackr's use case.

Statistics

Views

Total Views
1,533
Views on SlideShare
1,531
Embed Views
2

Actions

Likes
7
Downloads
50
Comments
0

1 Embed 2

http://www.scoop.it 2

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • - important to differentiate with Solr Cloud\n - Solr Cloud (in trunk but not quite out yet; will come out with Lucene 4.0)\n - Solr Cloud uses Zookeeper to coordinate the cluster, ES it’s built-in every node (issue with nodes losing connectivity with cluster, electing themselves as master, ES can use ZK as a plugin)\n - ES uses multicast, so if network does not support it, need to switch to unicast\n - Both support distributed NRT\n- refer to http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/\n
  • \n
  • \n
  • \n
  • - talk about how ES differs from Solr in that it detects the fields based on the content; Solr has the wildcard definitions.\n- Solr schema.xml vs. ES REST API driven JSON DSL config which can be dynamic\n
  • if curl statements get snoozes, show real app demo\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • if curl statements get snoozes, show real app demo\n
  • Percolators? Don’t trigger when a record is available for searching (Igor’s comment)\n
  • \n

Elasticsearch & "PeopleSearch" Elasticsearch & "PeopleSearch" Presentation Transcript

  • Elasticsearch & “PeopleSearch” Leveraging Elasticsearch @
  • About TraackrA search engineA people discovery engineSubscription-basedMigrated from Solr toElasticsearch in Q3 ’12
  • About me14+ years of experience buildingfull-stack web software systemswith a past focus on e-commerce and publishingVP Engineering @ Traackr,responsible for buildingengineering capability to enableTraackrs growth goalsabout.me/george-stathis
  • About this talk Short intro to Elasticsearch How search is done @ Traackr Why Elasticsearch was the right fit
  • About ElasticsearchLucene under the coversDistributed from the ground upFull support for Lucene Near Real-Time searchNative JSON Query DSLAutomatic schema detection (“schema-less”)Supports document types
  • Elasticsearch - Distributed Indices broken into shards shards have 0 or more replicas data nodes hold one or more shards data nodes can coordinate/forward requests automatic routing & rebalancing but overrides available Default mode is multicast (zen discovery), unicast available for multicast unfriendly networks, AWS plug-in available, Zookeeper plug-in available made possible by Sonian. YouTube demo: http://youtu.be/ Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch l4ReamjCxHo
  • Elasticsearch - NRTUses Lucene’s IndexReader.open(IndexWriterwriter, boolean applyAllDeletes)Opens a near real time IndexReader from theIndexWriterBy default, flushes and makes new updates availableevery second
  • Elasticsearch - JSON DSL # Query String curl localhost:9200/test/_search?pretty=1 -d { "query" : { "query_string" : { "query" : "tags:scala" } } } Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh # Range curl localhost:9200/test/_search?pretty=1 -d { "query" : { "range" : { "price" : { "gt" : 15 } } } } Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  • Elasticsearch - JSON DSL (cont)# Filtered Query# Filters are similar to queries, except they do no scoring# and are easily cached.# There are many filter types as well, including range and termcurl localhost:9200/test/_search?pretty=1 -d { "query" : { "filtered" : { "query" : { "query_string" : { "query" : "tags:scala" } }, "filter" : { "range" : { "price" : { "gt" : 15 } } } } }} Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
  • Elasticsearch - SchemaDynamic object mapping with intelligent defaultsCan be turned offCan be overridden globally or on a per index basis: { "_default_" : { "date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"], } }
  • Elasticsearch Demo
  • Search @ Traackr Answering authors by searching posts
  • Traackr search requirementsPosts are coming in at about 1 million a dayEach author averages several hundred postsPosts need to be available for search immediatelyRelevance and sorting has to be rolled up/grouped atthe author level
  • Early approach to searchsearch postsgroup matched posts by authorfor each grouped set, add up thelucene scores of the postscombine sum of post scores withauthor social and website metricsfor final group scoresort groups (i.e. authors)try to do this quickly!
  • Early approach to searchsearch postsgroup matched posts by authorfor each grouped set, add up thelucene scores of the postscombine sum of post scores with Performance hitauthor social and website metricsfor final group scoresort groups (i.e. authors)try to do this quickly!
  • Room for improvementHow can we avoid the “late binding” performancepenalty? Get the search engine to do as much of the scoring as possible Store all data needed for displaying results in the search engine (i.e. no db calls)
  • Alternatives - Denormalize? Index authors and their posts together under one document. Pros straight forward built-in post relevance sum Cons each profile change would trigger the reindexing of all the author’s posts each new post would trigger the re- indexing of all the author’s posts + profile a non-starter for real-time search
  • Alternatives - Solr Join? “In many cases, documents have relationships between them and it is too expensive to denormalize them. Thus, a join operation is needed. Preserving the document relationship allows documents to be updated independently without having to reindex large numbers of denormalized documents.” - http://wiki.apache.org/solr/Join E.g. Find all post docs matching "search engines", then join them against author docs and return that list of authors: ...?q={!join+from=author_id+to=id}search+engines Pros addresses the issue of loading author profiles from db Cons Does not preserve the post relevance scores -> non-starter Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:
  • Alternatives - Solr Grouping? Groups results by a given document field (e.g. author_id) http://wiki.apache.org/solr/FieldCollapsing ...&q=real+time+search&group=true&group.field=author_id[...] "grouped":{ "author_id":{ "matches":2, "groups":[{ "groupValue":"04e3bc5078344ad1a065815f0bb9f14d", "doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[ { "id":"5d09240934eb331bada1ff3f0b773153", "title":"Refresh API", "url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html", "author_id":"04e3bc5078344ad1a065815f0bb9f14d"}] }}, { "groupValue":"9e4f40e1aa82f2e1a9368748d1268082", "doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[ { "id":"831ce82bdff34abeb495f260bc7d67d2", "title":"Realtime Search: Solr vs Elasticsearch"}, "url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/", "author_id":"9e4f40e1aa82f2e1a9368748d1268082"}, [...]] }}]}}
  • Alternatives - Solr Grouping? Pros Faster than doing grouping at the app layer: no need for post counting Possible to sort groups by sum of post relevance scores inside the engine (with some custom work): Cons No concept of author; author profiles still need to be fetched from db, so still suffers from some performance penalty Submit patch for group sort options? Not a lot of interest in sorting groups by anything other than max score: Don’t want to be stuck maintaining custom Solr code (been there done that with HBase: http://www.slideshare.net/gstathis/finding- the-right-nosql-db-for-the-job-the-path-to-a- nonrdbms-solution-at-traackr )
  • Alternatives - Elasticsearch! Supports document types { and parent/child document "post" : { "_parent" : { mappings: http:// "type" : "author" www.elasticsearch.org/guide/ } reference/mapping/parent- } } field.html Out-of-the-box support for curl localhost:9200/traackr/_search?pretty=1 -d { querying child documents "query": { and obtaining their parents: "top_children": { http://www.elasticsearch.org/ "type": "post", "query": { guide/reference/query-dsl/ "query_string": { top-children-query.html. "query": "elasticsearch NRT" } Con: memory heavy }, can order parent "score": "sum" results by sum of } child scores! Parent documents can be } sorted but sum/avg/max of }
  • Alternatives - Elasticsearch! Supports document types { and parent/child document "post" : { "_parent" : { mappings: http:// "type" : "author" www.elasticsearch.org/guide/ } reference/mapping/parent- } } field.html Out-of-the-box support for curl localhost:9200/traackr/_search?pretty=1 -d { querying child documents "query": { and obtaining their parents: "top_children": { http://www.elasticsearch.org/ "type": "post", "query": { guide/reference/query-dsl/ "query_string": { top-children-query.html. "query": "elasticsearch NRT" } Con: memory heavy }, can order parent "score": "sum" results by sum of } child scores! Parent documents can be } sorted but sum/avg/max of } Big win
  • Top Children Demo
  • Other Elasticsearch benefits Lucene: don’t have to give up query syntax if you come from Solr In-JVM nodes: can use Java API to unit test different permutations of indexing configurations (e.g. different analyzers and tokenizers): great help for testing search on a qualitative basis; allows for embedded ES instances Index API and Cluster API: a great deal of cluster and index configuration changes can be made on the fly through curl API calls without restarting the cluster; very convenient for testing and cluster management Warmer API: significant help in avoiding search time drops due to segment merges; https://github.com/elasticsearch/elasticsearch/issues/1913 Percolators: register queries and let the engine tell you which queries match on a given document; great potential for real-time; http://www.elasticsearch.org/guide/ reference/api/percolate.html
  • Q&A