Solr 101
 

Solr 101

on

  • 1,150 views

An introduction to Solr. Presented at JavaZone 2012.

An introduction to Solr. Presented at JavaZone 2012.

Statistics

Views

Total Views
1,150
Views on SlideShare
1,150
Embed Views
0

Actions

Likes
2
Downloads
25
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • Intros c= me & solr\n
  • Who – swiss, sébastien muller, ex solr newbie, 1 yr w/ Solr almost daily, several projects\nWhat – work for findwise Findwizard, information access consultant….. Enterprise search!\nWhere – Oslo for a year\nWhy – Oslo Solr Meetup community semi regular meetings at a pub in oslo\n\nINTRODUCTORY TALK\n
  • Internally and/or externally, both for finding information or finding who has the information…. Ecommerce fail w/out search, cant find what you want to buy = no sale\nA lot of which is unstructured\nAccording to research performed by google approx. 85% of organisations can barely access less than 50% of the data they produce\n
  • \n
  • General description – the “sales pitch”\n\nWeb based app – runs in a servlet container and is deployed as a java war, works with all major application servers such as Tomcat or Jetty\n
  • No point in building when there’s open source and licensed options readily available\nOpen source = free but might end up spending a lot to get it to do what you want – no vendor lock in, complete customisability\nLicensed = expensive and likely to spend a lot still\n
  • Although it being open source allows for a very low barrier for adoption there is no Service Level Agreement with an open source community\nOpen Source community based project likely to yield better long term QA testing, more personally invested in the quality of the project, but life > *\n\nIssue tracking is public\nAccess to source code – but the documentation isn’t necessarily comprehensive… google :D\n
  • Inverted index – like a book = list of keywords paired with location -> makes for v. fast queries rather than searching through documents for specific terms\n\nDocument = collection of fields with optional boosting values book…. Page… database entry etcrepresented by a single result or hit\n\nSingle/multi valued\n\nStored = original pre-analysis value stored and returnable by queries, necessary for some features, increases index size store and retrieve, unless indexed too\n\nIndexed = searchable and facetable, unless stored will not be returned in a search\n\nTokenization = breaking up text sequences, filtering and trasnforming to generate “terms” that are tied to a specific field\n\nRestricts the search space by creating subset of indexed documents against which queries can be made\n\nContributes to relevance calculations in query time, can be customised\n
  • \n
  • Schema = define data and field types\n\nSolrconfig = search components, replication, request handlers etc\n\nDEMO schema/solrconfig in notepad++ and DEMO POSTing from netbeans/browser\n
  • Sort on relevance score, value of fields….\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • Sort on relevance score, value of fields….\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • Sort on relevance score, value of fields….\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • Single solr instance with separate configurations (schema and config) and indexes while maintaining the convinience of unified administration\n\nEasy to add new cores or even replace cores with each other\n\nSTATUS, create, swap, unload, alias rename\n\nAtomically swaps the names used to access two existing cores. This can be useful for replacing a "live" core with an "ondeck" core, and keeping the old "live" core running in case you decide to roll-back.\n
  • Basic – good starting point, fine for a small index with few updates and low query load as all updates will slow down querying\n\nMaster/slave – indexing on one, querying on the other, all replications slow down indexing, can modify replication interval – improves query speed/qps\n\n\n1:N – more qps, no more query speed necessarily -> both require index to remain on 1 machine\n
  • Updates go to one of N machines – unique key field must be unique across all shards – couple of features aren’t supported eg. More like this and joins\n\nMakes it easier to rebalance index operations across more servers\n\nshards=solr1:8983/solr,solr2:8983/solr&indent=true&q=ipod+solr -> shards parameter syntax -> can be added to a requestHandler specifically for shards\n
  • Norwegian Bank and Norwegian based e-commerce group of sites\n
  • One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there’s no indication as to whether they had or not\n
  • Norwegian Bank and Norwegian based e-commerce group of sites\n
  • One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there’s no indication as to whether they had or not\n
  • Semantic FAIL\n
  • \n
  • \n
  • One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there’s no indication as to whether they had or not\n
  • \n

Solr 101 Solr 101 Presentation Transcript

  • SOLR 101JavaZone 2012, Oslo, Sébastien Muller, Findwise
  • Agenda Introductions Enterprise search What is Solr, why choose it? Solr Terminology Main Solr Features How it works Anatomy of a Query Scalbility Case studies  Sparebank1  Komplett Group
  • Enterprise Search  Search has become mission critical for most enterprises  Intranet  Web presence  E-commerce  Exponential growth of data  Cost of not finding information  Knowledge (sharing)  Time  Money  Information blackhole
  • What is Solr? Official definition: “Solr is an open source enterprise search platform based on the Lucene Java search library, with an HTTP interface using XML, JSON or other formats. It provides hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Apache Tomcat.” http://lucene.apache.org/solr
  • What is Solr?  Open-source, license-free search engine  Uses Apache Lucene library and adds enterprise search server features and capabilities  Web based application that processes requests and returns responses via HTTP  Easy scalability and great performance  Industry-tested worldwide  Modern solution architecture based on XML and Java – easy to work with  Well integrated with the ecosystem around Big Data, such as Hadoop (also Nutch, Tika).
  • Why choose Solr?  “Buy” > Build  Open source vs. Commercial solution  Open source software is free  Licensed software can be very expensive  High quality and easily modifiable relevancy  Very fast query and indexing performance  Highly flexible data processing/transformation
  • Why choose Solr?  Some challenges unique to open source…  No guaranteed support or bug fixing from community  No formal quality control or support for upgrades  Limited support for less experienced developers  Some benefits unique to open source…  Widely used and tested  Access to source code  Access to development versions and unreleased patches  Ultimately search is a specialised field and requires specialists
  • Solr Terminology  Index(ing)  Inverted index  Document  Field  Stored and/or indexed fields  Analysis  Tokenization  Filters  Terms  Query  Filter  Function  Facet
  • Main Solr Features  Full text search  Field search  Number and date searching  Facets  Spelling assistance – “Did you mean…?”  Replication  Master/Slave architecture  Related hits  Query completion  Admin GUI
  • How it works  Easy configuration through XML  schema.xml  solrconfig.xml  Documents are POSTed via HTTP to Solr  Add/update  Delete  Commit  Queries and response are also sent via HTTP  Choice of formats
  • Anatomy of a Query  Common parameters  Start, rows, fl, fq, sort  http://wiki.apache.org/solr/CommonQueryParameters ? q=*:*&start=0&rows=10&fl=title&fq=collection:popular&s ort=title asc  Slightly more advanced  &facets  &qf
  • What is Facetting?  Navigation/discovery technique  Tally of docs for each distinct field value  Parameters  &facet=true  &facet.field=category And so much more…
  • Scalability  Architecture goals:  More queries per second (qps)  Faster query execution  Bigger indexes  Faster indexing  Scaling options  Multicore  Replication  Sharding
  • Scalability - Multicore  Having more than one Solr in one Solr webapp <solr persistent = “true” sharedLib = “lib” > <cores adminPath=“/admin/cores”> <core name=“core0” instanceDir=“core0” /> <core name=“core1” instanceDir=“core1” /> </cores </solr>  http://localhost:8080/solr/admin/cores?action=...  STATUS  CREATE  SWAP
  • Scalability - Replication  Basic architecture – indexing/querying handled by one instance  1:1 Master/slave  Indexing  Querying  1:N Master/slaves  Different user groups
  • Scalability - Sharding  Distributed index  N masters with index split between them  Simple hashing to choose index  Sharding + replication  N masters with M slaves each  More shards = faster execution time  More slaves = higher average QPS &shards=solr1:8983/solr,solr2:8983/ solr&indent=true&q=ipod+solr
  • Case Studies
  • SpareBank1 - Background  SpareBank1 Gruppen  19 individual localised bank portals and one parent front page  Boost 25 umbrella project  Semantic URLs: https://www2.sparebank1.no/9898/3_privat? _nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233 149354625&_  New search interface  Banking app  CMS with no easy way of tracking individual banks’ publications  Mass duplicates  Access to irrelevant data
  • SpareBank1 - Requirements  Customer requirements : “bedre portal søk”
  • SpareBank1 - Requirements  Basic search features include  High quality relevance and precision  Relevant faceting  Query completion  Spell check and suggestions  Search analytics
  • SpareBank1 – Live Demo https://www2.sparebank1.no/
  • Komplett - Background  Komplett NO, SE, DK… inWarhouse.se, MPX  Existing Solr solution  Mile long query with boosting per field  Poor relevance  Peripherals/accessories ranked higher than products  Limited faceting  No query completion or spellcheck  Sloooooow indexing
  • Komplett - Requirements  Superior and customisable relevance model  Much more comprehensive indexing of products and specifications  Spellcheck  Query completion  So much more faceting
  • Sébastien Mullersebastien.muller@findwise.com