Suche mit Apache Lucene & Co.
 

Like this? Share it with your network

Share

Suche mit Apache Lucene & Co.

on

  • 1,232 views

Mehr und mehr Anwendungen könnten von einer ausgefeilten und schnellen Suchfunktionalität profitieren. Dieser Workshop beschäftigt sich nach einer kurzen Lucene-Einführung vor allem mit den ...

Mehr und mehr Anwendungen könnten von einer ausgefeilten und schnellen Suchfunktionalität profitieren. Dieser Workshop beschäftigt sich nach einer kurzen Lucene-Einführung vor allem mit den praktischen Einsatzmöglichkeiten von Apache Solr und EleasticSearch in verschiedenen Szenarien: einfache Volltextsuche, schnelle Autovervollständigung, Suche mit Facetten, Inhaltsausschnitte, ähnliche Dokumente, einfacher Datenimport etc.

Statistics

Views

Total Views
1,232
Views on SlideShare
1,232
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Suche mit Apache Lucene & Co. Presentation Transcript

  • 1. Suche mit Apache Lucene & CoChristian MederBernhard Pflugfelderinovex Gmbh
  • 2. Background‣  open source (free software)‣  Linux‣  Web‣  Java‣  Android‣  CTO@inovex‣  Christian MederChristian MederSpeaker2
  • 3. Background‣  Lucene‣  Solr‣  Text Mining Technologies,Information Retrieval‣  Hadoop‣  Java‣  Big Data Engineer@inovex‣  bpflugfelder@inovex.deBernhard PflugfelderSpeaker3
  • 4. ‣  09:00 - 09:30Introduction, Search in a nutshell‣  09:30 - 10:00Solr Exercise 1: Installation, Web Admin Interface‣  10:00 - 10:30Solr Exercise 2: Indexing, Queries I‣  10:30 - 11:00Coffee Break‣  11:30 - 12:00Solr Exercise 3: Data ingestion XML / SQL, Queries IISession IAgenda4
  • 5. ‣  12:00 - 12:30Solr Exercise 4: Schema, Data types, Analyzers, Stemming‣  12:30 – 13:30Lunch‣  13:30 - 14:00Solr Exercise 5: Facet search, Filter search, Interval search‣  14:00 - 14:30Solr Exercise 6: Dismax, Autosuggestion, MoreLikeThisSession IIAgenda5
  • 6. ‣  14:30 - 15:00ES Exercise 1: Installation, Indexing, Queries I‣  15:00 - 15:30Coffee Break‣  15:30 - 16:00ES Exercise 2: Schema, Data types, Analyzers, Queries II‣  16:00 - 16:30ES Exercise 3: Data ingestion SQL / XML‣  16:30 - 17:00ES Exercise 4: Facet search, Filter search, Interval searchSession IIIAgenda6
  • 7. Search tag cloudIntroduction7
  • 8. ‣  Classical search applications are applications focusing oninformation or document retrieval‣  Requirement: find information the user asks for!‣  Some examples:‣  Web search‣  Enterprise search‣  Document search (within DMS or CMS)‣  Search on portals and archives‣  Product search‣  Specialized searches for people, companies, etc.Classical searchapplicationsIntroduction8
  • 9. Where search is in Enterprise SearchIntroduction9
  • 10. Where search is in Online shopsIntroduction10
  • 11. Where search is in Semantic search @GoogleIntroduction11
  • 12. Where search is inIntroduction12Navigation &Information access
  • 13. Data Analysis Search-basedapplicationsIntroduction13http://datarpm.com/product
  • 14. ‣  Can you think of other scenarios where search applicationswill also do a good job?‣  Remind the key capabilities of search technologies:‣  Persistency‣  Flexible data model‣  Unstructured data, but not only‣  Extremely quick access to data‣  Horizontal scalabilityThere are plenty of applications scenarios out there wheresearch technologies shall be considered!NoSQL DatabaseIntroduction14Document store
  • 15. Hot open sourcesearch technologiesProjects15http://lucene.apache.orghttp://lucene.apache.org/solr/http://www.elasticsearch.org
  • 16. Lucene is an open source, pure Java APIfor enabling information retrieval‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001‣  Licensed by Apache License 2.0‣  Pure Java Library with implementations for :‣  Lucene.NET (http://lucenenet.apache.org)‣  PyLucene (http://lucene.apache.org/pylucene/)‣  and more:http://wiki.apache.org/lucene-java/LuceneImplementations‣  Large and very active developer community, well documented and supported (38active committer!)‣  Current stable release: 4.2.1‣  Widely used and adopted for commercial / non-commercial projects:http://wiki.apache.org/lucene-java/PoweredByProjects16Overviewhttp://lucene.apache.org/
  • 17. ‣  Scalable, High-Performance Indexing‣  over 95GB/hour on modern hardware‣  small RAM requirements‣  incremental indexing as fast as batch indexing‣  index size roughly 20-30% the size of text indexed‣  Powerful, Accurate and Efficient Search Algorithms‣  ranked searching -- best results returned first‣  many powerful query types‣  fielded searching (e.g., title, author, contents)‣  date-range searching‣  sorting by any field‣  multiple-index searching with merged results‣  allows simultaneous update and searching[From http://lucene.apache.org/core/features.html]Projects17Highlightshttp://lucene.apache.org/
  • 18. Solr is a standalone enterprise search server & documentstore with based on Lucene‣  Created by Yonik Seeley at CNET Networks in 2004‣  Introduced as Apache Incubator in 2006, became TLP in 2007‣  Licensed by Apache License 2.0‣  Seeley and others founded Lucid Imagination -> LucidWorks‣  Large and very active developer community, well documented and supported(strong relationship to Lucene community also)‣  Current stable release: 4.2.1‣  Widely used and adopted for commercial / non-commercial projects:http://wiki.apache.org/solr/PublicServersOverviewProjects18http://lucene.apache.org/solr/
  • 19. ‣  Architectural highlights‣  Extensible Plugin Architecture‣  SolrCloud – distributed indexing and search architecture‣  Efficient Replication to other Solr Search Servers‣  Configurable Query Result, Filter, and Document cache instances‣  Access & Monitoring‣  Standards Based Open Interfaces‣  XML,JSON and HTTP‣  REST-like API‣  Comprehensive HTML Administration Interfaces‣  Server statistics exposed over JMX for monitoringHighlightsProjects19http://lucene.apache.org/solr/
  • 20. ‣  Data model‣  Lucene’s document oriented index data structure‣  Schema for field types and fields of documents‣  Analysis & Indexing highlights‣  Out-of-box support for JSON, XML, CSV/delimited-text, DBMS‣  Support of PDF, DOC, XLS, PPT, HTML‣  Declarative Lucene Analyzer specification‣  Many additional text analysis components including word splitting, regex andsounds-like filters‣  External file-based configuration of stopword lists, synonym lists, andprotected word listsHighlightsProjects20http://lucene.apache.org/solr/
  • 21. Open source search technologies‣  Search highlights‣  Facet search and filtering (values, queries, date/time ranges)‣  Geospatial search (e.g. local search)‣  Configurable caching‣  Sorting (number of fields, complex functions of numeric fields)‣  Autocomplete‣  Highlighted context snippets‣  Spelling suggestions for user queries‣  More Like This suggestions for given document‣  Function Query‣  Advanced query parser for high relevancy results from user-entered queriesHighlightsProjects21http://lucene.apache.org/solr/
  • 22. ‣  Solr clients in various languages are freely available:‣  Java, Scala, Ruby, Python, .NET, Javascript (AJAX), …‣  http://wiki.apache.org/solr/IntegratingSolr‣  Very helpful tools:‣  Grep (log file analysis)‣  Luke (index analysis)‣  Solrmeter (performance analysis)‣  Scalable Performance Monitoring for Solr (Monitoring)Clients & ToolsProjects22http://lucene.apache.org/solr/
  • 23. Documentation URLGetting started http://lucene.apache.org/solr/4_0_0/tutorial.htmlRelease documentation: http://lucene.apache.org/solr/4_0_0/Javadocs http://lucene.apache.org/solr/4_0_0/solr-core/index.htmlSolr Wiki http://wiki.apache.org/solr/Mailing lists http://lucene.apache.org/solr/discussion.htmlApache Solr 3 Enterprise Search Server http://link.packtpub.com/2LjDxEApache Solr 3.1 Cookbook http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/bookLucidWorks Technical Support http://support.lucidworks.com/homeDocumentationProjects23http://lucene.apache.org/solr/
  • 24. +  Solr is a mature technology widely used in commercial applications‣  Easy integration in third-party application‣  Big community, good documentation, good support‣  You have a Solr problem - most likely someone else had it already‣  Very helpful tools for analysis and monitoring+  Solr provides a large bundle of features:‣  Lots of analyzers and specific query types‣  Individual relevance boosting‣  Admin interface-  Because Solr can so much, it’s a heavy weight technology:‣  much to configure‣  most part of the configuration is static / no api access‣  includes redundant functionality (e.g. similar requesthandlers)Pros & ConsProjects24http://lucene.apache.org/solr/
  • 25. Search ArchitectureProjects25
  • 26. ‣  Installation‣  Administration‣  Solr Web Admin InterfaceSolr Exercise I26
  • 27. ‣  Solr is a pure Java application‣  Solr is built upon:‣  Lucene‣  Zookeeper‣  Guava-libraries‣  HttpComponents, SLF4J, Various Commons libraries‣  Solr source code available at:‣  http://svn.apache.org/viewcvs.cgi/lucene/dev/ (Web access)‣  http://svn.apache.org/repos/asf/lucene/dev/ (anonymous access)‣  Solr needs a servlet container to run such as Jetty, Tomcat, Glassfish to run‣  Embedded Jetty for easily playing and testing SolrSolr Exercise I27Overviewhttp://lucene.apache.org/solr/
  • 28. Run Solr on embedded Jetty:1.  Unpack the Solr distribution to your desired location (= SOLR_MAIN)2.  Change to directory SOLR_MAIN/example3.  Start the example Solr instance: java -jar start.jarTo verify the installation open your browser and go to the Solr Admin pagehttp://localhost:8983/solrSolr Exercise I28Installationhttp://lucene.apache.org/solr/
  • 29. ‣  Solr Core (aka Core)‣  basically an isolated running instance of a Solr index‣  each Core has its own solrconfig.xml, schema.xml and index data‣  search results can not be computed over Cores‣  Solr Collection (aka Collection)‣  Logical index distributed over multiple machines‣  Physical partitioning using sharding‣  Part of SolrCloud (Scalability, High Availability)Solr Exercise I29Core vs. Collectionhttp://lucene.apache.org/solr/
  • 30. Solr Home Directory as recommended:‣  solr.xml‣  primary configuration file Solr looks for when starting‣  this file specifies the list of SolrCores it should load‣  Solr Core Instance Directories‣  contains configuration and data of a SolrCore‣  lib/‣  shared lib directory for solr instance‣  zoo.cfg‣  Zookeeper configuration when using SolrCloud‣  How to tell Solr where SOLR_HOME is located?‣  Use the Java system property: solr.solr.home‣  e.g. java -Dsolr.solr.home=/some/dir -jar start.jarSolr Exercise I30Solr Homehttp://lucene.apache.org/solr/
  • 31. Solr Core Instance Directory as recommended:‣  conf/‣  This directory is mandatory and must contain your solrconfig.xml andschema.xml.‣  Any other optional configuration files would also be kept here.‣  data/‣  This directory is the default location where Solr will keep your index, and isused by the replication scripts for dealing with snapshots.‣  You can override this location in the conf/solrconfig.xml.‣  lib/‣  This directory is optional. If it exists, Solr will load any Jars found in thisdirectory and use them to resolve any "plugins” specified in yoursolrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).Solr Exercise I31Instance Directoryhttp://lucene.apache.org/solr/
  • 32. Solr includes an Admin Web interface providing your with‣  General configuration details‣  Core-specific configuration details‣  Log information‣  Run queries‣  Document field / Term statistics‣  Document fields‣  Cache statistics‣  Server cluster informationAccess it via http://localhost:8983/solrSolr Exercise I32Admin Webinterfacehttp://lucene.apache.org/solr/
  • 33. ‣  Indexing the first XML data‣  Try first simple queries‣  Different query types‣  Get result score‣  HighlightingSolr Exercise II33
  • 34. Search BasicsSolr Exercise II34Document Queryindexing indexing(Query analysis)Representation Representation(tokens) Query (tokens)evaluationIndex-based search
  • 35. ‣  An inverted index is an index datastructure that‣  stores mappings from tokens totheir locations (e.g. documents)‣  allows fast access of thosedocuments that contains specifictokens‣  The purpose of an inverted indexis to allow fast full text searchesSearch BasicsSolr Exercise II35Inverted index
  • 36. Solr Exercise II36  IndexDocumentDocumentDocumentDocumentFieldFieldFieldFieldFieldName ValueSearch Basics Data model
  • 37. Solr Exercise II37  Doc 1:Penn StateFootball …footballDoc 2:Footballplayers …StatePostingidword doc offset1 football Doc 1 3Doc 1 67Doc 2 12 penn Doc 1 13 players Doc 2 24 state Doc 1 2Doc 2 13PostingTableSearch Basics Data model
  • 38. ‣  How to select importantterms?‣  Simple method: usingmiddle-frequency wordsSolr Exercise II38Frequency/Informativityfrequency informativityMax.Min.1 2 3 … RankSearch Basics Term selection
  • 39. ‣  tf = term frequency‣  frequency of a term/keyword in a document‣  The higher the tf, the higher the importance (weight) for the doc.‣  df = document frequency‣  no. of documents containing the term‣  distribution of the term‣  idf = inverse document frequency‣  the unevenness of term distribution in the corpus‣  the specificity of term to a document‣  The more the term is distributed evenly, the less it is specific to a documentweight(t,D) = tf(t,D) * idf(t)Solr Exercise II39Search Basics Term selection
  • 40. ‣  1-word query:The documents to be retrieved are those that include the word‣  Retrieve the inverted list for the word‣  Sort in decreasing order of the weight of the word‣  Multi-word query?-  Combining several lists-  How to combine matches of these different lists?-  How to interpret the weight? (IR model)Solr Exercise II40Search Basics Querying
  • 41. ‣  Vector space = all the terms encountered<t1, t2, t3, …, tn>‣  DocumentD = < a1, a2, a3, …, an>ai = weight of ti in D‣  QueryQ = < b1, b2, b3, …, bn>bi = weight of ti in Q‣  R(D,Q) = Sim(D,Q)‣  Cosine Similarity (TF*IDF)‣  Okapi BM25Vector-space modelSearch Basics41t1t2DQ
  • 42. ‣  The Solr UpdateRequestHandler defines the logic to deal with index updateactions based on a specific data source or data format‣  UpdateRequestHandlers must be defined in the solrconfig.xml and are matchedto specific url path in oder to access it via HTTP‣  Solr supports serveral file types out-of-the-box by using the specific updatehandler:‣  Standard UpdateRequestHandler‣  supporting XML, XSLT, JSON, CSV and javabin‣  DataImportHandler‣  Indexing events: Add/Replace, Commit, Soft Commit, DeleteSolr Indexing Update RequesthandlersSolr Exercise II42<requestHandler name=“update” class="solr.UpdateRequestHandler"/>
  • 43. Solr Indexing XML AddSolr Exercise II43curl http://localhost:8983/solr/jax2013/update-H Content-Type:text/xml --data-binary<add><doc><field name=”id”>etext78942</field><field name=”title”>Solr textbook</field><field name=”subject">search technology</field><field name=”author">Bernhard Pflugfelder</field></doc>[<doc> ... </doc>[<doc> ... </doc>]]</add>
  • 44. Solr Indexing XML UpdateSolr Exercise II44curl http://localhost:8983/solr/jax2013/update-H Content-Type:text/xml’ --data-binary<add><doc><field name=”id">etext78942</field><field name=”author" update="set">Christian Meder</field><field name=”subject" update="add">open source</field></doc></add>
  • 45. Solr Indexing XML DeleteSolr Exercise II45curl http://localhost:8983/solr/jax2013/update-H Content-Type:text/xml’ --data-binary<delete><id>etext78942</id><query>author:meder</query></delete>
  • 46. Solr Indexing XML CommitSolr Exercise II46curl http://localhost:8983/solr/jax2013/update-H Content-Type:text/xml’ --data-binary<commit waitSearcher="false"/>curl http://localhost:8983/solr/jax2013/update?optimize=true&waitFlush=false
  • 47. ‣  Multiple index actions in one JSONSolr Indexing JSON Add / Delete /CommitSolr Exercise II47curl http://localhost:8983/solr/jax2013/update/json -H Content-type:application/json -d ’{"add": {"commitWithin": 5000,"doc": {"f1": "v1","f1": "v2"}},"commit": {},"delete": { "id":"ID" },"delete": { "query":"QUERY" }"delete": { "query":"QUERY", commitWithin:500 }}
  • 48. ‣  Commands add, set and incSolr Indexing JSON Atomic updatesSolr Exercise II48curl http://localhost:8983/solr/jax2013/update/json -H Content-type:application/json -d ’[{"id" : "etext78942","title" : {"set":”solr 4.2.1 textbook"},”viewcount” : {"inc":3},"author" : {"add":”Bernhard Pflugfelder"}}]’
  • 49. Solr Indexing Try outSolr Exercise II49cd SOLR_MAIN/example/exampledocscurl http://localhost:8983/solr/collection1/update/json?commit=true’ --data-binary @books.json-H Content-type:application/jsoncd SOLR_MAIN/example/exampledocsjava -jar post.jar -hjava -jar post.jar *.xml
  • 50. ‣  q=+content:goethe +content:schiller‣  q=+content:goethe -content:schiller‣  q=title:faust‣  q=title:faust AND -content:goethe‣  q=content:“romeo and juliet”‣  q=title:water*‣  q=title:water~0.5‣  q=created:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]‣  q=viewcount:[20 TO 50]‣  q=viewcount:[100 TO *]Solr QueriesSolr Exercise II50curl –XPOST ‘http://localhost:8983/solr/jax2013/select’ –dQuery Syntax
  • 51. Solr Queries CommonparametersSolr Exercise II51Param name Param value Descriptionq string The user query stringstart number Offset in the list of returned documentsrows number Number of documents returnedfq string A filter queryfl string,string,… Fields returned for each documentdebugQuery true / false Include debug info in the responsecurl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d‘q=+solr –elasticsearch&start=20&row=40&fl=* score’
  • 52. Highlighting OverviewSolr Exercise II52Param name Param value Descriptionhl true / false Switch on / off highlightinghl.q string Alternative highlighting queryhl.fl string, string,… Fields used for highlightinghl.snippets number Number of maximum snippetshl.fragsize number Number of characters per snippethl.simple.pre[post] string Text appears before / after matchcurl –XPOST ‘http://localhost:8983/solr/collection1/select’ –d‘q=+solr –elasticsearch&start=20&row=40&fl=* score&hl=true&hl.fl=title,abstract’
  • 53. ‣  Datainputhandler SQL‣  Datainputhandler XMLSolr Exercise III53
  • 54. ‣  DataInputhandler makes possible to:‣  index data in relational databases‣  compose documents from multiple columns and tables‣  bulk import or incremental update using Delta Query mechanism‣  schedule full imports and delta imports‣  Index data from XML/HTML using XPATH expressions‣  DataInputhandler is part of Solr Contrib‣  Define in solrconfig.xmlDataInputhandler OverviewSolr Exercise III54<requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler"><lst name="defaults"><str name="config">/home/username/data-config.xml</str></lst></requestHandler>
  • 55. ‣  http://localhost:8983/solr/dataimport?command=full-import‣  http://localhost:8983/solr/dataimport?command=delta-import‣  http://localhost:8983/solr/dataimport?command=status‣  http://localhost:8983/solr/dataimport?command=reload-config‣  http://localhost:8983/solr/dataimport?command=abortDataInputhandler CommandsSolr Exercise III55
  • 56. ‣  The dataconfig.xml defines the data source and which data shall be used topopulate Solr documents during import‣  Defines tags:‣  dataSource‣  document‣  entity‣  The entity defines a specific data selection resulting in a Solr document‣  The query gives the data needed to populate fields of the Solr documentDataInputhandler ConfigurationSolr Exercise III56<dataConfig><dataSource … /><document name="products"><entity name="item" query="select * from item” /></document></dataConfig>
  • 57. ‣  MySQL‣  Oracle‣  Use multiple data source within on DIH config by property name‣  Each entity definition must then define a parameter name as wellDataInputhandler DataSourceSolr Exercise III57<dataSource name="jdbc" driver=”com.mysql.jdbc.Driver”url="jdbc:mysql://localhost/dbname"user="db_username" password="db_password"/>/><dataSource name="jdbc" driver="oracle.jdbc.driver.OracleDriver"url="jdbc:oracle:thin:@//hostname:port/SID"user="db_username" password="db_password"/>
  • 58. DataInputhandler SQL full-importSolr Exercise III58<dataConfig><dataSource … /><document name="products"><entity name="item" query="select * from item"><field column="ID" name="id" /><field column="NAME" name="name" /><field column="MANU" name="manu" /><field column="WEIGHT" name="weight" /><field column="PRICE" name="price" /><field column="POPULARITY" name="popularity" /><field column="INSTOCK" name="inStock" /><field column="INCLUDES" name="includes" /></entity></document></dataConfig>
  • 59. DataInputhandler SQL full-importSolr Exercise III59<dataConfig><dataSource … /><document><entity name="item" query="select * from item"><entity name="feature" query="select description asfeatures from feature where item_id=${item.ID}"/><entity name="item_category" query="select CATEGORY_IDfrom item_category where item_id=${item.ID}"><entity name="category" query="select description as catfrom category where id = ${item_category.CATEGORY_ID}"/></entity></entity></document></dataConfig>
  • 60. ‣  Increment update of the specific content of a relational database‣  Avoid indexing already indexed data again‣  http://localhost:8983/solr/dataimport?command=delta-import‣  Provide three specific queries for each entity except root:‣  The deltaImportQuery gives the data needed to populate fields whenrunning a delta-import‣  The deltaQuery gives the primary keys of the current entity which havechanges since the last index time‣  The parentDeltaQuery uses the changed rows of the current table(fetched with deltaQuery) to give the changed rows in the parent table.This is necessary because whenever a row in the child table changes, weneed to re-generate the document which has that field.DataInputhandler SQL Delta-ImportSolr Exercise III60
  • 61. DataInputhandler SQL Delta-ImportSolr Exercise III61<entity name="item" pk="ID”query="select * from item”deltaImportQuery="select * from item where ID=${dih.delta.id}”deltaQuery="select id from item where last_modified &gt;${dih.last_index_time}”><entity name="feature" pk="ITEM_ID” query="select description asfeatures from feature where item_id=${item.ID}” /><entity name="item_category" pk="ITEM_ID, CATEGORY_ID” query="selectCATEGORY_ID from item_category where ITEM_ID=${item.ID}"><entity name="category" pk="ID” query="select description as catfrom category where id = ${item_category.CATEGORY_ID}” /></entity></entity>
  • 62. ‣  HTTP source‣  XML File sourceDataInputhandler Other DataSourcesSolr Exercise III62<dataConfig><dataSource type="HttpDataSource" />…</dataConfig><dataConfig><dataSource type=”FileDataSource" encoding=“UTF-8”/>…</dataConfig>
  • 63. ‣  The entity defines location of the XML file‣  Solr document field population is done by evaluating XPATH expressionsDataInputhandler XML full-importSolr Exercise III63<entity name="page”processor="XPathEntityProcessor"stream="true"forEach="/RDF/etext/"url="../../catalog.rdf.xml"transformer="RegexTransformer,DateFormatTransformer”><field column="id" xpath="/RDF/etext/@id" /><field column="title" xpath="/RDF/etext/title" /><field column="alternative" xpath="/RDF/etext/alternative" /><field column="author" xpath="/RDF/etext/creator" /><field column="multi_author” xpath="/RDF/etext/creator/Bag/li" /><field column="subject" xpath="//LCSH/value" /><field column="viewcount"xpath="/RDF/etext/downloads/nonNegativeInteger/value" /><field column="created"xpath="/RDF/etext/created/W3CDTF/value" dateTimeFormat="yyyy-MM-dd" /></entity>
  • 64. ‣  Schema,‣  Data types‣  Analyzers, TokenizersSolr Exercise IV64
  • 65. ‣  Defines document representation by specifying fields‣  with a specific field type‣  with specific field type properties‣  Dynamic fields‣  CopyField‣  Define analyzers:‣  Tokenizers‣  Filters‣  Synonym lists, stop word lists‣  additional text analysis‣  Assign analyzers to the Text-based data types (solr.TextField)‣  Example schema.xmlSchemaSolr Exercise IV65Overview
  • 66. ‣  Field types‣  int, long, float, double, boolean‣  string, date, binary‣  derived from solr.TextField‣  text_general, text_de, text_en, …‣  Field type properties‣  indexed (true / false)‣  stored (true / false)‣  multiValued (true / false)‣  termVectors (true / false)Schema FieldsSolr Exercise IV66
  • 67. Break stream of characters intotokens / terms‣  Normalization (e.g. case)‣  Stopwords‣  Stemming‣  Lemmatizer / Decomposer‣  Part of Speech Tagger‣  Information ExtractionAnalyzing /TokenizationOverviewSolr Exercise IV67
  • 68. ‣  function words do not bear useful information for searchingof, in, about, with, I, although, …‣  Stopword list: contain stopwords, not to be used as index‣  Prepositions‣  Articles‣  Pronouns‣  Some adverbs and adjectives‣  Some frequent words (e.g. document)‣  The removal of stopwords usually improves search quality‣  Solr provides default stopword lists for various languagesAnalyzing /TokenizationStopwordsSolr Exercise IV68
  • 69. ‣  Apply strict algorithmic normalization of inflection forms (e.g. Porter)‣  Strategy: removing some endings of words.Example:computer, compute, computes, computing, computed, computation are allnormalized to comput‣  But: going -> go, king -> k ???????????‣  Stemming might work well for English‣  However, be careful using stemming, especially for GermanAnalyzing /TokenizationStemmingSolr Exercise IV69
  • 70. Analyzing /TokenizationDefine an analyzerSolr Exercise IV70<fieldType name=”<name>" class="solr.TextField”positionIncrementGap="100"><analyzer type="index”><!– tokenizer and filters for indexing --><tokenizer class=“CLASS” PARAMS /><filter class=“CLASS” PARAMS /></analyzer><analyzer type="query"><!– tokenizer and filters for search --><tokenizer class=“CLASS” PARAMS /><filter class=“CLASS” PARAMS /></analyzer></fieldType>
  • 71. ‣  TokenizerFactories‣  solr.StandardTokenizerFactory‣  solr.WhitespaceTokenizerFactory‣  solr.KeywordTokenizerFactory‣  TokenFilterFactories‣  solr.LowerCaseFilterFactory‣  solr.TrimFilterFactory‣  solr.StopFilterFactory‣  solr.WordDelimiterFilterFactory‣  solr.SynonymFilterFactory‣  solr.EdgeNGramFilterFactoryAnalyzing /TokenizationTokenizers & FiltersSolr Exercise IV71
  • 72. ‣  English‣  solr.PorterStemFilterFactory‣  solr.SnowballPorterFilterFactory‣  solr.EnglishMinimalStemFilterFactory‣  German‣  solr.SnowballPorterFilterFactory‣  solr.GermanLightStemFilterFactory‣  solr.GermanMinimalStemFilterFactory‣  More information at http://wiki.apache.org/solr/LanguageAnalysisAnalyzing /TokenizationLanguage analysisSolr Exercise IV72
  • 73. ‣  Faceted search‣  Filter query‣  MoreLikeThis querySolr Exercise V73
  • 74. Faceted search OverviewSolr Exercise V74
  • 75. ‣  „Die Aussage eines Probanden bei einem Usability-Test einer Faceted SearchLösung im Rahmen dieser Studie ist damit richtungsweisend:‣  „Mit dem Filter hier habe ich das Gefühl, dass selbst eine schnöde Suche richtigSpaß machen kann.””‣  Quelle: Faceted Search: Die neue Suche im Usability-Test (zumkostenlosen Download unter http://usability.de)Faceted search MotivationSolr Exercise V75
  • 76. ‣  Faceted search (aka facetednavigation) organizes search resultsbased on different categories ordimensions giving the user thepossibility to drill down the searchresults‣  Facets can be authors, titles, tags,dates, languages, file types …‣  Typically, meta data describingconcepts and meaning of documentsare useful as facets‣  Facets can be shown with countsFaceted search OverviewSolr Exercise V76
  • 77. ‣  Solr provides faceting mechanism out-of-the-box including the returning of counts‣  Important: facet fields must be defined with indexed=true‣  Often facet fields are analyzed differently as search fields. Therefore it iscommon to define separate document fields for faceting in schema.xml‣  Facet fields shall not be tokenized, lower-cased, stemmed‣  Facet fields can be of type‣  int, long, float, double, boolean‣  solr.TextField‣  date‣  From the view point of performance also define‣  stored=false‣  omitNorms=falseFaceted search Solr FacetingSolr Exercise V77<field name=”facet_author” indexed=“true” stored=“false” omitNorms=“false” />
  • 78. ‣  Solr provides two basic mechanism to build facets‣  Arbitrary faceting (facet.query=query)‣  Field value faceting (facet.field=fieldname)‣  In case of Field value faceting two faceting methods can be chosen‣  Enum Based Field Queries (facet.method=enum)‣  Field Cache (facet.method=fc)‣  Other common parametersFaceted search Solr FacetingSolr Exercise V78Param name Param value Descriptionfacet true / false Switch on / off facetingfacet.prefix String Facet results must start with prefixfacet.sort sort / index Sort facet resultsfacet.limit number Limit number of facet resultsfacet.mincount number Minimal count to be considered
  • 79. Faceted search Date facetingSolr Exercise V79Param name Param value Descriptionfacet.date fieldname The fieldname of typedate used for datefacetingfacet.date.start date expression The start date of the firstdate facet intervalfacet.date.end date expression The upper bound for thelast date facet intervalfacet.date.gap date expression The size of each daterange intervalq=*:*&rows=0&wt=xml&indent=true&facet=true&facet.date=created&facet.date.start=1996-01-31T23:00:00Z&facet.date.end=2013-04-021T00:00:00Z&facet.date.gap=%2B1YEAR
  • 80. Faceted search Range facetingSolr Exercise V80Param name Param value Descriptionfacet.range fieldname The fieldname of anumeric field typefacet.range.start number The start date of the firstrange intervalfacet.range.end number The upper bound for thelast range intervalfacet.range.gap number The size of each rangeintervalq=*:*&rows=0&wt=xml&indent=true&facet=true&facet.range=viewcount&facet.range.start=0&facet.range.end=150&facet.range.gap=20
  • 81. ‣  Filter queries restrict the document result set to a specific subset of the returnedset based on the original query‣  The scores of the documents are not influenced by filter queries‣  Examples‣  access permissions (ACLs)‣  categories or tags‣  Importantly, the results of filter queries are automatically cached per default‣  Solr uses a separate in-memory filter cache‣  Thus, filter queries will be evaluated very fast if they are cached‣  Complex, often used queries are good candidates for filter queries‣  Keep in mind that the size of filter cache depends on the search scenario musttherefore be tuned explicitlyFilter query OverviewSolr Exercise V81
  • 82. ‣  Filter queries are defined by query parameter fq‣  Avoid caching filter queriesFilter query ExamplesSolr Exercise V82q=content:arthur&fq=subject:fantasy&fl=title,author&rows=5content:arthur&fq=subject:fantasy&fq=viewcount:[* TO 100]&fl=title,author&rows=5content:arthur&fq=subject:fantasy&fq={!cache=false}viewcount:[* TO 100]&fl=title,author&rows=5
  • 83. ‣  Idea of MoreLikeThis‣  MoreLikeThis constructs a query based on the terms of given set of fields‣  Matching documents are “similar” based on the chosen set of fields‣  Fields used by MoreLikeThis should define termVerctors=“true”MoreLikeThis OverviewSolr Exercise V83Param name Param value Descriptionmlt.fl fieldnames Fields to be used by MLTmlt.mintf number Minimum term ferquencymlt.mindf number Minimum document frequencymlt.minwl number Minimum word lengthmlt.maxwl number Maximum word lengthmlt.maxqt number Maximum number of query termsq=content:schiller&mlt=true&mlt.fl=subject&mlt.mindf=50&mlt.mintf=1
  • 84. ‣  Advanced queries:‣  Dismax query parser‣  Sorting‣  Grouping‣  AutosuggestionSolr Exercise VI84
  • 85. ‣  Motivation‣  Standard Solr parser only supports simple query control‣  One field can be defined as default search field‣  Supports only boolean conjunction of sub queries (AND / OR)‣  Strict query syntax to perform e.g. phrase queries‣  Dismax (and eDismax) query parsers are more robust query parsers offeringvarious additional query parameters and controls to optimize queries‣  These additional query parameters and controls are hidden from the user‣  Dismax stands for Disjunction Max‣  Disjunction means that multiple fields can be search simultaneously withdifferent field weights‣  Max means that the maximum score of the field matches is taken as thedocument score (instead of the sum)DisMax Parser OverviewSolr Exercise VI85
  • 86. Param name Descriptionq.alt Alternative query executed if the user query is notspecified or blankqf The query fields to be searched for. Each field can bedefined with an individual field weight.mm Minimum match of query words in order to evaluate adocument matchpf Defines phrase fields. Boost documents that have thesearch terms in close proximity within the phrase fields.ps The phrase slop effecting the boosting of phrase queriesevaluated on the pf fieldsqs The phrase slop for user defined phrase queriesqb A raw query that is added to the user query to influencescoringbf Function queries that are added to the user queries toinfluence scoringDisMax Parser ParametersSolr Exercise VI86
  • 87. DisMax Parser ExamplesSolr Exercise VI87http://localhost:8983/solr/jax2013/select?q=schiller&defType=dismax&qf=author^20.0+content^0.3http://localhost:8983/solr/jax2013/select?q=schiller&defType=dismax&qf=author^20.0+content^0.3&bq=subject:drama^5.0
  • 88. ‣  Ranking (= ordering) the documents results based on criteria‣  Default ranking is done based on the document score‣  The sort parameter allows to rank the document results based on an arbitraryfield or even function‣  Sort fields must be defined as indexed=true and multiValued=false‣  Syntax: …&sort=fieldname [asc/desc],fieldname [asc/desc],…Sorting OverviewSolr Exercise VI88http://localhost:8983/solr/jax2013/select?q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+deschttp://localhost:8983/solr/jax2013/select?q=schiller&defType=dismax&qf=author^20.0+content^0.3&sort=viewcount+desc
  • 89. Grouping OverviewSolr Exercise VI89
  • 90. ‣  Motivation‣  Documents with a common values for some field are partitioned into groups‣  Documents with the same field value are collapsed to a single resultGrouping ParametersSolr Exercise VI90Query parameter Query value Descriptiongroup true / false Switch on / off groupinggroup.field fieldname Field to group onrows number Number of groups returnedstart number Offset in into the list of returnedgroupsgroup.limit number Number of docs returned for eachgroupgroup.offset number Offset into the list of returneddocuments per groupsort fieldname [asc/desc] Sort groups on some fieldgroup.sort fieldname [asc/desc] Sort documents of every group onsome field
  • 91. Autosuggestion OverviewSolr Exercise VI91
  • 92. ‣  Autosuggestion (aka Autocomplete) is a common search feature that supportsthe user by providing query suggestions during typing‣  Autosuggestion functionality can include‣  the search index‣  separate word lists‣  synonyms / black lists‣  grouping suggestions‣  Fuzziness‣  Whatever mechanism is actually used to provide autosuggest, it must beevaluated suggestions very quickly.‣  Solr provides different mechanisms to build autosuggestion functionality:‣  using facet search‣  using standard search (standard query parser)‣  using spellchecker Solr pluginAutosuggestion OverviewSolr Exercise VI92
  • 93. ‣  Define new field title_auto using for autosuggestion‣  Define the field type text_auto providing specific analysis for autosuggestion‣  How to get suggestions for a user query?Autosuggestion Using facetingSolr Exercise VI93<field name=”title" type="text_general" indexed="true” stored="true” /><field name=”title_auto" type="text_auto" indexed="true" stored="true” /><copyField source=”content" dest=”content_auto" /><fieldType name="text_auto" class="solr.TextField” positionIncrementGap="100”><analyzer><tokenizer class="solr.KeywordTokenizerFactory"/><filter class="solr.LowerCaseFilterFactory"/></analyzer></fieldType>q=*:*&facet=true&facet.field=title_auto&facet.mincount=1&facet.prefix=schi
  • 94. ‣  Again, define new field title_auto as in previous slide‣  Next, redefine the field type text_auto as follows‣  Now, you can use the standard Solr query parser to get suggestionsAutosuggestion Using standardsearchSolr Exercise VI94<fieldType name="text_auto" class="solr.TextField” positionIncrementGap=“100”><analyzer><tokenizer class="solr.KeywordTokenizerFactory"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EdgeNGramFilterFactory"minGramSize="1" maxGramSize=“25" side="front" /></analyzer></fieldType>q=title_auto:query&q.op=AND&rows=5&fl=titleq=title_auto:query&q.op=AND&rows=0&facet=true&facet.field=tag&facet.mincount=1&facet.limit=5
  • 95. Elasticsearch is a “distributed-from-scratch” search serverbased on LuceneCreated by Shay Banon with a first version made public in 02/2010:ElasticSearch itself was born out of my frustration with the fact that there isn’t reallya good, open source, solution for distributed search engine out there, which alsocombines what I expect of search engines after building Compass (and on that, I willblog later…).I have been working on this for the past several months, pouring my search anddistributed knowledge into this (and portions of my heart and time ;) )[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]OverviewProjects95http://www.elasticsearch.org/
  • 96. ‣  Current stable version 0.20.6‣  Licensed by Apache License 2.0‣  Small group of core developer, but strong support of valuable Lucene committer‣  Already a promising list of users (small and big companies)‣  github, soundcloud, stackoverflow, mozilla, klout‣  http://www.elasticsearch.org/users/OverviewProjects96http://www.elasticsearch.org/
  • 97. ‣  Pure Java application‣  Search, indexing und scoring is done by Lucene‣  Document-oriented‣  Schema-less‣  Well, ElasticSearch might be schema-less, Lucene isn’t!‣  ElasticSearch therefore automatically detect correct types‣  However, a schema is still needed! Why?‣  HTTP & JSON API for all interactions‣  Indexing / Updating‣  Searching‣  Administration / Monitoring‣  Distribution is fundamental feature of ElasticSearch!HighlightsProjects97http://www.elasticsearch.org/
  • 98. ‣  Facet search and filtering (values, queries, date/time ranges)‣  Lots of query types‣  Script filters‣  Geospatial search called GeoShape Query‣  Configurable caching for‣  Filters‣  Field data‣  NRT search with separate API‣  Sorting, Highlighting‣  MoreLikeThis based on document or field‣  Multi Tenancy:‣  Define multiple indices that e.g. handles documents differently duringindexing‣  Still, you can search over them with one queryHighlightsProjects98http://www.elasticsearch.org/
  • 99. ‣  ElasticSearch Gateway Module stores indices and metadata to:‣  Local FS, Shared FS, Hadoop, Amazon S3‣  River Interface:‣  Pluggable service to constantly pull data‣  Manage over specific REST endpoint‣  Implementations for CouchDB, MongoDB‣  Lucene Analyzer specification over elasticsearch.yml or API‣  Bulk indexing‣  Default: single document indexing‣  Bulk indexing over specific REST endpointsHighlightsProjects99http://www.elasticsearch.org/
  • 100. +  Simple but effective architecture+  Easiness of use, even when using distributed search+  High matureness, even though ES is young+  Modern technologies used+  HTTP and JSON only-  Shard splitting is not trivial-  Still small community and small group of core developer-  Compared to Solr:-  Less number of query types-  Less possibilities for boosting-  Less number of analyzers-  Missing features such as clustering, autocomplete, spell checkingPros & ConsProjects100http://www.elasticsearch.org/
  • 101. ‣  Installation‣  Indexing‣  Queries IES Exercise I101
  • 102. ‣  On Linux systems‣  On Windows systems‣  RunInstallationES Exercise I102unzip elasticsearch-0.20.6.zipcd elasticsearch-0.20.6bin/elasticsearch –f[unzip elasticsearch-0.20.6.zip]dir elasticsearch-0.20.6bin/elasticsearch.bat -fcurl -X GET http://localhost:9200/http://www.elasticsearch.org/
  • 103. ‣  On Linux systems‣  Run‣  ShutdownInstallationES Exercise I103unzip elasticsearch-0.20.6.zipcd elasticsearch-0.20.6bin/elasticsearch –p path/to/pidfilecurl -X GET http://localhost:9200/curl -XPOST http://localhost:9200/_shutdown’curl -XPOST http://localhost:9200/_cluster/nodes/_shutdown’http://www.elasticsearch.org/
  • 104. ‣  bin/‣  eslasticsearch [elasticsearch.bat] to start elasticsearch server‣  script plugin [plugin.bat] to install plugins‣  config/‣  contains the global configuration‣  server config file elasticsearch.yml‣  logging config file logging.yml‣  data/‣  standard directory containing index data‣  configurable by path.dataES_HOMEES Exercise I104http://www.elasticsearch.org/
  • 105. ‣  lib/‣  shared library directory‣  place additional libraries here‣  logs/‣  log files will be placed here using default log configuration‣  configurable by path.log in elasticsearch.ymlES_HOMEES Exercise I105http://www.elasticsearch.org/
  • 106. ‣  cluster‣  one or more nodes build a cluster‣  usually distributed over various machines‣  one master node that is automatically chosen‣  node‣  running instance of elasticsearch‣  a node automatically discovers other nodes at start up‣  node discovery is done either using unicast or multicast messages‣  index‣  separate document database model with own mapping and types‣  is partitioned in one or more primary and replica shardsTerminologyES Exercise I106http://www.elasticsearch.org/
  • 107. ‣  mapping‣  schema definition defining types with their associated fields‣  field types and properties‣  shard‣  low level data structure of elasticsearch‣  single Lucene index‣  managed automatically by elasticsearch‣  primary shard‣  every documents is exclusively stored in a primary shard‣  all primary shards make up the documents of the index‣  default: 5 primary shardsTerminologyES Exercise I107http://www.elasticsearch.org/
  • 108. ‣  replica shard‣  each primary shard is replicated 0 or more times‣  replica shards are distributed automatically‣  replica shards are used for search and primary shard fail-over‣  type‣  within an index zero or more types can be defined‣  a type defines a certain set of field similar to a table structure‣  types are defined in the mappingTerminologyES Exercise I108http://www.elasticsearch.org/
  • 109. ‣  Index API‣  index (PUT/POST)‣  update (PUT/POST)‣  delete (DELETE),‣  delete by query (DELETE)‣  Documents are defined as JSON objects‣  index and type are defined in the url path‣  automatic creation of an index and mapping‣  action.auto_create_index‣  index.mapper.dynamic‣  elasticssearch automatically identifies field types based on JSON input‣  automatic ID generationIndex APIES Exercise I109http://www.elasticsearch.org/
  • 110. ‣  Index a book‣  Index a book with defining a named typeIndex APIES Exercise I110$ curl -XPUT http://localhost:9200/books/book/1 -d {"author" : "bernhard pflugfelder","post_date" : "2013-04-22T14:12:12","title" : "my first book","abstract" : "this book is about elasticsearch",}$ curl -XPUT http://localhost:9200/books/book/1 -d {"book" : {"author" : "bernhard pflugfelder","post_date" : "2013-04-22T14:12:12","title" : "my first book","abstract" : "this book is about elasticsearch",}}http://www.elasticsearch.org/
  • 111. ‣  Index a book with automatic ID generation‣  ResultIndex APIES Exercise I111$ curl -XPOST http://localhost:9200/books/book/ -d {"author" : "bernhard pflugfelder","post_date" : "2013-04-22T14:12:12","title" : "my first book","abstract" : "this book is about elasticsearch",}{"ok" : true,"_index" : "books","_type" : "book","_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32","_version" : 1}http://www.elasticsearch.org/
  • 112. ‣  Update operations are done by providing a script manipulating the field structure‣  Following steps composes the update process:‣  fetch the requested document‣  apply the script‣  indexed as a new document‣  Only the source field _source can be updated‣  _source is always stored in the index‣  stores the actual JSON used at index time‣  can be disabled for every type separately‣  can be compressed (from version 0.90 compression is done automatically)Index APIES Exercise I112{"book" : {"_source" : {"enabled" : false}}}http://www.elasticsearch.org/
  • 113. ‣  Create a new field tag‣  Replace the value of field tag‣  Add an additional value for the field tagIndex APIES Exercise I113curl -XPOST localhost:9200/books/book/1/_update -d {"script" : "ctx._source.tag = "search""}curl -XPOST localhost:9200/books/book/1/_update -d {"script" : "ctx._source.tags += tag","params" : {"tag" : "open source technologies"}curl -XPOST localhost:9200/books/book/1/_update -d {"script" : "ctx._source.tag = "search technologies""}http://www.elasticsearch.org/
  • 114. ‣  Delete a document based on its unique ID‣  Delete a document based on a search queryIndex APIES Exercise I114curl -XDELETE http://localhost:9200/books/book/1$ curl -XDELETE http://localhost:9200/books/book/_query -d {"term" : { "author" : "bernhard pflugfelder" }}http://www.elasticsearch.org/
  • 115. ‣  Term query‣  Terms querySearch APIES Exercise I115$ curl -XGET http://localhost:9200/books/book/_search -d {"query" : {"term" : { "author" : "bernhard" }}}$ curl -XGET http://localhost:9200/books/book/_search -d {"query" : {"terms" : { "author" : [ "bernhard”, “pflugfelder” ],“minimum_match” : 1}}}http://www.elasticsearch.org/
  • 116. ‣  Match queries accepts text, numeric and date values‣  Match queries are applied per field, automatically chosen proper analyzer‣  Types of match queries‣  boolean (default)‣  phrase match‣  phrase prefix match‣  multi match (two or more fields are searched)Search APIES Exercise I116http://www.elasticsearch.org/
  • 117. ‣  Simple syntax‣  Extended syntaxSearch APIES Exercise I117$ curl -XGET http://localhost:9200/books/book/_search -d {"query" : {"term" : { "author" : "bernhard" }}}{"match" : {"abstract" : {"query" : "about elasticsearch","operator" : "and"}}}Param name Param value Descriptionoperator “and”, “or” boolean operatorfuzziness 0.0 – 1.0 add fuzziness to the original termshttp://www.elasticsearch.org/
  • 118. ‣  Simple syntax‣  Extended syntaxSearch APIES Exercise I118$ curl -XGET http://localhost:9200/books/book/_search -d {"query" : {"match_phrase" : { ” abstract" : "about elasticsearch" }}}{"match_phrase" : {”abstract" : {"query" : "about elasticsearch","operator" : "and"}}}Param name Param value Descriptionslop number phrase sloppinessanalyzer 0.0 – 1.0 analyzer name to be used for queryhttp://www.elasticsearch.org/
  • 119. ‣  Mapping (aka schema)‣  Field types‣  Analyzers‣  Queries IIES Exercise II119
  • 120. ‣  The schema mapping defines the index structure and document representation‣  Elasticsearch works without an explicit schema (“schema-less”),‣  Automatic inference is however dangerous in many situations‣  This, define an explicit schema is the preferred way‣  A mapping consists of:‣  type name‣  list of fields (i.e. properties)‣  each property defines a field type and, optionally, field attributes‣  Mappings are formatted in JSON‣  Mappings are managed using the Mapping API (PUT / POST / GET)MappingES Exercise II120http://www.elasticsearch.org/
  • 121. ‣  Define a mapping for type book‣  Retrieve the current mapping for type bookMappingES Exercise II121# echo " {"mappings" : {"books" : {"properties" : {”id" : { "type" : "string" },"title" : { "type" : "string" },"author" : { "type" : "string" },”subject" : { "type" : ”string" },”view_count" : { "type" : ”integer" },"created" : { "type" : "date","format" : “dateOptionalTime" }}}}} " > book.jsoncurl –XPUT localhost:9200/gutenberg/books/_mapping’ –d @book.json# curl localhost:9200/gutenberg/books/_mapping?pretty=1http://www.elasticsearch.org/
  • 122. ‣  Field types‣  string, date‣  number‣  byte, short, integer, long, float, double‣  boolean, binary (BASE64)‣  Common field attributesMappingES Exercise II122Name Value Descriptionindex_name string field name stored within the indexindex yes / no Field shall be searchablestore yes ( no Original values shall be storedanalyzer string Analyzer used for that fieldnull_value value Default field value if a value is not assigned to adocumenthttp://www.elasticsearch.org/
  • 123. AnalyzersES Exercise II123‣  Analyzers are defined either‣  in elasticsearch.yml or elasticsearch.json‣  by the Index API‣  Common analyzers‣  standard‣  whitespace‣  stop‣  keyword‣  language‣  snowballcurl localhost:9200/_analyze?analyzer=standard -d ’elasticsearch is groovy!’curl localhost:9200/_analyze?analyzer=whitespace -d ’elasticsearch is groovy!curl localhost:9200/_analyze?analyzer=stop -d ’elasticsearch is groovy!curl localhost:9200/_analyze?analyzer=keyword -d ’elasticsearch is groovy!’http://www.elasticsearch.org/
  • 124. AnalyzersES Exercise II124discovery.zen.multicast.enabled: falsehttp:max_content_length: 100000index:number_of_shards: 1analysis:analyzer:Default:type: standardlowercase_analyzer:type: customtokenizer: standardfilter: [standard, lowercase]http://www.elasticsearch.org/
  • 125. ‣  Elasticsearch provides two highlighting algorithms‣  fast vector highlighter‣  highlighter (standard implementation)‣  Requirement to use fast vector highlighterHighlightingES Exercise II125{”books" : {”title" : {"type" : "string”,"term_vector" : "with_positions_offsets”}}}{"query" : {...},"highlight" : {"pre_tags" : ["<tag1>", "<tag2>"],"post_tags" : ["</tag1>", "</tag2>"],"fields" : {"_all" : {}}}}http://www.elasticsearch.org/
  • 126. ‣  Faceted search‣  Filter query‣  Sorting‣  More Like ThisES Exercise III126
  • 127. ‣  Elasticsearch provides the following facet mechanism:‣  Group results by a field value‣  Group by numeric or date ranges‣  Group numeric or date values in equally sized buckets (histogram)‣  Group results around a coordinate based on the geo distance‣  Basic facet definition‣  Facet types: terms, range, histogram, date_histogram, geo_distanceFaceted searchES Exercise III127{"facets" : {"<FACET NAME>" : {"<FACET TYPE>" : { ... },"global" : true}}}http://www.elasticsearch.org/
  • 128. Faceted searchES Exercise III128curl -X POST http://localhost:9200/gutenberg/books/_search?pretty=1 -d ’{"from": 0,"size": 10,"query": {"match": {”author": ”schiller"}},"facets": {"tagsFacet": {"terms": {"field": ”subject","size": 10}}}}http://www.elasticsearch.org/
  • 129. Faceted searchES Exercise III129{"query" : {"match_all" : {}},"facets" : {"range1" : {"range" : {”view_count" : [{ "to" : 50 },{ "from" : 20, "to" : 70 },{ "from" : 70, "to" : 120 },{ "from" : 150 }]}}}}http://www.elasticsearch.org/
  • 130. ‣  Histogram facet works on any numeric field‣  Field values are rounded to fit in the respective bucket‣  The property interval defines the bucket sizeFaceted searchES Exercise III130{"query" : {"match_all" : {}},"facets" : {"histo1" : {"histogram" : {"field" : ”view_count","interval" : 100}}}}http://www.elasticsearch.org/
  • 131. ‣  Elastic search also provides filter queries internally cached for optimalperformance‣  A filter query can be applied based on a returned search result like hereFilter queryES Exercise III131curl -XPOST localhost:9200/gutenberg/books/_search?pretty=1 -d {"query" : {"term" : { ”title" : ”schiller" }},"filter" : {"term" : { ”subject" : ”drama" }},"facets" : {"tag" : {"terms" : { "field" : ”subject" }}}}http://www.elasticsearch.org/
  • 132. ‣  Or the filter query is applied during the search of the user query at first place‣  Difference to previous filter query?Filter queryES Exercise III132curl -XPOST localhost:9200/books/_search?pretty=1 -d {"filtered" : {"query" : {"term" : { ”author" : “schiller" }},"filter" : {"range" : {”view_count" : { "from" : 50, "to" : 100 }}}}}http://www.elasticsearch.org/
  • 133. ‣  Sorting is done based on one or multiple fields‣  In case of multiple sorting fields, sorting is done per field‣  ascending / descending sorting‣  _score refers to sort based on the scoreSortingES Exercise III133curl -XPOST localhost:9200/gutenberg/books/_search?pretty=1 -d ’{"sort" : [{ ”view_count" : {"order" : ”desc"} },"_score”],"query" : {"term" : { "title" : ”schiller" }}}http://www.elasticsearch.org/
  • 134. mlt queryES Exercise III134curl -XPOST localhost:9200/gutenberg/books/_search?pretty=1 -d ’{"more_like_this" : {"fields" : ["title", ”subject"],"like_text" : "text like this one","min_term_freq" : 1,"max_query_terms" : 12}}http://www.elasticsearch.org/Name Value Descriptionfields fieldname(s) List of fields used for mltlike_text string The text to find docs likemin_term_freq number Minimal term freqmax_query_terms number Maximal term freqmin_doc_freq number Minimal document freqmax_doc_freq number Maximal document freqpercent_terms_to_match 0.0 – 1.0 Percentage of terms match