Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Własna wyszukiwarka w oparciu o Apache Solr

44 views

Published on

Presentation prepared by Tomasz Sobczak.
http://stacja.it

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Własna wyszukiwarka w oparciu o Apache Solr

  1. 1. Własna wyszukiwarka w oparciu o Apache Solr STACJA.IT 25.06.2016
  2. 2. Agenda 09.15 – 10.15 1. Wstęp 2. Instalacja i omówienie Apache Solr 3. Przygotowanie modelu danych 10.30 – 13.00 1. Indeksowanie treści ze strony WWW 2. Przerwa 11.30-11.45 3. Wyszukiwanie pełnotekstowe 13.00 – 14.00 OBIAD 14.00 – 14.45 1. Budowanie dynamicznej nawigacji 2. Budowanie podpowiedzi zapytań 15.00 – 16.30 1. Wyszukiwarka: aplikacja webowa i interfejs użytkownik 16.30 – 17.00 1. Pełna oferta szkoleń 2. Dyskusja, zakończenie szkolenia
  3. 3. O mnie • Konsultant w Findwise • Lucene / Solr / Elasticsearch • and other Apache stuff • All about search! • Enterprise Search Warsaw Meetup • https://pl.linkedin.com/in/sobczakt • https://twitter.com/sobczakt
  4. 4. Apache Solr
  5. 5. Introduction Why? • Relevancy • Performance • Scalability • Flexibility Challenges: • Support & bug fixing • Quality control • Upgrades
  6. 6. Introduction (Virtual) Machine Operating System Java Virtual Machine Tomcat/Jetty/WebSphere/JBoss Solr faceting, replication, caching, distributed search, admin, Lucene best practices Lucene Java core search, analysis tools, hit highlighting, spell checking
  7. 7. Introduction Caches RequestHandlers Schema UpdatesSearch Admin Data Import Handler Lucene Transaction Log Update Processors
  8. 8. Introduction Index Query Request Handler Response Writer Response UpdateHandler Solr Lucene Data Source
  9. 9. Introduction Index Query Request Handler Response Writer Response UpdateHandler Solr Lucene Data Source
  10. 10. Introduction Query Response http://hostname:port/solr/core/select?q=search { "responseHeader":{ "status":0, "QTime":1, "params":{ "indent":"true", "q":"search", "wt":"json" } }, "response":{ "numFound":100, "start":0, "docs":[ ] } } Solr
  11. 11. Introduction Search components many of Query Spelling Faceting Suggest and more...
  12. 12. Lab 1. Run Solr 2. Get know Solr Admin GUI
  13. 13. Data model
  14. 14. Schema • Types o All types o Order doesn't matter • Fields o All fields (must have a type) o Order doesn't matter • Settings
  15. 15. Schema Schema controls analysis Index Query Data Analysis Analysis
  16. 16. Schema
  17. 17. Attributes
  18. 18. Dynamic fields Dynamic fields allow Solr to index fields that you did not explicitly define in your schema <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
  19. 19. Analyzers • An analyzer processes the text for a field • Each field type has its own analyzer • An analyzer is a combination of other classes o CharFilter o Tokenizer o TokenFilter
  20. 20. Character filters <fieldType name="text_ws" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> He went to the café. He went to the cafe.
  21. 21. Tokenizers He went to the cafe. He went theto cafe. <fieldType name="text_ws" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  22. 22. Filters He went theto cafe. he went theto cafe. <fieldType name="text_ws" class="solr.TextField"> <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  23. 23. Schema browser
  24. 24. Analysis
  25. 25. Schemaless • Allows to construct a schema by indexing sample data without manually edit the schema • 3 features: o Enable Managed Schema o Define an UpdateRequestProcessorChain o Make the UpdateRequestProcessorChain the Default for the UpdateRequestHandler curl "http://localhost:8983/solr/gettingstarted/update?commit=true" -H "Content- type:application/csv" -d ' id,Artist,Album,Released,Rating,FromDistributor,Sold 44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'
  26. 26. Lab 1. Create field types: text_pl, text_st 2. Create fields: title, content, url, indextime, category, title_facet 3. Remember about Morfologik
  27. 27. Data indexing
  28. 28. Indexing
  29. 29. Indexing Name Extension Brief Description Segments File segments.gen, segments_N Stores information about segments Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file. Compound File .cfs An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. Compound File Entry table .cfe The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4) Fields .fnm Stores information about the fields Field Index .fdx Contains pointers to field data Field Data .fdt The stored fields for documents Term Infos .tis Part of the term dictionary, stores term info Term Info Index .tii The index into the Term Infos file Frequencies .frq Contains the list of docs which contain each term along with frequency Positions .prx Stores position information about where a term occurs in the index Norms .nrm Encodes length and boost factors for docs and fields Term Vector Index .tvx Stores offset into the document data file Term Vector Documents .tvd Contains information about each document that has term vectors Term Vector Fields .tvf The field level info about term vectors Deleted Documents .del Info about what files are deleted
  30. 30. Indexing ● ● ● Segment 1 Term dict Stored fields Term pos Term freq Norms ● ● ● Segment 0 Term dict Stored fields Term pos Term freq Norms ● ● ● Segment 2 Term dict Stored fields Term pos Term freq Norms ● ● ● Segment N Term dict Stored fields Term pos Term freq Norms ● ● ● Segments file
  31. 31. Indexing Inverted index • Alternative for comparing strings • Words take little space, but occurrences quite a lot • Lucene stores many data connected with documents
  32. 32. Indexing Most common ways of loading data: • HTTP requests • Client API • Data Import Handler • ExtractingRequestHandler and Tika
  33. 33. Data Import Handler • Datasource, entity, processor, transformer • Methods • abort, delta-import, full-import, reload-config, status • Data sources • JdbcDataSource, FileDataSource, URLDataSource • Processors • SqlEntityProcessor, XPathEntityProcessor, TikaEntityProcessor • Transformers • DateFormatTransformer, HTMLStripTransformer, RegexTransformer
  34. 34. Data Import Handler <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" /> <document> <entity name="item" query="select * from item" deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'"> <field column="NAME" name="name" /> <entity name="feature" query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"> <field name="features" column="DESCRIPTION" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> <entity name="category" query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'" deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>
  35. 35. Data Import Handler <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" /> <document> <entity name="item" query="select * from item" deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'"> <field column="NAME" name="name" /> <entity name="feature" query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"> <field name="features" column="DESCRIPTION" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> <entity name="category" query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'" deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>
  36. 36. Data Import Handler <dataConfig> <dataSourcetype="HttpDataSource"/> <document> <entity name="slashdot" pk="link" url="http://rss.slashdot.org/Slashdot/slashdot" processor="XPathEntityProcessor" forEach="/RDF/channel | /RDF/item" transformer="DateFormatTransformer"> <fieldcolumn="source" xpath="/RDF/channel/title" commonField="true"/> <fieldcolumn="source-link"xpath="/RDF/channel/link" commonField="true"/> <fieldcolumn="subject" xpath="/RDF/channel/subject"commonField="true"/> <fieldcolumn="title" xpath="/RDF/item/title"/> <fieldcolumn="link" xpath="/RDF/item/link"/> <fieldcolumn="description" xpath="/RDF/item/description"/> <fieldcolumn="creator" xpath="/RDF/item/creator"/> <fieldcolumn="item-subject"xpath="/RDF/item/subject"/> <fieldcolumn="date" xpath="/RDF/item/date” dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" /> <fieldcolumn="slash-department"xpath="/RDF/item/department"/> <fieldcolumn="slash-section"xpath="/RDF/item/section"/> <fieldcolumn="slash-comments" xpath="/RDF/item/comments"/> </entity> </document> </dataConfig>
  37. 37. Indexing External pipeline Update handler Analysis Index
  38. 38. Indexing UpdateRequestProcessor UpdateRequestProcessor UpdateRequestProcessor UpdateRequestProcessor UpdateRequestProcessorChain
  39. 39. Indexing • A segment is an index o But treated as part of a whole • New data means new segments o Each commit makes a new segment • Segments get merged when there are too many o Controlled by "Merge Policy“
  40. 40. Indexing • Segmented, write-once architecture has benefits o Replication o Updates while active o Efficient storage • Drawbacks o Updates expensive o Updates can be slow o Merge costs
  41. 41. Indexing • Solr is very scalable to data size • A search engine quick for query, time consuming for insert/indexing o Opposite to a database, that is quick to insert and queries can be time consuming o Considerations on index-time vs query-time  Some settings, for both index-time and query-time (language, stemming)  Some settings, choose index-time or query-time (for example synonyms)
  42. 42. Commits • A commit makes new data available to Solr • Data in RAM is flushed • Any data on disk since last commit is included • This forms a new segment • A new segments file is written with updated information • Causes the opening of a new IndexSearcher
  43. 43. Commits The Commit Process • As content is added to a Lucene/Solr index it is not searchable until a commit occurs. (soft commit is enough) • Solr uses an IndexSearcher to hold a view of the index in memory • A commit initiates the opening of a new IndexSearcher that will have a new view of the index which includes newly written segments since the last commit occurred
  44. 44. Commits • Soft Commit – kept in RAM/transactionlog, visible • Hard Commit – Writes to disk, configurable to open new searcher • Optimize, merges index files to one, perform infrequently
  45. 45. Optimize • Deleting documents from the index creates "empty space" in the Lucene index files • Optimizing squeezes out the empty space • Indexing and queries can sometimes be slower during an optimize • This is one reason why it’s advisable to have separate index and searcher nodes • When optimizing is finished, the entire Lucene index file set is new • Replication copies the entire index after an optimize
  46. 46. Updating document • Atomic updates o set, add, remove, removeregex, inc o Fields need to be stored • Optimistic Concurrency o _version_ or own field • Combining both strategies
  47. 47. Lab 1. Index data with DIH 2. Index data with Norconex
  48. 48. Searching
  49. 49. Sorting • Solr can sort by o Score o A value in a field o A function • In ascending or descending order • &sort=price asc,manu_exact desc • By default, Solr sorts in Unicode order o Aalop come after Zebra. Aa is considered an alternate spelling of Å
  50. 50. Query parsers • The Standard Query Parser • The Extended DisMax Query Parser • Other Parsers
  51. 51. Query parsers • The Standard Query Parser • The Extended DisMax Query Parser • Other Parsers • Common parameters defType, sort, start, rows, fq, fl, debug, explainOther, timeAllowed, omitHeader, wt, logParamsList • Local parameters q={!q.op=AND df=title}solr rocks q={!dismax qf=myfield}solr rocks
  52. 52. Standard Query Parser • q, q.op, df • Break query to terms and operators • Single terms and phrases • Boosts ^ • Constant score =^ • Boolean operators
  53. 53. Standard Query Parser • Boolean query +this –that this AND that • Field query title:this description:that • Range query price:[0 TO 100] -price:[100 TO *]
  54. 54. Standard Query Parser • Phrase / proximity query „Harry Potter” „Harry Potter”~1 • Multi-term query title:apache website title:(jakarta OR apache) AND website • Fuzzy query roam~ • Wildcard query te?t tes*
  55. 55. • q • q.alt • qf • mm • pf, pf2, pf3 • ps, ps2, ps3 • qs • tie • bq • bf • boost • uf • stopwords eDisMax Query Parser • Chooses the max of a field score • TF-IDF can change dramatically between fields • qf=text^1 catch_line^5 tags^3 • tie of 1.0 just turns the score into something of a “DisSum” o max(scores) + tie * sum(otherscores)
  56. 56. Block Join Query Parsers Boost Query Parser Collapsing Query Parser Complex Phrase Query Parser Field Query Parser Function Query Parser Function Range Query Parser Join Query Parser Lucene Query Parser Max Score Query Parser More Like This Query Parser Other parsers Nested Query Parser Old Lucene Query Parser Prefix Query Parser Raw Query Parser Re-Ranking Query Parser Simple Query Parser Spatial Query Parsers Surround Query Parser Switch Query Parser Term Query Parser Terms Query Parser
  57. 57. Join query parser fq={!join from=blog_id to=id} body:netflix id: blog1 name: Solr ‘n Stuff owner: Yonik Seeley started: 2007-10-26 id: blog2 name: lifehacker owner: Gawker Media started: 2005-1-31 id: post1 blog_id: blog1 author: Yonik Seeley title: Solr relevancy function queries body: Lucene’s default ranking […] id: post2 blog_id: blog1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called […] id: post3 blog_id: blog2 author: Whitson Gordon title: How to Install Netflix on Android
  58. 58. Join query parser • Only show posts from blogs started in 2010 or after &fq={!join from=id to=blog_id}started:[2010 TO *] • If a post in a blog mentions embassy, search q=bomb&fq={!join from=blog_id to=blog_id}embassy • If a blog post mentions embassy, search all emails with the same blog owner for bomb q=email_body:bomb &fq={!join from=owner_email_user to=email_user} {!join from=blog_id to=id}embassy
  59. 59. Block Join Query Parser • allow indexing and searching for relational content that has been indexed as nested documents
  60. 60. Block Join Query Parser • Block Join Children Query Parser q={!child of="content_type:parentDocument"}title:lucene • Block Join Parent Query Parser q={!parent which="content_type:parentDocument"}comments:SolrCloud
  61. 61. Block Join Query Parser • Block Join Children Query Parser q={!child of="content_type:parentDocument"}title:lucene <doc> <str name="id">4</str> <str name="comments">Lots of new features</str> </doc> • Block Join Parent Query Parser q={!parent which="content_type:parentDocument"}comments:SolrCloud <doc> <str name="id">1</str> <arr name="title"><str>Solr has block join support</str></arr> <arr name="content_type"><str>parentDocument</str></arr> </doc>
  62. 62. How it works
  63. 63. How it works
  64. 64. How it works
  65. 65. How it works
  66. 66. TFIDF • TF (term frequency) - tf(t in d) • IDF (inverse document frequency) - idf(t)
  67. 67. Precision The percentage of documents in the returned results that are relevant.
  68. 68. Recall The percentage of relevant results returned out of all relevant results in the system
  69. 69. Compromise A perfect system would have 100% precision and 100% recall for every user and every query
  70. 70. Relevance Take into account: • the needs of various users • the meaningful categories • any inherent relevance of • the age of documents
  71. 71. Relevance Take into account: • the needs of various users • the meaningful categories • any inherent relevance of • the age of documents Testing methodologies!
  72. 72. Relevance What's most important? Title:potter Author:potter Description:potter Reviews:potter
  73. 73. Relevance Which is a better match? Query: harry potter Text: Harry was a nice man. He lived on main street, next door to a potter. Harry Potter was a wizard
  74. 74. Boosting • Index-time boosts o document o field • Query-time boosts o function query
  75. 75. Lab Try standard dismax queries yourself in different combinations
  76. 76. Facets
  77. 77. Faceting
  78. 78. Faceting • Values in fields o Strings, dates, numbers, etc. o Can have one value, or multiple • Queries o Range queries are most common for faceting by query o This year (Jan 1 to Dec 31) o Last year • Data in a particular format o You don't want the words in the field separated
  79. 79. Range faceting Range facets divide a range into equal sized buckets 100 300 500 700 900 facet.range.start=100 facet.range.end=900 facet.range.gap=200 facet.range.other=before facet.range.other=after Before After
  80. 80. Hierarchical Faceting • Complex data structure • Can need more levels • sometimes lots • of them Count All • Books (2,123,456) • Computers & Technology (601,234) • Computer Science (123,456) • Artificial Intelligence (27,665) • Human-Computer Interaction (1,353) • Information Theory (2,004) • … Count Leaf Only • Books • Computers & Technology • Computer Science • Artificial Intelligence (27,665) • Human-Computer Interaction (1,353) • Information Theory (2,004) • …
  81. 81. Hierarchical Faceting • A value is represented by a path /Books/Computers/Artificial Intelligence • This may be indexed as category:/Books category:/Books/Computers category:/Books/Computers/Artificial Intelligence • PathHierarchyTokenizerFactory
  82. 82. Lab Try facet queries yourself
  83. 83. Suggestions
  84. 84. Suggestions <searchComponent name="suggest" class="solr.SuggestComponent"> <lst name="suggester"> <str name="name">mySuggester</str> <str name="lookupImpl">FuzzyLookupFactory</str> <str name="dictionaryImpl">DocumentDictionaryFactory</str> <str name="field">cat</str> <str name="weightField">price</str> <str name="suggestAnalyzerFieldType">string</str> <str name="buildOnStartup">false</str> </lst> </searchComponent>
  85. 85. Suggestions <searchComponent name="suggest" class="solr.SuggestComponent"> <lst name="suggester"> <str name="name">mySuggester</str> <str name="lookupImpl">FuzzyLookupFactory</str> AnalyzingLookupFactory, FuzzyLookupFactory, AnalyzingInfixLookupFactory… <str name="dictionaryImpl">DocumentDictionaryFactory</str> DocumentDictionaryFactory, DocumentExpressionDictionaryFactory, HighFrequencyDictionaryFactory <str name="field">cat</str> <str name="weightField">price</str> <str name="suggestAnalyzerFieldType">string</str> <str name="buildOnStartup">false</str> </lst> </searchComponent>
  86. 86. Suggestions
  87. 87. Lab Build QC based on dismax query and Suggest Component
  88. 88. Search application
  89. 89. Search Design • Query completion • Ambiguous phrases • Grouping • Related searches
  90. 90. Lab Let’s build search app!
  91. 91. Summary
  92. 92. Read more • https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Refer ence+Guide • http://lucenerevolution.org/ • http://solr.pl/en/ • http://yonik.com/ • https://lucidworks.com/blog/ • https://www.amazon.com/s/ref=nb_sb_noss?url=search- alias%3Daps&field-keywords=solr • https://www.linkedin.com/groups/1557747
  93. 93. • Founded in 2005 • 100 employees • Vendor independent consultants • Sweden, Denmark, Norway, Finland, Poland & UK
  94. 94. FINDABILITY
  95. 95. Search Taxonomy and metadata Content structure and navigation Big data Training SERVICES
  96. 96. Chcesz wiedzieć więcej? Szkolenia pozwalają na indywidualną pracę z każdym uczestnikiem • pracujemy w grupach 4-8 osobowych • program może być dostosowany do oczekiwań grupy • rozwiązujemy i odpowiadamy na indywidualne pytania uczestników • mamy dużo więcej czasu :)
  97. 97. Szkolenie dedykowane dla Ciebie Zapoznaj się z programami szkoleń: • Własna wyszukiwarka w oparciu o Apache Solr / Elasticsearch • Skalowanie Apache Solr / Elasticsearch • Wszystkie aspekty pracy nad trafnością wyników wyszukiwania w Apache Solr / Elasticsearch
  98. 98. Wspierają nas

×