JavaEdge09 : Java Indexing and Searching


Published on

From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching'

In this session we discussed the need of Full Test Search (as opposed to regular textual/SQL search) , Lucene and it's OO mismatches, the solution that Hibernate Search provides to those mismatches and then a bit about Lucene's scoring algorithm.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • JIRA search for issuesECLIPSE – search for its documentation
  • לדבר על HIBERNATE וORM ממש בקצרה ואז להעביר לשקף הבא
  • Execution – sync or async. (default: sync)Thread_pool.size. (default: 1)Buffer_queue.max (default: infinite) Be aware of OutOfMemoryException
  • JavaEdge09 : Java Indexing and Searching

    1. 1. Java Indexing and Searching<br />By : Shay Sofer & EvgenyBorisov<br />
    2. 2. Motivation<br />Lucene Intro<br />Hibernate Search<br />Indexing<br />Searching<br />Scoring<br />Alternatives<br />Agenda<br />
    3. 3. Motivation<br />What is Full Text Search and why do I need it?<br />
    4. 4. Motivation<br />Use case<br />“Book” table<br />Good practices for Gava<br />
    5. 5. We’d like to : <br />Index the information efficiently<br />answer queries using that index<br />More common than you think<br />Full Text Search<br />Motivation<br />
    6. 6. Integrated full text search engine in the database<br /> e.g. DBSight, Recent versions of MySQL, MS SQL Server, Oracle Text, etc <br />Out of the box Search Appliances <br />e.g. Google Search Appliance<br />Third party libraries<br />Full Text Search Solutions<br />Motivation<br />
    7. 7. Lucene Intro<br />
    8. 8. The most popular full text search library<br />Scalable and high performance<br />Around for about 9 years<br />Open source <br />Supported by the Apache Software Foundation<br />Apache Lucene<br />Lucene Intro<br />
    9. 9. Lucene Intro<br />
    10. 10. “Word-oriented” search<br />Powerful query syntax<br />Wildcards, typos, proximity search.<br />Sorting by relevance (Lucene’s scoring algorithm) or any other field<br />Fast searching, fast indexing<br />Inverted index.<br />Lucene’s Features<br />Lucene Intro<br />
    11. 11. Lucene Intro<br />Inverted Index<br /> DB<br />Head First Java<br />0<br />Best of the best of the best<br />1<br />Chuck Norris in action<br />2<br />JBoss in action<br />3<br />
    12. 12. A Field is a key+value. Value is always represented as a String (Textual)<br />A Document can contain as many Fields as we’d like<br />Lucene’sindex is a collection of Documents<br />Basic Definitions<br />Lucene Intro<br />
    13. 13. Lucene Intro<br />Using Lucene API…<br />IndexSearcher is = newIndexSearcher(“BookIndex&quot;);<br />QueryParserparser = newQueryParser(&quot;title&quot;, analyzer);<br />Query query = parser.parse(“Good practices for Gava”);<br />return;<br />
    14. 14. OO domain model Vs. Lucene’s Index structure<br />Lucene Intro<br />
    15. 15. The Structural Mismatch<br />Converting objects to string and vice versa<br />No representation of relation between Documents<br />The Synchronization Mismatch<br />DB must by sync’ed with the index<br />The Retrieval Mismatch<br />Retrieving documents ( =pairs of key + value) and not objects <br />Object vs Flat text mismatches<br />Lucene Intro<br />
    16. 16. Hibernate Search<br />Emmanuel Bernard<br />
    17. 17. Leverages ORM and Lucene together to solve those mismatches<br />Complements Hibernate Core by providing FTS on persistent domain models.<br />It’s actually a bridge that hides the sometimes complex Lucene API usage.<br />Open source.<br />Hibernate Search<br />
    18. 18. Document = Class (Mapped POJO)<br />Hibernate Search metadata can be described by Annotations only<br />Regardless, you can still use Hibernate Core with XML descriptors (hbm files)<br />Let’s create our first mapping – Book<br />Mapping<br />Hibernate Search<br />
    19. 19. @Entity @Indexed<br />publicclass Book implementsSerializable {<br />@Id<br />private Long id;<br />@Boost(2.0f)<br /> @Field <br />private String title;<br />@Field<br /> privateStringdescription;<br /> privateStringimageURL;<br />@Field (index=Index.UN_TOKENIZED)<br /> privateStringisbn;<br /> … <br />}<br />Hibernate Search<br />
    20. 20. Types will be converted via “Field Bridge”.<br />It is a bridge between the Java type and its representation in Lucene (aka String)<br />Hibernate Search comes with a set for most standard types (Numbers – primitives and wrappers, Date, Class etc)<br />They are extendable, of course<br />Bridges<br />Hibernate Search<br />
    21. 21. We can use a field bridge…<br />@FieldBridge(impl = MyPaddedFieldBridge.class,<br />params = {@Parameter(name=&quot;padding&quot;, value=“5&quot;)} )<br />public Double getPrice(){<br />return price;<br />}<br />Or a class bridge - incase the data we want to index is more than just the field itself<br />e.g. concatenation of 2 fields<br />Custom Bridges<br />Hibernate Search<br />
    22. 22. In order to create a custom bridge we need to implement the interface StringBridge<br />ParameterizedBridge – to inject params<br />Custom Bridges<br />Hibernate Search<br />
    23. 23. Directory is where Lucene stores its index structure.<br />Filesystem Directory Provider<br />In-memory Directory Provider<br />Clustering<br />Directory Providers<br />Hibernate Search<br />
    24. 24. Default<br />Most efficient<br />Limited only by the disk’s free space<br />Can be easily replicated<br />Luke support<br />Filesystem Directory Provider<br />Hibernate Search<br />
    25. 25. Index dies as soon as SessionFactory is closed.<br />Very useful when unit testing. (along side with in-memory DBs)<br />Data can be made persistent at any moment, if needed.<br />Obviously, be aware of OutOfMemoryException<br />In-memory Directory Provider <br />Hibernate Search<br />
    26. 26. &lt;!-- Hibernate Search Config --&gt;<br />&lt;propertyname=&quot;;&gt;<br /><br />&lt;/property&gt;<br />&lt;propertyname=<br />&quot;;&gt;<br /><br />&lt;/property&gt;<br />Directory Providers Config Example<br />Hibernate Search<br />
    27. 27. Correlated queries - How do we navigate from one entity to another?<br />Lucene doesn’t support relationships between documents<br />Hibernate Search to the rescue - Denormalization<br />Relationships<br />Hibernate Search<br />
    28. 28. Hibernate Search<br />
    29. 29. @Entity@Indexed<br />publicclass Book{<br /> @ManyToOne<br /> @IndexEmbedded<br /> private Author author;<br />}<br />@Entity @Indexed<br />publicclass Author{<br />private String firstName;<br />}<br />Object navigation is easy (author.firstName)<br />Relationships<br />Hibernate Search<br />
    30. 30. Entities can be referenced by other entities.<br />Relationships – Denormalization Pitfall<br />Hibernate Search<br />
    31. 31. Entities can be referenced by other entities.<br />Relationships – Denormalization Pitfall<br />Hibernate Search<br />
    32. 32. Entities can be referenced by other entities.<br />Relationships – Denormalization Pitfall<br />Hibernate Search<br />
    33. 33. The solution: The association pointing back to the parent will be marked with @ContainedIn<br />@Entity @Indexed<br />publicclass Book{<br /> @ManyToOne<br /> @IndexEmbedded<br />private Author author;<br />}<br />@Entity @Indexed<br />publicclass Author{<br />@OneToMany(mappedBy=“author”) <br /> @ContainedIn<br /> private Set&lt;Book&gt; books;<br />}<br />Relationships – Solution<br />Hibernate Search<br />
    34. 34. Responsible for tokenizing and filtering words <br />Tokenizing – not a trivial as it seems<br />Filtering – Clearing the noise (case, stop words etc) and applying “other” operations<br />Creating a custom analyzer is easy<br />The default analyzer is Standard Analyzer<br />Analyzers<br />Hibernate Search<br />
    35. 35. StandardTokenizer : Splits words and removes punctuations.<br />StandardFilter : Removes apostrophes and dots from acronyms.<br />LowerCaseFilter : Decapitalizes words.<br />StopFilter : Eliminates common words.<br />Standard Analyzer<br />Hibernate Search<br />
    36. 36. Other cool Filters….<br />Hibernate Search<br />
    37. 37. N-Gram algorithm – Indexing a sequence of n consecutive characters. <br /> Usually when a typo occurs, part of the word is still correct<br />Encyclopedia in 3-grams =<br />Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia<br />Approximative Search<br />Hibernate Search<br />
    38. 38. Algorithms for indexing of words by their pronunciation <br />The most widely known algorithm is Soundex<br />Other Algorithms that are available : RefinedSoundex, Metaphone, DoubleMetaphone<br />Phonetic Approximation<br />Hibernate Search<br />
    39. 39. Synonyms<br />You can expand your synonym dictionary with your own rules (e.g. Business oriented words)<br />Stemming<br />Stemming is the process of reducing words to their stem, base or root form.<br />“Fishing”, “Fisher”, “Fish” and “Fished”  Fish<br />Snowball stemming language – supports over 15 languages<br />Synonyms & Stemming<br />Hibernate Search<br />
    40. 40. Lucene is bundled with the basic analyzers, tokenizers and filters. <br />More can be found at Lucene’s contribution part and at Apache-Solr<br />Additional Analyzers<br />Hibernate Search<br />
    41. 41. No free Hebrew analyzer for Lucene<br />ItamarSyn-Hershko<br />Involved in the creation of CLucene (The C++ port of Lucene)<br />Creating a Hebrew analyzer as a side project<br />Looking to join forces<br /><br />Hebrew?<br />Hibernate Search<br />
    42. 42. Hibernate Search<br />שר הטבעות, גירסה ראשונה:אחוות הטבעת<br />
    43. 43. Motivation<br />Lucene Intro<br />Hibernate Search<br />Indexing<br />Searching<br />Scoring<br />Alternatives<br />Agenda<br />
    44. 44. When data has changed?<br />Which data has changed?<br />When to index the changing data?<br />How to do it all efficiently?<br /> Hibernate Search will do it for you!<br />Transparent indexing<br />Indexing<br />
    45. 45. Indexing – On Rollback <br />Application<br />Queue<br />DB<br />Start Transaction<br />Session <br />(Entity Manager)<br />Insert/update<br />delete<br />Lucene Index<br />
    46. 46. Indexing – On Rollback <br />Transaction failed<br />Application<br />Queue<br />DB<br />Rollback<br />Start Transaction<br />Session <br />(Entity Manager)<br />Insert/update<br />delete<br />Lucene Index<br />
    47. 47. Indexing – On Commit <br />Transaction Committed<br />Application<br />Queue<br />DB<br />Session <br />(Entity Manager)<br />Insert/update<br />delete<br />√<br />Lucene Index<br />
    48. 48. &lt;property <br /> name=&quot;org.hibernate.worker.execution“&gt;async<br />&lt;/property&gt;<br />&lt;property<br /> name=&quot;org.hibernate.worker.thread_pool.size“&gt;2<br /> &lt;/property&gt;<br />&lt;property name=&quot;org.hibernate.worker.buffer_queue.max“&gt;10<br />&lt;/property&gt;    <br />hibernate.cfg.xml<br />Indexing<br />
    49. 49. Indexing<br />It’s too late! I already have a database without Lucene! <br />
    50. 50. FullTextSession extends from Session of Hibernate core <br />Session session = sessionFactory.openSession();<br />FullTextSessionfts = Search.getFullTextSession(session);<br />index(Object entity)<br />purge(Class entityType, Serializable id)<br />purgeAll(Class entityType)<br />Manual indexing<br />Indexing<br />
    51. 51. tx = fullTextSession.beginTransaction();<br />//read the data from the database<br /> Query query = fullTextSession.createCriteria(Book.class);<br /> List&lt;Book&gt; books = query.list();<br />for (Book book: books ) {<br />fullTextSession.index( book);<br />}<br />tx.commit();<br />Manual indexing<br />Indexing<br />
    52. 52. tx = fullTextSession.beginTransaction();<br />List&lt;Integer&gt; ids = getIds();<br />for (Integer id : ids) {<br />if(…){<br />fullTextSession.purge(Book.class, id );<br /> }<br /> }<br />tx.commit();<br />fullTextSession.purgeAll(Book.class);<br />Removing objects from the Lucene index<br />Indexing<br />
    53. 53. Indexing<br />Rrrr!!! I got an OutOfMemoryException!<br />
    54. 54. session.setFlushMode(FlushMode.MANUAL);<br />session.setCacheMode(CacheMode.IGNORE);<br />Transactiontx=session.beginTransaction();<br />ScrollableResultsresults = session.createCriteria(Item.class)<br /> .scroll(ScrollMode.FORWARD_ONLY);<br />intindex = 0;<br />while( {<br />index++;<br />session.index(results.get(0));<br />if (index % BATCH_SIZE == 0){<br />session.flushToIndexes();<br />session.clear(); <br />} }<br />tx.commit();<br />Indexing<br />100<br />54<br />
    55. 55. Searching<br />
    56. 56. title : lord title: rings<br />+title : lord +title: rings<br /> title : lord –author: Tolkien<br /> title: r?ngs<br /> title: r*gs<br /> title: “Lord of the Rings”<br /> title: “Lord Rings”~5<br /> title: rengs~0.8<br /> title: lord author: Tolkien^2<br />And more…<br />Lucene’s Query Syntax<br />Searching<br />
    57. 57. To build FTS queries we need to:<br />Create a Lucene query<br />Create a Hibernate Search query that wraps the Lucene query<br />Why?<br />No need to build framework around Lucene<br />Converting document to object happens transparently.<br />Seamless integration with Hibernate Core API<br />Querying<br />Searching<br />
    58. 58. String stringToSearch = “rings&quot;;<br />Term term = new Term(“title&quot;,stringToSearch);<br />TermQuery query = newTermQuery(term);<br />FullTextQueryhibQuery = <br />session.createFullTextQuery(query,Book.class); <br />List&lt;Book&gt; results = hibQuery.list(); <br />Hibernate Queries Examples<br />Searching<br />
    59. 59. String stringToSearch = &quot;r??gs&quot;;<br />Term term = new Term(“title&quot;,stringToSearch);<br />WildCardQuery query = newWildCardQuery (term);<br />...<br />List&lt;Book&gt; results = hibQuery.list(); <br />WildCardQuery Example<br />Searching<br />
    60. 60. Motivation<br />Use case<br />Book table<br />Good practices for Gava<br />
    61. 61. HS Query Flowchart<br />Searching<br /> Hibernate <br />Search<br />Query<br />Query the index<br />Lucene<br />Index<br />Client<br />Receive matching ids<br />Loads objects from the Persistence Context<br />DB<br />DB access <br />(if needed)<br />Persistence Context<br />
    62. 62. You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core !<br />Multistage search engine<br />Sorting<br />Explanation object<br />Querying tips<br />Searching<br />
    63. 63. Score<br />
    64. 64. Most based on Vector Space Model of Salton<br />Score<br />
    65. 65. Most based on Vector Space Model of Salton<br />Score<br />
    66. 66. Term Rating<br />Score<br />Logarithm<br />number of documents in the index<br />term weight<br />total number of documents containing term “I”<br />best java in action books<br />
    67. 67. Term Rating Calculation<br />Score<br />
    68. 68. Head First Java<br />Best of the best of the best<br />Best examples from Hibernate in action<br />The best action of Chuck Norris<br />Scoring example<br />Score<br />Search for: “best java in action books”<br />0.60206<br />0.12494<br />0.30103<br />
    69. 69. Conventional Boolean retrieval<br />Calculating score for only matching documents<br />Customizing similarity algorithm<br />Query boosting<br />Custom scoring algorithms<br />Lucene’s scoring approach<br />Score<br />
    70. 70. Alternatives<br />
    71. 71. Alternatives<br />Shay Banon<br />
    72. 72. Alternatives<br />Distributed<br />Spring support<br />Simple<br />Lucene based<br />Integrates with popular ORM frameworks<br />Configurable via XML or annotations<br />Local & External TX Manager<br />
    73. 73. Alternatives<br />
    74. 74. Enterprise Search Server<br />Supports multiple protocols (xml, json, ruby, etc...)<br />Runs as a standalone Full Text Search server within a servlet<br />e.g. Tomcat<br />Heavily based on Lucene<br />JSA – Java Search API (based on JPA)<br />ODM (Object/Document Mapping)<br /> Spring integration (Transactions)<br />Apache Solr<br />Alternatives<br />
    75. 75. Powerful Web Administration Interface<br />Can be tailored without any Java coding!<br />Extensive plugin architecture<br />Server statistics exposed over JMX<br />Scalability – easily replicated<br />Apache Solr<br />Alternatives<br />
    76. 76. Resources<br />Lucene<br />Lucenecontrib part<br />Hibernate Search<br />Hibernate Search in Action / Emmanuel Bernard, John Griffin<br />Compass<br />Apache Solr<br />
    77. 77. Thank you!<br />Q & A<br />