Your SlideShare is downloading. ×
JavaEdge09 : Java Indexing and Searching
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

JavaEdge09 : Java Indexing and Searching

6,454
views

Published on

From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching' …

From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching'

In this session we discussed the need of Full Test Search (as opposed to regular textual/SQL search) , Lucene and it's OO mismatches, the solution that Hibernate Search provides to those mismatches and then a bit about Lucene's scoring algorithm.

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,454
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
107
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • JIRA search for issuesECLIPSE – search for its documentation
  • לדבר על HIBERNATE וORM ממש בקצרה ואז להעביר לשקף הבא
  • Execution – sync or async. (default: sync)Thread_pool.size. (default: 1)Buffer_queue.max (default: infinite) Be aware of OutOfMemoryException
  • Transcript

    • 1. Java Indexing and Searching
      By : Shay Sofer & EvgenyBorisov
    • 2. Motivation
      Lucene Intro
      Hibernate Search
      Indexing
      Searching
      Scoring
      Alternatives
      Agenda
    • 3. Motivation
      What is Full Text Search and why do I need it?
    • 4. Motivation
      Use case
      “Book” table
      Good practices for Gava
    • 5. We’d like to :
      Index the information efficiently
      answer queries using that index
      More common than you think
      Full Text Search
      Motivation
    • 6. Integrated full text search engine in the database
      e.g. DBSight, Recent versions of MySQL, MS SQL Server, Oracle Text, etc
      Out of the box Search Appliances
      e.g. Google Search Appliance
      Third party libraries
      Full Text Search Solutions
      Motivation
    • 7. Lucene Intro
    • 8. The most popular full text search library
      Scalable and high performance
      Around for about 9 years
      Open source
      Supported by the Apache Software Foundation
      Apache Lucene
      Lucene Intro
    • 9. Lucene Intro
    • 10. “Word-oriented” search
      Powerful query syntax
      Wildcards, typos, proximity search.
      Sorting by relevance (Lucene’s scoring algorithm) or any other field
      Fast searching, fast indexing
      Inverted index.
      Lucene’s Features
      Lucene Intro
    • 11. Lucene Intro
      Inverted Index
      DB
      Head First Java
      0
      Best of the best of the best
      1
      Chuck Norris in action
      2
      JBoss in action
      3
    • 12. A Field is a key+value. Value is always represented as a String (Textual)
      A Document can contain as many Fields as we’d like
      Lucene’sindex is a collection of Documents
      Basic Definitions
      Lucene Intro
    • 13. Lucene Intro
      Using Lucene API…
      IndexSearcher is = newIndexSearcher(“BookIndex");
      QueryParserparser = newQueryParser("title", analyzer);
      Query query = parser.parse(“Good practices for Gava”);
      return is.search(query);
    • 14. OO domain model Vs. Lucene’s Index structure
      Lucene Intro
    • 15. The Structural Mismatch
      Converting objects to string and vice versa
      No representation of relation between Documents
      The Synchronization Mismatch
      DB must by sync’ed with the index
      The Retrieval Mismatch
      Retrieving documents ( =pairs of key + value) and not objects
      Object vs Flat text mismatches
      Lucene Intro
    • 16. Hibernate Search
      Emmanuel Bernard
    • 17. Leverages ORM and Lucene together to solve those mismatches
      Complements Hibernate Core by providing FTS on persistent domain models.
      It’s actually a bridge that hides the sometimes complex Lucene API usage.
      Open source.
      Hibernate Search
    • 18. Document = Class (Mapped POJO)
      Hibernate Search metadata can be described by Annotations only
      Regardless, you can still use Hibernate Core with XML descriptors (hbm files)
      Let’s create our first mapping – Book
      Mapping
      Hibernate Search
    • 19. @Entity @Indexed
      publicclass Book implementsSerializable {
      @Id
      private Long id;
      @Boost(2.0f)
      @Field
      private String title;
      @Field
      privateStringdescription;
      privateStringimageURL;
      @Field (index=Index.UN_TOKENIZED)
      privateStringisbn;

      }
      Hibernate Search
    • 20. Types will be converted via “Field Bridge”.
      It is a bridge between the Java type and its representation in Lucene (aka String)
      Hibernate Search comes with a set for most standard types (Numbers – primitives and wrappers, Date, Class etc)
      They are extendable, of course
      Bridges
      Hibernate Search
    • 21. We can use a field bridge…
      @FieldBridge(impl = MyPaddedFieldBridge.class,
      params = {@Parameter(name="padding", value=“5")} )
      public Double getPrice(){
      return price;
      }
      Or a class bridge - incase the data we want to index is more than just the field itself
      e.g. concatenation of 2 fields
      Custom Bridges
      Hibernate Search
    • 22. In order to create a custom bridge we need to implement the interface StringBridge
      ParameterizedBridge – to inject params
      Custom Bridges
      Hibernate Search
    • 23. Directory is where Lucene stores its index structure.
      Filesystem Directory Provider
      In-memory Directory Provider
      Clustering
      Directory Providers
      Hibernate Search
    • 24. Default
      Most efficient
      Limited only by the disk’s free space
      Can be easily replicated
      Luke support
      Filesystem Directory Provider
      Hibernate Search
    • 25. Index dies as soon as SessionFactory is closed.
      Very useful when unit testing. (along side with in-memory DBs)
      Data can be made persistent at any moment, if needed.
      Obviously, be aware of OutOfMemoryException
      In-memory Directory Provider
      Hibernate Search
    • 26. <!-- Hibernate Search Config -->
      <propertyname="hibernate.search.default.directory_provider">
      org.hibernate.search.store.FSDirectoryProvider
      </property>
      <propertyname=
      "hibernate.search.com.alphacsp.Book.directory_provider">
      org.hibernate.search.store.RAMDirectoryProvider
      </property>
      Directory Providers Config Example
      Hibernate Search
    • 27. Correlated queries - How do we navigate from one entity to another?
      Lucene doesn’t support relationships between documents
      Hibernate Search to the rescue - Denormalization
      Relationships
      Hibernate Search
    • 28. Hibernate Search
    • 29. @Entity@Indexed
      publicclass Book{
      @ManyToOne
      @IndexEmbedded
      private Author author;
      }
      @Entity @Indexed
      publicclass Author{
      private String firstName;
      }
      Object navigation is easy (author.firstName)
      Relationships
      Hibernate Search
    • 30. Entities can be referenced by other entities.
      Relationships – Denormalization Pitfall
      Hibernate Search
    • 31. Entities can be referenced by other entities.
      Relationships – Denormalization Pitfall
      Hibernate Search
    • 32. Entities can be referenced by other entities.
      Relationships – Denormalization Pitfall
      Hibernate Search
    • 33. The solution: The association pointing back to the parent will be marked with @ContainedIn
      @Entity @Indexed
      publicclass Book{
      @ManyToOne
      @IndexEmbedded
      private Author author;
      }
      @Entity @Indexed
      publicclass Author{
      @OneToMany(mappedBy=“author”)
      @ContainedIn
      private Set<Book> books;
      }
      Relationships – Solution
      Hibernate Search
    • 34. Responsible for tokenizing and filtering words
      Tokenizing – not a trivial as it seems
      Filtering – Clearing the noise (case, stop words etc) and applying “other” operations
      Creating a custom analyzer is easy
      The default analyzer is Standard Analyzer
      Analyzers
      Hibernate Search
    • 35. StandardTokenizer : Splits words and removes punctuations.
      StandardFilter : Removes apostrophes and dots from acronyms.
      LowerCaseFilter : Decapitalizes words.
      StopFilter : Eliminates common words.
      Standard Analyzer
      Hibernate Search
    • 36. Other cool Filters….
      Hibernate Search
    • 37. N-Gram algorithm – Indexing a sequence of n consecutive characters.
      Usually when a typo occurs, part of the word is still correct
      Encyclopedia in 3-grams =
      Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia
      Approximative Search
      Hibernate Search
    • 38. Algorithms for indexing of words by their pronunciation
      The most widely known algorithm is Soundex
      Other Algorithms that are available : RefinedSoundex, Metaphone, DoubleMetaphone
      Phonetic Approximation
      Hibernate Search
    • 39. Synonyms
      You can expand your synonym dictionary with your own rules (e.g. Business oriented words)
      Stemming
      Stemming is the process of reducing words to their stem, base or root form.
      “Fishing”, “Fisher”, “Fish” and “Fished”  Fish
      Snowball stemming language – supports over 15 languages
      Synonyms & Stemming
      Hibernate Search
    • 40. Lucene is bundled with the basic analyzers, tokenizers and filters.
      More can be found at Lucene’s contribution part and at Apache-Solr
      Additional Analyzers
      Hibernate Search
    • 41. No free Hebrew analyzer for Lucene
      ItamarSyn-Hershko
      Involved in the creation of CLucene (The C++ port of Lucene)
      Creating a Hebrew analyzer as a side project
      Looking to join forces
      itamar@divrei-tora.com
      Hebrew?
      Hibernate Search
    • 42. Hibernate Search
      שר הטבעות, גירסה ראשונה:אחוות הטבעת
    • 43. Motivation
      Lucene Intro
      Hibernate Search
      Indexing
      Searching
      Scoring
      Alternatives
      Agenda
    • 44. When data has changed?
      Which data has changed?
      When to index the changing data?
      How to do it all efficiently?
      Hibernate Search will do it for you!
      Transparent indexing
      Indexing
    • 45. Indexing – On Rollback
      Application
      Queue
      DB
      Start Transaction
      Session
      (Entity Manager)
      Insert/update
      delete
      Lucene Index
    • 46. Indexing – On Rollback
      Transaction failed
      Application
      Queue
      DB
      Rollback
      Start Transaction
      Session
      (Entity Manager)
      Insert/update
      delete
      Lucene Index
    • 47. Indexing – On Commit
      Transaction Committed
      Application
      Queue
      DB
      Session
      (Entity Manager)
      Insert/update
      delete

      Lucene Index
    • 48. <property
      name="org.hibernate.worker.execution“>async
      </property>
      <property
      name="org.hibernate.worker.thread_pool.size“>2
       </property>
      <property name="org.hibernate.worker.buffer_queue.max“>10
      </property>    
      hibernate.cfg.xml
      Indexing
    • 49. Indexing
      It’s too late! I already have a database without Lucene!
    • 50. FullTextSession extends from Session of Hibernate core
      Session session = sessionFactory.openSession();
      FullTextSessionfts = Search.getFullTextSession(session);
      index(Object entity)
      purge(Class entityType, Serializable id)
      purgeAll(Class entityType)
      Manual indexing
      Indexing
    • 51. tx = fullTextSession.beginTransaction();
      //read the data from the database
      Query query = fullTextSession.createCriteria(Book.class);
      List<Book> books = query.list();
      for (Book book: books ) {
      fullTextSession.index( book);
      }
      tx.commit();
      Manual indexing
      Indexing
    • 52. tx = fullTextSession.beginTransaction();
      List<Integer> ids = getIds();
      for (Integer id : ids) {
      if(…){
      fullTextSession.purge(Book.class, id );
      }
      }
      tx.commit();
      fullTextSession.purgeAll(Book.class);
      Removing objects from the Lucene index
      Indexing
    • 53. Indexing
      Rrrr!!! I got an OutOfMemoryException!
    • 54. session.setFlushMode(FlushMode.MANUAL);
      session.setCacheMode(CacheMode.IGNORE);
      Transactiontx=session.beginTransaction();
      ScrollableResultsresults = session.createCriteria(Item.class)
      .scroll(ScrollMode.FORWARD_ONLY);
      intindex = 0;
      while(results.next()) {
      index++;
      session.index(results.get(0));
      if (index % BATCH_SIZE == 0){
      session.flushToIndexes();
      session.clear();
      } }
      tx.commit();
      Indexing
      100
      54
    • 55. Searching
    • 56. title : lord title: rings
      +title : lord +title: rings
      title : lord –author: Tolkien
      title: r?ngs
      title: r*gs
      title: “Lord of the Rings”
      title: “Lord Rings”~5
      title: rengs~0.8
      title: lord author: Tolkien^2
      And more…
      Lucene’s Query Syntax
      Searching
    • 57. To build FTS queries we need to:
      Create a Lucene query
      Create a Hibernate Search query that wraps the Lucene query
      Why?
      No need to build framework around Lucene
      Converting document to object happens transparently.
      Seamless integration with Hibernate Core API
      Querying
      Searching
    • 58. String stringToSearch = “rings";
      Term term = new Term(“title",stringToSearch);
      TermQuery query = newTermQuery(term);
      FullTextQueryhibQuery =
      session.createFullTextQuery(query,Book.class);
      List<Book> results = hibQuery.list();
      Hibernate Queries Examples
      Searching
    • 59. String stringToSearch = "r??gs";
      Term term = new Term(“title",stringToSearch);
      WildCardQuery query = newWildCardQuery (term);
      ...
      List<Book> results = hibQuery.list();
      WildCardQuery Example
      Searching
    • 60. Motivation
      Use case
      Book table
      Good practices for Gava
    • 61. HS Query Flowchart
      Searching
      Hibernate
      Search
      Query
      Query the index
      Lucene
      Index
      Client
      Receive matching ids
      Loads objects from the Persistence Context
      DB
      DB access
      (if needed)
      Persistence Context
    • 62. You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core !
      Multistage search engine
      Sorting
      Explanation object
      Querying tips
      Searching
    • 63. Score
    • 64. Most based on Vector Space Model of Salton
      Score
    • 65. Most based on Vector Space Model of Salton
      Score
    • 66. Term Rating
      Score
      Logarithm
      number of documents in the index
      term weight
      total number of documents containing term “I”
      best java in action books
    • 67. Term Rating Calculation
      Score
    • 68. Head First Java
      Best of the best of the best
      Best examples from Hibernate in action
      The best action of Chuck Norris
      Scoring example
      Score
      Search for: “best java in action books”
      0.60206
      0.12494
      0.30103
    • 69. Conventional Boolean retrieval
      Calculating score for only matching documents
      Customizing similarity algorithm
      Query boosting
      Custom scoring algorithms
      Lucene’s scoring approach
      Score
    • 70. Alternatives
    • 71. Alternatives
      Shay Banon
    • 72. Alternatives
      Distributed
      Spring support
      Simple
      Lucene based
      Integrates with popular ORM frameworks
      Configurable via XML or annotations
      Local & External TX Manager
    • 73. Alternatives
    • 74. Enterprise Search Server
      Supports multiple protocols (xml, json, ruby, etc...)
      Runs as a standalone Full Text Search server within a servlet
      e.g. Tomcat
      Heavily based on Lucene
      JSA – Java Search API (based on JPA)
      ODM (Object/Document Mapping)
      Spring integration (Transactions)
      Apache Solr
      Alternatives
    • 75. Powerful Web Administration Interface
      Can be tailored without any Java coding!
      Extensive plugin architecture
      Server statistics exposed over JMX
      Scalability – easily replicated
      Apache Solr
      Alternatives
    • 76. Resources
      Lucene
      Lucenecontrib part
      Hibernate Search
      Hibernate Search in Action / Emmanuel Bernard, John Griffin
      Compass
      Apache Solr
    • 77. Thank you!
      Q & A