Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
APACHE LUCENE FOR
JAVAEE DEVELOPERS
VIRTUAL:JBUG
by @SanneGrinovero
A QUICK INTRODUCTION
HIBERNATE TEAM
Hibernate Search project lead
Hibernate OGM team, occasionally Hibernate ORM
INFINISPAN TEAM
The Lucene guy: Infinispan Query, Infinispan Lucene Directory
OTHER PROJECTS I HELP WITH...
WildFly, JGroups, Apache Lucene, ...
SUPPORTED BY
Red Hat and a lot of passion for OSS
AGENDA
What is and how can it help you
Integrations with a JPA application via
How does this all relate with and
Lucene in...
THE SEARCH PROBLEM
Hello, I'm looking for a book in your online
shop having primary key #2342
SQL CAN HANDLE TEXT
The LIKE operator?
LET'S REFRESH SOME HISTORY ON THE
WIKIPEDIA
Select * from WikipediaPages p where p.content LIKE ?;
Select * from Wikipedia...
AM I CHEATING?
I'm quoting some very successfull web companies.
How many can you list which do not provide an effective
se...
REQUIREMENTS FOR A SEARCH ENGINE
Need to guess what you want w/o you typing all of the
content
We all hate forms
We want t...
SOME MORE THINGS TO CONSIDER:
Approximate word matches
Stemming / Language specific analysis
Typos
Synonyms, Abbreviations...
BASICS: KEYWORD EXTRACTION
On how to improve running by Scott
1. Tokenization & Analysis:
how
improv
run
scott
2. Scoring
APACHE LUCENE
Open source Apache™ top level project
Primarily Java, ported to many other languages and
platforms
Extremely...
AS A JAVAEE DEVELOPER:
You are familiar with JPA
But Lucene is much better than a relational database to
address this prob...
LET'S INTRODUCE APACHE LUCENE VIA
HIBERNATE SEARCH
Deeply but transparently integrated with Hibernate's
EntityManager
Internally uses advanced Apache Lucene features, but
protects your deadlines from the lower level details
Ge...
Does not prevent you to perform any form of advanced /
native Lucene query
Transparent index state synchronization
Transaction integrations
Options to rebuild the index efficiently
Failover and clu...
HIBERNATE SEARCH QUICKSTART
<dependency>
   <groupid>org.hibernate</groupid>
   <artifactid>hibernate­search­orm</artifact...
HOW TO INDEX A TRIVIAL DOMAIN
MODEL
@Entity
public class Actor {
 
  @Id
  Integer id;
 
  String name;
}
@Indexed @Entity
public class Actor {
 
  @Id
  Integer id;
 
  String name;
}
@Indexed @Entity
public class Actor {
 
  @Id
  Integer id;
 
  @Field
  String name;
}
LET'S INTRODUCE RELATIONS
@Entity
public class DVD {
  @Id
  Integer id;
  String title;
  @ManyToMany
  Set<Actor> actors...
@Indexed @Entity
public class DVD {
  @Id
  Integer id;
  @Field
  String title;
  @ManyToMany @IndexedEmbedded
  Set<Acto...
INDEX FIELDS FOR ACTOR
id name
1 Harrison Ford
2 Kirsten Dunst
INDEX FIELDS FOR DVD
id title actors.name *
1 Melancholia {...
RUN A LUCENE QUERY BUT GET JPA
MANAGED RESULTS
String[] productFields = { "title", "actors.name" };
org.apache.lucene.sear...
HIBERNATE SEARCH: BASICS DEMO
LUCENE & TEXT ANALYSIS
@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
  token...
FILTERS
List results = fullTextEntityManager
   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "mi...
List results = fullTextEntityManager
   .createFullTextQuery( query, Product.class )
   .enableFullTextFilter( "minorsFilt...
FACETING
"MORE LIKE THIS"
Coffee decaffInstance = ... // you already have one
QueryBuilder qb = getCoffeeQueryBuilder();
Query mltQ...
SPATIAL FILTERING
HOW TO RUN THIS ON WILDFLY?
ARCHITECTURE & INDEX MANAGEMENT
ah, the catch!
Indexes need to be stored, updated and read from.
You can have many indexes, managed independently
An index can be written...
THE INFINISPAN / LUCENE
INTEGRATIONS
WHAT IS INFINISPAN?
In Memory Key/Value Store
ASL v2 License
Scalable
JTA Transactions
Persistence (File/JDBC/LevelDB/...)...
INFINISPAN / JAVAEE?
JavaEE: JCache implementation
A core component of WildFly
"Embedded" mode does not depend on WildFly
...
INFINISPAN / APACHE LUCENE?
Lucene integrations for Querying the datagrid!
Lucene integrations to store the index!
Hiberna...
HIBERNATE SEARCH & INFINISPAN
QUERY
Same underlying technology
Same API to learn
Same indexing configuration options
Same annotations, not an entity:
@Indexed
@AnalyzerDef(name = "lowercaseKeyword",
        tokenizer = @TokenizerDef(factor...
Store the Java POJO classes in the cache directly:
Country uk = ...
cache.put("UK", uk );
USING LUCENE QUERY PARSER
QueryParser qp = new QueryParser("default", new StandardAnalyzer());
                
Query luce...
COUNT ENTITIES
import org.apache.lucene.search.MatchAllDocsQuery;
MatchAllDocsQuery allDocsQuery = new MatchAllDocsQuery()...
USING LUCENE INDEXREADER
DIRECTLY
SearchIntegrator searchFactory = Search.getSearchManager(cache)
                .getSear...
GETTING STARTED WITH INFINISPAN
<dependency>
   <groupId>org.infinispan</groupId>
   <artifactId>infinispan­embedded</arti...
ADD PERSISTENCE (XML)
<infinispan>
     <cache­container> 
         <local­cache name="testCache">
            <persistenc...
ADD PERSISTENCE
(PROGRAMMATIC)
Configuration configuration = new ConfigurationBuilder()
    .persistence()
    .addStore(L...
CLUSTERING - REPLICATED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTranspor...
CLUSTERING - DISTRIBUTED
GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()
     .transport().defaultTranspo...
QUERYING
Apache Lucene Index
Native Map Reduce
Index-less
Hadoop and Spark (coming)
INDEXING - CONFIGURATION
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.ALL)
     ....
QUERY - SYNC/ASYNC
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         ....
QUERY - RAM STORAGE
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
         ...
QUERY - INFINISPAN STORAGE
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCAL)
  ...
QUERY - FILESYSTEM STORAGE
Configuration configuration = new ConfigurationBuilder()
   .indexing().index(Index.LOCAL)
    ...
QUERY - INFINISPAN INDEXMANAGER
Configuration configuration = new ConfigurationBuilder()
     .indexing().index(Index.LOCA...
THE INFINISPAN LUCENE DIRECTORY
Storing the Apache Lucene index in an high-performance in
memory data grid
You don't need Hibernate Search to cluster your existing
Lucene application, but you'll need some external
coordination to...
INFINISPAN QUERY AND THE LUCENE
DIRECTORY IN ACTION
Weather Demo by Gustavo Nalle Fernandes
DEMO
Indexed
NOAA.gov data fro...
WHAT'S ON THE HORIZON FOR
INFINISPAN
Improvements in indexing performance
Hadoop and Spark integration experiments
Combini...
WHAT'S COMING FOR HIBERNATE
SEARCH
Upgrading to Lucene 5
Experiment integrations with REST based Lucene servers
(Solr, Ela...
THANK YOU!
Some references:
, the super simple
, the
The website, the by
The website
Our team's blog
(requires Chrome)
Apa...
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Apache Lucene for Java EE Developers
Upcoming SlideShare
Loading in …5
×

Apache Lucene for Java EE Developers

3,525 views

Published on

Apache Lucene is the de-facto standard open source library for Java developers to implement full-text-search capabilities.

While it’s thriving in its field, it is rarely mentioned in the scope of Java EE development.

In this talk we will see for which features many developers love Lucene, make some concrete examples of common problems it elegantly solves, and see some best practices about using it in a Java EE stack.

Finally we'll see how some popular OSS projects such as Hibernate ORM (JPA provider), WildFly (Java EE runtime) and Infinispan (in-memory datagrid, JCache implementor) actually provide great Lucene integration capabilities.

Published in: Technology
  • Be the first to comment

Apache Lucene for Java EE Developers

  1. 1. APACHE LUCENE FOR JAVAEE DEVELOPERS VIRTUAL:JBUG by @SanneGrinovero
  2. 2. A QUICK INTRODUCTION HIBERNATE TEAM Hibernate Search project lead Hibernate OGM team, occasionally Hibernate ORM
  3. 3. INFINISPAN TEAM The Lucene guy: Infinispan Query, Infinispan Lucene Directory
  4. 4. OTHER PROJECTS I HELP WITH... WildFly, JGroups, Apache Lucene, ...
  5. 5. SUPPORTED BY Red Hat and a lot of passion for OSS
  6. 6. AGENDA What is and how can it help you Integrations with a JPA application via How does this all relate with and Lucene index management Plans and wishlist for the future Apache Lucene Hibernate Search Infinispan WildFly
  7. 7. THE SEARCH PROBLEM Hello, I'm looking for a book in your online shop having primary key #2342
  8. 8. SQL CAN HANDLE TEXT The LIKE operator?
  9. 9. LET'S REFRESH SOME HISTORY ON THE WIKIPEDIA Select * from WikipediaPages p where p.content LIKE ?; Select * from WikipediaPages p where p.title LIKE ?; Select * from WikipediaPages p where   (lowercase(p.content) LIKE %:1% OR    lowercase(p.content) LIKE %:2% OR    lowercase(p.content) LIKE %:3% OR   ...);
  10. 10. AM I CHEATING? I'm quoting some very successfull web companies. How many can you list which do not provide an effective search engine? Why is that?
  11. 11. REQUIREMENTS FOR A SEARCH ENGINE Need to guess what you want w/o you typing all of the content We all hate forms We want the results in the blink of an eye We want the right result on top: Relevance
  12. 12. SOME MORE THINGS TO CONSIDER: Approximate word matches Stemming / Language specific analysis Typos Synonyms, Abbreviations, Technical Language specializations
  13. 13. BASICS: KEYWORD EXTRACTION On how to improve running by Scott 1. Tokenization & Analysis: how improv run scott 2. Scoring
  14. 14. APACHE LUCENE Open source Apache™ top level project Primarily Java, ported to many other languages and platforms Extremely popular, it's everywhere! High pace of improvement, excellent team Most impressive testing
  15. 15. AS A JAVAEE DEVELOPER: You are familiar with JPA But Lucene is much better than a relational database to address this problem Easy integration with the platform is a requirement
  16. 16. LET'S INTRODUCE APACHE LUCENE VIA HIBERNATE SEARCH Deeply but transparently integrated with Hibernate's
  17. 17. EntityManager Internally uses advanced Apache Lucene features, but protects your deadlines from the lower level details Gets great performance out of it Simple annotations, yet many flexible override options
  18. 18. Does not prevent you to perform any form of advanced / native Lucene query
  19. 19. Transparent index state synchronization Transaction integrations Options to rebuild the index efficiently Failover and clustering integration points Flexible Error handling
  20. 20. HIBERNATE SEARCH QUICKSTART <dependency>    <groupid>org.hibernate</groupid>    <artifactid>hibernate­search­orm</artifactid>    <version>5.4.0.CR1</version> </dependency> <dependency>    <groupid>org.hibernate</groupid>    <artifactid>hibernate­core</artifactid>    <version>5.0.0.CR2</version> </dependency> <dependency>    <groupid>org.hibernate</groupid>    <artifactid>hibernate­entitymanager</artifactid>    <version>5.0.0.CR2</version> </dependency>
  21. 21. HOW TO INDEX A TRIVIAL DOMAIN MODEL @Entity public class Actor {     @Id   Integer id;     String name; }
  22. 22. @Indexed @Entity public class Actor {     @Id   Integer id;     String name; }
  23. 23. @Indexed @Entity public class Actor {     @Id   Integer id;     @Field   String name; }
  24. 24. LET'S INTRODUCE RELATIONS @Entity public class DVD {   @Id   Integer id;   String title;   @ManyToMany   Set<Actor> actors = new HashSet<>(); }
  25. 25. @Indexed @Entity public class DVD {   @Id   Integer id;   @Field   String title;   @ManyToMany @IndexedEmbedded   Set<Actor> actors = new HashSet<>(); }
  26. 26. INDEX FIELDS FOR ACTOR id name 1 Harrison Ford 2 Kirsten Dunst INDEX FIELDS FOR DVD id title actors.name * 1 Melancholia {Kirsten Dunst, Charlotte Gainsbourg, Kiefer Sutherland} 2 The Force Awakens {Harrison Ford, Mark Hamill, Carrie Fisher}
  27. 27. RUN A LUCENE QUERY BUT GET JPA MANAGED RESULTS String[] productFields = { "title", "actors.name" }; org.apache.lucene.search.Query luceneQuery = // ... FullTextEntityManager ftEm =    Search.getFullTextEntityManager( entityManager ); FullTextQuery query = // extends javax.persistence.Query    ftEm.createFullTextQuery( luceneQuery, DVD.class ); List dvds = // Managed entities!    query.setMaxResults(100).getResultList(); int totalNbrOfResults = query.getResultSize();
  28. 28. HIBERNATE SEARCH: BASICS DEMO
  29. 29. LUCENE & TEXT ANALYSIS @Indexed(index = "tweets") @Analyzer(definition = "english") @AnalyzerDef(name = "english",   tokenizer = @TokenizerDef(     factory = StandardTokenizerFactory.class),     filters = {     @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),     @TokenFilterDef(factory = LowerCaseFilterFactory.class),     @TokenFilterDef(factory = StopFilterFactory.class, params = {        @Parameter(name = "words", value = "stoplist.properties"),        @Parameter(name = "ignoreCase", value = "false")        }) }) @Entity public class Tweet {
  30. 30. FILTERS List results = fullTextEntityManager    .createFullTextQuery( query, Product.class )    .enableFullTextFilter( "minorsFilter" )    .list();
  31. 31. List results = fullTextEntityManager    .createFullTextQuery( query, Product.class )    .enableFullTextFilter( "minorsFilter" )    .enableFullTextFilter( "specialDayOffers" )       .setParameter( "day", “20150714” )    .enableFullTextFilter( "inStockAt" )       .setParameter( "location", "Newcastle" )    .list();
  32. 32. FACETING
  33. 33. "MORE LIKE THIS" Coffee decaffInstance = ... // you already have one QueryBuilder qb = getCoffeeQueryBuilder(); Query mltQuery = qb   .moreLikeThis()   .comparingAllFields()   .toEntityWithId( decaffInstance.getId() )   .createQuery(); List results = fullTextEntityManager   .createFullTextQuery( mltQuery, Coffee.class )   .list();
  34. 34. SPATIAL FILTERING
  35. 35. HOW TO RUN THIS ON WILDFLY?
  36. 36. ARCHITECTURE & INDEX MANAGEMENT ah, the catch!
  37. 37. Indexes need to be stored, updated and read from. You can have many indexes, managed independently An index can be written by an exclusive writer only Backends can be configured differently per index Index storage - the Directory - can also be configured per index
  38. 38. THE INFINISPAN / LUCENE INTEGRATIONS
  39. 39. WHAT IS INFINISPAN? In Memory Key/Value Store ASL v2 License Scalable JTA Transactions Persistence (File/JDBC/LevelDB/...) Local/Clustered Embedded/Server ...
  40. 40. INFINISPAN / JAVAEE? JavaEE: JCache implementation A core component of WildFly "Embedded" mode does not depend on WildFly Hibernate 2n level cache
  41. 41. INFINISPAN / APACHE LUCENE? Lucene integrations for Querying the datagrid! Lucene integrations to store the index! Hibernate Search integrations!
  42. 42. HIBERNATE SEARCH & INFINISPAN QUERY Same underlying technology Same API to learn Same indexing configuration options
  43. 43. Same annotations, not an entity: @Indexed @AnalyzerDef(name = "lowercaseKeyword",         tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),         filters = {@TokenFilterDef(factory = LowerCaseFilterFactory.class)} ) @SerializeWith(CountryExternalizer.class) public class Country {     @Field(store = Store.YES)     @Analyzer(definition = "lowercaseKeyword")     private String name;
  44. 44. Store the Java POJO classes in the cache directly: Country uk = ... cache.put("UK", uk );
  45. 45. USING LUCENE QUERY PARSER QueryParser qp = new QueryParser("default", new StandardAnalyzer());                  Query luceneQ = qp  .parse("+station.name:airport +year:2014 +month:12 +(avgTemp < 0)"); CacheQuery cq = Search.getSearchManager(cache)                            .getQuery(luceneQ, DaySummary.class);                   List<Object> results = query.list();             
  46. 46. COUNT ENTITIES import org.apache.lucene.search.MatchAllDocsQuery; MatchAllDocsQuery allDocsQuery = new MatchAllDocsQuery();                  CacheQuery query = Search.getSearchManager(cache)                               .getQuery(allDocsQuery, DaySummary.class);                  int count = query.getResultSize();             
  47. 47. USING LUCENE INDEXREADER DIRECTLY SearchIntegrator searchFactory = Search.getSearchManager(cache)                 .getSearchFactory();                  IndexReader indexReader = searchFactory                 .getIndexReaderAccessor().open(DaySummary.class);                  IndexSearcher searcher = new IndexSearcher(indexReader);             
  48. 48. GETTING STARTED WITH INFINISPAN <dependency>    <groupId>org.infinispan</groupId>    <artifactId>infinispan­embedded</artifactId>    <version>7.2.3.Final</version> </dependency>          EmbeddedCacheManager cacheManager = new DefaultCacheManager(); Cache<String,String> cache = cacheManager.getCache(); cache.put("key", "data goes here");         
  49. 49. ADD PERSISTENCE (XML) <infinispan>      <cache­container>           <local­cache name="testCache">             <persistence>                <leveldb­store path="/tmp/folder"/>             </persistence>          </local­cache>      </cache­container> </infinispan>              DefaultCacheManager cm = new DefaultCacheManager("infinispan.xml"); Cache<Integer, String> cache = cacheManager.getCache("testCache");             
  50. 50. ADD PERSISTENCE (PROGRAMMATIC) Configuration configuration = new ConfigurationBuilder()     .persistence()     .addStore(LevelDBStoreConfigurationBuilder.class)     .build(); DefaultCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, String> cache = cm.getCache();             
  51. 51. CLUSTERING - REPLICATED GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()      .transport().defaultTransport()      .build();       Configuration cfg = new ConfigurationBuilder()      .clustering().cacheMode(CacheMode.REPL_SYNC)      .build();                  EmbeddedCacheManager cm = new DefaultCacheManager(globalCfg, cfg); Cache<Integer, String> cache = cm.getCache();             
  52. 52. CLUSTERING - DISTRIBUTED GlobalConfiguration globalCfg = new GlobalConfigurationBuilder()      .transport().defaultTransport()      .build();       Configuration configuration = new ConfigurationBuilder()      .clustering().cacheMode(CacheMode.DIST_SYNC)      .hash().numOwners(2).numSegments(100)      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(globalConfiguration, confi Cache<Integer, String> cache = cm.getCache();             
  53. 53. QUERYING Apache Lucene Index Native Map Reduce Index-less Hadoop and Spark (coming)
  54. 54. INDEXING - CONFIGURATION Configuration configuration = new ConfigurationBuilder()      .indexing().index(Index.ALL)      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, DaySummary> cache = cm.getCache();             
  55. 55. QUERY - SYNC/ASYNC Configuration configuration = new ConfigurationBuilder()      .indexing().index(Index.LOCAL)          .addProperty("default.worker.execution", "async")      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, DaySummary> cache = cm.getCache();             
  56. 56. QUERY - RAM STORAGE Configuration configuration = new ConfigurationBuilder()      .indexing().index(Index.LOCAL)          .addProperty("default.worker.execution", "async")          .addProperty("default.directory_provider", "ram")      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, DaySummary> cache = cm.getCache();             
  57. 57. QUERY - INFINISPAN STORAGE Configuration configuration = new ConfigurationBuilder()      .indexing().index(Index.LOCAL)          .addProperty("default.worker.execution", "async")          .addProperty("default.directory_provider", "infinispan")      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, DaySummary> cache = cm.getCache();             
  58. 58. QUERY - FILESYSTEM STORAGE Configuration configuration = new ConfigurationBuilder()    .indexing().index(Index.LOCAL)          .addProperty("default.directory_provider", "filesystem")          .addProperty("default.indexBase", "/path/to/index); .build();             
  59. 59. QUERY - INFINISPAN INDEXMANAGER Configuration configuration = new ConfigurationBuilder()      .indexing().index(Index.LOCAL)          .addProperty("default.worker.execution", "async")          .addProperty("default.indexmanager",             "org.infinispan.query.indexmanager.InfinispanIndexManager"      .build(); EmbeddedCacheManager cm = new DefaultCacheManager(configuration); Cache<Integer, DaySummary> cache = cm.getCache();             
  60. 60. THE INFINISPAN LUCENE DIRECTORY Storing the Apache Lucene index in an high-performance in memory data grid
  61. 61. You don't need Hibernate Search to cluster your existing Lucene application, but you'll need some external coordination to guarantee the single IndexWriter.
  62. 62. INFINISPAN QUERY AND THE LUCENE DIRECTORY IN ACTION Weather Demo by Gustavo Nalle Fernandes DEMO Indexed NOAA.gov data from 1901 to 2014 ~10M summaries Yearly country max recorded temperature by month Cache<Integer, DaySummary>
  63. 63. WHAT'S ON THE HORIZON FOR INFINISPAN Improvements in indexing performance Hadoop and Spark integration experiments Combining indexed & non-indexed query capabilities, including remote queries
  64. 64. WHAT'S COMING FOR HIBERNATE SEARCH Upgrading to Lucene 5 Experiment integrations with REST based Lucene servers (Solr, ElasticSearch) Improved backends to simplify clustering setup GSOC: generic JPA support and improved developer tooling A lot more! See also the roadmap
  65. 65. THANK YOU! Some references: , the super simple , the The website, the by The website Our team's blog (requires Chrome) Apache Lucene website Hibernate Search website Hibernate Search JPA demo WildFly integration tests Infinispan Weather Demo @gustavonalle WildFly in.relation.to Export these slides to PDF

×