This document provides an agenda and overview for a one-day Lucene boot camp tutorial. The schedule includes sessions on introducing Lucene, indexing, analysis, searching, and performance. It also covers topics like indexing in Lucene, analyzing text, querying, sorting results, and optimizing search performance. The document seeks to help attendees understand Lucene's core capabilities through real examples, code, and data. It encourages attendees to ask questions.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Munching & crunching - Lucene index post-processingabial
Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Munching & crunching - Lucene index post-processingabial
Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)
Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.
In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.
Apache Lucene's next major release, 4.0, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.
In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.
Apache Lucene's next major release, 4.0, will introduce lots of flexibility into indexing, but also fundamental changes to the well-known APIs: It features a new and consistent, 4-dimensional iteration API on top of a low-level, pluggable codec API giving applications full control over the postings data.
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Search is everywhere, and therefore so is Apache Lucene. While providing amazing out-of-the-box defaults, there’s enough projects weird enough to require custom search scoring and ranking. In this talk, I’ll walk through how to use Lucene to implement your custom scoring and search ranking. We’ll see how you can achieve both amazing power (and responsibility) over your search results. We’ll see the flexibility of Lucene’s data structures and explore the pros/cons of custom Lucene scoring vs other methods of improving search relevancy.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
Finite-State Queries in Lucene:
* Background, improvement/evolution of MultiTermQuery API in 2.9 and Flex
* Implementing existing Lucene queries with NFA/DFA for better performance: Wildcard, Regex, Fuzzy
* How you can use this Query programmatically to improve relevance (I'll use an English test collection/English examples)
Quick overview of other Lucene features in development, such as:
* Flexible Indexing
* "More-Flexible" Scoring: challenges/supporting BM25, more vector-space models, field-specific scoring, etc.
* Improvements to analysis
Bonus:
* Lucene / Solr merger explanation and future plans
About the presenter:
Robert Muir is a super-active Lucene developer. He works as a software developer for Abraxas Corporation. Robert received his MS in Computer Science from Johns Hopkins and BS in CS from Radford University. For the last few years Robert has been working on foreign language NLP problems - "I really enjoy working with Lucene, as it's always receptive to better int'l/language support, even though everyone seems to be a performance freak... such a weird combination!"
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
The University Seminar series aim to provide a basic understanding of Open Source Information Retrieval and its application in the real world through the Apache Lucene/Solr technologies.
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.
At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.
Main topics to be covered in this talk:
+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.
At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.
Main topics to be covered in this talk:
+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
This talk describes how to implement conceptual search (semantic search) within a modern search engine using the word2vec algorithm to learn concepts. We also cover how to auto-tune the search engine parameters using black box optimization techniques, and the problems of feedback loops encountered when building machine learning systems that modify the user behavior used to train the system.
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
How Solr Search Works - A tech Talk at Atlogys Delhi Office by our Senior Technologist Rajat Jain. The lecture takes a deep dive into Solr - what it is, how it works, what it does and its inbuilt architecture. A wonderful technical session with many live examples, a sneak peak into solr code and config files and a live demo. Part of Atlogys Academy Series.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. Intro
• My Background
• Your Background
• Brief History of Lucene
• Goals for Tutorial
– Understand Lucene core capabilities
– Real examples, real code, real data
• Ask Questions!!!!!
3. Schedule
1. 10-10:10 Introducing Lucene and Search
2. 10:10-12 Indexing, Analysis, Searching, Performance
3. 12-12:05 Break
4. 12-1 More on Indexing, Analysis, Searching, Performance
5. 1-2:30 Lunch
6. 2:30-2:40 Recap, Questions, Content
7. 2:40-4:40 Class Example
8. 4-4:20 Break
9. 4:20-5 Class Example
10. 5-5:20 Lucene Contributions (time permitting)
11. 5:20-5:25 Open Discussion (time permitting)
12. 5:25-5:30 Resources/Wrap Up
4. Lucene is…
• NOT a crawler
– See Nutch
• NOT an application
– See PoweredBy on the Wiki
• NOT a library for doing Google PageRank
or other link analysis algorithms
– See Nutch
• A library for enabling text based search
5. A Few Words about Solr
• HTTP-based Search Server
• XML Configuration
• XML, JSON, Ruby, PHP, Java support
• Caching, Replication
• Many, many nice features that Lucene users
need
• http://lucene.apache.org/solr
6. Search Basics
• Goal: Identify documents that
are similar to input query
• Lucene uses a modified Vector
Space Model (VSM)
– Boolean + VSM
– TF-IDF
– The words in the document
and the query each define a
Vector in an n-dimensional
space
– Sim(q1, d1) = cos Θ
– In Lucene, boolean approach
restricts what documents to
score
q1
d1
Θ
dj= <w1,j,w2,j,…,wn,j>
q= <w1,q,w2,q,…wn,q>
w = weight assigned to term
7. Indexing
• Process of preparing and adding text to
Lucene
– Optimized for searching
• Key Point: Lucene only indexes Strings
– What does this mean?
• Lucene doesn’t care about XML, Word, PDF, etc.
– There are many good open source extractors available
• It’s our job to convert whatever file format we have
into something Lucene can use
8. Indexing Classes
• Analyzer
– Creates tokens using a Tokenizer and filters
them through zero or more TokenFilters
• IndexWriter
– Responsible for converting text into internal
Lucene format
9. Indexing Classes
• Directory
– Where the Index is stored
– RAMDirectory, FSDirectory, others
• Document
– A collection of Fields
– Can be boosted
• Field
– Free text, keywords, dates, etc.
– Defines attributes for storing, indexing
– Can be boosted
– Field Constructors and parameters
• Open up Fieldable and Field in IDE
10. How to Index
• Create IndexWriter
• For each input
– Create a Document
– Add Fields to the Document
– Add the Document to the IndexWriter
• Close the IndexWriter
• Optimize (optional)
11. Task 1.a
• From the Boot Camp Files, use the basic.ReutersIndexer
skeleton to start
• Index the small Reuters Collection using the
IndexWriter, a Directory and
StandardAnalyzer
– Boost every 10 documents by 3
• Questions to Answer:
– What Fields should I define?
– What attributes should each Field have?
• What Fields should OMIT_NORMS?
– Pick a field to boost and give a reason why you think it should be
boosted
13. Searching
• Key Classes:
– Searcher
• Provides methods for searching
• Take a moment to look at the Searcher class declaration
• IndexSearcher, MultiSearcher,
ParallelMultiSearcher
– IndexReader
• Loads a snapshot of the index into memory for searching
– Hits
• Storage/caching of results from searching
– QueryParser
• JavaCC grammar for creating Lucene Queries
• http://lucene.apache.org/java/docs/queryparsersyntax.html
– Query
• Logical representation of program’s information need
14. Query Parsing
• Basic syntax:
title:hockey +(body:stanley AND body:cup)
• OR/AND must be uppercase
• Default operator is OR (can be changed)
• Supports fairly advanced syntax, see the website
– http://lucene.apache.org/java/docs/queryparsersyntax.html
• Doesn’t always play nice, so beware
– Many applications construct queries programmatically
or restrict syntax
15. Task 1.b
• Using the ReutersIndexerTest.java skeleton in the boot
camp files
– Search your newly created index using queries you develop
– Delete a Document by the doc id
• Hints:
– Use a IndexSearcher
– Create a Query using the QueryParser
– Display the results from the Hits
• Questions:
– What is the default field for the QueryParser?
– What Analyzer to use?
16. Task 1 Results
• Locks
– Lucene maintains locks on files to prevent
index corruption
– Located in same directory as index
• Scores from Hits are normalized
– Scores across queries are NOT comparable
• Lucene 2.3 has some transactional
semantics for indexing, but is not a DB
17. Deletion and Updates
• Deletions can be a bit confusing
– Both IndexReader and IndexWriter
have delete methods
• Updates are always a delete and an add
• Updates are always a delete and an add
– Yes, that is a repeat!
– Nature of data structures used in search
18. Analysis
• Analysis is the process of creating Tokens to be indexed
• Analysis is usually done to improve results overall, but it
comes with a price
• Lucene comes with many different Analyzers,
Tokenizers and TokenFilters, each with their own
goals
– See contrib/analyzers
• StandardAnalyzer is included with the core JAR and
does a good job for most English and Latin-based tasks
• Often times you want the same content analyzed in
different ways
• Consider a catch-all Field in addition to other Fields
20. Indexing in a Nutshell
• For each Document
– For each Field to be tokenized
• Create the tokens using the specified Tokenizer
– Tokens consist of a String, position, type and offset information
• Pass the tokens through the chained TokenFilters where
they can be changed or removed
• Add the end result to the inverted index
• Position information can be altered
– Useful when removing words or to prevent phrases
from matching
22. Tokenization
• Split words into Tokens to be processed
• Tokenization is fairly straightforward for
most languages that use a space for word
segmentation
– More difficult for some East Asian languages
– See the CJK Analyzer
23. Modifying Tokens
• TokenFilters are used to alter the token
stream to be indexed
• Common tasks:
– Remove stopwords
– Lower case
– Stem/Normalize -> Wi-Fi -> Wi Fi
– Add Synonyms
• StandardAnalyzer does things that you may
not want
24. Custom Analyzers
• Solution: write your own Analyzer
• Better solution: write a configurable
Analyzer so you only need one Analyzer
that you can easily change for your projects
– See Solr
• Tokenizers and TokenFilters must
be newly constructed for each input
25. Special Cases
• Dates and numbers need special treatment to be
searchable
– o.a.l.document.DateTools
– org.apache.solr.util.NumberUtils
• Altering Position Information
– Increase Position Gap between sentences to prevent
phrases from crossing sentence boundaries
– Index synonyms at the same position so query can
match regardless of synonym used
27. Indexing Performance
• Behind the Scenes
– Lucene indexes Documents into memory
– At certain trigger points, memory (segments)
are flushed to the Directory
– Segments are periodically merged
• Lucene 2.3 has significant performance
improvements
28. IndexWriter Performance
Factors
• maxBufferedDocs
– Minimum # of docs before merge occurs and a new segment is
created
– Usually, Larger == faster, but more RAM
• mergeFactor
– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larger == faster, better for batch indexing
• maxFieldLength
– Limit the number of terms in a Document
29. Lucene 2.3 IndexWriter
Changes
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs and
setMergeFactor
• Takes storage and term vectors out of the merge
process
• Turn off auto-commit if there are stored fields and
term vectors
• Provides significant performance increase
30. Index Threading
• IndexWriter and IndexReader are thread-
safe and can be shared between threads without
external synchronization
• One open IndexWriter per Directory
• Parallel Indexing
– Index to separate Directory instances
– Merge using IndexWriter.addIndexes
– Could also distribute and collect
31. Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and trunk (2.3)
– contrib/benchmark/conf:
• indexing.alg
• indexing-multithreaded.alg
• Info:
– Mac Pro 2 x 2GHz Dual-Core Xeon
– 4 GB RAM
– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
33. Searching
• Earlier we touched on basics of search
using the QueryParser
• Now look at:
– Searcher/IndexReader Lifecycle
– Query classes
– More details on the QueryParser
– Filters
– Sorting
34. Lifecycle
• Recall that the IndexReader loads a snapshot
of index into memory
– This means updates made since loading the index will
not be seen
• Business rules are needed to define how often to
reload the index, if at all
– IndexReader.isCurrent() can help
• Loading an index is an expensive operation
– Do not open a Searcher/IndexReader for every
search
35. Query Classes
• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query
instances as clauses
– should
– required
• PhraseQuery finds terms occurring near each
other, position-wise
– “slop” is the edit distance between two terms
• Take 2-3 minutes to explore Query
implementations
36. Spans
• Spans provide information about where
matches took place
• Not supported by the QueryParser
• Can be used in BooleanQuery clauses
• Take 2-3 minutes to explore SpanQuery
classes
– SpanNearQuery useful for doing phrase
matching
37. QueryParser
• MultiFieldQueryParser
• Boolean operators cause confusion
– Better to think in terms of required (+ operator) and not
allowed (- operator)
• Check JIRA for QueryParser issues
• http://www.gossamer-threads.com/lists/lucene/java-user/40945
• Most applications either modify QP, create their
own, or restrict to a subset of the syntax
• Your users may not need all the “flexibility” of
the QP
38. Sorting
• Lucene default sort is by score
• Searcher has several methods that take in a
Sort object
• Sorting should be addressed during indexing
• Sorting is done on Fields containing a single
term that can be used for comparison
• The SortField defines the different sort types
available
– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,
DOC
39. Sorting II
• Look at Searcher, Sort and
SortField
• Custom sorting is done with a
SortComparatorSource
• Sorting can be very expensive
– Terms are cached in the FieldCache
• SortFilterTest.java example
40. Filters
• Filters restrict the search space to a
subset of Documents
• Use Cases
– Search within a Search
– Restrict by date
– Rating
– Security
– Author
41. Filter Classes
• QueryWrapperFilter (QueryFilter)
– Restrict to subset of Documents that match a Query
• RangeFilter
– Restrict to Documents that fall within a range
– Better alternative to RangeQuery
• CachingWrapperFilter
– Wrap another Filter and provide caching
• SortFilterTest.java example
42. Expert Results
• Searcher has several “expert” methods
– Hits is not always what you need due to:
• Caching
• Normalized Scores
• Reexecutes Query repeatedly as results are accessed
• HitCollector allows low-level access to all
Documents as they are scored
• TopDocs represents top n docs that match
– TopDocsTest in examples
43. Searchers
• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader
– Not a Searcher, but can be used with
IndexSearcher to achieve same results for local
indexes
• ParallelMultiSearcher
– Like MultiSearcher, but threaded
• RemoteSearchable
– RMI based remote searching
• Look at MultiSearcherTest in example
code
44. Search Performance
• Search speed is based on a number of factors:
– Query Type(s)
– Query Size
– Analysis
– Occurrences of Query Terms
– Optimize
– Index Size
– Index type (RAMDirectory, other)
– Usual Suspects
• CPU
• Memory
• I/O
• Business Needs
45. Query Types
• Be careful with WildcardQuery as it rewrites
to a BooleanQuery containing all the terms
that match the wildcards
• Avoid starting a WildcardQuery with wildcard
• Use ConstantScoreRangeQuery instead of
RangeQuery
• Be careful with range queries and dates
– User mailing list and Wiki have useful tips for
optimizing date handling
46. Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same
terms
• Disambiguation
– May be useful when doing synonym expansion
– Difficult to automate and may be slower
– Some applications may allow the user to disambiguate
• Relevance Feedback/More Like This
– Use most important words
– “Important” can be defined in a number of ways
47. Usual Suspects
• CPU
– Profile your application
• Memory
– Examine your heap size, garbage collection approach
• I/O
– Cache your Searcher
• Define business logic for refreshing based on indexing needs
– Warm your Searcher before going live -- See Solr
• Business Needs
– Do you really need to support Wildcards?
– What about date range queries down to the millisecond?
48. Explanations
• explain(Query, int) method is
useful for understanding why a Document
scored the way it did
• ExplainsTest in sample code
• Open Luke and try some queries and then
use the “explain” button
49. FieldSelector
• Prior to version 2.1, Lucene always loaded all
Fields in a Document
• FieldSelector API addition allows Lucene to
skip large Fields
– Options: Load, Lazy Load, No Load, Load and Break,
Load for Merge, Size, Size and Break
• Makes storage of original content more viable
without large cost of loading it when not used
• FieldSelectorTest in example code
50. Scoring and Similarity
• Lucene has sophisticated scoring
mechanism designed to meet most needs
• Has hooks for modifying scores
• Scoring is handled by the Query, Weight
and Scorer class
51. Affecting Relevance
• FunctionQuery from Solr (variation in
Lucene)
• Override Similarity
• Implement own Query and related classes
• Payloads
• HitCollector
• Take 5 to examine these
54. Next Up
• Dealing with Content
– File Formats
– Extraction
• Large Task
• Miscellaneous
• Wrapping Up
55. File Formats
• Several open source libraries, projects for extracting content to use in
Lucene
– PDF: PDFBox
• http://www.pdfbox.org/
– Word: POI, Open Office, TextMining
• http://www.textmining.org/textmining.zip
– XML: SAX or Pull parser
– HTML: Neko, Jtidy
• http://people.apache.org/~andyc/neko/doc/html/
• http://jtidy.sourceforge.net/
• Tika
– http://incubator.apache.org/tika/
• Aperture
– http://aperture.sourceforge.net
56. Aperture Basics
• Crawlers
• Data Connectors
• Extraction Wrappers
– POI, PDFBox, HTML, XML, etc.
• http://aperture.wiki.sourceforge.net/Extractors
will give you info on what comes back from
Aperture
• LuceneApertureCallbackHandler
in example code
57. Large Task
• Using the skeleton files in the
com.lucenebootcamp.training.full package:
– Get some content:
• Web, file system
• Different file formats
– Index it
• Plan out your fields, boosts, field properties
• Support updates and deletes
• Optional:
– How fast can you make it go? Divide and conquer?
Multithreaded?
58. Large Task
• Search Content
– Allow for arbitrary user queries across multiple
Fields via command line or simple web interface
– How fast can you make it?
• Support:
– Sort
– Filter
– Explains
• How much slower is to retrieve an explanation?
59. Large Task
• Document Retrieval
– Display/write out the one or more documents
– Support FieldSelector
60. Large Task
• Optional Tasks
– Hit Highlighting using contrib/Highlighter
– Multithreaded indexing and Search
– Explore other Field construction options
• Binary fields, term vectors
– Use Lucene trunk version and try out some of the
changes in indexing
– Try out Solr or Nutch at http://lucene.apache.org/
• What’s do they offer that Lucene Java doesn’t that you might
need?
61. Large Task Metadata
– Pair up if you want
– Ask questions
– 2 hours
– Use Luke to check your index!
– Explore other parts of Lucene that you are
interested in
– Be prepared to discuss/share with the class
63. Term Information
• TermEnum gives access to terms and how many
Documents they occur in
– IndexReader.terms()
– IndexReader.termPositions()
• TermDocs gives access to the frequency of a
term in a Document
– IndexReader.termDocs()
• Term Vectors give access to term frequency
information in a given Document
– IndexReader.getTermFreqVector
• TermsTest in sample code
64. Lucene Contributions
• Many people have generously contributed code to
help solve common problems
• These are in contrib directory of the source
• Popular:
– Analyzers
– Highlighter
– Queries and MoreLikeThis
– Snowball Stemmers
– Spellchecker
65. Open Discussion
• Multilingual Best Practices
– UNICODE
– One Index versus many
• Advanced Analysis
• Distributed Lucene
• Crawling
• Hadoop
• Nutch
• Solr
66. Resources
• http://lucene.apache.org/
• http://en.wikipedia.org/wiki/Vector_space_model
• Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
• Lucene In Action by Hatcher and Gospodnetić
• Wiki
• Mailing Lists
– java-user@lucene.apache.org
• Discussions on how to use Lucene
– java-dev@lucene.apache.org
• Discussions on how to develop Lucene
• Issue Tracking
– https://issues.apache.org/jira/secure/Dashboard.jspa
• We always welcome patches
– Ask on the mailing list before reporting a bug
68. Finally…
• Please take the time to fill out a survey to
help me improve this training
– Located in base directory of source
– Email it to me at trainer@lucenebootcamp.com
• There are several Lucene related talks on
Friday
70. Task 2
• Take 10-15 minutes, pair up, and write an
Analyzer and Unit Test
– Examine results in Luke
– Run some searches
• Ideas:
– Combine existing Tokenizers and TokenFilters
– Normalize abbreviations
– Filter out all words beginning with the letter A
– Identify/Mark sentences
• Questions:
– What would help improve search results?
71. Task 2 Results
• Share what you did and why
• Improving Results (in most cases)
– Stemming
– Ignore Case
– Stopword Removal
– Synonyms
– Pay attention to business needs
72. Grab Bag
• Accessing Term Information
– TermEnum
– TermDocs
– Term Vectors
• FieldSelector
• Scoring and Similarity
• File Formats
73. Task 6
• Count and print all the unique terms in the
index and their frequencies
– Notes:
• Half of the class write it using TermEnum and
TermDocs
• Other Half write it using Term Vectors
• Time your Task
• Only count the title and body content
74. Task 6 Results
• Term Vector approach is faster on smaller
collections
• TermEnum approach is faster on larger
collections
75. Task 4
• Re-index your collection
– Add in a “rating” field that randomly assigns a number
between 0 and 9
• Write searches to sort by
• Date
• Title
• Rating, Date, Doc Id
• A Custom Sort
• Questions
– How to sort the title?
– How to sort multiple Fields?
77. Task 5
• Create and search using Filters to:
– Restrict to all docs written on Feb. 26, 1987
– Restrict to all docs with the word “computer”
in title
• Also:
– Create a Filter where the length of the body +
title is greater than X
78. Task 5 Results
• Solr has more advanced Filter
mechanisms that may be worth using
• Cache filters
79. Task 7
• Pair up if you like and take 30-40 minutes to:
– Pick two file formats to work on
– Identify content in that format
• Can you index contents on your hard drive?
• Project Gutenberg, Creative Commons, Wikipedia
• Combine w/ Reuters collection
– Extract the content and index it using the appropriate
library
– Store the content as a Field
– Search the content
– Load Documents with and without
FieldSelector and measure performance
80. Task 7 (cont.)
• Include score and explanation in results
• Dump results to XML or HTML
• Be prepared to share with class what you did
– What libraries did you use?
– What content did you use?
– What is your Document structure?
– What issues did you have?