These are the slides for the session I presented at SoCal Code Camp Los Angeles on October 14, 2012.
http://www.socalcodecamp.com/session.aspx?sid=a4774b3c-7a2d-45db-8721-f54c5a314e17
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.
http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Introduction to search engine-building with LuceneKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012.
http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
In order to reduce the cost of building domain ontologies manually, in this paper, we propose a method and a tool named DODDLE-OWL for domain ontology construction reusing texts and existing ontologies extracted by an ontology search engine: Swoogle. In the experimental evaluation, we applied the method to a particular field of law and evaluated the acquired ontologies.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
Search engines, recommendation systems, advertising networks, and even data analytics tools all share the same end goal - to deliver the most relevant information possible to meet a given information need (usually in real-time). Perfecting these systems requires algorithms which can build a deep understanding of the domains represented by the underlying data, understand the nuanced ways in which words and phrases should be parsed and interpreted within different contexts, score the relationships between arbitrary phrases and concepts, continually learn from users' context and interactions to make the system smarter, and generate custom models of personalized tastes for each user of the system.
In this talk, we'll dive into both the philosophical questions associated with such systems ("how do you accurately represent and interpret the meaning of words?", "How do you prevent filter bubbles?", etc.), as well as look at practical examples of how these systems have been successfully implemented in production systems combining a variety of available commercial and open source components (inverted indexes, entity extraction, similarity scoring and machine-learned ranking, auto-generated knowledge graphs, phrase interpretation and concept expansion, etc.).
Topic models such as Latent Dirichlet Allocation (LDA) have been extensively used for characterizing text collections according to the topics discussed in documents. Organizing documents according to topic can be applied to different information access tasks such as document clustering, content-based recommendation or summarization. Spoken documents such as podcasts typically involve more than one speaker (e.g., meetings, interviews, chat shows or news with reporters). This paper presents a work-in-progress based on a variation of LDA that includes in the model the different speakers participating in conversational audio transcripts. Intuitively, each speaker has her own background knowledge which generates different topic and word distributions. We believe that informing a topic model with speaker segmentation (e.g., using existing speaker diarization techniques) may enhance discovery of topics in multi-speaker audio content.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
In order to reduce the cost of building domain ontologies manually, in this paper, we propose a method and a tool named DODDLE-OWL for domain ontology construction reusing texts and existing ontologies extracted by an ontology search engine: Swoogle. In the experimental evaluation, we applied the method to a particular field of law and evaluated the acquired ontologies.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
Search engines, recommendation systems, advertising networks, and even data analytics tools all share the same end goal - to deliver the most relevant information possible to meet a given information need (usually in real-time). Perfecting these systems requires algorithms which can build a deep understanding of the domains represented by the underlying data, understand the nuanced ways in which words and phrases should be parsed and interpreted within different contexts, score the relationships between arbitrary phrases and concepts, continually learn from users' context and interactions to make the system smarter, and generate custom models of personalized tastes for each user of the system.
In this talk, we'll dive into both the philosophical questions associated with such systems ("how do you accurately represent and interpret the meaning of words?", "How do you prevent filter bubbles?", etc.), as well as look at practical examples of how these systems have been successfully implemented in production systems combining a variety of available commercial and open source components (inverted indexes, entity extraction, similarity scoring and machine-learned ranking, auto-generated knowledge graphs, phrase interpretation and concept expansion, etc.).
Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
An introduction to elasticsearch with a short demonstration on Kibana to present the search API. The slide covers:
- Quick overview of the Elastic stack
- indexation
- Analysers
- Relevance score
- One use case of elasticsearch
The query used for the Kibana demonstration can be found here:
https://github.com/melvynator/elasticsearch_presentation
You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Introduction to search engine-building with Lucene
1. Introduction to Search Engine-
Building with Lucene
Kai Chan
SoCal Code Camp, October 2012
2. How to Search
• One (common) approach to searching all your
documents:
for each document d {
if (query is a substring of d’s content) {
add d to the list of results
}
}
sort the results (or not)
1
3. How to Search
• Problems
– Slow: Reads the whole database for each search
– Not scalable: If your database grows by 10x, your
search slows down by 10x
– How to show the most relevant documents first?
2
4. Inverted Index
• (term -> document list) map
Documents: T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
Inverted "a": {2}
index: "banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
E 3
5. Inverted Index
• (term -> <document, position> list) map
T0 = "it is what it is”
0 1 2 3 4
T1 = "what is it”
0 1 2
T2 = "it is a banana”
0 1 2 3
E 4
6. Inverted Index
• (term -> <document, position> list) map
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}
E 5
7. Inverted Index
• Speed
– Term list
• Very small compared to documents’ content
• Tends to grow at a slower speed than documents
(after a certain level)
– Term lookup
• O(1) to O(log of the number of terms)
– For a particular term:
• Document lists: very small
• Document + position lists: still small
– Few terms per query
6
8. Inverted Index
• Relevance
– Extra information in the index
• Stored in a easily accessible way
• Determine relevance of each document to the query
– Enables sorting by (decreasing) relevance
7
9. Determining Relevancy
• Two models used in the searching process
– Boolean model
• AND, OR, NOT, etc.
• Either a document matches a query, or not
– Vector space model
• How often a query term appears in a document vs.
how often the term appears in all documents
• Scoring and sorting by relevancy possible
8
10. Determining Relevancy
Lucene uses both models
all documents
filtering (Boolean Model)
some documents
(unsorted)
scoring (Vector Space Model)
some documents
(sorted by score)
9
12. Scoring
• Term frequency (TF)
– How many times does this term (t) appear in this
document (d)?
– Score proportional to TF
• Document frequency (DF)
– How many documents have this term (t)?
– Score proportional to the inverse of DF (IDF)
11
13. Scoring
• Coordination factor (coord)
– Documents that contains all or most query terms
get higher scores
• Normalizing factor (norm)
– Adjust for field length and query complexity
12
14. Scoring
• Boost
– “Manual override”: ask Lucene to give a higher
score to some particular thing
– Index-time
• Document
• Field (of a particular document)
– Search-time
• Query
13
15. Scoring
coordination factor query normalizing factor
score(q, d) = coord(q, d) . queryNorm(q) .
Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))
term inverse
frequency document
frequency
term boost document boost,
field boost,
length normalizing factor
http://lucene.apache.org/core/3_6_0/scoring.html 14
16. Work Flow
• Indexing
– Index: storage of inverted index + documents
– Add fields to a document
– Add the document to the index
– Repeat for every document
• Searching
– Generate a query
– Search with this query
– Get back a sorted document list (top N docs)
15
17. Adding Field to Document
• Store?
• Index?
– Analyzed (split text into multiple terms)
– Not analyzed (treat the whole text as ONE term)
– Not indexed (this field will not be searchable)
– Store norms?
16
18. Analyzed vs. Not Analyzed
Text: “the quick brown fox”
Analyzed: 4 terms Not analyzed: 1 term
1. the 1. the quick brown fox
2. quick
3. brown
4. fox
17
19. Index-time Analysis
• Analyzer
– Determine which TokenStream classes to use
• TokenStream
– Does the actual hard work
– Tokenizer: text to tokens
– Token filter: tokens to tokens
18
23. Attributes
• Past versions of Lucene: Token object
• Recent version of Lucene: attributes
– Efficiency, flexibility
– Ask for attributes you want
– Receive attribute objects
– Use these object for information about tokens
22
24. create token stream
TokenStream tokenStream =
analyzer.reusableTokenStream(fieldName, reader);
tokenStream.reset();
CharTermAttribute term = obtain each
stream.addAttribute(CharTermAttribute.class); attribute you
want to know
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);
PositionIncrementAttribute posInc =
stream.addAttribute(PositionIncrementAttribute.class);
while (tokenStream.incrementToken()) { go to the next token
doSomething(term.toString(),
offset.startOffset(), use information about
offset.endOffset(), the current token
posInc.getPositionIncrement());
}
tokenStream.end(); close token stream
tokenStream.close(); 23
25. Query-time Analysis
• Text in a query is analyzed like fields
• Use the same analyzer that analyzed the
particular field
+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)
quick brown fox lazy dog cozy cat
24
26. Query Formation
• Query parsing
– A query parser in core code
– Additional query parsers in contributed code
• Or build query from the Lucene query classes
25
28. Term Range Query
• Matches documents with any of the terms in a
particular range
– Field
– Lowest term text
– Highest term text
– Include lowest term text?
– Include highest term text?
27
29. Prefix Query
• Matches documents with any of the terms
with a particular prefix
– Field
– Prefix
28
30. Wildcard/Regex Query
• Matches documents with any of the terms
that match a particular pattern
– Field
– Pattern
• Wildcard: * for 0+ characters, ? for 0-1 character
• Regular expression
• Pattern matching on individual terms only
29
31. Fuzzy Query
• Matches documents with any of the terms
that are “similar” to a particular term
– Levenshtein distance (“edit distance”):
Number of character insertions, deletions or
substitutions needed to transform one string into
another
• e.g. kitten -> sitten -> sittin -> sitting (3 edits)
– Field
– Text
– Minimum similarity score
30
32. Phrase Query
• Matches documents with all the given words
present and being “near” each other
– Field
– Terms
– Slop
• Number of “moves of words” permitted
• Slop = 0 means exact phrase match required
31
33. Boolean Query
• Conceptually similar to boolean operators
(“AND”, “OR”, “NOT”), but not identical
• Why Not AND, OR, And NOT?
– http://www.lucidimagination.com/blog/2011/12/
28/why-not-and-or-and-not/
– In short, boolean operators do not handle > 2
clauses well
32
34. Boolean Query
• Three types of clauses
– Must
– Should
– Must not
• For a boolean query to match a document
– All “must” clauses must match
– All “must not” clauses must not match
– At least one “must” or “should” clause must
match
33
35. Span Query
• Asks Lucene not only what documents the
query matches, but also where it matches
(“spans”)
• Span
– Particular parts or places in a document
– <document ID, start position, end position> tuple
34
36. T0 = "it is what it is”
0 1 2 3 4
T1 = "what is it”
0 1 2
T2 = "it is a banana”
0 1 2 3
<doc ID, start pos., end pos.>
“it is”: <0, 0, 2>
<0, 3, 5>
<2, 0, 2>
35
37. Span Query
• SpanTermQuery
– Same as TermQuery, except your can build other
span queries with it
• SpanOrQuery
– Matches spans that are matched by any of some
span queries
• SpanNotQuery
– Matches spans that are matched by one span
query but not the other span query
36
38. spanTerm(apple) spanOr([apple, orange])
apple orange apple orange
spanTerm(orange) spanNot(apple, orange)
37
39. Span Query
• SpanNearQuery
– Matches spans that are within a certain distance
(“slop”) of each other
– Slop: max number of positions between spans
– Can specify whether order matters
38
41. Filtering
• A Filter narrows down the search result
– Creates a set of document IDs
– Decides what documents get processed further
– Does not affect scoring, i.e. does not score/rank
documents that pass the filter
– Can be cached easily
– Useful for access control, presets, etc.
40
42. Notable Filter classes
• TermsFilter
– Allows documents with any of the given terms
• TermRangeFilter
– Filter version of TermRangeQuery
• PrefixFilter
– Filter version of PrefixQuery
• QueryWrapperFilter
– “Adapts” a query into a filter
• CachingWrapperFilter
– Cache the result of the wrapped filter
41
43. Sorting
• Score (default)
• Index order
• Field
– Requires the field be indexed & not analyzed
– Specify type (string, int, etc.)
– Normal or reverse order
– Single or multiple fields
42
44. Interfacing Lucene with “Outside”
• Embedding directly
• Language bridge
– E.g. PHP/Java Bridge
• Web service
– E.g. Jetty + your own request handler
• Solr
– Lucene + Jetty + lots of useful functionality
43
45. Books
• Lucene in Action, 2nd Edition
– Written by 3 committers and PMC members
– http://www.manning.com/hatcher3/
• Introduction to Information Retrieval
– Not specific to Lucene, but about IR concepts
– Free e-book
– http://nlp.stanford.edu/IR-book/
44
47. Getting Started
• Getting started
– Download lucene-3.6.1.zip (or .tgz)
– Add lucene-core-3.6.1.jar to your classpath
– Consider using an IDE (e.g. Eclipse)
– Luke (Lucene Index Toolbox)
http://code.google.com/p/luke/
46
I bet this is exactly how many systems are handling search right now.Perhaps many systems do not think about how to sort the result and just throws back the result list to the user, without considering what should go first.
Image the slowdown if your website goes from "nobody besides our employees and friends use it" to being "the next FaceBook”.People loose interest in your application easily,if the first few things your search result present do not look exactly like what they are trying to find.
Expand onthe inverted index we just saw.Positions start with zero.
There are only so many words that people commonly use.You can hash the terms, organize them as a prefix tree, sort them and use binary search, and so on.For the purpose of deciding which documents match, you only need to store document IDs (integers).
Extra info: determine how good of a match a document is to a query.Put the best matches near the topof the search result list.
The highest-scored (most relevant) document is the first in the result list.
In VSM, documents and queries are presented as vectors in an n-dimensional space, where n is the total number of unique terms in the document collection, and each dimension corresponds to a separate term. A vector's value in a particular dimension is not zero if the document or the query contains that term.Document vector closer to query vector = document more relevant to the query
The term might be a common word that appears everywhere.
Existence of the index can help with the search, but the index must be created in the first place before we can search with it.
Storing the field means that the original text is stored in the index; can retrieve it at search time.Indexing the fieldmeans that the field is made searchable.
Some fields (e.g. serial numbers) should not be analyzed, as they contain information that cannot be logically broken into pieces.
Token = term, at index time, with start/end position information, and not tied to a document already in the index.
Case-sensitivity, punctuations, apostrophes, how to break URLs and e-mail addressesWhat needs to be kept one-piece or broken down, and whereWhitespaceAnalyzer:whitespaces as separators;punctuations are a part of tokens. StopAnalyzer: non-letters as separators; makes everything lowercase; removes common stop-words like "the”.StandardAnalyzer:sophisticated rules to handle punctuations, hyphens, etc.; recognizes (and avoids breaking up) e-mail addresses and internet hostnames.
Character folding: turns the "a" with an accent mark above into an "a" without the accent markStemming: the words "consistent" and "consistency" have the same stem, which is "consist”Synonyms: like "country" and "nation”Shingles: “the quick”, “the brown”, “brown fox”; useful for searching text in Asian languages like Chinese and Japanese; reduces the number of unique terms in an index and reduces overhead.
Offsets: character offsets of this token from the beginning of the field's textPosition increment: position of this token relative to the previous token; usually 1
This query have clauses about 3 fields. So you analyze 3 pieces of text and get back 3 sets of tokens.A good practice is to use the same analyzer that analyzed the particular field that you are searching.
Examples of range:January 1st to December 31st of 2012 (inclusive)1 to 10 (excluding 10)
Your pattern describe a term, not a document, so you cannot put a phrase or a sentence in a pattern and expect the query to match that phrase or sentence.
Minimum similarity score isbased on the editing distance.
It takes two moves to swap two words in a phrase.
Lucene does not have the standard boolean operators.
Lucene has these instead (of the “standard” boolean operators).
End position is actually one plus the position of the last term in the span
This "slop" is different from the "slop" in Phrase Query.
total number of positions between spans = 2 + 1 + 0 = 3The first two queries match this document because the slops are at least 3. The third query does not match, because the slope is less than 3. The fourth query does not match because even though the required slop is large enough, the query require all the spans to be in the given order, and the spans in this document are not. The fifth query matches because the given order matches the order of the spans in the document.
CachingWrapperFilter good for filters that don’t change a lot, e.g. access restriction.
Index order = order in which docs are added to the indexIndex and not analyzed = whole field as one token/term
Embedding directly: good when the rest of your application is also in Java.In most uses cases, you would be dealing with Solr rather than Lucene directly. But you would still be indirectly using Lucene, and you can still benefit from understanding many of the things discussed in this session.
Eclipse has many useful features such as setting up the classpath and compiling your code for you.Website has Lucene 3 and 4. Lucene 4 is still in beta. The book and most resources out there covers Lucene 3.
It shows you what your index looks like and what fields and terms it has. You can look at individual documents, run queries, try out different analyzers.