The document summarizes some unexpected uses of the Apache Lucene library beyond traditional text search. In 3 sentences: Lucene can be used as a fast key-value store, to index and store content in various file formats, and for machine learning tasks like classifying unlabeled documents into predefined categories using vector space models and analyzing document similarity. It also discusses using Lucene for record linkage, question answering systems, randomized testing to improve code quality, and performance improvements in newer Lucene versions.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
In just a few short years, search has quickly evolved from being a small text box in the nether regions of a website to being front and center in our lives. Increasingly, however, search engine technology is also being used for practical, real time recommendations, events processing, complex spatial functionality and time series analysis capable of not only matching user's queries in text, but also driving real time decision making and analytics. In fact, open source Apache Lucene/Solr can do all of this and more by taking advantage of new data structures and algorithms that complement more traditional IR approaches. In this demo-driven talk, Lucene committer Grant Ingersoll will take a look at some of the new and exciting ways users are leveraging Lucene/Solr and related technology to drive deeper insight into information needs that go beyond keywords in a text box.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Solr is a great tool to have in the data scientist toolbox. In this talk, I walk through several demos of using Solr to data science activities as well as explore various use cases for Solr and data science
In just a few short years, search has quickly evolved from being a small text box in the nether regions of a website to being front and center in our lives. Increasingly, however, search engine technology is also being used for practical, real time recommendations, events processing, complex spatial functionality and time series analysis capable of not only matching user's queries in text, but also driving real time decision making and analytics. In fact, open source Apache Lucene/Solr can do all of this and more by taking advantage of new data structures and algorithms that complement more traditional IR approaches. In this demo-driven talk, Lucene committer Grant Ingersoll will take a look at some of the new and exciting ways users are leveraging Lucene/Solr and related technology to drive deeper insight into information needs that go beyond keywords in a text box.
A 1 hour intro to search, Apache Lucene and Solr, and LucidWorks Search. Contains a quick start with LucidWorks Search and a demo using financial data (See Github prj: http://bit.ly/lws-financial) as well as some basic vocab and search explanations
http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Enhancing Performance with Globus and the Science DMZ
Bet you didn't know Lucene can...
1. Thinking Lucene Think Lucid
Bet You Didn’t Know Lucene Can…
Grant Ingersoll
Chief Scientist | Lucid Imagination
@gsingers
CONFIDENTIAL | 1
2. A Funny Thing Happened On the Way To…
“Apache Lucene(TM) is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly any
application that requires full-text search, especially cross-platform.”
- http://lucene.apache.org
CONFIDENTIAL | 2
3. What can Lucene solve?
DB/NoSQL-like problems
Search-like problems
Stuff
CONFIDENTIAL | 3
4. … Find your Keys?
Lucene/Solr is a reasonably fast
key-value store
– Bonus: search your values!
NoSQL before NoSQL was cool
10 M doc index: 600,000 lookups
per second, single threaded, read-
only
– Not hard to remove the read-only
assumption or the single node
assumption
CONFIDENTIAL | 4
5. …Store your Content?
Solr or Tika + Lucene can index popular office formats
Solr can backup/replicate and scale as content grows
Commit/rollback functionality
Can dynamically add fields
– No schema required up front
Retrieval is fast for keys or arbitrary text
Trunk/4.x:
– Column storage
– Pluggable storage capabilities
– Joins (a few variations)
CONFIDENTIAL | 5
7. … Find you a Date?
Sex: Male
Seeking: Female
Meet Age: 53
Bob Job: Flute Repair shop owner
Location: Moose Jaw, Saskatchewan
Likes: rap music, cricket, long walks on the beach, Thai
food
Dislikes: classical music, cats
Likes: Rap music Cricket Long walks Thai food
on the
beach
Likes: Rap music Cricket Long walks Thai food
on the
beach
Payload
5 2 10
CONFIDENTIAL | 7
8. Along comes Mary
Sex: Female
Seeking: Male
Age: 47
Meet Mary Job: CEO
Location: Moose Jaw, Saskatchewan
Likes: Hip hop, sunsets, Korean food
Dislikes: cats
Filters Queries
Sex, Seeking, Age (as Likes: OR, Phrases, Payload
RangeQuery), Job, Location (as Queries
spatial)
Dislikes: As Not Queries or down
boosted or perhaps ignore?
Boosts: Popularity, Secret Sauce
CONFIDENTIAL | 8
9. Will Mary and Bob Find Love?
?
CEO Owner, Chief Executive
Officer, Executive
Sunsets Beaches, outdoors Match
Korean Food Asian Food
Age Range Match Yes
CONFIDENTIAL | 9
10. … Label Your Content?
Given a new, unseen document, label it with one
one or more predefined labels
Supervised Machine Learning
Train
– Set of data annotated with predefined labels
Test
– Evaluate how well classifier can determine your
content
CONFIDENTIAL | 10
11. Simple Vector Space Classifiers
K Nearest Neighbor (kNN)
– Each Training Document indexed with id, category and
text field
– Pick Category based on whichever category has the most
hits in the top K
Simple TF-IDF (TFIDF)
– Training Chapter 7
• Index category and concatenation of all content with that
label
– Pick Category based on which ever document has best
score
Query: “Important” terms from new, unseen document
– Use Lucene’s More Like This to generate the Query
CONFIDENTIAL | 11
12. Training Data
Politics Sports Entertainment
Spongebob
Obama Vikings win
caught
fundraising Super Bowl
shoplifting
Carolina
Republican Hurricanes Brangelina on a
Fundraising earn first Rampage
Stanley Cup
Obama clashes Minnesota Megastar
with Twins capture clashes with
Republicans World Series Paparazzi
CONFIDENTIAL | 12
13. Simple TF-IDF Model
Training
Politics Sports Entertainment
obama fundraising vikings win super bowl spongebob caught
republican fundraising carolina hurricanes earn shoplifting brangelina
obama clashes with first stanley cup rampage megastar
republicans minnesota twins capture clashes paparazzi
world series
Test/Production
Input document is the query!
e.g.: patriots lose super bowl
CONFIDENTIAL | 13
14. Help you Learn a New Language?
Manu Konchady
uses Lucene to
teach new
languages
Find exactly where
a match occurred
Can also identify
languages! (Solr)
Analyzers can help
you tokenize,
stem, etc. many
languages
CONFIDENTIAL | 14
15. … Detect Plagiarism?
For each document
– For each sentence
• Index Sentence and calculate a hash for each
document
Hash function has property that similar
sentences will hash to the same value
For each new document
– For each sentence
• Query: hash (optionally also search for the
sentence)
Can also do this at the document level by Contrib’d by Andrzej Bialecki
calculating hash for whole document and Erik Hatcher
CONFIDENTIAL | 15
16. … Find the Bad Guys?
Problem: Is Bob “Bad Guy” Johnson the same person as Robert William
Johnson?
Called Record Linkage or Entity Resolution
– Common problem in business, finance, marketing, etc.
Index contains all user profiles
Ad hoc
– Query: incoming user profile
– Tricks: fuzzy queries, alternate queries
– Post process results
Systematic: pairwise similarity (More Like This for all docs)
CONFIDENTIAL | 16
17. …Make you more money?
Who says a search needs to just do keyword matching using good old TF-
IDF?
Solr makes it easy to:
– Rerank documents based on things like price, inventory, margin, popularity, etc.
– Apply Business Rules
– Hardcode results
– Scale for the Holiday season
CONFIDENTIAL | 17
18. … Play Jeopardy!?
Indeed, IBM Watson uses Lucene
Critical component of Question Answering (QA) is often retrieval
How to build a simple QA system?
– Documents can be:
• Whole text, paragraph, sentences
• Position-based queries (spans) to find where keywords match
• Index part of speech tags and possibly other analysis
– Queries:
• Classify based on Answer Type
• Retrieve passages based on keywords plus answer type Chapter 9
• Score passages!
CONFIDENTIAL | 18
20. … Make you a Better Programmer?
If your tests aren’t failing from time to time, are you really doing enough
testing?
We’ve introduced some serious randomized testing
– We run randomized tests every 30 minutes, ad infinitum
– Random Locales, time zones, index file format, much, much more
– Some in the community also randomize JVMs continuously
We liked what we built so much, we now publish it as its own module
– https://issues.apache.org/jira/browse/LUCENE-3492
– https://github.com/carrotsearch/randomizedtesting
More References at end of talk
CONFIDENTIAL | 20
21. … Run Circles Around Previous Versions of Lucene?
Finite State Transducers
Pluggable Indexing Models
– Codecs
http://bit.ly/dawid-weiss-lucene-rev
Pluggable Scoring Models
– BM25, Information based, others
CONFIDENTIAL | 21
23. …Play Chess?!? – THOUGHT EXPERIMENT
Well, maybe not play, but, could we help?
Premise: Even though chess has a very large number of possibilities, most
board positions have been played before
Could you assist with real time analysis?
– Index large collection of previously played games
Document A
– Sequence of all moves of the game
– Metadata
– Query: PrefixQuery of current board + Function
– Results: Ranked list of moves most likely to lead to a win
Alternatives: index board positions, subsequences of moves (n-grams)
CONFIDENTIAL | 23
24. What else?
In case you haven’t noticed, Lucene can do a lot of things that are not
“traditional search”
I’d love to hear your use cases!
CONFIDENTIAL | 24
That’s the description of Lucene, but hey, it’s good for other things tooLet’s explore theseWe’ll start easy, then get into things that are mathematically similar to search and then talk some crazy stuff
Oh, BTW, it can do search over the valuesKeys can be anything, not just strings
Commit/rollback not totally the same as DB
Lucene is a perfectly good content based recommendation engine. In fact, this can fall under the category of “search”Lots of flexibility around representing featureshttp://www.lucidimagination.com/search/document/5485be0137448eca/problems_with_itembasedrecommender_with_lucene#c82c577e1e28259f
You remembered your synonyms and associations, right? Maybe bootstrap from Wordnet or other resource? Perhaps you even used Lucene to calculate co-occurencesYou can tweak the system as needed to come up w/ appropriate queries, etc.
Let’s say you have a bunch of training data
Pairwise similarity: compare all documents
Scoring is easier said than done, but simple approach can be effective for fact-based questions