SlideShare a Scribd company logo
1 of 37
Download to read offline
Lucene And Solr Document
Classification
Alessandro Benedetti, Software Engineer, Sease Ltd.
Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
● Classification
● Lucene Approach
● Solr Integration
● Demo
● Extensions
● Future Work
Agenda
“Classification is the problem of identifying
to which of a set of categories
(sub-populations) a new observation
belongs, on the basis of a training set of data
containing observations (or instances)
whose category membership is known. “
Wikipedia
Classification
● E-mail spam filter
● Document categorization
● Sexually explicit content detection
● Medical diagnosis
● E-commerce
● Language identification
Real World Use Cases
● Supervised learning
● Labelled training samples
● Documents modelled as
feature vectors
● Term occurrences as features
● Model predicts unseen documents
label
Basics Of Text Classification
Apache Lucene
Apache LuceneTM
is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text search,
especially cross-platform.
Apache Lucene is an open source project available for free download.
● Lucene index has complex data structures
● Lot of organizations have already indexes in place
● Pre existent data can be used to classify
● No need to train a model from a separate training set
● From training set to Inverted index
Apache Lucene For Classification
● Advanced configurable text analysis
● Term frequencies
● Term positions
● Document frequencies
● Norms
● Part of speech tags and custom payload
Apache Lucene For Classification
● Given an index with labelled documents
● Each document has a class field
● Given an unknown document in input
● Given a set of relevant fields
● Search the top K most similar documents
● Fetch the classes from the retrieved documents
● Return most occurring class(es)
● Class ranking in retrieved documents is important !
K Nearest Neighbours
● KNN uses Lucene More Like This
● Lucene query component
● Extract interesting terms* from the input document fields
● Build a Lucene query
● Run the query against the search index
● Resulting documents are “the similar documents”
* an interesting term is a term :
- occurring frequently in the seed document (high term frequency)
- but quite rare in the corpus (high inverted document frequency)
More Like This
Assumptions
● Term occurrences are probabilistic independent features
● Terms positions are irrelevant ( bag of words )
Calculate the probability score of each available class C
● Prior ( #DocsInClassC / #Docs )
● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c))
Where given term t
P(t|c) = TF(t) in documents of class c +1 /
#terms in all documents of class c + #docs of class c
Assign top scoring class
Naive Bayes Classifier
● Documents are the Lucene unit of information
● Documents are a map field -> value
● Each field may be analysed differently
(different tokenization and token filtering)
● Each field may have a different weight for the classification
(affecting differently the similarity score)
Document Classification
Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project.
Its major features include powerful full-text search, hit highlighting, faceted
search and analytics, rich document parsing, geospatial search, extensive
REST APIs as well as parallel SQL.
Apache Solr
Index Time Integration - SOLR-7739
● Ingest the document
● Assign the class
● Set the class as a field value
● Index the document
Request Handler Integration (TO DO) - SOLR-7738
Return an assigned class :
● Given a text and a field
● Given an input document
● Given an indexed document id
Solr Integration
● Pipeline of processors
● Each single document flows
through the chain
● Each processor is executed once
● Last processor triggers the
update command
Update Request Processor Chain
● Update Component
● Configurable Singleton Factory
● Single instance per request thread
● Process a single Document
● SolrCloud compatible*
* Pre processor / Post processor
Update Request Processor
● Access the Index Reader
● A Lucene Document Classifier is instantiated
● A class is assigned by the classifier
● A new field is added to the original Document, with the class
● The document goes through the next processing steps
Classification Update Request Processor
...
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">text</str>
<str name="update.chain">classification</str>
</lst>
</initParams>
...
Solrconfig.xml - Update Handler
...
<updateRequestProcessorChain name="classification">
<processor class="solr.ClassificationUpdateProcessorFactory">
...
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
...
Solrconfig.xml - Chain configuration
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">knn</str>
<str name="knn.k">20</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">5</str>
</processor>
N.B. classField must be stored
Solrconfig.xml - K nearest neighbour classifier config
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">bayes</str>
</processor>
N.B. classField must be Indexed (take care of analysis)
Solrconfig.xml - Naive Bayes classifier config
● Lucene >= 6.0
● Solr >= 6.1
● Classification needs a training set ->
An index with initially human assigned classes is required
Solr Classification - Important Notes
● Sci-Fi StackExchange dataset
● Roughly 18.000 questions and answers
● Roughly 6.000 tagged
● 70 % Training Set + 30% test set
Solr Classification - Demo
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
● True Positive : Predicted class == actual class
● False Positive : Predicted class != actual class
● True Negative : Not predicted class != actual class
● False Negative : Not predicted class == actual class
Precision = TP / TP+FP
Recall = TP / TP+FN
Solr Classification - System Evaluation Metrics
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
MaxOutputClasses 1
[System Global Accuracy]0.5095676824946846
[System Globel Recall]0.2686846038863976
TP{star-wars}59
FP{star-wars}75
FN{star-wars}7
[Precision (of predicted)]{star-wars}0.44029850746268656
[Recall for class)]{star-wars}0.8939393939393939
TP{harry-potter}147
FP{harry-potter}137
FN{harry-potter}3
[Precision (of predicted)]{harry-potter}0.5176056338028169
[Recall for class]{harry-potter}0.98
Solr Classification - Demo - Full Dataset
MaxOutputClasses 5
[System Global Accuracy]0.20481927710843373
[System Globel Recall]0.5399850523168909
TP{star-wars}66
FP{star-wars}400
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.14163090128755365
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}584
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.20435967302452315
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Full Dataset
MaxOutputClasses 1
[System Global Accuracy]0.9907407407407407
[System Globel Recall]0.6750788643533123
TP{star-wars}64
FP{star-wars}0
FN{star-wars}2
[Precision (of predicted)]{star-wars}1.0
[Recall for class)]{star-wars}0.9696969696969697
TP{harry-potter}150
FP{harry-potter}2
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.9868421052631579
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
MaxOutputClasses 5
[System Global Accuracy]0.24259259259259258
[System Globel Recall]0.8264984227129337
TP{star-wars}66
FP{star-wars}52
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.559322033898305
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}48
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.7575757575757576
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
Multi classes support
● Class field may be multi valued
● Assign multiple classes
● Not only the top scoring but top N (parameter)
Split human/auto assigned classes
● classTrainingField
● classOutputField
Default : use the same field
Solr Classification - Extensions SOLR-8871
Classification Context Filtering
● Reduce the document space to consider ->
reduce the training set
● Useful when only a subset of the index may be interesting for
classification
● Consider only the human labelled documents as training data
Solr Classification - Extensions SOLR-8871
Individual Field Weighting
● When classifying, each field has a different importance
e.g.
title vs content
● Set a different boost per field
● Knn compatible
● Bayes compatible
Solr Classification - Extensions SOLR-8871
● Numeric Field Support (Knn)
(Euclidean distance based)
● Lat lon support (Knn)
(geo distance based)
● SolrCloud support
(use the entire sharded index as training set)
Solr Classification - Future Work
Questions ?
● Special thanks to Tommaso Teofili,
Apache committer who followed the developments and made possible the
contributions.
● And to the
Audience :)

More Related Content

What's hot

Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Thomas Riley
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph ExplosionNeo4j
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Citus Data
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureScyllaDB
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...DataStax
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jNeo4j
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 

What's hot (20)

Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
The Knowledge Graph Explosion
The Knowledge Graph ExplosionThe Knowledge Graph Explosion
The Knowledge Graph Explosion
 
Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Thanos - Prometheus on Scale
Thanos - Prometheus on ScaleThanos - Prometheus on Scale
Thanos - Prometheus on Scale
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
Architecting peta-byte-scale analytics by scaling out Postgres on Azure with ...
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 

Similar to Apache Lucene/Solr Document Classification

PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R LearningKumar P
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 

Similar to Apache Lucene/Solr Document Classification (20)

PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
advancedR.pdf
advancedR.pdfadvancedR.pdf
advancedR.pdf
 
Advanced r
Advanced rAdvanced r
Advanced r
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R Learning
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Apache solr
Apache solrApache solr
Apache solr
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Android Database
Android DatabaseAndroid Database
Android Database
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Apache Lucene/Solr Document Classification

  • 1. Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd.
  • 2. Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  • 3. ● Classification ● Lucene Approach ● Solr Integration ● Demo ● Extensions ● Future Work Agenda
  • 4. “Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. “ Wikipedia Classification
  • 5. ● E-mail spam filter ● Document categorization ● Sexually explicit content detection ● Medical diagnosis ● E-commerce ● Language identification Real World Use Cases
  • 6. ● Supervised learning ● Labelled training samples ● Documents modelled as feature vectors ● Term occurrences as features ● Model predicts unseen documents label Basics Of Text Classification
  • 7. Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 8. ● Lucene index has complex data structures ● Lot of organizations have already indexes in place ● Pre existent data can be used to classify ● No need to train a model from a separate training set ● From training set to Inverted index Apache Lucene For Classification
  • 9. ● Advanced configurable text analysis ● Term frequencies ● Term positions ● Document frequencies ● Norms ● Part of speech tags and custom payload Apache Lucene For Classification
  • 10. ● Given an index with labelled documents ● Each document has a class field ● Given an unknown document in input ● Given a set of relevant fields ● Search the top K most similar documents ● Fetch the classes from the retrieved documents ● Return most occurring class(es) ● Class ranking in retrieved documents is important ! K Nearest Neighbours
  • 11. ● KNN uses Lucene More Like This ● Lucene query component ● Extract interesting terms* from the input document fields ● Build a Lucene query ● Run the query against the search index ● Resulting documents are “the similar documents” * an interesting term is a term : - occurring frequently in the seed document (high term frequency) - but quite rare in the corpus (high inverted document frequency) More Like This
  • 12. Assumptions ● Term occurrences are probabilistic independent features ● Terms positions are irrelevant ( bag of words ) Calculate the probability score of each available class C ● Prior ( #DocsInClassC / #Docs ) ● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c)) Where given term t P(t|c) = TF(t) in documents of class c +1 / #terms in all documents of class c + #docs of class c Assign top scoring class Naive Bayes Classifier
  • 13. ● Documents are the Lucene unit of information ● Documents are a map field -> value ● Each field may be analysed differently (different tokenization and token filtering) ● Each field may have a different weight for the classification (affecting differently the similarity score) Document Classification
  • 14. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Apache Solr
  • 15. Index Time Integration - SOLR-7739 ● Ingest the document ● Assign the class ● Set the class as a field value ● Index the document Request Handler Integration (TO DO) - SOLR-7738 Return an assigned class : ● Given a text and a field ● Given an input document ● Given an indexed document id Solr Integration
  • 16. ● Pipeline of processors ● Each single document flows through the chain ● Each processor is executed once ● Last processor triggers the update command Update Request Processor Chain
  • 17. ● Update Component ● Configurable Singleton Factory ● Single instance per request thread ● Process a single Document ● SolrCloud compatible* * Pre processor / Post processor Update Request Processor
  • 18. ● Access the Index Reader ● A Lucene Document Classifier is instantiated ● A class is assigned by the classifier ● A new field is added to the original Document, with the class ● The document goes through the next processing steps Classification Update Request Processor
  • 19. ... <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse"> <lst name="defaults"> <str name="df">text</str> <str name="update.chain">classification</str> </lst> </initParams> ... Solrconfig.xml - Update Handler
  • 20. ... <updateRequestProcessorChain name="classification"> <processor class="solr.ClassificationUpdateProcessorFactory"> ... </processor> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> ... Solrconfig.xml - Chain configuration
  • 21. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">knn</str> <str name="knn.k">20</str> <str name="knn.minTf">1</str> <str name="knn.minDf">5</str> </processor> N.B. classField must be stored Solrconfig.xml - K nearest neighbour classifier config
  • 22. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">bayes</str> </processor> N.B. classField must be Indexed (take care of analysis) Solrconfig.xml - Naive Bayes classifier config
  • 23. ● Lucene >= 6.0 ● Solr >= 6.1 ● Classification needs a training set -> An index with initially human assigned classes is required Solr Classification - Important Notes
  • 24. ● Sci-Fi StackExchange dataset ● Roughly 18.000 questions and answers ● Roughly 6.000 tagged ● 70 % Training Set + 30% test set Solr Classification - Demo
  • 25. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 26. ● True Positive : Predicted class == actual class ● False Positive : Predicted class != actual class ● True Negative : Not predicted class != actual class ● False Negative : Not predicted class == actual class Precision = TP / TP+FP Recall = TP / TP+FN Solr Classification - System Evaluation Metrics
  • 27. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 28. MaxOutputClasses 1 [System Global Accuracy]0.5095676824946846 [System Globel Recall]0.2686846038863976 TP{star-wars}59 FP{star-wars}75 FN{star-wars}7 [Precision (of predicted)]{star-wars}0.44029850746268656 [Recall for class)]{star-wars}0.8939393939393939 TP{harry-potter}147 FP{harry-potter}137 FN{harry-potter}3 [Precision (of predicted)]{harry-potter}0.5176056338028169 [Recall for class]{harry-potter}0.98 Solr Classification - Demo - Full Dataset
  • 29. MaxOutputClasses 5 [System Global Accuracy]0.20481927710843373 [System Globel Recall]0.5399850523168909 TP{star-wars}66 FP{star-wars}400 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.14163090128755365 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}584 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.20435967302452315 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Full Dataset
  • 30. MaxOutputClasses 1 [System Global Accuracy]0.9907407407407407 [System Globel Recall]0.6750788643533123 TP{star-wars}64 FP{star-wars}0 FN{star-wars}2 [Precision (of predicted)]{star-wars}1.0 [Recall for class)]{star-wars}0.9696969696969697 TP{harry-potter}150 FP{harry-potter}2 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.9868421052631579 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 31. MaxOutputClasses 5 [System Global Accuracy]0.24259259259259258 [System Globel Recall]0.8264984227129337 TP{star-wars}66 FP{star-wars}52 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.559322033898305 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}48 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.7575757575757576 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 32. Multi classes support ● Class field may be multi valued ● Assign multiple classes ● Not only the top scoring but top N (parameter) Split human/auto assigned classes ● classTrainingField ● classOutputField Default : use the same field Solr Classification - Extensions SOLR-8871
  • 33. Classification Context Filtering ● Reduce the document space to consider -> reduce the training set ● Useful when only a subset of the index may be interesting for classification ● Consider only the human labelled documents as training data Solr Classification - Extensions SOLR-8871
  • 34. Individual Field Weighting ● When classifying, each field has a different importance e.g. title vs content ● Set a different boost per field ● Knn compatible ● Bayes compatible Solr Classification - Extensions SOLR-8871
  • 35. ● Numeric Field Support (Knn) (Euclidean distance based) ● Lat lon support (Knn) (geo distance based) ● SolrCloud support (use the entire sharded index as training set) Solr Classification - Future Work
  • 37. ● Special thanks to Tommaso Teofili, Apache committer who followed the developments and made possible the contributions. ● And to the Audience :)