SlideShare a Scribd company logo
1 of 30
Download to read offline
Text categorization with
   Lucene and Solr
     Tommaso Teofili
     tommaso [at] apache [dot] org




                                     1
About me
l      ASF member having fun with:
  l    Lucene / Solr
  l    Hama
  l    UIMA
  l    Stanbol
  l    … some others
l      SW engineer @ Adobe R&D




                                      2
Agenda
l    Classification
l    Lucene classification module
l    Solr text categorization services
l    Conclusions




                                          3
Classification
l    Let the algorithm assign one or more labels
      (classes) to some item given some
      previous knowledge
      l    Spam filter
      l    Tagging system
      l    Digit recognition system
      l    Text categorization
      l    etc.




                                           4
Classification?
why with Lucene?




                   5
The short story
l      Lucene already has a lot of features for
        common information retrieval needs
  l    Postings
  l    Term vectors
  l    Statistics
  l    Positions
  l    TF / IDF
  l    maybe Payloads
  l    etc.
l      We may avoid bringing in new components
        to do classification just leveraging what we
        get for free from Lucene
                                              6
The (slightly) longer story #1
l      While playing with NLP stuff
l      Need to implement a naïve bayes classifier
  l    Not possible to plug in stuff requiring touching
        the architecture
  l    Not really interested in (near) real time
        performance
l      Iteration 1
  l    Plain in memory Java stuff
l      Iteration 2
  l    Same stuff but using Lucene instead of loading
        things into memory
         l    Too much faster J
                                                    7
The (slightly) longer story #2
l      So I realized
  l    Lucene has so many features stored you can
        take advantage of for free
  l    Therefore writing the classification algorithm is
        relatively simple
  l    In many cases you’re just not adding anything to
        the architecture
         l    Your Lucene index was already there for searching
  l    Lucene index is, to some extent, already a model
        which we just need to “query” with the proper
        algorithm
  l    And it is fast enough
                                                           8
Lucene classification module
l      Work in progress on trunk
l      LUCENE-4345
  l    Establishing classification API
  l    With currently two implementations
         l    Naïve bayes
         l    K nearest neighbor




                                             9
Lucene classification module
l      Classifier API
  l    Training
  l    void train(atomicReader, contentField,
        classField, analyzer) throws IOException

         l    atomicReader : the reader on the Lucene index to
               use for classification
               l    still unsure if IR’d be better
         l    textFieldName : the name of the field which contains
               documents’ texts
         l    classFieldName : the name of the field which contains
               the class assigned to existing documents
         l    analyzer : the item used for analyzing the unseen
               texts
                                                            10
Lucene classification module
l      Classifier API
  l    Classifying
  l    ClassificationResult assignClass(String text)
        throws IOException
         l    text: the unseen text of the document to classify
         l    ClassificationResult : the object containing the
               assigned class along with the related score




                                                              11
K Nearest neighbor classifier
l    Fairly simple classification algorithm
l    Given some new unseen item
l    I search in my knowledge base the k items
      which are nearer to the new one
l    I get the k classes assigned to the k
      nearest items
l    I assign to the new item the class that is
      most frequent in the k returned items



                                           12
K Nearest neighbor classifier
l      How can we do this in Lucene?
  l    We have VSM for representing documents as
        vectors and eventually find distances
  l    Lucene MoreLikeThis module can do a lot for it
  l    Given a new document
         l    It’s represented as a MoreLikeThisQuery which filters
               out too frequent words and helps on keeping only the
               relevant tokens for finding the neighbors
         l    The query is executed returning only the first k results
         l    The result is then browsed in order to find the most
               frequent class and that is then assigned with a score
               of classFreq / k


                                                               13
Naïve Bayes classifier
l      Slightly more complicated
l      Based on probabilities
l      C = argmax( P(d|c) * P(c) )
  l    P(d|c) : likelihood
  l    P(c) : prior
  l    With some assumptions:
         l    bag of words assumption: positions don't matter
         l    conditional independence: the feature probabilities
               are independent given a class




                                                             14
Naïve Bayes classifier
l      Prior calculation is easy
  l    It’s the relative frequency of each class
        #docsWithClassC / #docs
l      Likelihood is easy too because of the bag
        of words assumption
  l    P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c)
  l    So we just need probabilities of single terms
         l    P(x|c) := (tf of x in documents with class c + 1)/
               (#terms in docs with class c + #docs)




                                                                15
Naïve Bayes classifier
l      Does the bag of words assumption affect
        the classifier’s precision?
  l    Yes in theory
         l    in text documents (nearby) words are strictly
               correlated
  l    Not always in practice
         l    depending on your index data it may or not have an
               impact




                                                               16
Using different indexes
l      The Classifier API makes usage of an
        AtomicReader to get the data for the
        training
l      It must not be the very same index used for
        every day index / search
  l    For performance reasons
  l    For enhancing classifier effectiveness
         l    Using more specific analyzers
         l    Indexing data in a different way
                l    e.g. one big document for each class and use kNN (with a
                      small k) or TF-IDF similarity



                                                                       17
Things to consider - bootstrapping
l      How are your first documents classified?
  l    Manually
         l    Categories are already there in the documents
         l    Someone is explicitly charged to do that (e.g. article
               authors) at some point in time
  l    (semi) automatically
         l    Using some existing service / library
                l    With or without human supervision

  l    In either case the classifier needs something to
        be fed with to be effective



                                                               18
Things to consider – tokenizing
l      How are your content field tokenized?
  l    Whitespace
         l    It doesn’t work for each language
  l    Standard
  l    Sentence
  l    What about using N-Grams?
  l    What about using Shingles?




                                                   19
Things to consider - filtering
l      Some words may / should be filtered while
  l    Training
  l    Classifying
l      Often
  l    Stopwords
  l    Punctuation
  l    Not relevant PoS tagged tokens



                                            20
Raw benchmarking
l      Tried both algorithms on ~1M docs index
  l    Naïve bayes is affected by the # of classes
  l    kNN is affected by k being large
l      None of them took more than 1-2m to train
        even with great number of classes or large
        k values




                                                  21
From Lucene to Solr
l      The Lucene classifiers can be easily used
        in Solr
  l    As specific search services
         l    A classification based more like this
  l    While indexing
         l    For automatic text categorization




                                                       22
Classification based MLT
l      Use case:
  l    “give me all the documents that belong to the
        same category of a new not indexed document”
  l    Slightly different from basic MLT since it does
        not return the nearest docs
  l    That is useful if the user doesn’t want / need to
        index the document and still want to find all the
        other documents of the same category, whatever
        this category means


                                                   23
Classification based MLT
l      ClassificationMLTHandler
  l    String d = req.getParams().get(DOC);
  l    ClassificationResult r = classifier.assignClass(d);
  l    String c = r.getAssignedClass();
  l    req.getSearcher().search(new TermQuery(new
        Term(classFieldName, c)), rows);




                                                    24
Automatic text categorization
l      Once a doc reaches Solr
l      We can use the Lucene classifiers to
        automate assigning document’s category
l      We can leverage existing Solr facilites for
        enhancing the indexing pipeline
  l    An UpdateChain can be decorated with one or
        more UpdateRequestProcessors




                                               25
Automatic text categorization
l      Configuration
  l    <updateRequestProcessorChain name=“ctgr">
  l    <processor
        class=”solr.CategorizationUpdateRequestProces
        sorFactory">
  l    <processor
        class="solr.RunUpdateProcessorFactory" />
  l    </updateRequestProcessorChain>



                                               26
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      void processAdd(AddUpdateCommand
        cmd) throws IOException
         l    String text = solrInputDocument.getFieldValue(“text”);
         l    String class = classifier.assignClass(text);
         l    solrInputDocument.addField(“cat”, class);
  l    Every now and then need to retrain to get latest
        stuff in the current index, but that can be done in
        the background without affecting performances


                                                             27
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      Finer grained control
  l    Use automatic text categorization only if a value
        does not exist for the “cat” field
  l    Add the classifier output class to the “cat” field
        only if it’s above a certain score




                                                      28
Wrap up
l      Simple classifiers with no or little effort
l      No architecture change
l      Both available to Lucene and Solr
l      Still reasonably fast
l      A lot more can be done
  l    Implement a MaxEnt Lucene based classifier
         l    which takes into account words correlation



                                                            29
Thanks!




          30

More Related Content

What's hot

constructors in java ppt
constructors in java pptconstructors in java ppt
constructors in java pptkunal kishore
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
File and directories in python
File and directories in pythonFile and directories in python
File and directories in pythonLifna C.S
 
11. Java Objects and classes
11. Java  Objects and classes11. Java  Objects and classes
11. Java Objects and classesIntro C# Book
 
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Minnie Seungmin Cho
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Jean Brenda
 
Generics and collections in Java
Generics and collections in JavaGenerics and collections in Java
Generics and collections in JavaGurpreet singh
 
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)JunSuzuki21
 
PHPUnit 入門介紹
PHPUnit 入門介紹PHPUnit 入門介紹
PHPUnit 入門介紹Jace Ju
 
#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기Arawn Park
 
الوراثة في الجافا
الوراثة في الجافاالوراثة في الجافا
الوراثة في الجافاGhadeerAhmedAljishi
 
Object-Oriented Programming with Java UNIT 1
Object-Oriented Programming with Java UNIT 1Object-Oriented Programming with Java UNIT 1
Object-Oriented Programming with Java UNIT 1SURBHI SAROHA
 

What's hot (20)

Json
JsonJson
Json
 
constructors in java ppt
constructors in java pptconstructors in java ppt
constructors in java ppt
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
File and directories in python
File and directories in pythonFile and directories in python
File and directories in python
 
11. Java Objects and classes
11. Java  Objects and classes11. Java  Objects and classes
11. Java Objects and classes
 
Models for hierarchical data
Models for hierarchical dataModels for hierarchical data
Models for hierarchical data
 
C# classes objects
C#  classes objectsC#  classes objects
C# classes objects
 
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab) Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
Azure OpenAI 및 ChatGPT 실습가이드 (Hands-on-lab)
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
Input output streams
Input output streamsInput output streams
Input output streams
 
Generics and collections in Java
Generics and collections in JavaGenerics and collections in Java
Generics and collections in Java
 
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)
PAKDD2023_Tutorial_T2 (Overview, Part 1, and Part 2)
 
PHPUnit 入門介紹
PHPUnit 入門介紹PHPUnit 入門介紹
PHPUnit 入門介紹
 
#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기#살아있다 #자프링외길12년차 #코프링2개월생존기
#살아있다 #자프링외길12년차 #코프링2개월생존기
 
الوراثة في الجافا
الوراثة في الجافاالوراثة في الجافا
الوراثة في الجافا
 
Dot Net Course Syllabus
Dot Net Course SyllabusDot Net Course Syllabus
Dot Net Course Syllabus
 
C# Access modifiers
C# Access modifiersC# Access modifiers
C# Access modifiers
 
Generics C#
Generics C#Generics C#
Generics C#
 
Inner classes in java
Inner classes in javaInner classes in java
Inner classes in java
 
Object-Oriented Programming with Java UNIT 1
Object-Oriented Programming with Java UNIT 1Object-Oriented Programming with Java UNIT 1
Object-Oriented Programming with Java UNIT 1
 

Similar to Text Categorization with Lucene and Solr

Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptxSAICHARANREDDYN
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture OneAngelo Corsaro
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in pythonnitamhaske
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python ProgrammingKAUSHAL KUMAR JHA
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave ImplicitMartin Odersky
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and ReasonableKevlin Henney
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxabhishek364864
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersHoffman Lab
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2Vishal Biyani
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptBill Buchan
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP Max Kleiner
 

Similar to Text Categorization with Lucene and Solr (20)

Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Java Notes
Java NotesJava Notes
Java Notes
 
Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptx
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in python
 
Proyecto
ProyectoProyecto
Proyecto
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave Implicit
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and Reasonable
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP
 
Java_Interview Qns
Java_Interview QnsJava_Interview Qns
Java_Interview Qns
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

More from Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Text Categorization with Lucene and Solr

  • 1. Text categorization with Lucene and Solr Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me l  ASF member having fun with: l  Lucene / Solr l  Hama l  UIMA l  Stanbol l  … some others l  SW engineer @ Adobe R&D 2
  • 3. Agenda l  Classification l  Lucene classification module l  Solr text categorization services l  Conclusions 3
  • 4. Classification l  Let the algorithm assign one or more labels (classes) to some item given some previous knowledge l  Spam filter l  Tagging system l  Digit recognition system l  Text categorization l  etc. 4
  • 6. The short story l  Lucene already has a lot of features for common information retrieval needs l  Postings l  Term vectors l  Statistics l  Positions l  TF / IDF l  maybe Payloads l  etc. l  We may avoid bringing in new components to do classification just leveraging what we get for free from Lucene 6
  • 7. The (slightly) longer story #1 l  While playing with NLP stuff l  Need to implement a naïve bayes classifier l  Not possible to plug in stuff requiring touching the architecture l  Not really interested in (near) real time performance l  Iteration 1 l  Plain in memory Java stuff l  Iteration 2 l  Same stuff but using Lucene instead of loading things into memory l  Too much faster J 7
  • 8. The (slightly) longer story #2 l  So I realized l  Lucene has so many features stored you can take advantage of for free l  Therefore writing the classification algorithm is relatively simple l  In many cases you’re just not adding anything to the architecture l  Your Lucene index was already there for searching l  Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm l  And it is fast enough 8
  • 9. Lucene classification module l  Work in progress on trunk l  LUCENE-4345 l  Establishing classification API l  With currently two implementations l  Naïve bayes l  K nearest neighbor 9
  • 10. Lucene classification module l  Classifier API l  Training l  void train(atomicReader, contentField, classField, analyzer) throws IOException l  atomicReader : the reader on the Lucene index to use for classification l  still unsure if IR’d be better l  textFieldName : the name of the field which contains documents’ texts l  classFieldName : the name of the field which contains the class assigned to existing documents l  analyzer : the item used for analyzing the unseen texts 10
  • 11. Lucene classification module l  Classifier API l  Classifying l  ClassificationResult assignClass(String text) throws IOException l  text: the unseen text of the document to classify l  ClassificationResult : the object containing the assigned class along with the related score 11
  • 12. K Nearest neighbor classifier l  Fairly simple classification algorithm l  Given some new unseen item l  I search in my knowledge base the k items which are nearer to the new one l  I get the k classes assigned to the k nearest items l  I assign to the new item the class that is most frequent in the k returned items 12
  • 13. K Nearest neighbor classifier l  How can we do this in Lucene? l  We have VSM for representing documents as vectors and eventually find distances l  Lucene MoreLikeThis module can do a lot for it l  Given a new document l  It’s represented as a MoreLikeThisQuery which filters out too frequent words and helps on keeping only the relevant tokens for finding the neighbors l  The query is executed returning only the first k results l  The result is then browsed in order to find the most frequent class and that is then assigned with a score of classFreq / k 13
  • 14. Naïve Bayes classifier l  Slightly more complicated l  Based on probabilities l  C = argmax( P(d|c) * P(c) ) l  P(d|c) : likelihood l  P(c) : prior l  With some assumptions: l  bag of words assumption: positions don't matter l  conditional independence: the feature probabilities are independent given a class 14
  • 15. Naïve Bayes classifier l  Prior calculation is easy l  It’s the relative frequency of each class #docsWithClassC / #docs l  Likelihood is easy too because of the bag of words assumption l  P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c) l  So we just need probabilities of single terms l  P(x|c) := (tf of x in documents with class c + 1)/ (#terms in docs with class c + #docs) 15
  • 16. Naïve Bayes classifier l  Does the bag of words assumption affect the classifier’s precision? l  Yes in theory l  in text documents (nearby) words are strictly correlated l  Not always in practice l  depending on your index data it may or not have an impact 16
  • 17. Using different indexes l  The Classifier API makes usage of an AtomicReader to get the data for the training l  It must not be the very same index used for every day index / search l  For performance reasons l  For enhancing classifier effectiveness l  Using more specific analyzers l  Indexing data in a different way l  e.g. one big document for each class and use kNN (with a small k) or TF-IDF similarity 17
  • 18. Things to consider - bootstrapping l  How are your first documents classified? l  Manually l  Categories are already there in the documents l  Someone is explicitly charged to do that (e.g. article authors) at some point in time l  (semi) automatically l  Using some existing service / library l  With or without human supervision l  In either case the classifier needs something to be fed with to be effective 18
  • 19. Things to consider – tokenizing l  How are your content field tokenized? l  Whitespace l  It doesn’t work for each language l  Standard l  Sentence l  What about using N-Grams? l  What about using Shingles? 19
  • 20. Things to consider - filtering l  Some words may / should be filtered while l  Training l  Classifying l  Often l  Stopwords l  Punctuation l  Not relevant PoS tagged tokens 20
  • 21. Raw benchmarking l  Tried both algorithms on ~1M docs index l  Naïve bayes is affected by the # of classes l  kNN is affected by k being large l  None of them took more than 1-2m to train even with great number of classes or large k values 21
  • 22. From Lucene to Solr l  The Lucene classifiers can be easily used in Solr l  As specific search services l  A classification based more like this l  While indexing l  For automatic text categorization 22
  • 23. Classification based MLT l  Use case: l  “give me all the documents that belong to the same category of a new not indexed document” l  Slightly different from basic MLT since it does not return the nearest docs l  That is useful if the user doesn’t want / need to index the document and still want to find all the other documents of the same category, whatever this category means 23
  • 24. Classification based MLT l  ClassificationMLTHandler l  String d = req.getParams().get(DOC); l  ClassificationResult r = classifier.assignClass(d); l  String c = r.getAssignedClass(); l  req.getSearcher().search(new TermQuery(new Term(classFieldName, c)), rows); 24
  • 25. Automatic text categorization l  Once a doc reaches Solr l  We can use the Lucene classifiers to automate assigning document’s category l  We can leverage existing Solr facilites for enhancing the indexing pipeline l  An UpdateChain can be decorated with one or more UpdateRequestProcessors 25
  • 26. Automatic text categorization l  Configuration l  <updateRequestProcessorChain name=“ctgr"> l  <processor class=”solr.CategorizationUpdateRequestProces sorFactory"> l  <processor class="solr.RunUpdateProcessorFactory" /> l  </updateRequestProcessorChain> 26
  • 27. Automatic text categorization l  CategorizationUpdateRequestProcessor l  void processAdd(AddUpdateCommand cmd) throws IOException l  String text = solrInputDocument.getFieldValue(“text”); l  String class = classifier.assignClass(text); l  solrInputDocument.addField(“cat”, class); l  Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 27
  • 28. Automatic text categorization l  CategorizationUpdateRequestProcessor l  Finer grained control l  Use automatic text categorization only if a value does not exist for the “cat” field l  Add the classifier output class to the “cat” field only if it’s above a certain score 28
  • 29. Wrap up l  Simple classifiers with no or little effort l  No architecture change l  Both available to Lucene and Solr l  Still reasonably fast l  A lot more can be done l  Implement a MaxEnt Lucene based classifier l  which takes into account words correlation 29
  • 30. Thanks! 30