SlideShare a Scribd company logo
1 of 30
Download to read offline
Building an open-source based search solution –
                  first steps

                      Roman Kern

             Institute of Knowledge Management
                Graz University of Technology
                       Know-Center Graz
            rkern@tugraz.at, rkern@know-center.at


          Data Science Meetup / 2012-04-12
Overview           Graz University of Technology




  Motivation


  Background


  Solr Ecosystem


  Solr Features


  Conclusions




                                                   2 / 28
Motivation                                              Graz University of Technology




   Search
       Change in users expectations
       Missing, sub-optimal search causes frustration

   Science
       Information retrieval
       Success story
       Mostly focused on web search

   Industry
       Enterprise search
       Heterogeneous data sources

                                                                                        3 / 28
Background of the Speaker                      Graz University of Technology




http://a1.net




                            http://wissen.de
                                                                               4 / 28
Apache Lucene Umbrella Project                          Graz University of Technology




   Components
       Search engine ⇒ Lucene
       Search server ⇒ Solr
       Web search engine ⇒ Nutch
       Lightweight crawler ⇒ Droids
       File-format parsing ⇒ Tika
       Communicate with CMS ⇒ ManifoldCF
       Distributed coordination ⇒ ZooKeeper
       Natural language processing ⇒ OpenNLP
       Related projects: Hadoop, Mahout, Carrot2, ...

   Common aspects
   Apache license, implemented in Java, community
                                                                                        5 / 28
Lucene                                Graz University of Technology




  Search Engine Library
         Java API
             Only for expert users
         Search-Index
             File-system
             In-memory index
         Advanced features
             Incremental indexing
             Update while searching
         Base for many projects
             Solr
             ir-lib
             elasticsearch
         LIA (Lucene in Action)


  http://lucene.apache.org/core/                                      6 / 28
Nutch                                       Graz University of Technology




  Web search engine
        Builds upon Solr
        Web crawler
            Link database, crawl database
        Distributed
            Runs on Hadoop
        Mode of operation
            Crawl a single domain
            Crawl the web with seed sites



  http://nutch.apache.org/




                                                                            7 / 28
Droids                                   Graz University of Technology




   Crawler component
         Lightweight crawler
         Main features
             Throttling
             Multi-threaded
             Well behaved (robots.txt)



   http://incubator.apache.org/droids/




                                                                         8 / 28
Tika                                               Graz University of Technology




   Text extraction
        Text & meta-data
        File-formats
             Office
                  Microsoft Formats (Apache POI)
                  OpenDocument
             Common text formats
                  PDF (PDFBox)
                  HTML (tagsoup)
             Non-text
                  Images
                  Sound



   http://tika.apache.org/


                                                                                   9 / 28
ManifoldCF                                  Graz University of Technology




  Content Management System Connectors
       Communicate with CMS/DMS
       Connectors
            FileNet P8 (IBM)
            Documentum (EMC)
            LiveLink (OpenText)
            Meridio (Autonomy)
            Windows shares (Microsoft)
            SharePoint (Microsoft)
            More: Alfresco, JDBC, ...
       Data is then stored and indexed
            e.g. Solr



  http://incubator.apache.org/connectors/


                                                                            10 / 28
ZooKeeper                        Graz University of Technology




  Distributed coordination
       Orchestrate servers
       Distributed
            Configuration
            Name lookup
            Synchronization




  http://zookeeper.apache.org/
                                                                 11 / 28
OpenNLP                                                      Graz University of Technology




  Natural language processing
       Process plain text
       Maximum entropy classification with beam search
       Models
            Sentence splitting
            Token splitting
            Part-of-speech (POS) tagging
            Named entity recognition
            more: chunker, parser, co-reference resolution



  http://opennlp.sourceforge.net/




                                                                                             12 / 28
Hadoop                                                  Graz University of Technology




  Distributed computing
       Scale out framework
       Distributed file-system
            Data is partitioned
            Stored on multiple nodes
       Map/Reduce paradigm
            Map your algorithms to mappers & reducers



  Related projects: HBase, Pig, Hive, ...


  http://hadoop.apache.org/



                                                                                        13 / 28
Mahout                            Graz University of Technology




  Distributed machine learning
       Scale out framework
       Machine learning
            Recommender systems
            Clustering
            Classification
       Integration
            Standalone
            Hadoop
            Amazon EC2



  http://mahout.apache.org/



                                                                  14 / 28
Details   Graz University of Technology




                                          15 / 28
Search Server                                          Graz University of Technology




   What Solr is
       Web-Service
       Full-text indexing & search
       Support to store arbitrary content

   What Solr isn’t
       Solr = grep
       Database
            But, somehow similar to No-SQL databases

   Solr vs. IR-Lib
       Solr: easy to use, easy to integrate, XML configuration
       IR-Lib: expert knowledge to use, Java configuration, fast

                                                                                       16 / 28
Index Structure                                        Graz University of Technology




   Inverted Index
       Dictionary of words (terms)
       Map from term to document

   Document
       List of fields
       Input fields are them mapped according to the schema

   Field-types
       Defined in the schema
       Type (string, boolean, date, number) - internally mapped to
       string


                                                                                       17 / 28
Index Management                                         Graz University of Technology




  API
        HTTP Server
        Various formats (XML, binary, JavaScript, ...)

  Document life-cycle
        There is no update
        Delete (done automatically by Solr)
        Insert
        Implications
            An unique id is necessary
            Use batch updates
        Commit, rollback (and optimize)


                                                                                         18 / 28
Input Handling                                             Graz University of Technology




   Different input formats
       XML
       CSV
       JDBC (database)
            DIH (data import handler)
            Support incremental updates (via timestamps)
       Solr Cell
            Binary content
            Apache Tika
            Text content and metadata




                                                                                           19 / 28
Text Processing                                     Graz University of Technology




   Scope
       During indexing & query

   Tokenization
       Split text into tokens
       Lower-case alignment
       Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒
       triplic, ...)
       Synonyms (via Thesaurus)
       Stop-word filtering
       Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
       n-grams, soundex, umlauts


                                                                                    20 / 28
Query Processing                                               Graz University of Technology




   Query parsers
        Lucene query parser (rich syntax)
              AND, OR, NOT, range queries, wildcards, fuzzy query, phrase
              query
              Boosting of individual parts
              Example: ((boltzmann OR schroedinger) NOT einstein)
        Dismax query parser
              No query syntax
              Searches over multiple fields (separate boost for each field)
              Configure the amount of terms to be mandatory
              Distance between terms is used for ranking (phrase boosting)


   Dismax is a good starting point, but may become expensive




                                                                                               21 / 28
Search Features               Graz University of Technology




   Query filter
       Additional query
       No impact on ranking
       Results are cached

   Boosting query
       Only in Dismax

   Query elevation
       Fix certain queries

   Request handler
       Pre-define clauses
       Invariants
                                                              22 / 28
Search Result                                               Graz University of Technology




   Ranking
       Relevance
       Sort on field value (only single term per document)

   Available data & features
       Sequence of IDs & score
       Stored fields
       Snippets (plus highlighting)
       Facets
             Count the search hits
             Types: field value, dates, queries
             Sort, prefix, ...
             Could be used for term suggestion (aka. query suggestion)
       Field collapsing (grouping)
       Spell checking (did-you-mean)
                                                                                            23 / 28
Additional Solr Features                Graz University of Technology




   Query by Example
       More like this

   Stats
       Per field
       Min, max, sum, missing, ...

   Admin-GUI
       Webapp to troubleshoot queries
       Browse schema

   JMX
       Read properties & statistics
       Can be accessed remotely
                                                                        24 / 28
Integration                               Graz University of Technology




   Deployment
       Within a web application server
       Embedded

   Monitor
       Log output

   Access
       Various language bindings
       Java, Ruby, JavaScript, PHP, ...



                                                                          25 / 28
Multi-core                                           Graz University of Technology




   Multiple indices
       Each index has its own configuration

   Operations
       Reload (when configuration has been changed)
       Rename
       Swap
       Merge
       Create, Status




                                                                                     26 / 28
Scale Solr                       Graz University of Technology




   Replication
       Master and slaves nodes
       Replication
       Slaves poll master

   Dispatch search request
       Load balancer




                                                                 27 / 28
Sharding Indexes                                       Graz University of Technology




   Single index
       Index spawned over multiple machines
       Search is done in parallel

   Mapping
       Application has to provide a deterministic mapping
       Document ⇒ index




                                                                                       28 / 28
Conclusions                                           Graz University of Technology




   Ecosystem
          Vivid community
          Corporative backing

   Solr
          Easy to get started
          Hard to optimize for specific requirements




                                                                                      29 / 28
The End        Graz University of Technology




  Thank you!




                                               30 / 28

More Related Content

Similar to DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Matija Gobec
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmentAndrea Wiggins
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6umavanth
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Baruch Sadogursky
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorialmarpierc
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Srinivasa Rao Aravilli
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Timothy Chen
 
ryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterAnastasia Ryssiouk
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersVladimir Pavlov
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Stuart Chalk
 

Similar to DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps (20)

Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorial
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale Serve and Scale ML Models ( Low latency prediction systems) at Scale
Serve and Scale ML Models ( Low latency prediction systems) at Scale
 
grid mining
grid mininggrid mining
grid mining
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
ryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposterryssiouk_anastasia_arestyposter
ryssiouk_anastasia_arestyposter
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
 
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
Eureka Research Workbench: A Semantic Approach to an Open Source Electroni...
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

  • 1. Building an open-source based search solution – first steps Roman Kern Institute of Knowledge Management Graz University of Technology Know-Center Graz rkern@tugraz.at, rkern@know-center.at Data Science Meetup / 2012-04-12
  • 2. Overview Graz University of Technology Motivation Background Solr Ecosystem Solr Features Conclusions 2 / 28
  • 3. Motivation Graz University of Technology Search Change in users expectations Missing, sub-optimal search causes frustration Science Information retrieval Success story Mostly focused on web search Industry Enterprise search Heterogeneous data sources 3 / 28
  • 4. Background of the Speaker Graz University of Technology http://a1.net http://wissen.de 4 / 28
  • 5. Apache Lucene Umbrella Project Graz University of Technology Components Search engine ⇒ Lucene Search server ⇒ Solr Web search engine ⇒ Nutch Lightweight crawler ⇒ Droids File-format parsing ⇒ Tika Communicate with CMS ⇒ ManifoldCF Distributed coordination ⇒ ZooKeeper Natural language processing ⇒ OpenNLP Related projects: Hadoop, Mahout, Carrot2, ... Common aspects Apache license, implemented in Java, community 5 / 28
  • 6. Lucene Graz University of Technology Search Engine Library Java API Only for expert users Search-Index File-system In-memory index Advanced features Incremental indexing Update while searching Base for many projects Solr ir-lib elasticsearch LIA (Lucene in Action) http://lucene.apache.org/core/ 6 / 28
  • 7. Nutch Graz University of Technology Web search engine Builds upon Solr Web crawler Link database, crawl database Distributed Runs on Hadoop Mode of operation Crawl a single domain Crawl the web with seed sites http://nutch.apache.org/ 7 / 28
  • 8. Droids Graz University of Technology Crawler component Lightweight crawler Main features Throttling Multi-threaded Well behaved (robots.txt) http://incubator.apache.org/droids/ 8 / 28
  • 9. Tika Graz University of Technology Text extraction Text & meta-data File-formats Office Microsoft Formats (Apache POI) OpenDocument Common text formats PDF (PDFBox) HTML (tagsoup) Non-text Images Sound http://tika.apache.org/ 9 / 28
  • 10. ManifoldCF Graz University of Technology Content Management System Connectors Communicate with CMS/DMS Connectors FileNet P8 (IBM) Documentum (EMC) LiveLink (OpenText) Meridio (Autonomy) Windows shares (Microsoft) SharePoint (Microsoft) More: Alfresco, JDBC, ... Data is then stored and indexed e.g. Solr http://incubator.apache.org/connectors/ 10 / 28
  • 11. ZooKeeper Graz University of Technology Distributed coordination Orchestrate servers Distributed Configuration Name lookup Synchronization http://zookeeper.apache.org/ 11 / 28
  • 12. OpenNLP Graz University of Technology Natural language processing Process plain text Maximum entropy classification with beam search Models Sentence splitting Token splitting Part-of-speech (POS) tagging Named entity recognition more: chunker, parser, co-reference resolution http://opennlp.sourceforge.net/ 12 / 28
  • 13. Hadoop Graz University of Technology Distributed computing Scale out framework Distributed file-system Data is partitioned Stored on multiple nodes Map/Reduce paradigm Map your algorithms to mappers & reducers Related projects: HBase, Pig, Hive, ... http://hadoop.apache.org/ 13 / 28
  • 14. Mahout Graz University of Technology Distributed machine learning Scale out framework Machine learning Recommender systems Clustering Classification Integration Standalone Hadoop Amazon EC2 http://mahout.apache.org/ 14 / 28
  • 15. Details Graz University of Technology 15 / 28
  • 16. Search Server Graz University of Technology What Solr is Web-Service Full-text indexing & search Support to store arbitrary content What Solr isn’t Solr = grep Database But, somehow similar to No-SQL databases Solr vs. IR-Lib Solr: easy to use, easy to integrate, XML configuration IR-Lib: expert knowledge to use, Java configuration, fast 16 / 28
  • 17. Index Structure Graz University of Technology Inverted Index Dictionary of words (terms) Map from term to document Document List of fields Input fields are them mapped according to the schema Field-types Defined in the schema Type (string, boolean, date, number) - internally mapped to string 17 / 28
  • 18. Index Management Graz University of Technology API HTTP Server Various formats (XML, binary, JavaScript, ...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) 18 / 28
  • 19. Input Handling Graz University of Technology Different input formats XML CSV JDBC (database) DIH (data import handler) Support incremental updates (via timestamps) Solr Cell Binary content Apache Tika Text content and metadata 19 / 28
  • 20. Text Processing Graz University of Technology Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒ triplic, ...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi) n-grams, soundex, umlauts 20 / 28
  • 21. Query Processing Graz University of Technology Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive 21 / 28
  • 22. Search Features Graz University of Technology Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants 22 / 28
  • 23. Search Result Graz University of Technology Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix, ... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) 23 / 28
  • 24. Additional Solr Features Graz University of Technology Query by Example More like this Stats Per field Min, max, sum, missing, ... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely 24 / 28
  • 25. Integration Graz University of Technology Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP, ... 25 / 28
  • 26. Multi-core Graz University of Technology Multiple indices Each index has its own configuration Operations Reload (when configuration has been changed) Rename Swap Merge Create, Status 26 / 28
  • 27. Scale Solr Graz University of Technology Replication Master and slaves nodes Replication Slaves poll master Dispatch search request Load balancer 27 / 28
  • 28. Sharding Indexes Graz University of Technology Single index Index spawned over multiple machines Search is done in parallel Mapping Application has to provide a deterministic mapping Document ⇒ index 28 / 28
  • 29. Conclusions Graz University of Technology Ecosystem Vivid community Corporative backing Solr Easy to get started Hard to optimize for specific requirements 29 / 28
  • 30. The End Graz University of Technology Thank you! 30 / 28