SlideShare a Scribd company logo
1 of 41
Download to read offline
Search is Not Enough

               Using Solr for Analytics




Steve Kearns
Director of Product Management
www.basistech.com
Agenda

• Basis Technology

• Search and Metadata

• Text + Text Analytics = Metadata

• Solr += Analytics
    Configuration
    Interface


• Conclusion
About Basis Technology

• Global leader in computational linguistics as applied to
  search-based applications, information discovery, and
  identity resolution

• Developer of the most capable, most mature, and most
  widely used platform for multilingual text analytics

• Solutions for commercial enterprises expanding globally
  and for government agencies dealing with foreign
  intelligence

• Offices: Boston, Washington, San Francisco, London,
  Tokyo
Search
Metadata
Search and Metadata

• Search alone
    Helps find documents/records
    May return many unnecessary results
    Inefficient way to solve a specific problem


• Search with Metadata
    New ways to visualize, navigate and explore
    Helps enable users to take action on documents/records
    Provides context to aid decision making
    New ways to connect disparate data sources
    New knobs to tune relevance


• Structured Data Sources
    Link unstructured data against structured
Metadata In Action
Metadata in Action
Metadata – Where does it come from?

• Structured Information associated with documents
    Author, publish date, part number, price, provenance, etc.


• Manual Annotation

• Text Analytics
Text Analytics
Text Analytics


    A set of automated analytical
       methods designed to add
       structure to unstructured
                content
Text Analytics techniques
Categorization of Text Analytics Technology

• Document-level Analytics
    Language Identification
    Summarization
    Categorization


• Sub-document Analysis
    Lemmatization – Improved Search
    Entity Extraction
    Fact/Relationship Extraction
    Topic Extraction
    Sentiment


• Cross-Document Analysis
    Document Clustering (at index or query time)
    Entity Search and Co-reference Resolution
Text Analytics in Action: E-Discovery

• Demo!
Document Level Analysis: Language Identification


        • Sub-document Lang ID is possible
                                                                        La Grande-Bretagne
                                           Американская                 a de son côté jugé
                                                                            Après avoir
La Grande-Bretagne a                                                                                    「端末側で行単位に(あるい
                                           софтверная компания          querencontréde
                                                                             l'accord les
de son côté jugé que                       становится
                                                                                 Le président           は一画面分)編集しておいて
l'accord de данный момент
                                                                        Luxembourg deOlusegun
                                                                            présidents
                                                                                 nigérian               、 「端末側で行単位に(あるい
           В                               пользующимся спросом
                                                                        constituaitdes cinq salué
                                                                            quatre un
                                                                                 Obasanjo a pays          は一画面分)編集しておいて
                                                                                                        送信キーによりまとめて送信
Luxembourg правительство США,私ごとになりますが、ちょうどこの
                                           у спецслужб США
constituait un                   ころ大学院生でしたが、ACOS-6
                                           экспертом в области          véritable
                                                                            africains (Afrique du
                                                                                 cette l'engagement       、 FNPがコンピュータと端末の
                                                                                                        する」という方式と、
           обвиняющее
véritable changement             用のある言語処理系の開発を請
                                           лингвистики (в               changement dans la
                                                                            Sud,du G8, déclarant
                                                                                  Algérie,                   間にあって、実際の端末との
                                                                                                          送信キーによりまとめて送信
           радикальную                                                                                  「端末には知能はなく、一字
dans la stratégie
           мусульманскую         け負って作っていました。ACOS- и
                                           частности, изучения          stratégie
                                                                            Sénégal, "la condition
                                                                                 que Nigeria)                やりとりを制御するのです。そ
                                                                                                          する」という方式と、
                                                                                                        一字すべてがその都度送ら
agricole de l'Europe, "Аль 6はMulticsの概念に非常に近い
           группировку                     обработки информации             membres du comité                して、コンピュータとFNPの間
                                                                                                          「端末には知能はなく、一字
                                                                                 majeure au             れ処理される」
tandis queКаида" в терактах 2 ものを持っていました、あるいは
            l'Irlande y a                  на арабском языке)                                                の通信は、
                                                                                                          一字すべてがその都度送ら
                                 持とうとしていました。
                                                                            de pilotage du
                                                                                 développement est
vu un gage de stabilité
           года назад,                     после терактов 11                                                 少量の転送には不向きで、大
                                                                                                          れ処理される」
et et de sécurité pour свое また、ハードウェアも大変似てい
           активизирует                    сентября 2001 г.                        French                    量の一括転送に向いていまし
les agriculteurs.
           внимание к арабскому  ました。シールをはがすと、  Le président nigérian
           языку и программам その下から別のアメリカの会社の
           его обработки.        名前が出てくるマシンでテスト
                                                Olusegun Obasanjo a
                                                salué cette                                                          Japanese
           Грамматика языков したこともありました。1年間ほと du G8,
    「端末側で行単位に(あるいは一                             l'engagement
    画面分)編集しておいて、
           данной группы         んど休みなしにマシンルームque "la
                                                déclarant                 Программное
    送信キーによりまとめて送信する              にこもっていて、ここでの議論とcondition majeure au      обеспечение Basis
                                                                              Американская
                                 疑問を自分のテーマとしても
    」という方式と、 Программное обеспечение développement est                    Technology позволяет
                                                                              софтверная
                                 扱ったことがあるのです。それで
    「端末には知能はなく、一字一字 позволяет l'absence de conflit".                      осуществлять поиск
                    Basis Technology                                          компания момент
                                                                                 В данный              Bild vergrößern
                    осуществлять 、よーくわかるのです。
    すべてがその都度送られ処理さпоиск слов с La porte-parole de la                      словстановится
                                                                               с близкими
                                                                                 правительство
                                                                                                                                       German
                                                                                                       Berlin (AP) Der Kanzler
    れる」             близкими значениями, а      présidence française,     значениями, а также
                                                                                 США, обвиняющее
                                                                              пользующимся             strahlte: «Ich gestehe, dass
                                                                                                                                        29%
    という方式は、究極的に前者は  также транслитерировать Catherine Colonna, a          транслитерировать
                                                                                 радикальную
                                                                              спросом у спецслужб      ich 90 Prozent Zustimmung
    半二重通信、後者は全二重通信  арабские и фарси-буквы в pour sa part qualifié la
    とフィットします。латинские. ПродуктFNPがコンピュータと端末の間に
                                        был     réunion                          мусульманскую
                                                                              США экспертом в          EVIAN (AP) - Les membres du
                                                                                                                                       French
    後者では、入力のエコーもコンピ разработан по      あって、実際の端末とのやりとり
                                                d'"exceptionnelle".              группировку "Аль
                                                                              области                  G8 se sont engagés dimanche      33%
    ュータ側で制御されます。                       を制御するのです。そして、コン                           Каида" в терактах 2
                    специальному заказу                                                                soir à soutenir la
    つまり、入力した字の表示はキーСША с
                    правительства      ピュータとFNPの間の通信は、
                                                                                                       これはファンドマネージャー
                                                                                                                                       Japanese
    入力がコンピュータに送られ、
    それが送り返されて表示されま
                    целью оптимизации  少量の転送には不向きで、大量
                                       の一括転送に向いていました。
                    процесса анализа арабских
                                                                                     Russian           さんが嘘をついているという                     21%
    す。              текстов.           FNPによるコンピュータへの割り                                                わけではありません。計算
                                       込み要求は高価なものだったか                                                  ilHaaqa-n bikitaabinaa s-        Arabic
                                       らです。Multicsでのプロセスの                                              sirriyyi r-raqiimi fii yurjae
                                       wake upも高価だということもあり                                             ittikhaadha maa yulzamu
                                                                                                                                         17%
                                       ました。
Document Level Analysis: Categorization

       • Group Documents into Pre-defined categories




http://news.google.com/
http://www.bbc.co.uk/
Sub-Document Analysis: Linguistics

• Segmentation of Asian language

• Lemmatization

Stemming
N-Gram




Morphological
Lemmatization
Segmentation
Sub-Document Analysis: Sentiment

      • Sentence, paragraph, entity, aspect, emotion




http://twittersentiment.appspot.com/search?query=Lucene
http://maps.google.com/maps/place?cid=7410753351872099397
Sub-Document Analysis: Entity Extraction

      • Identify Named Concepts in Unstructured Text
              Statistical, rules, lists




http://www.twitscoop.com/
Sub-Document: Fact / Rel. / Event Extraction

      • Identify Facts, Link Entities, Events and Times




http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
Cross-Document: Entity Co-reference Resolution

• Map extracted entities to real-world Concepts
Cross-Document Analysis: Clustering

• Near Duplicate Detection

• Unsupervised Clustering
Text Analytics: Entity Search
Solr += Analytics
Text Analytics in/around Solr

• Analyzer/Tokenizer/TokenFilter

• UpdateRequestProcessor
    Run Analysis in Solr
    Call External Analysis Service


• Pre-Processor to Solr
Integration Point: Analyzer

• Good for:
    Linguistics
    Segmentation of Asian Language
    Customized Segmentation


• Limitations:
    No access to document object


• An Analyzer is:
    Charfilter
    Tokenizer
    Set of TokenFilters
Analyzer/Tokenizer Configuration

• Schema.xml

   FieldType
     • Analyzer (Index)
        – CharFilter
        – Tokenize
        – TokenFilter
     • Analyzer (Query)
Integration Point: UpdateRequestProcessor

• Runs Before Analyzers

• Full Access to Document


• Two options:
    Run the analysis directly in Solr
    Call out to external analysis services



• Limitations:
    Think through your indexing strategy
Integration Point: UpdateRequestProcessor

• Run the analysis directly in Solr
    Good for light weight, stateless document analytics
    Not good for cross-document analytics




• Call out to external analysis services
    Web Services, UIMA, OpenPipeline, GATE, custom code
    Note that these external calls are synchronous
    Additional complexity / points of failure
UpdateRequestProcessor Configuration

• SolrConfig.xml
    RequestHandler
       • update.processor = UpdateRequestProcessorChain.name
    UpdateRequestProcessorChain
       • Processors
Integration Point: Pre-Processor

• Index in Solr as Last Step of Analysis




• Good For:
    Finer-grained control
    Managing dependencies between analytic components
    Scalability


• Limitations:
    Complexity / New points of failure
    Cannot use Solr’s content acquisition features
Integration Summary

• There are Many Options!

• Document-Level Analysis:
    Generally, safe to run in UpdateRequestProcessor


• Sub-Document Analysis:
    UpdateRequestProcessor or external


• Cross-Document Analysis:
    Run external


• Multiple-Analysis Components:
    Run external document processing pipeline
Other Concerns

• Re-Indexing may be expensive, so when linking against
  structured data..
    Index RowID if structured DB allows changes
       • Retrieve row details at page rendering time to enable faceting
    Index content if DB is static


• FieldCollapsing of Similar Documents
    Powerful way to reduce the number of results without losing information
Dashboard
Search and Filter
Detailed Document View
Entity Search – Cross Language
Search/Filter/Explore
Summary




Text Analytics Enables Productive Search
For More Information

• Visit www.basistech.com

• Write to info@basistech.com

• Call 617-386-2090 or 800-697-2062
Thank You!



Steve Kearns
Director of Product Management
www.basistech.com

More Related Content

More from lucenerevolution

Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 

More from lucenerevolution (20)

Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 

Recently uploaded

【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)Hiroki Ichikura
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案sugiuralab
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
TSAL operation mechanism and circuit diagram.pdf
TSAL operation mechanism and circuit diagram.pdfTSAL operation mechanism and circuit diagram.pdf
TSAL operation mechanism and circuit diagram.pdftaisei2219
 
SOPを理解する 2024/04/19 の勉強会で発表されたものです
SOPを理解する       2024/04/19 の勉強会で発表されたものですSOPを理解する       2024/04/19 の勉強会で発表されたものです
SOPを理解する 2024/04/19 の勉強会で発表されたものですiPride Co., Ltd.
 
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介Yuma Ohgami
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 

Recently uploaded (8)

【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
TSAL operation mechanism and circuit diagram.pdf
TSAL operation mechanism and circuit diagram.pdfTSAL operation mechanism and circuit diagram.pdf
TSAL operation mechanism and circuit diagram.pdf
 
SOPを理解する 2024/04/19 の勉強会で発表されたものです
SOPを理解する       2024/04/19 の勉強会で発表されたものですSOPを理解する       2024/04/19 の勉強会で発表されたものです
SOPを理解する 2024/04/19 の勉強会で発表されたものです
 
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介
Open Source UN-Conference 2024 Kawagoe - 独自OS「DaisyOS GB」の紹介
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 

Search is Not Enough: Using Solr for Analytics

  • 1. Search is Not Enough Using Solr for Analytics Steve Kearns Director of Product Management www.basistech.com
  • 2. Agenda • Basis Technology • Search and Metadata • Text + Text Analytics = Metadata • Solr += Analytics  Configuration  Interface • Conclusion
  • 3. About Basis Technology • Global leader in computational linguistics as applied to search-based applications, information discovery, and identity resolution • Developer of the most capable, most mature, and most widely used platform for multilingual text analytics • Solutions for commercial enterprises expanding globally and for government agencies dealing with foreign intelligence • Offices: Boston, Washington, San Francisco, London, Tokyo
  • 6. Search and Metadata • Search alone  Helps find documents/records  May return many unnecessary results  Inefficient way to solve a specific problem • Search with Metadata  New ways to visualize, navigate and explore  Helps enable users to take action on documents/records  Provides context to aid decision making  New ways to connect disparate data sources  New knobs to tune relevance • Structured Data Sources  Link unstructured data against structured
  • 9. Metadata – Where does it come from? • Structured Information associated with documents  Author, publish date, part number, price, provenance, etc. • Manual Annotation • Text Analytics
  • 11. Text Analytics A set of automated analytical methods designed to add structure to unstructured content
  • 13. Categorization of Text Analytics Technology • Document-level Analytics  Language Identification  Summarization  Categorization • Sub-document Analysis  Lemmatization – Improved Search  Entity Extraction  Fact/Relationship Extraction  Topic Extraction  Sentiment • Cross-Document Analysis  Document Clustering (at index or query time)  Entity Search and Co-reference Resolution
  • 14. Text Analytics in Action: E-Discovery • Demo!
  • 15. Document Level Analysis: Language Identification • Sub-document Lang ID is possible La Grande-Bretagne Американская a de son côté jugé Après avoir La Grande-Bretagne a 「端末側で行単位に(あるい софтверная компания querencontréde l'accord les de son côté jugé que становится Le président は一画面分)編集しておいて l'accord de данный момент Luxembourg deOlusegun présidents nigérian 、 「端末側で行単位に(あるい В пользующимся спросом constituaitdes cinq salué quatre un Obasanjo a pays は一画面分)編集しておいて 送信キーによりまとめて送信 Luxembourg правительство США,私ごとになりますが、ちょうどこの у спецслужб США constituait un ころ大学院生でしたが、ACOS-6 экспертом в области véritable africains (Afrique du cette l'engagement 、 FNPがコンピュータと端末の する」という方式と、 обвиняющее véritable changement 用のある言語処理系の開発を請 лингвистики (в changement dans la Sud,du G8, déclarant Algérie, 間にあって、実際の端末との 送信キーによりまとめて送信 радикальную 「端末には知能はなく、一字 dans la stratégie мусульманскую け負って作っていました。ACOS- и частности, изучения stratégie Sénégal, "la condition que Nigeria) やりとりを制御するのです。そ する」という方式と、 一字すべてがその都度送ら agricole de l'Europe, "Аль 6はMulticsの概念に非常に近い группировку обработки информации membres du comité して、コンピュータとFNPの間 「端末には知能はなく、一字 majeure au れ処理される」 tandis queКаида" в терактах 2 ものを持っていました、あるいは l'Irlande y a на арабском языке) の通信は、 一字すべてがその都度送ら 持とうとしていました。 de pilotage du développement est vu un gage de stabilité года назад, после терактов 11 少量の転送には不向きで、大 れ処理される」 et et de sécurité pour свое また、ハードウェアも大変似てい активизирует сентября 2001 г. French 量の一括転送に向いていまし les agriculteurs. внимание к арабскому ました。シールをはがすと、 Le président nigérian языку и программам その下から別のアメリカの会社の его обработки. 名前が出てくるマシンでテスト Olusegun Obasanjo a salué cette Japanese Грамматика языков したこともありました。1年間ほと du G8, 「端末側で行単位に(あるいは一 l'engagement 画面分)編集しておいて、 данной группы んど休みなしにマシンルームque "la déclarant Программное 送信キーによりまとめて送信する にこもっていて、ここでの議論とcondition majeure au обеспечение Basis Американская 疑問を自分のテーマとしても 」という方式と、 Программное обеспечение développement est Technology позволяет софтверная 扱ったことがあるのです。それで 「端末には知能はなく、一字一字 позволяет l'absence de conflit". осуществлять поиск Basis Technology компания момент В данный Bild vergrößern осуществлять 、よーくわかるのです。 すべてがその都度送られ処理さпоиск слов с La porte-parole de la словстановится с близкими правительство German Berlin (AP) Der Kanzler れる」 близкими значениями, а présidence française, значениями, а также США, обвиняющее пользующимся strahlte: «Ich gestehe, dass 29% という方式は、究極的に前者は также транслитерировать Catherine Colonna, a транслитерировать радикальную спросом у спецслужб ich 90 Prozent Zustimmung 半二重通信、後者は全二重通信 арабские и фарси-буквы в pour sa part qualifié la とフィットします。латинские. ПродуктFNPがコンピュータと端末の間に был réunion мусульманскую США экспертом в EVIAN (AP) - Les membres du French 後者では、入力のエコーもコンピ разработан по あって、実際の端末とのやりとり d'"exceptionnelle". группировку "Аль области G8 se sont engagés dimanche 33% ュータ側で制御されます。 を制御するのです。そして、コン Каида" в терактах 2 специальному заказу soir à soutenir la つまり、入力した字の表示はキーСША с правительства ピュータとFNPの間の通信は、 これはファンドマネージャー Japanese 入力がコンピュータに送られ、 それが送り返されて表示されま целью оптимизации 少量の転送には不向きで、大量 の一括転送に向いていました。 процесса анализа арабских Russian さんが嘘をついているという 21% す。 текстов. FNPによるコンピュータへの割り わけではありません。計算 込み要求は高価なものだったか ilHaaqa-n bikitaabinaa s- Arabic らです。Multicsでのプロセスの sirriyyi r-raqiimi fii yurjae wake upも高価だということもあり ittikhaadha maa yulzamu 17% ました。
  • 16. Document Level Analysis: Categorization • Group Documents into Pre-defined categories http://news.google.com/ http://www.bbc.co.uk/
  • 17. Sub-Document Analysis: Linguistics • Segmentation of Asian language • Lemmatization Stemming N-Gram Morphological Lemmatization Segmentation
  • 18. Sub-Document Analysis: Sentiment • Sentence, paragraph, entity, aspect, emotion http://twittersentiment.appspot.com/search?query=Lucene http://maps.google.com/maps/place?cid=7410753351872099397
  • 19. Sub-Document Analysis: Entity Extraction • Identify Named Concepts in Unstructured Text  Statistical, rules, lists http://www.twitscoop.com/
  • 20. Sub-Document: Fact / Rel. / Event Extraction • Identify Facts, Link Entities, Events and Times http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
  • 21. Cross-Document: Entity Co-reference Resolution • Map extracted entities to real-world Concepts
  • 22. Cross-Document Analysis: Clustering • Near Duplicate Detection • Unsupervised Clustering
  • 25. Text Analytics in/around Solr • Analyzer/Tokenizer/TokenFilter • UpdateRequestProcessor  Run Analysis in Solr  Call External Analysis Service • Pre-Processor to Solr
  • 26. Integration Point: Analyzer • Good for:  Linguistics  Segmentation of Asian Language  Customized Segmentation • Limitations:  No access to document object • An Analyzer is:  Charfilter  Tokenizer  Set of TokenFilters
  • 27. Analyzer/Tokenizer Configuration • Schema.xml FieldType • Analyzer (Index) – CharFilter – Tokenize – TokenFilter • Analyzer (Query)
  • 28. Integration Point: UpdateRequestProcessor • Runs Before Analyzers • Full Access to Document • Two options:  Run the analysis directly in Solr  Call out to external analysis services • Limitations:  Think through your indexing strategy
  • 29. Integration Point: UpdateRequestProcessor • Run the analysis directly in Solr  Good for light weight, stateless document analytics  Not good for cross-document analytics • Call out to external analysis services  Web Services, UIMA, OpenPipeline, GATE, custom code  Note that these external calls are synchronous  Additional complexity / points of failure
  • 30. UpdateRequestProcessor Configuration • SolrConfig.xml  RequestHandler • update.processor = UpdateRequestProcessorChain.name  UpdateRequestProcessorChain • Processors
  • 31. Integration Point: Pre-Processor • Index in Solr as Last Step of Analysis • Good For:  Finer-grained control  Managing dependencies between analytic components  Scalability • Limitations:  Complexity / New points of failure  Cannot use Solr’s content acquisition features
  • 32. Integration Summary • There are Many Options! • Document-Level Analysis:  Generally, safe to run in UpdateRequestProcessor • Sub-Document Analysis:  UpdateRequestProcessor or external • Cross-Document Analysis:  Run external • Multiple-Analysis Components:  Run external document processing pipeline
  • 33. Other Concerns • Re-Indexing may be expensive, so when linking against structured data..  Index RowID if structured DB allows changes • Retrieve row details at page rendering time to enable faceting  Index content if DB is static • FieldCollapsing of Similar Documents  Powerful way to reduce the number of results without losing information
  • 37. Entity Search – Cross Language
  • 39. Summary Text Analytics Enables Productive Search
  • 40. For More Information • Visit www.basistech.com • Write to info@basistech.com • Call 617-386-2090 or 800-697-2062
  • 41. Thank You! Steve Kearns Director of Product Management www.basistech.com