Search is Not Enough: Using Solr for Analytics
Upcoming SlideShare
Loading in...5
×
 

Search is Not Enough: Using Solr for Analytics

on

  • 1,178 views

Presented by Steve Kearns, Basis Technology - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 ...

Presented by Steve Kearns, Basis Technology - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Search is everywhere, and it is a crucially important capability in any enterprise, application, or website. However, an increasingly sophisticated user base expects their search engine to bring them more than just document hits - they want the facts, answers, and context that connect the results with their workflow. In this talk, Steve Kearns will discuss and demonstrate how the combination of structured data, text analytics on unstructured data, and Solr can be used to power advanced analytics applications at scale. This includes integrating text analytics components into Solr, adjustments to the Solr Schema, as well as UI-level changes that support the integration of structured and unstructured data from several sources.

Statistics

Views

Total Views
1,178
Views on SlideShare
1,178
Embed Views
0

Actions

Likes
0
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Search is Not Enough: Using Solr for Analytics Search is Not Enough: Using Solr for Analytics Presentation Transcript

    • Search is Not Enough Using Solr for AnalyticsSteve KearnsDirector of Product Managementwww.basistech.com
    • Agenda• Basis Technology• Search and Metadata• Text + Text Analytics = Metadata• Solr += Analytics  Configuration  Interface• Conclusion
    • About Basis Technology• Global leader in computational linguistics as applied to search-based applications, information discovery, and identity resolution• Developer of the most capable, most mature, and most widely used platform for multilingual text analytics• Solutions for commercial enterprises expanding globally and for government agencies dealing with foreign intelligence• Offices: Boston, Washington, San Francisco, London, Tokyo
    • Search
    • Metadata
    • Search and Metadata• Search alone  Helps find documents/records  May return many unnecessary results  Inefficient way to solve a specific problem• Search with Metadata  New ways to visualize, navigate and explore  Helps enable users to take action on documents/records  Provides context to aid decision making  New ways to connect disparate data sources  New knobs to tune relevance• Structured Data Sources  Link unstructured data against structured
    • Metadata In Action
    • Metadata in Action
    • Metadata – Where does it come from?• Structured Information associated with documents  Author, publish date, part number, price, provenance, etc.• Manual Annotation• Text Analytics
    • Text Analytics
    • Text Analytics A set of automated analytical methods designed to add structure to unstructured content
    • Text Analytics techniques
    • Categorization of Text Analytics Technology• Document-level Analytics  Language Identification  Summarization  Categorization• Sub-document Analysis  Lemmatization – Improved Search  Entity Extraction  Fact/Relationship Extraction  Topic Extraction  Sentiment• Cross-Document Analysis  Document Clustering (at index or query time)  Entity Search and Co-reference Resolution
    • Text Analytics in Action: E-Discovery• Demo!
    • Document Level Analysis: Language Identification • Sub-document Lang ID is possible La Grande-Bretagne Американская a de son côté jugé Après avoirLa Grande-Bretagne a 「端末側で行単位に(あるい софтверная компания querencontréde laccord lesde son côté jugé que становится Le président は一画面分)編集しておいてlaccord de данный момент Luxembourg deOlusegun présidents nigérian 、 「端末側で行単位に(あるい В пользующимся спросом constituaitdes cinq salué quatre un Obasanjo a pays は一画面分)編集しておいて 送信キーによりまとめて送信Luxembourg правительство США,私ごとになりますが、ちょうどこの у спецслужб СШАconstituait un ころ大学院生でしたが、ACOS-6 экспертом в области véritable africains (Afrique du cette lengagement 、 FNPがコンピュータと端末の する」という方式と、 обвиняющееvéritable changement 用のある言語処理系の開発を請 лингвистики (в changement dans la Sud,du G8, déclarant Algérie, 間にあって、実際の端末との 送信キーによりまとめて送信 радикальную 「端末には知能はなく、一字dans la stratégie мусульманскую け負って作っていました。ACOS- и частности, изучения stratégie Sénégal, "la condition que Nigeria) やりとりを制御するのです。そ する」という方式と、 一字すべてがその都度送らagricole de lEurope, "Аль 6はMulticsの概念に非常に近い группировку обработки информации membres du comité して、コンピュータとFNPの間 「端末には知能はなく、一字 majeure au れ処理される」tandis queКаида" в терактах 2 ものを持っていました、あるいは lIrlande y a на арабском языке) の通信は、 一字すべてがその都度送ら 持とうとしていました。 de pilotage du développement estvu un gage de stabilité года назад, после терактов 11 少量の転送には不向きで、大 れ処理される」et et de sécurité pour свое また、ハードウェアも大変似てい активизирует сентября 2001 г. French 量の一括転送に向いていましles agriculteurs. внимание к арабскому ました。シールをはがすと、 Le président nigérian языку и программам その下から別のアメリカの会社の его обработки. 名前が出てくるマシンでテスト Olusegun Obasanjo a salué cette Japanese Грамматика языков したこともありました。1年間ほと du G8, 「端末側で行単位に(あるいは一 lengagement 画面分)編集しておいて、 данной группы んど休みなしにマシンルームque "la déclarant Программное 送信キーによりまとめて送信する にこもっていて、ここでの議論とcondition majeure au обеспечение Basis Американская 疑問を自分のテーマとしても 」という方式と、 Программное обеспечение développement est Technology позволяет софтверная 扱ったことがあるのです。それで 「端末には知能はなく、一字一字 позволяет labsence de conflit". осуществлять поиск Basis Technology компания момент В данный Bild vergrößern осуществлять 、よーくわかるのです。 すべてがその都度送られ処理さпоиск слов с La porte-parole de la словстановится с близкими правительство German Berlin (AP) Der Kanzler れる」 близкими значениями, а présidence française, значениями, а также США, обвиняющее пользующимся strahlte: «Ich gestehe, dass 29% という方式は、究極的に前者は также транслитерировать Catherine Colonna, a транслитерировать радикальную спросом у спецслужб ich 90 Prozent Zustimmung 半二重通信、後者は全二重通信 арабские и фарси-буквы в pour sa part qualifié la とフィットします。латинские. ПродуктFNPがコンピュータと端末の間に был réunion мусульманскую США экспертом в EVIAN (AP) - Les membres du French 後者では、入力のエコーもコンピ разработан по あって、実際の端末とのやりとり d"exceptionnelle". группировку "Аль области G8 se sont engagés dimanche 33% ュータ側で制御されます。 を制御するのです。そして、コン Каида" в терактах 2 специальному заказу soir à soutenir la つまり、入力した字の表示はキーСША с правительства ピュータとFNPの間の通信は、 これはファンドマネージャー Japanese 入力がコンピュータに送られ、 それが送り返されて表示されま целью оптимизации 少量の転送には不向きで、大量 の一括転送に向いていました。 процесса анализа арабских Russian さんが嘘をついているという 21% す。 текстов. FNPによるコンピュータへの割り わけではありません。計算 込み要求は高価なものだったか ilHaaqa-n bikitaabinaa s- Arabic らです。Multicsでのプロセスの sirriyyi r-raqiimi fii yurjae wake upも高価だということもあり ittikhaadha maa yulzamu 17% ました。
    • Document Level Analysis: Categorization • Group Documents into Pre-defined categorieshttp://news.google.com/http://www.bbc.co.uk/
    • Sub-Document Analysis: Linguistics• Segmentation of Asian language• LemmatizationStemmingN-GramMorphologicalLemmatizationSegmentation
    • Sub-Document Analysis: Sentiment • Sentence, paragraph, entity, aspect, emotionhttp://twittersentiment.appspot.com/search?query=Lucenehttp://maps.google.com/maps/place?cid=7410753351872099397
    • Sub-Document Analysis: Entity Extraction • Identify Named Concepts in Unstructured Text  Statistical, rules, listshttp://www.twitscoop.com/
    • Sub-Document: Fact / Rel. / Event Extraction • Identify Facts, Link Entities, Events and Timeshttp://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
    • Cross-Document: Entity Co-reference Resolution• Map extracted entities to real-world Concepts
    • Cross-Document Analysis: Clustering• Near Duplicate Detection• Unsupervised Clustering
    • Text Analytics: Entity Search
    • Solr += Analytics
    • Text Analytics in/around Solr• Analyzer/Tokenizer/TokenFilter• UpdateRequestProcessor  Run Analysis in Solr  Call External Analysis Service• Pre-Processor to Solr
    • Integration Point: Analyzer• Good for:  Linguistics  Segmentation of Asian Language  Customized Segmentation• Limitations:  No access to document object• An Analyzer is:  Charfilter  Tokenizer  Set of TokenFilters
    • Analyzer/Tokenizer Configuration• Schema.xml FieldType • Analyzer (Index) – CharFilter – Tokenize – TokenFilter • Analyzer (Query)
    • Integration Point: UpdateRequestProcessor• Runs Before Analyzers• Full Access to Document• Two options:  Run the analysis directly in Solr  Call out to external analysis services• Limitations:  Think through your indexing strategy
    • Integration Point: UpdateRequestProcessor• Run the analysis directly in Solr  Good for light weight, stateless document analytics  Not good for cross-document analytics• Call out to external analysis services  Web Services, UIMA, OpenPipeline, GATE, custom code  Note that these external calls are synchronous  Additional complexity / points of failure
    • UpdateRequestProcessor Configuration• SolrConfig.xml  RequestHandler • update.processor = UpdateRequestProcessorChain.name  UpdateRequestProcessorChain • Processors
    • Integration Point: Pre-Processor• Index in Solr as Last Step of Analysis• Good For:  Finer-grained control  Managing dependencies between analytic components  Scalability• Limitations:  Complexity / New points of failure  Cannot use Solr’s content acquisition features
    • Integration Summary• There are Many Options!• Document-Level Analysis:  Generally, safe to run in UpdateRequestProcessor• Sub-Document Analysis:  UpdateRequestProcessor or external• Cross-Document Analysis:  Run external• Multiple-Analysis Components:  Run external document processing pipeline
    • Other Concerns• Re-Indexing may be expensive, so when linking against structured data..  Index RowID if structured DB allows changes • Retrieve row details at page rendering time to enable faceting  Index content if DB is static• FieldCollapsing of Similar Documents  Powerful way to reduce the number of results without losing information
    • Dashboard
    • Search and Filter
    • Detailed Document View
    • Entity Search – Cross Language
    • Search/Filter/Explore
    • SummaryText Analytics Enables Productive Search
    • For More Information• Visit www.basistech.com• Write to info@basistech.com• Call 617-386-2090 or 800-697-2062
    • Thank You!Steve KearnsDirector of Product Managementwww.basistech.com