SlideShare a Scribd company logo
1 of 32
Introduction to Apache
Solr & Lucid Imagination
Grant Ingersoll
Thursday, 29 July 2010



Co-sponsored by


                 Sponsored by


We deliver information solutions
Co-sponsored by…




                We consult and design.
                       Sponsored by
Steve Odart     We architect and build.
www.ixxus.com
                We support.
                And we realise the
                true value of your content...

                We deliver information solutions.
 © 2010                       Lucid Imagination, Inc.   2
Agenda
             Introductions
             About Lucid Imagination &
             Open Source Search
             LucidWorks for Solr
             Searching your domain with Solr
             Putting Solr into production
             Questions
    Slides are posted for
download at the end of this
  presentation; full replay
       available within
 ~48 hours of live webcast

    © 2010                          Lucid Imagination, Inc.   3
About me
           Grant Ingersoll
             Lucene/Solr committer
             Co-founder Apache Mahout project
             Co-author of upcoming “Taming Text”
             Chair, Apache Lucene PMC




  © 2010                             Lucid Imagination, Inc.   4
About Lucid Imagination



  Build on, complement the open source technology &
  install base of Apache Lucene and Solr
  Deliver subscription-based value-add software,
  support and training to enhance & extend Lucene/Solr
  Center of excellence for Lucene/Solr app developers




  © 2010                  Lucid Imagination, Inc.        5
Company Background


  Lucene Project Launched: 1997
  Solr Project Launched   2006
  Company Launched:       Aug. 2007
  Financing:              Shasta Ventures, Granite Ventures, Walden
                          International, In-Q-Tel
  Paying Customers:       100+ (and counting…)
  HQ:                     San Mateo, California, USA
  Partners:               US, Europe, Japan, Latin America




  © 2010                     Lucid Imagination, Inc.                  6
Lucid Imagination Offerings




                             Consulting                    Subscriptions



                 Training
                                                                             Certified
                                       Search                              Distributions

                                     Customers
                                     Building Better,
                                    Faster, Less Costly                           Health
            Best Practices
                                   Search Applications                            Checks




   © 2010                                 Lucid Imagination, Inc.
                                                                                           7
Lucene/Solr Success Stories with Lucid Imagination




© 2010                  Lucid Imagination, Inc.      8
Data Happens



           Data constantly growing faster, more diverse
            Mix of content, composition, and repositories: new terms,
            fields, range of data types grow in tandem with volume
           Diversity and location of data are
           an application development problem
            Search and discovery tools are the solution
            Scalability, performance and relevancy key to user success
            Transparency, breadth and flexibility are key to development
            success

  © 2010                         Lucid Imagination, Inc.                   9
© 2010   Lucid Imagination, Inc.   10
Lucene/Solr


Lucene: powerful flexible search library
                                                Java ported to 7 other
   Speed, accuracy, scalability,                environments (PHP, C++, Python, etc.)
   efficiency
                                                Liberal Apache License
   Cross-platform portability of
   indexes                                      One of Top 5 Apache Projects
                                                Top 10 Open Source Project

 Solr: The Lucene Search Server
     REST-like interface                        Hit highlighting
     Faceting                                   RDBMS integration
     Rich Document Handling
                                                Distributed scalability
     Easy configuration
                                                             •Lucene, Solr and their logos are trademarks of the Apache Software Foundation


  © 2010                           Lucid Imagination, Inc.                                                                         11
Lucene/Solr Open Source
Quality @ the tipping point


            Scalability
              823 billion documents searched by Lucene at MySpace.com
            Performance
              Real time: LinkedIn search covers 48 million members, adding one
              new member (with new content) per second
            Relevancy
              Open source APIs deliver better customization and the ability to fine
              tune results
            Economics
              5-8x reduction in server footprint over commercial search
              No vendor lock-in lowers lifecycle costs


   © 2010                            Lucid Imagination, Inc.                          12
Creating Lasting Business Value


 Three key trends…                                               …result in:
 From being
                                                                 CREATING
 locked into         Reduced                                     COMPETITIVE
 single-vendor         risk
 relationships                                                   ADVANTAGE:
                                                                 Focus on core process
               Shorter                                           innovations unique to
               time to    Better fit                             your business instead
               market                                            of operating and
                                                                 maintaining
Resulting from                  Access to code results           3rd party software
direct communication            in increased
                                                                 packages
between innovators and          adaptability of
users                           process to systems
      © 2010                                                                             13


                                       Lucid Imagination, Inc.
Search 101
            Search tools are designed for dealing with fuzzy data
              Works well with structured and unstructured data
                  Performs well when dealing with large volumes of data
              Many apps don’t need the limits that databases place on content
                  Search fits well alongside a DB too

            Given a user’s information need,
            (query) find and, optionally, score
            content relevant to that need
              Many different ways to solve
              this problem, each with tradeoffs
            What’s “relevant” mean?


   © 2010                              Lucid Imagination, Inc.                  14
Two Foundation Concepts


Relevance                                    Indexing
Vector Space Model (VSM) for relevance       Finds and maps terms and documents
   Common across many search engines                Conceptually similar to a book index
   Apache Lucene is a highly optimized              At the heart of fast search/retrieve
   implementation of the VSM




    © 2010                           Lucid Imagination, Inc.                               15
Solr Basics
            Content is modeled via Documents and Fields
              Content can be text, integers, floats, dates, custom
              Analysis can be employed to alter content before indexing
              Controlled via schema.xml
            Searches are supported through a wide range of Query
            options
              Keyword
              Terms
              Phrases
              Wildcards, other



   © 2010                             Lucid Imagination, Inc.             16
Solr Basics
            Schema
              Define Fields, field metadata and Analysis
              <field name="name" type="text" indexed="true"
              stored="true"/>


            Solr Config
              Define low-level Lucene controls
              Specify how clients interact with Solr via Request
              Handlers (“mini servlets”)
              Configure highlighting, spell checking, admin, etc.

   © 2010                         Lucid Imagination, Inc.           17
Getting Started
     1.     Install LucidWorks Certified Distribution
     2.     Model your domain
     3.     Index your content
     4.     Test
     5.     Deploy




   © 2010                          Lucid Imagination, Inc.   18
LucidWorks Certified Distribution
            Free certified distribution
            Installer
              Simple
              Plugins and enhancements
              Updateable
              Complete Reference Guide
              Support for Linux, Windows, Mac
              UI and headless both available
            Get started at http://lucene.li/R

   © 2010                         Lucid Imagination, Inc.   19
Master Your Domain with Solr


            Get to know
            your content


            Get to know
            your users



   © 2010                  Lucid Imagination, Inc.   20
Modeling your Content

           Collection/Aggregate
            Examine collection level stats, like:
             MIME Types
             Number of Docs
             Update rates
             Languages present
             Much, much more
            Look for patterns and relationships
            Identify helpful resources
  © 2010                      Lucid Imagination, Inc.   21
Modeling your Content
           Randomly sample a set of your documents


           Look for:
             Common structures like titles, tables, columns, etc.
             Important metadata
             Tokenization issues
               Try out in http://localhost:8983/solr/admin/analysis.jsp
             Importance Indicators
             May also look at paragraph, sentence,
             word and character issues



  © 2010                             Lucid Imagination, Inc.              22
Understanding your Users
    Sophisticated vs. Simple
    Speed and Relevance
    Search and Discovery
          Search
          Faceting
          Did you mean?
          Similar Pages (More Like This)
          Highlighting
    UI expectations



 © 2010                            Lucid Imagination, Inc.   23
Build your Application

    Map your content into
    Documents and Fields via the
    Solr schema
    Setup your Solr access patterns
    in the solrconfig.xml
    Index your content
    Search/Browse/Discover




   © 2010                Lucid Imagination, Inc.   24
Indexing
    Many Clients
          Java, PHP, Ruby, etc.
          See example/exampledocs
    Example:
    Upload CSV, Solr XML
   <add><doc>
    <field
       name="id">EN7800GTX/2DHTV/25
       6M</field>
   <field name="manu">ASUS Computer
       Inc.</field>
    <field name="cat">electronics</field>
   </doc></add>
 © 2010                               Lucid Imagination, Inc.   25
Search

   Clients also support search
through API calls


   HTTP support by
definition:
          http://localhost:8983/sol
          r/select/?q=*:*&fl=score,
          id
          http://localhost:8983/sol
          r/select/?q=name:iPod&f
          l=score,id

 © 2010                           Lucid Imagination, Inc.   26
Getting to Production



            Some Issues to think about:
             Scaling
             Improving Findability




   © 2010                     Lucid Imagination, Inc.   27
Scaling Solr




 Get the most out of each machine
                                                              http://lucene.li/V
    Typical Hardware (your mileage may vary):
       Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM

 High Query Volume
 Large Index
 Both
   © 2010                           Lucid Imagination, Inc.                        28
Improving Findability

Common Techniques
    Analysis:
          Lowercase, stemming,
          synonyms, stopwords,
          compound analysis (e.g. STR-
          AV220 -> STR AV 220)
    Faceting
    Spell Checking
    Editorial


    See http://lucene.li/U
 © 2010                                  Lucid Imagination, Inc.   29
Improving Findability
            Phrase Queries and other Position-based Queries
            (SpanQuery)
            Disjunction Max Query (aka “DisMax”)
            Intent Analysis
            Invisible Queries
            Fake Queries
            Relevance Feedback and “More Like This”


            See http://lucene.li/S


   © 2010                        Lucid Imagination, Inc.      30
Resources
    Websites
           http://www.lucidimagination.com
           http://search.lucidimagination.com
           http://lucene.apache.org/solr


    Solr Support
           http://www.lucidimagination.com/How-We-Can-Help
           solr-user@lucene.apache.org




  © 2010                         Lucid Imagination, Inc.     31
Q&A
            Slides are posted for
                 download at
              http://lucene.li/a ;
         full replay available within
         ~48 hours of live webcast
© 2010    Lucid Imagination, Inc.

More Related Content

Viewers also liked

Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe PhotoshopLaRue
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Pista American Idiot
Pista American IdiotPista American Idiot
Pista American Idiottanica
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlersLucidworks (Archived)
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
across the universe
across the universeacross the universe
across the universetanica
 
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...Technology opportunities in hampton roads (kaszubowski ), nasa technology day...
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...Marty Kaszubowski
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク彰 村地
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketPaul Williamson
 
Maroon5
Maroon5Maroon5
Maroon5tanica
 
Updated: You Have An Idea ... Do You Have A Business?
Updated: You Have An Idea ...  Do You Have A Business?Updated: You Have An Idea ...  Do You Have A Business?
Updated: You Have An Idea ... Do You Have A Business?Marty Kaszubowski
 
Hellosong
HellosongHellosong
Hellosongtanica
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your TechnologyMarty Kaszubowski
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 

Viewers also liked (20)

Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe Photoshop
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Pista American Idiot
Pista American IdiotPista American Idiot
Pista American Idiot
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlers
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
across the universe
across the universeacross the universe
across the universe
 
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...Technology opportunities in hampton roads (kaszubowski ), nasa technology day...
Technology opportunities in hampton roads (kaszubowski ), nasa technology day...
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene java
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the market
 
Maroon5
Maroon5Maroon5
Maroon5
 
Crazy
CrazyCrazy
Crazy
 
Updated: You Have An Idea ... Do You Have A Business?
Updated: You Have An Idea ...  Do You Have A Business?Updated: You Have An Idea ...  Do You Have A Business?
Updated: You Have An Idea ... Do You Have A Business?
 
Hellosong
HellosongHellosong
Hellosong
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your Technology
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 

Recently uploaded

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Recently uploaded (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Introduction to Apache Lucene/Solr

  • 1. Introduction to Apache Solr & Lucid Imagination Grant Ingersoll Thursday, 29 July 2010 Co-sponsored by Sponsored by We deliver information solutions
  • 2. Co-sponsored by… We consult and design. Sponsored by Steve Odart We architect and build. www.ixxus.com We support. And we realise the true value of your content... We deliver information solutions. © 2010 Lucid Imagination, Inc. 2
  • 3. Agenda Introductions About Lucid Imagination & Open Source Search LucidWorks for Solr Searching your domain with Solr Putting Solr into production Questions Slides are posted for download at the end of this presentation; full replay available within ~48 hours of live webcast © 2010 Lucid Imagination, Inc. 3
  • 4. About me Grant Ingersoll Lucene/Solr committer Co-founder Apache Mahout project Co-author of upcoming “Taming Text” Chair, Apache Lucene PMC © 2010 Lucid Imagination, Inc. 4
  • 5. About Lucid Imagination Build on, complement the open source technology & install base of Apache Lucene and Solr Deliver subscription-based value-add software, support and training to enhance & extend Lucene/Solr Center of excellence for Lucene/Solr app developers © 2010 Lucid Imagination, Inc. 5
  • 6. Company Background Lucene Project Launched: 1997 Solr Project Launched 2006 Company Launched: Aug. 2007 Financing: Shasta Ventures, Granite Ventures, Walden International, In-Q-Tel Paying Customers: 100+ (and counting…) HQ: San Mateo, California, USA Partners: US, Europe, Japan, Latin America © 2010 Lucid Imagination, Inc. 6
  • 7. Lucid Imagination Offerings Consulting Subscriptions Training Certified Search Distributions Customers Building Better, Faster, Less Costly Health Best Practices Search Applications Checks © 2010 Lucid Imagination, Inc. 7
  • 8. Lucene/Solr Success Stories with Lucid Imagination © 2010 Lucid Imagination, Inc. 8
  • 9. Data Happens Data constantly growing faster, more diverse Mix of content, composition, and repositories: new terms, fields, range of data types grow in tandem with volume Diversity and location of data are an application development problem Search and discovery tools are the solution Scalability, performance and relevancy key to user success Transparency, breadth and flexibility are key to development success © 2010 Lucid Imagination, Inc. 9
  • 10. © 2010 Lucid Imagination, Inc. 10
  • 11. Lucene/Solr Lucene: powerful flexible search library Java ported to 7 other Speed, accuracy, scalability, environments (PHP, C++, Python, etc.) efficiency Liberal Apache License Cross-platform portability of indexes One of Top 5 Apache Projects Top 10 Open Source Project Solr: The Lucene Search Server REST-like interface Hit highlighting Faceting RDBMS integration Rich Document Handling Distributed scalability Easy configuration •Lucene, Solr and their logos are trademarks of the Apache Software Foundation © 2010 Lucid Imagination, Inc. 11
  • 12. Lucene/Solr Open Source Quality @ the tipping point Scalability 823 billion documents searched by Lucene at MySpace.com Performance Real time: LinkedIn search covers 48 million members, adding one new member (with new content) per second Relevancy Open source APIs deliver better customization and the ability to fine tune results Economics 5-8x reduction in server footprint over commercial search No vendor lock-in lowers lifecycle costs © 2010 Lucid Imagination, Inc. 12
  • 13. Creating Lasting Business Value Three key trends… …result in: From being CREATING locked into Reduced COMPETITIVE single-vendor risk relationships ADVANTAGE: Focus on core process Shorter innovations unique to time to Better fit your business instead market of operating and maintaining Resulting from Access to code results 3rd party software direct communication in increased packages between innovators and adaptability of users process to systems © 2010 13 Lucid Imagination, Inc.
  • 14. Search 101 Search tools are designed for dealing with fuzzy data Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean? © 2010 Lucid Imagination, Inc. 14
  • 15. Two Foundation Concepts Relevance Indexing Vector Space Model (VSM) for relevance Finds and maps terms and documents Common across many search engines Conceptually similar to a book index Apache Lucene is a highly optimized At the heart of fast search/retrieve implementation of the VSM © 2010 Lucid Imagination, Inc. 15
  • 16. Solr Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Controlled via schema.xml Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards, other © 2010 Lucid Imagination, Inc. 16
  • 17. Solr Basics Schema Define Fields, field metadata and Analysis <field name="name" type="text" indexed="true" stored="true"/> Solr Config Define low-level Lucene controls Specify how clients interact with Solr via Request Handlers (“mini servlets”) Configure highlighting, spell checking, admin, etc. © 2010 Lucid Imagination, Inc. 17
  • 18. Getting Started 1. Install LucidWorks Certified Distribution 2. Model your domain 3. Index your content 4. Test 5. Deploy © 2010 Lucid Imagination, Inc. 18
  • 19. LucidWorks Certified Distribution Free certified distribution Installer Simple Plugins and enhancements Updateable Complete Reference Guide Support for Linux, Windows, Mac UI and headless both available Get started at http://lucene.li/R © 2010 Lucid Imagination, Inc. 19
  • 20. Master Your Domain with Solr Get to know your content Get to know your users © 2010 Lucid Imagination, Inc. 20
  • 21. Modeling your Content Collection/Aggregate Examine collection level stats, like: MIME Types Number of Docs Update rates Languages present Much, much more Look for patterns and relationships Identify helpful resources © 2010 Lucid Imagination, Inc. 21
  • 22. Modeling your Content Randomly sample a set of your documents Look for: Common structures like titles, tables, columns, etc. Important metadata Tokenization issues Try out in http://localhost:8983/solr/admin/analysis.jsp Importance Indicators May also look at paragraph, sentence, word and character issues © 2010 Lucid Imagination, Inc. 22
  • 23. Understanding your Users Sophisticated vs. Simple Speed and Relevance Search and Discovery Search Faceting Did you mean? Similar Pages (More Like This) Highlighting UI expectations © 2010 Lucid Imagination, Inc. 23
  • 24. Build your Application Map your content into Documents and Fields via the Solr schema Setup your Solr access patterns in the solrconfig.xml Index your content Search/Browse/Discover © 2010 Lucid Imagination, Inc. 24
  • 25. Indexing Many Clients Java, PHP, Ruby, etc. See example/exampledocs Example: Upload CSV, Solr XML <add><doc> <field name="id">EN7800GTX/2DHTV/25 6M</field> <field name="manu">ASUS Computer Inc.</field> <field name="cat">electronics</field> </doc></add> © 2010 Lucid Imagination, Inc. 25
  • 26. Search Clients also support search through API calls HTTP support by definition: http://localhost:8983/sol r/select/?q=*:*&fl=score, id http://localhost:8983/sol r/select/?q=name:iPod&f l=score,id © 2010 Lucid Imagination, Inc. 26
  • 27. Getting to Production Some Issues to think about: Scaling Improving Findability © 2010 Lucid Imagination, Inc. 27
  • 28. Scaling Solr Get the most out of each machine http://lucene.li/V Typical Hardware (your mileage may vary): Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM High Query Volume Large Index Both © 2010 Lucid Imagination, Inc. 28
  • 29. Improving Findability Common Techniques Analysis: Lowercase, stemming, synonyms, stopwords, compound analysis (e.g. STR- AV220 -> STR AV 220) Faceting Spell Checking Editorial See http://lucene.li/U © 2010 Lucid Imagination, Inc. 29
  • 30. Improving Findability Phrase Queries and other Position-based Queries (SpanQuery) Disjunction Max Query (aka “DisMax”) Intent Analysis Invisible Queries Fake Queries Relevance Feedback and “More Like This” See http://lucene.li/S © 2010 Lucid Imagination, Inc. 30
  • 31. Resources Websites http://www.lucidimagination.com http://search.lucidimagination.com http://lucene.apache.org/solr Solr Support http://www.lucidimagination.com/How-We-Can-Help solr-user@lucene.apache.org © 2010 Lucid Imagination, Inc. 31
  • 32. Q&A Slides are posted for download at http://lucene.li/a ; full replay available within ~48 hours of live webcast © 2010 Lucid Imagination, Inc.