SlideShare a Scribd company logo
1 of 47
Download to read offline
Introduction
         to
        Solr
  NFJS - Boston, September 2011
     Presented by Erik Hatcher
erik.hatcher@lucidimagination.com
         Lucid Imagination
 http://www.lucidimagination.com
About me...
• Co-author, "Lucene in Action" (and "Java
  Development with Ant" / "Ant in Action"
  once upon a time)
• "Apache guy" - Lucene/Solr committer;
  member of Lucene PMC, member of
  Apache Software Foundation
• Co-founder, evangelist, trainer, coder @
  Lucid Imagination
About Lucid Imagination...
•   Lucid Imagination provides commercial-grade
    support, training, high-level consulting and value-
    added software for Lucene and Solr.

•   We make Lucene ‘enterprise-ready’ by offering:

    •   Free, certified, distributions and downloads.

    •   Support, training, and consulting.

    •   LucidWorks Enterprise, a commercial search
        platform built on top of Solr.
Abstract
 Apache Solr serves search requests at the enterprises and
the largest companies around the world. Built on top of the
 top-notch Apache Lucene library, Solr makes indexing and
searching integration into your applications straightforward.
Solr provides faceted navigation, spell checking, highlighting,
  clustering, grouping, and other search features. Solr also
  scales query volume with replication and collection size
with distributed capabilities. Solr can index rich documents
       such as PDF, Word, HTML, and other file types.
What is Solr?
•   An open source search server

•   Indexes content sources, processes query requests, returns
    search results

•   Uses Lucene as the "engine", but adds full enterprise search
    server features and capabilities

•   A web-based application that processes HTTP requests and
    returns HTTP responses.

•   Initially started in 2004 and developed by CNET as an in-house
    project to add search capability for the company website.

•   Donated to ASF in 2006.
Who uses Solr?




  And many many many many more...!
Which Solr version?
•   There’s more than one answer!

•   The current, released, stable version is 3.3
    (soon to be 3.4)

•   The development release is referred to as “trunk”.

•   This is where the new, less tested work goes on

•   Also referred to as 4.0

•   LucidWorks Enterprise is built on a trunk snapshot
    + additional features.
What is Lucene?
•   An open source search library (not an application)

•   100% Java

•   Continuously improved and tuned for more than 10
    years

•   Compact, portable index representation

•   Programmable text analyzers, spell checking and
    highlighting

•   Not, itself, a crawler or a text extraction tool
Inverted Index
•   Lucene stores input data in what is known as an
    inverted index

•   In an inverted index each indexed term points to a
    list of documents that contain the term

•   Similar to the index provided at the end of a book

•   In this case "inverted" simply means the list of terms
    point to documents

•   It is much faster to find a term in an index, than to
    scan all the documents
Inverted Index Example
Ingestion
• API / Solr XML, JSON, and javabin/SolrJ
• CSV
• Relational databases
• File system
• Web crawl (using Nutch, or others)
• Others - XML feeds (e.g. RSS/Atom), e-mail
Solr indexing options
Solr XML
POST to /update
<add>
  <doc>
    <field name="id">rawxml1</field>
    <field name="content_type">text/xml</field>
    <field name="category">index example</field>
    <field name="title">Simple Example</field>
    <field name="filename">addExample.xml</field>
    <field name="text">A very simple example of
         adding a document to the index.</field>
  </doc>
</add>
Solr JSON

POST to /update/json
[
  {"id" : "TestDoc1", "title" : "test1"},
  {"id" : "TestDoc2", "title" : "another test"}
]
CSV indexing
•   http://localhost:8983/solr/update/csv

•   Files can be sent over HTTP:

    •   curl http://localhost:8983/solr/update/
        csv --data-binary @data.csv -H 'Content-
        type:text/plain; charset=utf-8’

•   or streamed from the file system:

    •   curl http://localhost:8983/solr/update/
        csv?stream.file=exampledocs/
        data.csv&stream.contentType=text/
        plain;charset=utf-8
Rich documents
•   Solr uses Tika for extraction. Tika is a toolkit for detecting and
    extracting metadata and structured text content from various
    document formats using existing parser libraries.

•   Tika identifies MIME types and then uses the appropriate parser
    to extract text.

•   The ExtractingRequestHandler uses Tika to identify types and
    extract text, and then indexes the extracted text.

•   The ExtractingRequestHandler is sometimes called "Solr Cell",
    which stands for Content Extraction Library.

•   File formats include MS Office, Adobe PDF, XML, HTML, MPEG
    and many more.
Solr Cell parameters
•   The literal parameter is very important.

    •   A way to add other fields not indexed using Tika to documents.

    •   &literal.id=12345

    •   &literal.category=sports

•   Using curl to index a file on the file system:
    •   curl 'http://localhost:8983/solr/update/extract?
        literal.id=doc1&commit=true' -F
        myfile=@tutorial.html

•   Streaming a file from the file system:
    •   curl "http://localhost:8983/solr/update/extract?
        stream.file=/some/path/
        news.doc&stream.contentType=application/
        msword&literal.id=12345"
Streaming remote docs

• Streaming a file from a URL:
 • curl  http://localhost:8983/solr/
    update/extract?
    literal.id=123&stream.url=http://
    www.solr.com/content/file.pdf -H
    'Content-type:application/pdf’
DataImportHandler
•   An "in-process" module that can be used to index data directly
    from relational databases and other data sources

•   Configuration driven

•   A tool that can aggregate data from multiple database tables, or
    even multiple data sources to be indexed as a single Solr
    document

•   Provides powerful and customizable data transformation tools

•   Can do full import or delta import

•   Pluggable to allow indexing of any type of data source
DIH Examples

• Rich documents
• Relational database
• E-mail
Other commands
• <commit/> and <optimize/>
• <delete>...</delete>
 • <id>Q-36</id>
 • <query>category:electronics</query>
• To update a document, simply add a
  document with same unique key
Configuring Solr
• schema.xml
 • defines field types, fields, and unique key
• solrconfig.xml
 • Lucene settings
 • request handler, component, and plugin
    definitions and customizations
Searching Basics
•   http://localhost:8983/solr/select?q=*:*

    •   q - main query

    •   rows - maximum number of "hits" to return

    •   start - zero-based hit starting point

    •   fl - comma-separated field list

        •   * for all stored fields, score for computed
            Lucene score
Other Common Search
     Parameters
• sort - specify sort criteria either by field(s)
  or function(s) in ascending or descending
  order
• fq - filter queries, multiple values supported
• wt - writer type - format of Solr response
• debugQuery - adds debugging info to
  response
Filtering results
• Use fq to filter results in addition to main
  query constraints
• fq results are independently cached in Solr's
  filterCache
• filter queries do not contribute to ranking
  scores
• Commonly used for filtering on facets
Typical Solr Request

• http://localhost:8983/solr/select
  ?q=ipod
  &facet=on
  &facet.field=cat
  &fq=cat:electronics
Features
•   Faceting              •   Distributed search

•   Highlighting          •   Replication

•   Spellchecking         •   Suggest

•   More-like-this        •   Geospatial support

•   Clustering            •   UIMA integration

•   Grouping              •   Extensible
Integration
• It's just HTTP
 • and CSV, JSON, XML, etc on the requests
    and responses
• Any language or environment can work
  with Solr easily
• Many libraries/layers exist on top
Ruby indexing example
SolrJ searching example

SolrServer solrServer = new CommonsHttpSolrServer(
   "http://localhost:8983/solr");
SolrQuery query = new SolrQuery();
query.setQuery(userQuery);
query.setFacet(true);
query.setFacetMinCount(1);
query.addFacetField("category");

QueryResponse queryResponse = solrServer.query(query);
Devilish Details

• analysis: tokenization and token filtering
• query parsing
• relevancy tuning
• performance and scalability
SolrMeter




http://code.google.com/p/solrmeter/
e.g. data.gov
Data.gov CSV catalog
URL,Title,Agency,Subagency,Category,Date Released,Date Updated,Time
Period,Frequency,Description,Data.gov Data Category Type,Specialized Data Category
Designation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit of
Analysis,Granularity,Geographic Coverage,Collection Mode,Data Collection
Instrument,Data Dictionary/Variable List,Applicable Agency Information Quality
Guideline Designation,Data Quality Certification,Privacy and Confidentiality,Technical
Documentation,Additional Metadata,FGDC Compliance (Geospatial Only),Statistical
Methodology,Sampling,Estimation,Weighting,Disclosure Avoidance,Questionnaire
Design,Series Breaks,Non-response Adjustment,Seasonal Adjustment,Statistical
Characteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/
TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ Access
Point,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data Extraction
Access Point,Widget Access Point
"http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanic
and Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4
and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of Terminal
Doppler Weather Radar data and is used primarily for research purposes. The archived data includes base data and
derived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation
(NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radial
velocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysis
products for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployed
throughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK,
personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data needed
to warn of impending severe weather and possible flash floods; support air traffic safety and assist in the management
of air traffic flow control; facilitate resource protection at military bases; and optimize the management of water,
agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and Atmospheric
Administration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...
Debugging
http://localhost:8983/solr/data.gov?q=searching&debugQuery=true
Custom pages


• Document detail page
• Multiple query intersection comparison
  with Venn visualization
Document detail
http://localhost:8983/solr/data.gov/document
?id=http%3A%2F%2Fwww.data.gov%2Fdetails%2F61
Query intersection

• Just showing off.... how easy it is to do
  something with a bit of visual impact
• Compare three independent queries,
  intersecting them in a Venn diagram
  visualization
What now?
• Download Solr
• "install" it (unzip it)
• Start Solr: java -jar start.jar
• Ingest your data
• Iterate on schema & config
• Ship It!
UI / prototyping


• Solritas - aka VelocityResponseWriter
• Blacklight - projectblacklight.org
Blacklight @ UVa
Blacklight @ Stanford
For more information...
•   http://www.lucidimagination.com

•   LucidFind

    •   search Lucene ecosystem: mailing lists, wikis, JIRA, etc

    •   http://search.lucidimagination.com

•   Getting started with LucidWorks Enterprise:

    •   http://www.lucidimagination.com/products/
        lucidworks-search-platform/enterprise

•   http://lucene.apache.org/solr - wiki, e-mail lists
LucidFind




http://www.lucidimagination.com/search/?q=user+interface
Thank You!

More Related Content

What's hot

Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchIsmaeel Enjreny
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debeziumKasun Don
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...confluent
 
Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearchFadel Chafai
 
ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010Elasticsearch
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 

What's hot (20)

Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
 
Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearch
 
ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 

Similar to Introduction to Solr

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 

Similar to Introduction to Solr (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apache solr
Apache solrApache solr
Apache solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
SOLR
SOLRSOLR
SOLR
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Solr 101
Solr 101Solr 101
Solr 101
 

More from Erik Hatcher

Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered LibrariesErik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - ChicagoErik Hatcher
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Erik Hatcher
 

More from Erik Hatcher (20)

Ted Talk
Ted TalkTed Talk
Ted Talk
 
Solr Payloads
Solr PayloadsSolr Payloads
Solr Payloads
 
it's just search
it's just searchit's just search
it's just search
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered Libraries
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Solr 4
Solr 4Solr 4
Solr 4
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Introduction to Solr

  • 1. Introduction to Solr NFJS - Boston, September 2011 Presented by Erik Hatcher erik.hatcher@lucidimagination.com Lucid Imagination http://www.lucidimagination.com
  • 2. About me... • Co-author, "Lucene in Action" (and "Java Development with Ant" / "Ant in Action" once upon a time) • "Apache guy" - Lucene/Solr committer; member of Lucene PMC, member of Apache Software Foundation • Co-founder, evangelist, trainer, coder @ Lucid Imagination
  • 3. About Lucid Imagination... • Lucid Imagination provides commercial-grade support, training, high-level consulting and value- added software for Lucene and Solr. • We make Lucene ‘enterprise-ready’ by offering: • Free, certified, distributions and downloads. • Support, training, and consulting. • LucidWorks Enterprise, a commercial search platform built on top of Solr.
  • 4. Abstract Apache Solr serves search requests at the enterprises and the largest companies around the world. Built on top of the top-notch Apache Lucene library, Solr makes indexing and searching integration into your applications straightforward. Solr provides faceted navigation, spell checking, highlighting, clustering, grouping, and other search features. Solr also scales query volume with replication and collection size with distributed capabilities. Solr can index rich documents such as PDF, Word, HTML, and other file types.
  • 5. What is Solr? • An open source search server • Indexes content sources, processes query requests, returns search results • Uses Lucene as the "engine", but adds full enterprise search server features and capabilities • A web-based application that processes HTTP requests and returns HTTP responses. • Initially started in 2004 and developed by CNET as an in-house project to add search capability for the company website. • Donated to ASF in 2006.
  • 6. Who uses Solr? And many many many many more...!
  • 7. Which Solr version? • There’s more than one answer! • The current, released, stable version is 3.3 (soon to be 3.4) • The development release is referred to as “trunk”. • This is where the new, less tested work goes on • Also referred to as 4.0 • LucidWorks Enterprise is built on a trunk snapshot + additional features.
  • 8. What is Lucene? • An open source search library (not an application) • 100% Java • Continuously improved and tuned for more than 10 years • Compact, portable index representation • Programmable text analyzers, spell checking and highlighting • Not, itself, a crawler or a text extraction tool
  • 9. Inverted Index • Lucene stores input data in what is known as an inverted index • In an inverted index each indexed term points to a list of documents that contain the term • Similar to the index provided at the end of a book • In this case "inverted" simply means the list of terms point to documents • It is much faster to find a term in an index, than to scan all the documents
  • 11. Ingestion • API / Solr XML, JSON, and javabin/SolrJ • CSV • Relational databases • File system • Web crawl (using Nutch, or others) • Others - XML feeds (e.g. RSS/Atom), e-mail
  • 13. Solr XML POST to /update <add> <doc> <field name="id">rawxml1</field> <field name="content_type">text/xml</field> <field name="category">index example</field> <field name="title">Simple Example</field> <field name="filename">addExample.xml</field> <field name="text">A very simple example of adding a document to the index.</field> </doc> </add>
  • 14. Solr JSON POST to /update/json [ {"id" : "TestDoc1", "title" : "test1"}, {"id" : "TestDoc2", "title" : "another test"} ]
  • 15. CSV indexing • http://localhost:8983/solr/update/csv • Files can be sent over HTTP: • curl http://localhost:8983/solr/update/ csv --data-binary @data.csv -H 'Content- type:text/plain; charset=utf-8’ • or streamed from the file system: • curl http://localhost:8983/solr/update/ csv?stream.file=exampledocs/ data.csv&stream.contentType=text/ plain;charset=utf-8
  • 16. Rich documents • Solr uses Tika for extraction. Tika is a toolkit for detecting and extracting metadata and structured text content from various document formats using existing parser libraries. • Tika identifies MIME types and then uses the appropriate parser to extract text. • The ExtractingRequestHandler uses Tika to identify types and extract text, and then indexes the extracted text. • The ExtractingRequestHandler is sometimes called "Solr Cell", which stands for Content Extraction Library. • File formats include MS Office, Adobe PDF, XML, HTML, MPEG and many more.
  • 17. Solr Cell parameters • The literal parameter is very important. • A way to add other fields not indexed using Tika to documents. • &literal.id=12345 • &literal.category=sports • Using curl to index a file on the file system: • curl 'http://localhost:8983/solr/update/extract? literal.id=doc1&commit=true' -F myfile=@tutorial.html • Streaming a file from the file system: • curl "http://localhost:8983/solr/update/extract? stream.file=/some/path/ news.doc&stream.contentType=application/ msword&literal.id=12345"
  • 18. Streaming remote docs • Streaming a file from a URL: • curl http://localhost:8983/solr/ update/extract? literal.id=123&stream.url=http:// www.solr.com/content/file.pdf -H 'Content-type:application/pdf’
  • 19. DataImportHandler • An "in-process" module that can be used to index data directly from relational databases and other data sources • Configuration driven • A tool that can aggregate data from multiple database tables, or even multiple data sources to be indexed as a single Solr document • Provides powerful and customizable data transformation tools • Can do full import or delta import • Pluggable to allow indexing of any type of data source
  • 20. DIH Examples • Rich documents • Relational database • E-mail
  • 21. Other commands • <commit/> and <optimize/> • <delete>...</delete> • <id>Q-36</id> • <query>category:electronics</query> • To update a document, simply add a document with same unique key
  • 22. Configuring Solr • schema.xml • defines field types, fields, and unique key • solrconfig.xml • Lucene settings • request handler, component, and plugin definitions and customizations
  • 23. Searching Basics • http://localhost:8983/solr/select?q=*:* • q - main query • rows - maximum number of "hits" to return • start - zero-based hit starting point • fl - comma-separated field list • * for all stored fields, score for computed Lucene score
  • 24. Other Common Search Parameters • sort - specify sort criteria either by field(s) or function(s) in ascending or descending order • fq - filter queries, multiple values supported • wt - writer type - format of Solr response • debugQuery - adds debugging info to response
  • 25. Filtering results • Use fq to filter results in addition to main query constraints • fq results are independently cached in Solr's filterCache • filter queries do not contribute to ranking scores • Commonly used for filtering on facets
  • 26. Typical Solr Request • http://localhost:8983/solr/select ?q=ipod &facet=on &facet.field=cat &fq=cat:electronics
  • 27. Features • Faceting • Distributed search • Highlighting • Replication • Spellchecking • Suggest • More-like-this • Geospatial support • Clustering • UIMA integration • Grouping • Extensible
  • 28. Integration • It's just HTTP • and CSV, JSON, XML, etc on the requests and responses • Any language or environment can work with Solr easily • Many libraries/layers exist on top
  • 30. SolrJ searching example SolrServer solrServer = new CommonsHttpSolrServer( "http://localhost:8983/solr"); SolrQuery query = new SolrQuery(); query.setQuery(userQuery); query.setFacet(true); query.setFacetMinCount(1); query.addFacetField("category"); QueryResponse queryResponse = solrServer.query(query);
  • 31. Devilish Details • analysis: tokenization and token filtering • query parsing • relevancy tuning • performance and scalability
  • 34. Data.gov CSV catalog URL,Title,Agency,Subagency,Category,Date Released,Date Updated,Time Period,Frequency,Description,Data.gov Data Category Type,Specialized Data Category Designation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit of Analysis,Granularity,Geographic Coverage,Collection Mode,Data Collection Instrument,Data Dictionary/Variable List,Applicable Agency Information Quality Guideline Designation,Data Quality Certification,Privacy and Confidentiality,Technical Documentation,Additional Metadata,FGDC Compliance (Geospatial Only),Statistical Methodology,Sampling,Estimation,Weighting,Disclosure Avoidance,Questionnaire Design,Series Breaks,Non-response Adjustment,Seasonal Adjustment,Statistical Characteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/ TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ Access Point,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data Extraction Access Point,Widget Access Point "http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanic and Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4 and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of Terminal Doppler Weather Radar data and is used primarily for research purposes. The archived data includes base data and derived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation (NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radial velocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysis products for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployed throughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK, personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data needed to warn of impending severe weather and possible flash floods; support air traffic safety and assist in the management of air traffic flow control; facilitate resource protection at military bases; and optimize the management of water, agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and Atmospheric Administration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...
  • 35.
  • 37. Custom pages • Document detail page • Multiple query intersection comparison with Venn visualization
  • 39. Query intersection • Just showing off.... how easy it is to do something with a bit of visual impact • Compare three independent queries, intersecting them in a Venn diagram visualization
  • 40.
  • 41. What now? • Download Solr • "install" it (unzip it) • Start Solr: java -jar start.jar • Ingest your data • Iterate on schema & config • Ship It!
  • 42. UI / prototyping • Solritas - aka VelocityResponseWriter • Blacklight - projectblacklight.org
  • 45. For more information... • http://www.lucidimagination.com • LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc • http://search.lucidimagination.com • Getting started with LucidWorks Enterprise: • http://www.lucidimagination.com/products/ lucidworks-search-platform/enterprise • http://lucene.apache.org/solr - wiki, e-mail lists