www.edureka.co/apache-solr
New-Age Search through Apache Solr
View Apache Solr course details at www.edureka.co/apache-solr
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
Slide 2
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/apache-solr
How it Works?
Slide 3 www.edureka.co/apache-solr
Objectives
At the end of this module, you will be able to understand:
The need for search engine for enterprise grade applications
The objectives & challenges of search engine
How is Indexing & Searching Handled in Lucene
Solr and its Architecture
Near Real Time Search with Solr
Leveraging Solr Capabilities with Hadoop
Solr with YARN
About job opportunity for Solr Developers
Slide 4Slide 4Slide 4 www.edureka.co/apache-solr
Why Do I Need Search Engines ?
Slide 5Slide 5Slide 5 www.edureka.co/apache-solr
Search Engine: Why do I need them?
1. Text Based Search
2. Filter
3. Documents
1
2
3
Slide 6Slide 6Slide 6 www.edureka.co/apache-solr
Search Engine – What it should be?
If you need a storage engine to search records / documents using text-based keywords it should support following
features:
1. Should be optimized for faster text searches
2. Should have flexible schema
3. Should support sorting of documents
4. Web Scale - Should be optimized for reads
5. Should be document oriented
Slide 7Slide 7Slide 7 www.edureka.co/apache-solr
Cleartrip Spatial Search
Slide 8Slide 8Slide 8 www.edureka.co/apache-solr
What is Lucene ?
 Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications
 Used by LinkedIn, Twitter, … and many more (see http://wiki.apache.org/lucene-java/PoweredBy )
 Scalable & High-performance Indexing
 Powerful, Accurate and Efficient Search Algorithms
 Cross-Platform Solution
» Open Source & 100% pure Java
» Implementations in other programming languages available that are index-compatible
Doug Cutting “Creator”
Slide 9Slide 9Slide 9 www.edureka.co/apache-solr
Indexing – How it works?
I like edureka courses
Edureka teaches big
data courses
Edureka helps learn new
technologies easily
Document - 1 (“D1”) Document - 2 (“D2”) Document - 3 (“D3”)
“edureka” = {D1, D2, D3}
“courses” = {D1, D2}
“teaches” = {D2}
“big” = {D2}
“data” = {D2}
“helps” = {D3}
“edureka”
Slide 10Slide 10Slide 10 www.edureka.co/apache-solr
Lucene – Writing to Index
Field
Field
Field
Field
Analyzer IndexWriter Directory
Document
Classes used when indexing documents with Lucene
Slide 11Slide 11Slide 11 www.edureka.co/apache-solr
Lucene – Searching In Index
QueryParser
Analyzer
IndexSearcherExpression
Query object
Text fragments
 Query Parser translates a textual expression from the end into an arbitrarily complex query for searching
Slide 12Slide 12Slide 12 www.edureka.co/apache-solr
Solr is an open source enterprise search server / web application
Solr Uses the Lucene Search Library and extends it
Solr exposes lucene Java API’s as RESTful services
You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP
You query it via HTTP GET and receive XML, JSON, CSV or binary results
What is Solr ?
Slide 13Slide 13Slide 13 www.edureka.co/apache-solr
Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces - XML, JSON and HTTP
Comprehensive HTML Administration Interfaces
Server statistics exposed over JMX for monitoring
Near Real-time indexing and Adaptable with XML Configuration
Linearly scalable, auto index replication, auto, Extensible Plugin Architecture
Solr: Key Features
Slide 14Slide 14Slide 14 www.edureka.co/apache-solr
Solr: Architecture
Slide 15Slide 15Slide 15 www.edureka.co/apache-solr
Request
Handler
Query Parser
Response
Writer
Index
qt: selects a RequestHandler for a query using/select(by default, the DisMaxRequestHandler is used)
defType : selects a query parser for the query
(by default, uses whatever has been
configured for the RequestHandler)
qf: selects which fields to query
in the index(by default, all fields
are required)
wt: selects a response writer
for formatting the query
response
fq: filters query by applying an additional query to
the initial query’s results, caches the results
Rows:
specifies the
number of rows
to be displayed
at one time
Start: specifies an
offset(by default 0)
into the query results
where the returned
response should begin
Solr: Search Process
Slide 16Slide 16Slide 16 www.edureka.co/apache-solr
Near Real-Time Search
 Near Real Time (NRT) search means that documents are available for search almost immediately after being
indexed: additions and updates to documents are seen in 'near' real time
http://localhost:8983/solr/update?stream.body=<add><doc><fieldname="id">testdoc</field></doc></add>&co
mmit=true
Slide 17Slide 17Slide 17 www.edureka.co/apache-solr
Real-Time Get
 The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the
associated cost of reopening a searcher
 This is primarily useful when using Solr as a NoSQL data store and not just a search index
Slide 18Slide 18Slide 18 www.edureka.co/apache-solr
Leveraging Solr Capabilities with Hadoop
 Solr provides us fast, efficient, powerful full-text search and near real-time indexing and SolrCloud is flexible
distributed search and indexing, and will do things like automatic fail over etc.
 Hence its very suitable as NoSQL replacement for traditional databases in many situations, especially when the size of
the data exceeds what is reasonable with a typical RDBMS
 We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr
 In all the major Hadoop distribution like Cloudera, Hortonworks, MapR you can integrate Solr easily
Slide 19Slide 19Slide 19 www.edureka.co/apache-solr
PDF
Word
HTML
.
.
.
Raw Files
Lucene
SolR SolR SolR
Query Response
Search
Web App
MapReduce
Indexing Job
Raw Files Indexed
HDFS
(Hadoop Distributed File System)
Scalable Indexing
Input Data
Slide 20Slide 20Slide 20 www.edureka.co/apache-solr
Solr with YARN
Slide 21Slide 21Slide 21 www.edureka.co/apache-solr
Job trends for Apache Solr
Slide 22Slide 22Slide 22 www.edureka.co/apache-solr
Disclaimer
Criteria and guidelines mentioned in this presentation may change. Please visit our website for
latest and additional information on Apache Solr
Slide 23Slide 23Slide 23 www.edureka.co/apache-solr
Course Topics
 Module 5
» Solr Searching
 Module 6
» Solr Extended Features
 Module 7
» Solr Cloud & Administration
 Module 8
» Final Project
 Module 1
» Introduction to Apache Lucene
 Module 2
» Exploring Lucene
 Module 3
» Introduction to Apache Solr
 Module 4
» Solr Indexing
New-Age Search through Apache Solr

New-Age Search through Apache Solr

  • 1.
    www.edureka.co/apache-solr New-Age Search throughApache Solr View Apache Solr course details at www.edureka.co/apache-solr For Queries: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN For more details please contact us: US : 1800 275 9730 (toll free) INDIA : +91 88808 62004 Email Us : sales@edureka.co
  • 2.
    Slide 2 LIVE OnlineClass Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate www.edureka.co/apache-solr How it Works?
  • 3.
    Slide 3 www.edureka.co/apache-solr Objectives Atthe end of this module, you will be able to understand: The need for search engine for enterprise grade applications The objectives & challenges of search engine How is Indexing & Searching Handled in Lucene Solr and its Architecture Near Real Time Search with Solr Leveraging Solr Capabilities with Hadoop Solr with YARN About job opportunity for Solr Developers
  • 4.
    Slide 4Slide 4Slide4 www.edureka.co/apache-solr Why Do I Need Search Engines ?
  • 5.
    Slide 5Slide 5Slide5 www.edureka.co/apache-solr Search Engine: Why do I need them? 1. Text Based Search 2. Filter 3. Documents 1 2 3
  • 6.
    Slide 6Slide 6Slide6 www.edureka.co/apache-solr Search Engine – What it should be? If you need a storage engine to search records / documents using text-based keywords it should support following features: 1. Should be optimized for faster text searches 2. Should have flexible schema 3. Should support sorting of documents 4. Web Scale - Should be optimized for reads 5. Should be document oriented
  • 7.
    Slide 7Slide 7Slide7 www.edureka.co/apache-solr Cleartrip Spatial Search
  • 8.
    Slide 8Slide 8Slide8 www.edureka.co/apache-solr What is Lucene ?  Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications  Used by LinkedIn, Twitter, … and many more (see http://wiki.apache.org/lucene-java/PoweredBy )  Scalable & High-performance Indexing  Powerful, Accurate and Efficient Search Algorithms  Cross-Platform Solution » Open Source & 100% pure Java » Implementations in other programming languages available that are index-compatible Doug Cutting “Creator”
  • 9.
    Slide 9Slide 9Slide9 www.edureka.co/apache-solr Indexing – How it works? I like edureka courses Edureka teaches big data courses Edureka helps learn new technologies easily Document - 1 (“D1”) Document - 2 (“D2”) Document - 3 (“D3”) “edureka” = {D1, D2, D3} “courses” = {D1, D2} “teaches” = {D2} “big” = {D2} “data” = {D2} “helps” = {D3} “edureka”
  • 10.
    Slide 10Slide 10Slide10 www.edureka.co/apache-solr Lucene – Writing to Index Field Field Field Field Analyzer IndexWriter Directory Document Classes used when indexing documents with Lucene
  • 11.
    Slide 11Slide 11Slide11 www.edureka.co/apache-solr Lucene – Searching In Index QueryParser Analyzer IndexSearcherExpression Query object Text fragments  Query Parser translates a textual expression from the end into an arbitrarily complex query for searching
  • 12.
    Slide 12Slide 12Slide12 www.edureka.co/apache-solr Solr is an open source enterprise search server / web application Solr Uses the Lucene Search Library and extends it Solr exposes lucene Java API’s as RESTful services You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP You query it via HTTP GET and receive XML, JSON, CSV or binary results What is Solr ?
  • 13.
    Slide 13Slide 13Slide13 www.edureka.co/apache-solr Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces - XML, JSON and HTTP Comprehensive HTML Administration Interfaces Server statistics exposed over JMX for monitoring Near Real-time indexing and Adaptable with XML Configuration Linearly scalable, auto index replication, auto, Extensible Plugin Architecture Solr: Key Features
  • 14.
    Slide 14Slide 14Slide14 www.edureka.co/apache-solr Solr: Architecture
  • 15.
    Slide 15Slide 15Slide15 www.edureka.co/apache-solr Request Handler Query Parser Response Writer Index qt: selects a RequestHandler for a query using/select(by default, the DisMaxRequestHandler is used) defType : selects a query parser for the query (by default, uses whatever has been configured for the RequestHandler) qf: selects which fields to query in the index(by default, all fields are required) wt: selects a response writer for formatting the query response fq: filters query by applying an additional query to the initial query’s results, caches the results Rows: specifies the number of rows to be displayed at one time Start: specifies an offset(by default 0) into the query results where the returned response should begin Solr: Search Process
  • 16.
    Slide 16Slide 16Slide16 www.edureka.co/apache-solr Near Real-Time Search  Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed: additions and updates to documents are seen in 'near' real time http://localhost:8983/solr/update?stream.body=<add><doc><fieldname="id">testdoc</field></doc></add>&co mmit=true
  • 17.
    Slide 17Slide 17Slide17 www.edureka.co/apache-solr Real-Time Get  The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher  This is primarily useful when using Solr as a NoSQL data store and not just a search index
  • 18.
    Slide 18Slide 18Slide18 www.edureka.co/apache-solr Leveraging Solr Capabilities with Hadoop  Solr provides us fast, efficient, powerful full-text search and near real-time indexing and SolrCloud is flexible distributed search and indexing, and will do things like automatic fail over etc.  Hence its very suitable as NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS  We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr  In all the major Hadoop distribution like Cloudera, Hortonworks, MapR you can integrate Solr easily
  • 19.
    Slide 19Slide 19Slide19 www.edureka.co/apache-solr PDF Word HTML . . . Raw Files Lucene SolR SolR SolR Query Response Search Web App MapReduce Indexing Job Raw Files Indexed HDFS (Hadoop Distributed File System) Scalable Indexing Input Data
  • 20.
    Slide 20Slide 20Slide20 www.edureka.co/apache-solr Solr with YARN
  • 21.
    Slide 21Slide 21Slide21 www.edureka.co/apache-solr Job trends for Apache Solr
  • 22.
    Slide 22Slide 22Slide22 www.edureka.co/apache-solr Disclaimer Criteria and guidelines mentioned in this presentation may change. Please visit our website for latest and additional information on Apache Solr
  • 23.
    Slide 23Slide 23Slide23 www.edureka.co/apache-solr Course Topics  Module 5 » Solr Searching  Module 6 » Solr Extended Features  Module 7 » Solr Cloud & Administration  Module 8 » Final Project  Module 1 » Introduction to Apache Lucene  Module 2 » Exploring Lucene  Module 3 » Introduction to Apache Solr  Module 4 » Solr Indexing