SlideShare a Scribd company logo
THE FIRST CLASS INTEGRATION OF SOLR WITH
HADOOP
Mark Miller (Cloudera)
WHO AM I?
Cloudera employee, Lucene/Solr committer, Lucene PMC member,
Apache member

!
First job out of college was in the Newspaper archiving business.

!
First full time employee at LucidWorks - a startup around Lucene/Solr.

!
Spent a couple years as “Core” engineering manager, reporting to the
VP of engineering.
•

Very fast and feature rich ‘core’ search engine library.

•

Compact and powerful, Lucene is an extremely popular full-text search library.

•

Provides low level API’s for analyzing, indexing, and searching text, along with a
myriad of related features.

!
!
!

•

Just the core - either you write the ‘glue’ or use a higher level search engine built
with Lucene.
• Solr (pronounced "solar") is an open source enterprise search
platform from the Apache Lucene project. Its major features
include full-text search, hit highlighting, faceted search,
dynamic clustering, database integration, and rich document
(e.g., Word, PDF) handling. Providing distributed search and
index replication, Solr is highly scalable. Solr is the most
popular enterprise search engine.



- Wikipedia
SEARCH ON HADOOP HISTORY
•
•
•
•
•
•
•

Katta
Blur
SolBase
HBASE-3529
SOLR-1301
SOLR-1045
Ad-Hoc
...
THE PLAN: STRENGTHEN THE FAMILY BONDS
•

No need to build something radically new - we have the pieces we need.

•

Focus on integration points.

•

Create high quality, first class integrations and contribute the work to the projects
involved.

!
!
!

•

Focus on integration and quality first - then performance and scale.
SOLRCLOUD
SOLR INTEGRATION
•

Read and Write directly to HDFS

•
•

First Class Custom Directory Support in Solr
Support Solr Replication on HDFS

•

Other improvements around usability and configuration

!
!
READ AND WRITE DIRECTLY TO HDFS

•

Lucene did not historically support append only file system

•

“Flexible Indexing” brought around support for append only filesystem support

•

Lucene support append only filesystem by default since 4.2

!
!
LUCENE DIRECTORY ABSTRACTION
•
•

It’s how Lucene interacts with index files.
Solr uses the Lucene library and offers DirectoryFactory

!
•
•
•
•
•
•
•
•

Class Directory {
listAll();
createOutput(file, context);
openInput(file, context);
deleteFile(file);
makeLock(file);
clearLock(file);
…
PUTTING THE INDEX IN HDFS
•

Solr relies on the filesystem cache to operate at full speed.

•

HDFS not known for it’s random access speed.

•

Apache Blur has already solved this with an HdfsDirectory that works on top of a
BlockDirectory.

!
!
!

•

The “block cache” caches the hot blocks of the index off heap (direct byte array)
and takes the place of the filesystem cache.

!
•

We contributed back optional ‘write’ caching.

!
!
PUTTING THE TRANSACTIONLOG IN HDFS
•

HdfsUpdateLog added - extends UpdateLog

•

Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ - no
additional configuration necessary.

!
!

•

Same extensive testing as used on UpdateLog
RUNNING SOLR ON HDFS
•

Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in
hdfs.

!
•

Set LockType to ‘hdfs’

•

Use an UpdateLog dataDir location that begins with ‘hdfs:/’

•
•
•

Or java -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lockType=solr.HdfsLockFactory
-Dsolr.updatelog=hdfs://host:port/path -jar start.jar

!
!
SOLR REPLICATION ON HDFS
!
•

While Solr has exposed a plug-able DirectoryFactory for a long time now, it was
really quite limited.

!
•

Most glaring, only a local file system based Directory would work with replication.

•

There where also other more minor areas that relied on a local filesystem Directory
implementation.

!
FUTURE SOLR REPLICATION ON HDFS
•

Take advantage of “distributed filesystem” and allow for something similar to HBase
regions.

!
•

If a node goes down, the data is still available in HDFS - allow for that index to be
automatically served by a node that is still up if it has the capacity.

Solr
Node

Solr
Node

Solr
Node
HDFS
• Leader reads and writes index files to HDFS
• Replicas only read from HDFS, write to /dev/null

Leader

Replica

HDFS

Replica
MAP REDUCE INDEX BUILDING
•

Scalable index creation via map-reduce

•

Many initial ‘homegrown’ implementations sent documents from reducer to
SolrCloud over http

!
!

•

To really scale, you want the reducers to create the indexes in HDFS and then load
them up with Solr

!
•

The ideal impl will allow using as many reducers as are available in your hadoop
cluster, and then merge the indexes down to the correct number of ‘shards’
MR INDEX BUILDING
Mapper:
Parse input

Mapper:
Parse input

Mapper:
Parse input

Arbitrary reducing steps of indexing and merging

End-Reducer

End-Reducer

Index

Index
SOLRCLOUD AWARE
•

Can ‘inspect’ ZooKeeper to learn about Solr cluster.

•

What URL’s to GoLive to.

•

The Schema to use when building indexes.

•

Match hash -> shard assignments of a Solr cluster.

!
!
!
GOLIVE
!
•
•
•
•

After building your indexes with map-reduce, how do you deploy them to
your Solr cluster?
We want it to be easy - so we built the GoLive option.
GoLive allows you to easily merge the indexes you have created
atomically into a live running Solr cluster.
Paired with the ZooKeeper Aware ability, this allows you to simply point
your map-reduce job to your Solr cluster and it will automatically discover
how many shards to build and what locations to deliver the final indexes to
in HDFS.
FLUME SOLR SYNC
• Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of
log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model that allows
for online analytic application.
FLUME SOLR SYNC
Logs

Other
Flume

Flume

Solr
HDFS
SOLRCLOUD AWARE
•

Can ‘inspect’ ZooKeeper to learn about Solr cluster.

•

What URL’s to send data to.

•

The Schema for the collection being indexed to.

!
!
HBASE INTEGRATION
•
•
•
•
•
•
•
•

Collaboration between NGData & Cloudera
NGData are creators of the Lily data management platform
Lily HBase Indexer
Service which acts as a HBase replication listener
HBase replication features, such as filtering, supported
Replication updates trigger indexing of updates (rows)
Integrates Morphlines library for ETL of rows
AL2 licensed on github https://github.com/ngdata
Triggers on
updates

interactive load

HBase

HDFS

Indexer(s
)

Solr
Solr
server
Solr
server
Solr
server
Solr
server
server
MORPHLINES
•

A morphline is a configuration file that allows you to define ETL transformation
pipelines

!
•

Extract content from input files, transform content, load content (eg to Solr)

•

Uses Tika to extract content from a large variety of input documents

•

Part of the CDK (Cloudera Development Kit)

!
!
syslog

Flume
Agent

Solr Sink
Command: readLine
Command: grok
Command: loadSolr

Solr

•
•
•
•
•
•
•
•
•
•
•

Open Source framework for simple ETL
Ships as part Cloudera Developer Kit (CDK)
It’s a Java library
AL2 licensed on github https://github.com/cloudera/cdk
Similar to Unix pipelines
Configuration over coding
Supports common Hadoop formats
Avro
Sequence file
Text
Etc…

!
•
•
•
•
•
•
•
•

Integrate with and load into Apache Solr
Flexible log file analysis
Single-line record, multi-line records, CSV files
Regex based pattern matching and extraction
Integration with Avro
Integration with Apache Hadoop Sequence Files
Integration with SolrCell and all Apache Tika parsers
Auto-detection of MIME types from binary data using Apache Tika
•
•
•
•
•
•
•
•
•
•

Scripting support for dynamic java code
Operations on fields for assignment and comparison
Operations on fields with list and set semantics
if-then-else conditionals
A small rules engine (tryRules)
String and timestamp conversions
slf4j logging
Yammer metrics and counters
Decompression and unpacking of arbitrarily nested container file
formats
Etc…
MORPHLINES EXAMPLE CONFIG

Example Input
<164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 po
Output Record
syslog_pri:164
syslog_timestamp:Feb  4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.

morphlines : [
 {
   id : morphline1
   importCommands : ["com.cloudera.**", "org.apache.solr.**"]
   commands : [
     { readLine {} }                    
     { 
       grok { 




         dictionaryFiles : [/tmp/grok-dictionaries]                               
         expressions : { 
           message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %
{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}"""
         }
       }
     }
     { loadSolr {} }     
    ]
 }
HUE INTEGRATION
•
•
•
•
•

Hue
Simple UI
Navigated, faceted drill down
Customizable display
Full text search, standard Solr
API and query language
CLOUDERA SEARCH
•

https://ccp.cloudera.com/display/SUPPORT/Downloads

•

Or Google

•

“cloudera search download”

!
!
Mark Miller, Cloudera

@heismark

More Related Content

What's hot

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
bigdatagurus_meetup
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
Yahoo Developer Network
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Renato Javier Marroquín Mogrovejo
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
Guy Harrison
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
huguk
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 

What's hot (20)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 

Similar to The First Class Integration of Solr with Hadoop

Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
Mark Kerzner
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
Chris Caple
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Lecture 2_ Intro to laravel.pptx
Lecture 2_ Intro to laravel.pptxLecture 2_ Intro to laravel.pptx
Lecture 2_ Intro to laravel.pptx
SaziaRahman
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
gagravarr
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Laravel ppt
Laravel pptLaravel ppt
Laravel ppt
Mayank Panchal
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
Alihossein shahabi
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
anynines GmbH
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
gagravarr
 

Similar to The First Class Integration of Solr with Hadoop (20)

Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Lecture 2_ Intro to laravel.pptx
Lecture 2_ Intro to laravel.pptxLecture 2_ Intro to laravel.pptx
Lecture 2_ Intro to laravel.pptx
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Laravel ppt
Laravel pptLaravel ppt
Laravel ppt
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

The First Class Integration of Solr with Hadoop

  • 1.
  • 2. THE FIRST CLASS INTEGRATION OF SOLR WITH HADOOP Mark Miller (Cloudera)
  • 3. WHO AM I? Cloudera employee, Lucene/Solr committer, Lucene PMC member, Apache member ! First job out of college was in the Newspaper archiving business. ! First full time employee at LucidWorks - a startup around Lucene/Solr. ! Spent a couple years as “Core” engineering manager, reporting to the VP of engineering.
  • 4. • Very fast and feature rich ‘core’ search engine library. • Compact and powerful, Lucene is an extremely popular full-text search library. • Provides low level API’s for analyzing, indexing, and searching text, along with a myriad of related features. ! ! ! • Just the core - either you write the ‘glue’ or use a higher level search engine built with Lucene.
  • 5. • Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine.
 
 - Wikipedia
  • 6. SEARCH ON HADOOP HISTORY • • • • • • • Katta Blur SolBase HBASE-3529 SOLR-1301 SOLR-1045 Ad-Hoc
  • 7. ...
  • 8. THE PLAN: STRENGTHEN THE FAMILY BONDS • No need to build something radically new - we have the pieces we need. • Focus on integration points. • Create high quality, first class integrations and contribute the work to the projects involved. ! ! ! • Focus on integration and quality first - then performance and scale.
  • 10. SOLR INTEGRATION • Read and Write directly to HDFS • • First Class Custom Directory Support in Solr Support Solr Replication on HDFS • Other improvements around usability and configuration ! !
  • 11. READ AND WRITE DIRECTLY TO HDFS • Lucene did not historically support append only file system • “Flexible Indexing” brought around support for append only filesystem support • Lucene support append only filesystem by default since 4.2 ! !
  • 12. LUCENE DIRECTORY ABSTRACTION • • It’s how Lucene interacts with index files. Solr uses the Lucene library and offers DirectoryFactory ! • • • • • • • • Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); …
  • 13. PUTTING THE INDEX IN HDFS • Solr relies on the filesystem cache to operate at full speed. • HDFS not known for it’s random access speed. • Apache Blur has already solved this with an HdfsDirectory that works on top of a BlockDirectory. ! ! ! • The “block cache” caches the hot blocks of the index off heap (direct byte array) and takes the place of the filesystem cache. ! • We contributed back optional ‘write’ caching. ! !
  • 14. PUTTING THE TRANSACTIONLOG IN HDFS • HdfsUpdateLog added - extends UpdateLog • Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ - no additional configuration necessary. ! ! • Same extensive testing as used on UpdateLog
  • 15. RUNNING SOLR ON HDFS • Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in hdfs. ! • Set LockType to ‘hdfs’ • Use an UpdateLog dataDir location that begins with ‘hdfs:/’ • • • Or java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lockType=solr.HdfsLockFactory -Dsolr.updatelog=hdfs://host:port/path -jar start.jar ! !
  • 16. SOLR REPLICATION ON HDFS ! • While Solr has exposed a plug-able DirectoryFactory for a long time now, it was really quite limited. ! • Most glaring, only a local file system based Directory would work with replication. • There where also other more minor areas that relied on a local filesystem Directory implementation. !
  • 17. FUTURE SOLR REPLICATION ON HDFS • Take advantage of “distributed filesystem” and allow for something similar to HBase regions. ! • If a node goes down, the data is still available in HDFS - allow for that index to be automatically served by a node that is still up if it has the capacity. Solr Node Solr Node Solr Node HDFS
  • 18. • Leader reads and writes index files to HDFS • Replicas only read from HDFS, write to /dev/null Leader Replica HDFS Replica
  • 19. MAP REDUCE INDEX BUILDING • Scalable index creation via map-reduce • Many initial ‘homegrown’ implementations sent documents from reducer to SolrCloud over http ! ! • To really scale, you want the reducers to create the indexes in HDFS and then load them up with Solr ! • The ideal impl will allow using as many reducers as are available in your hadoop cluster, and then merge the indexes down to the correct number of ‘shards’
  • 20. MR INDEX BUILDING Mapper: Parse input Mapper: Parse input Mapper: Parse input Arbitrary reducing steps of indexing and merging End-Reducer End-Reducer Index Index
  • 21. SOLRCLOUD AWARE • Can ‘inspect’ ZooKeeper to learn about Solr cluster. • What URL’s to GoLive to. • The Schema to use when building indexes. • Match hash -> shard assignments of a Solr cluster. ! ! !
  • 22. GOLIVE ! • • • • After building your indexes with map-reduce, how do you deploy them to your Solr cluster? We want it to be easy - so we built the GoLive option. GoLive allows you to easily merge the indexes you have created atomically into a live running Solr cluster. Paired with the ZooKeeper Aware ability, this allows you to simply point your map-reduce job to your Solr cluster and it will automatically discover how many shards to build and what locations to deliver the final indexes to in HDFS.
  • 23. FLUME SOLR SYNC • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • 25. SOLRCLOUD AWARE • Can ‘inspect’ ZooKeeper to learn about Solr cluster. • What URL’s to send data to. • The Schema for the collection being indexed to. ! !
  • 26. HBASE INTEGRATION • • • • • • • • Collaboration between NGData & Cloudera NGData are creators of the Lily data management platform Lily HBase Indexer Service which acts as a HBase replication listener HBase replication features, such as filtering, supported Replication updates trigger indexing of updates (rows) Integrates Morphlines library for ETL of rows AL2 licensed on github https://github.com/ngdata
  • 28. MORPHLINES • A morphline is a configuration file that allows you to define ETL transformation pipelines ! • Extract content from input files, transform content, load content (eg to Solr) • Uses Tika to extract content from a large variety of input documents • Part of the CDK (Cloudera Development Kit) ! !
  • 29. syslog Flume Agent Solr Sink Command: readLine Command: grok Command: loadSolr Solr • • • • • • • • • • • Open Source framework for simple ETL Ships as part Cloudera Developer Kit (CDK) It’s a Java library AL2 licensed on github https://github.com/cloudera/cdk Similar to Unix pipelines Configuration over coding Supports common Hadoop formats Avro Sequence file Text Etc… !
  • 30. • • • • • • • • Integrate with and load into Apache Solr Flexible log file analysis Single-line record, multi-line records, CSV files Regex based pattern matching and extraction Integration with Avro Integration with Apache Hadoop Sequence Files Integration with SolrCell and all Apache Tika parsers Auto-detection of MIME types from binary data using Apache Tika
  • 31. • • • • • • • • • • Scripting support for dynamic java code Operations on fields for assignment and comparison Operations on fields with list and set semantics if-then-else conditionals A small rules engine (tryRules) String and timestamp conversions slf4j logging Yammer metrics and counters Decompression and unpacking of arbitrarily nested container file formats Etc…
  • 32. MORPHLINES EXAMPLE CONFIG Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 po Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22. morphlines : [  {    id : morphline1    importCommands : ["com.cloudera.**", "org.apache.solr.**"]    commands : [      { readLine {} }                          {        grok { 
          dictionaryFiles : [/tmp/grok-dictionaries]                                         expressions : {            message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} % {DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}"""          }        }      }      { loadSolr {} }          ]  }
  • 33. HUE INTEGRATION • • • • • Hue Simple UI Navigated, faceted drill down Customizable display Full text search, standard Solr API and query language