1
Solr%+%Hadoop%=%Big%Data%Search
Mark%Miller
2
Who$Am$I?
Cloudera$employee,$Lucene/Solr$committer,$Lucene$PMC$member,
Apache$member
First$job$out$of$college$was$in$the$Newspaper$archiving$business.
First$full$time$employee$at$LucidWorks$G$a$startup$around
Lucene/Solr.
Spent$a$couple$years$as$“Core”$engineering$manager,$reporting$to
the$VP$of$engineering.
3
Very%fast%and%feature%rich%‘core’%search%engine%library.%
Compact%and%powerful,%Lucene%is%an%extremely%popular%full>text
search%library.
Provides%low%level%API’s%for%analyzing,%indexing,%and%searching
text,%along%with%a%myriad%of%related%features.
Just%the%core%>%either%you%write%the%‘glue’%or%use%a%higher%level
search%engine%built%with%Lucene.
4
Solr%(pronounced%"solar")%is%an%open%source%enterprise%search
platform%from%the%Apache%Lucene%project.%Its%major%features
include%full;text%search,%hit%highlighting,%faceted%search,%dynamic
clustering,%database%integration,%and%rich%document%(e.g.,%Word,
PDF)%handling.%Providing%distributed%search%and%index
replication,%Solr%is%highly%scalable.%Solr%is%the%most%popular
enterprise%search%engine.
;%Wikipedia
5
Search'on'Hadoop'History
'Katta
'Blur
'SolBase
'HBASE73529
'SOLR71301
'SOLR71045
'Ad7Hoc
•
•
•
•
•
•
•
6
Family'Tree
...
7
Strengthen(the(Family(Bonds
No(need(to(build(something(radically(new(8(we(have(the
pieces(we(need.
Focus(on(integration(points.
Create(high(quality,(first(class(integrations(and(contribute
the(work(to(the(projects(involved.
Focus(on(integration(and(quality(first(8(then(performance
and(scale.
•
•
•
•
8
SolrCloud
9
Solr%Integration
Read%and%Write%directly%to%HDFS
First%Class%Custom%Directory%Support%in%Solr
Support%Solr%Replication%on%HDFS
Other%improvements%around%usability%and%configuration
•
•
•
•
10
Read%and%Write%directly%to%HDFS
Lucene%did%not%historically%support%append%only%file%system
“Flexible%Indexing”%brought%around%support%for%append%only
filesystem%support
Lucene%support%append%only%filesystem%by%default%since%4.2
•
•
•
11
Lucene&Directory&Abstraction
It’s&how&Lucene&interacts&with&index&files.
Solr&uses&the&Lucene&library&and&offers&DirectoryFactory
Class&Directory&{
&&&&&&&&listAll();
&&&&&&&&createOutput(file,&context);
&&&&&&&&openInput(file,&context);
&&&&&&&&deleteFile(file);
&&&&&&&&makeLock(file);
&&&&&&&&clearLock(file);
&&&&&&&&…
12
Putting'the'Index'in'HDFS
Solr'relies'on'the'filesystem'cache'to'operate'at'full'speed.
HDFS'not'known'for'it’s'random'access'speed.
Apache'Blur'has'already'solved'this'with'an'HdfsDirectory'that
works'on'top'of'a'BlockDirectory.
The'“block'cache”'caches'the'hot'blocks'of'the'index'off'heap
(direct'byte'array)'and'takes'the'place'of'the'filesystem'cache.
We'contributed'back'optional'‘write’'caching.
•
•
•
•
•
13
Putting'the'TransactionLog'in'HDFS
HdfsUpdateLog'added'9'extends'UpdateLog
Triggered'by'setting'the'UpdateLog'dataDir'to'something'that
starts'with'hdfs:/'9'no'additional'configuration'necessary.
Same'extensive'testing'as'used'on'UpdateLog
•
•
•
14
Running&Solr&on&HDFS
Set&DirectoryFactory&to&HdfsDirectoryFactory&and&set&the&dataDir&to&a
location&in&hdfs.
Set&LockType&to&‘hdfs’
Use&an&UpdateLog&dataDir&location&that&begins&with&‘hdfs:/’
Or&java&FDsolr.directoryFactory=HdfsDirectoryFactory&
&&&&&&&&&&&&&&&FDsolr.lockType=solr.HdfsLockFactory
&&&&&&&&&&&&&&&FDsolr.updatelog=hdfs://host:port/path&Fjar&start.jar
•
•
•
•
15
Solr%Replication%on%HDFS
While%Solr%has%exposed%a%plug8able%DirectoryFactory%for%a%long
time%now,%it%was%really%quite%limited.
Most%glaring,%only%a%local%file%system%based%Directory%would
work%with%replication.
There%where%also%other%more%minor%areas%that%relied%on%a%local
filesystem%Directory%implementation.
•
•
•
16
Future&Solr&Replication&on&HDFS
Take&advantage&of&“distributed&filesystem”&and&allow&for
something&similar&to&HBase&regions.
If&a&node&goes&down,&the&data&is&still&available&in&HDFS&D&allow
for&that&index&to&be&automatically&served&by&a&node&that&is&still&up
if&it&has&the&capacity.
•
•
Solr&Node Solr&Node Solr&Node
HDFS
17
MR#Index#Building
Scalable#index#creation#via#map8reduce
Many#initial#‘homegrown’#implementations#sent#documents#from#reducer#to
SolrCloud#over#http
To#really#scale,#you#want#the#reducers#to#create#the#indexes#in#HDFS#and
then#load#them#up#with#Solr
The#ideal#impl#will#allow#using#as#many#reducers#as#are#available#in#your
hadoop#cluster,#and#then#merge#the#indexes#down#to#the#correct#number#of
‘shards’
•
•
•
•
18
MR#Index#Building
Mapper:
Parse#input#into
indexable#document
Mapper:
Parse#input#into
indexable#document
Mapper:
Parse#input#into
indexable#document
Index#shard
1
Index#shard
2
Arbitrary#reducing#steps#of#indexing#and#merging
End@Reducer#(shard#1):
Index#document
End@Reducer#(shard#2):
Index#document
19
SolrCloud(Aware
Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.
What(URL’s(to(GoLive(to.
The(Schema(to(use(when(building(indexes.
Match(hash(E>(shard(assignments(of(a(Solr(cluster.
•
•
•
•
20
GoLive
After+building+your+indexes+with+map:reduce,+how+do+you
deploy+them+to+your+Solr+cluster?
We+want+it+to+be+easy+:+so+we+built+the+GoLive+option.
GoLive+allows+you+to+easily+merge+the+indexes+you+have
created+atomically+into+a+live+running+Solr+cluster.
Paired+with+the+ZooKeeper+Aware+ability,+this+allows+you+to
simply+point+your+map:reduce+job+to+your+Solr+cluster+and+it+will
automatically+discover+how+many+shards+to+build+and+what
locations+to+deliver+the+final+indexes+to+in+HDFS.
•
•
•
•
21
Flume&Solr&Sync
Flume&is&a&distributed,&reliable,&and&available&service&for
efficiently&collecting,&aggregating,&and&moving&large&amounts
of&log&data.&It&has&a&simple&and&flexible&architecture&based&on
streaming&data&flows.&It&is&robust&and&fault&tolerant&with
tunable&reliability&mechanisms&and&many&failover&and
recovery&mechanisms.&It&uses&a&simple&extensible&data&model
that&allows&for&online&analytic&application.
=&Apache&Flume&Website
Other
Logs
22
Flume.Solr.Sync
HDFS
Flume
Agent
Flume
Agent
Solr
23
SolrCloud(Aware
Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster.
What(URL’s(to(send(data(to.
The(Schema(for(the(collection(being(indexed(to.
•
•
•
24
HBase&Integration
Collaboration&between&NGData&&&Cloudera
NGData&are&creators&of&the&Lily&data&management&platform
Lily&HBase&Indexer
Service&which&acts&as&a&HBase&replication&listener
HBase&replication&features,&such&as&filtering,&supported
Replication&updates&trigger&indexing&of&updates&(rows)
Integrates&Morphlines&library&for&ETL&of&rows
AL2&licensed&on&github&https://github.com/ngdata
•
•
•
•
•
•
•
•
25
HBase&Integration
HDFS
HBase
interactive&load
Indexer(s)
Triggers&on&updates
Solr&server
Solr&server
Solr&server
Solr&server
Solr&server
26
Morphlines
A,morphline,is,a,configuration,file,that,allows,you,to,define,ETL
transformation,pipelines
Extract,content,from,input,files,,transform,content,,load,content,(eg
to,Solr)
Uses,Tika,to,extract,content,from,a,large,variety,of,input,documents
Part,of,the,CDK,(Cloudera,Development,Kit)
•
•
•
•
27
Morphlines
syslog
Flume
Agent
Solr3Sink
Command:3readLine
Command:3grok
Command:3loadSolr
Solr
3Open3Source3framework3for3simple3ETL
3Ships3as3part3Cloudera3Developer3Kit3(CDK)
3It’s3a3Java3library
3AL23licensed3on3github
https://github.com/cloudera/cdk
3Similar3to3Unix3pipelines
3Configuration3over3coding
3Supports3common3Hadoop3formats
Avro
Sequence3file
Text
•
•
•
•
•
•
•
28
Morphlines
+Integrate+with+and+load+into+Apache+Solr
+Flexible+log+file+analysis
+Single:line+record,+multi:line+records,+CSV+files+
+Regex+based+pattern+matching+and+extraction+
+Integration+with+Avro+
+Integration+with+Apache+Hadoop+Sequence+Files
+Integration+with+SolrCell+and+all+Apache+Tika+parsers+
+Auto:detection+of+MIME+types+from+binary+data+using+++Apache+Tika
•
•
•
•
•
•
•
•
29
Morphlines
+Scripting+support+for+dynamic+java+code+
+Operations+on+fields+for+assignment+and+comparison
+Operations+on+fields+with+list+and+set+semantics+
+if:then:else+conditionals+
+A+small+rules+engine+(tryRules)
+String+and+timestamp+conversions+
+slf4j+logging
+Yammer+metrics+and+counters+
+Decompression+and+unpacking+of+arbitrarily+nested+container+file
formats
•
•
•
•
•
•
•
•
•
30
Morphlines+Example+Config
morphlines+:+[
+{
+++id+:+morphline1
+++importCommands+:+["com.cloudera.**",+"org.apache.solr.**"]
+++commands+:+[
+++++{+readLine+{}+}++++++++++++++++++++
+++++{+
+++++++grok+{+
+++++++++dictionaryFiles+:+[/tmp/grokFdictionaries]+++++++++++++++++++++++++++++++
+++++++++expressions+:+{+
+++++++++++message+:+"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}+%
{SYSLOGHOST:syslog_hostname}+%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:+%
{GREEDYDATA:syslog_message}"""
+++++++++}
+++++++}
+++++}
+++++{+loadSolr+{}+}+++++
++++]
+}
]
Example(Input
<164>Feb++4+10:46:14+syslog+sshd[607]:+listening+on+0.0.0.0+port+22
Output(Record
syslog_pri:164
syslog_timestamp:Feb++4+10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening+on+0.0.0.0+port+22.
31
Hue$Integration
Hue
Simple$UI
Navigated,$faceted$drill$down
Customizable$display
Full$text$search,$standard$Solr
API$and$query$language
•
•
•
•
32
Cloudera)Search
https://ccp.cloudera.com/display/SUPPORT/Downloads
Or)Google
“cloudera=search=download”
Mark%Miller,%Cloudera @heismark

Solr+Hadoop = Big Data Search