Solr+Hadoop = Big Data Search
 

Solr+Hadoop = Big Data Search

on

  • 10,114 views

From Solr committer Mark Miller

From Solr committer Mark Miller

Statistics

Views

Total Views
10,114
Views on SlideShare
8,621
Embed Views
1,493

Actions

Likes
30
Downloads
645
Comments
0

7 Embeds 1,493

http://www.scoop.it 1127
http://www.bigdatanosql.com 316
http://mysearch.nhncorp.com 41
http://www.linkedin.com 4
http://webcache.googleusercontent.com 3
https://twitter.com 1
http://www.google.co.in 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Solr+Hadoop = Big Data Search Solr+Hadoop = Big Data Search Presentation Transcript

  • 1 Solr%+%Hadoop%=%Big%Data%Search Mark%Miller
  • 2 Who$Am$I? Cloudera$employee,$Lucene/Solr$committer,$Lucene$PMC$member, Apache$member First$job$out$of$college$was$in$the$Newspaper$archiving$business. First$full$time$employee$at$LucidWorks$G$a$startup$around Lucene/Solr. Spent$a$couple$years$as$“Core”$engineering$manager,$reporting$to the$VP$of$engineering.
  • 3 Very%fast%and%feature%rich%‘core’%search%engine%library.% Compact%and%powerful,%Lucene%is%an%extremely%popular%full>text search%library. Provides%low%level%API’s%for%analyzing,%indexing,%and%searching text,%along%with%a%myriad%of%related%features. Just%the%core%>%either%you%write%the%‘glue’%or%use%a%higher%level search%engine%built%with%Lucene.
  • 4 Solr%(pronounced%"solar")%is%an%open%source%enterprise%search platform%from%the%Apache%Lucene%project.%Its%major%features include%full;text%search,%hit%highlighting,%faceted%search,%dynamic clustering,%database%integration,%and%rich%document%(e.g.,%Word, PDF)%handling.%Providing%distributed%search%and%index replication,%Solr%is%highly%scalable.%Solr%is%the%most%popular enterprise%search%engine. ;%Wikipedia
  • 5 Search'on'Hadoop'History 'Katta 'Blur 'SolBase 'HBASE73529 'SOLR71301 'SOLR71045 'Ad7Hoc • • • • • • •
  • 6 Family'Tree ...
  • 7 Strengthen(the(Family(Bonds No(need(to(build(something(radically(new(8(we(have(the pieces(we(need. Focus(on(integration(points. Create(high(quality,(first(class(integrations(and(contribute the(work(to(the(projects(involved. Focus(on(integration(and(quality(first(8(then(performance and(scale. • • • •
  • 8 SolrCloud
  • 9 Solr%Integration Read%and%Write%directly%to%HDFS First%Class%Custom%Directory%Support%in%Solr Support%Solr%Replication%on%HDFS Other%improvements%around%usability%and%configuration • • • •
  • 10 Read%and%Write%directly%to%HDFS Lucene%did%not%historically%support%append%only%file%system “Flexible%Indexing”%brought%around%support%for%append%only filesystem%support Lucene%support%append%only%filesystem%by%default%since%4.2 • • •
  • 11 Lucene&Directory&Abstraction It’s&how&Lucene&interacts&with&index&files. Solr&uses&the&Lucene&library&and&offers&DirectoryFactory Class&Directory&{ &&&&&&&&listAll(); &&&&&&&&createOutput(file,&context); &&&&&&&&openInput(file,&context); &&&&&&&&deleteFile(file); &&&&&&&&makeLock(file); &&&&&&&&clearLock(file); &&&&&&&&…
  • 12 Putting'the'Index'in'HDFS Solr'relies'on'the'filesystem'cache'to'operate'at'full'speed. HDFS'not'known'for'it’s'random'access'speed. Apache'Blur'has'already'solved'this'with'an'HdfsDirectory'that works'on'top'of'a'BlockDirectory. The'“block'cache”'caches'the'hot'blocks'of'the'index'off'heap (direct'byte'array)'and'takes'the'place'of'the'filesystem'cache. We'contributed'back'optional'‘write’'caching. • • • • •
  • 13 Putting'the'TransactionLog'in'HDFS HdfsUpdateLog'added'9'extends'UpdateLog Triggered'by'setting'the'UpdateLog'dataDir'to'something'that starts'with'hdfs:/'9'no'additional'configuration'necessary. Same'extensive'testing'as'used'on'UpdateLog • • •
  • 14 Running&Solr&on&HDFS Set&DirectoryFactory&to&HdfsDirectoryFactory&and&set&the&dataDir&to&a location&in&hdfs. Set&LockType&to&‘hdfs’ Use&an&UpdateLog&dataDir&location&that&begins&with&‘hdfs:/’ Or&java&FDsolr.directoryFactory=HdfsDirectoryFactory& &&&&&&&&&&&&&&&FDsolr.lockType=solr.HdfsLockFactory &&&&&&&&&&&&&&&FDsolr.updatelog=hdfs://host:port/path&Fjar&start.jar • • • •
  • 15 Solr%Replication%on%HDFS While%Solr%has%exposed%a%plug8able%DirectoryFactory%for%a%long time%now,%it%was%really%quite%limited. Most%glaring,%only%a%local%file%system%based%Directory%would work%with%replication. There%where%also%other%more%minor%areas%that%relied%on%a%local filesystem%Directory%implementation. • • •
  • 16 Future&Solr&Replication&on&HDFS Take&advantage&of&“distributed&filesystem”&and&allow&for something&similar&to&HBase&regions. If&a&node&goes&down,&the&data&is&still&available&in&HDFS&D&allow for&that&index&to&be&automatically&served&by&a&node&that&is&still&up if&it&has&the&capacity. • • Solr&Node Solr&Node Solr&Node HDFS
  • 17 MR#Index#Building Scalable#index#creation#via#map8reduce Many#initial#‘homegrown’#implementations#sent#documents#from#reducer#to SolrCloud#over#http To#really#scale,#you#want#the#reducers#to#create#the#indexes#in#HDFS#and then#load#them#up#with#Solr The#ideal#impl#will#allow#using#as#many#reducers#as#are#available#in#your hadoop#cluster,#and#then#merge#the#indexes#down#to#the#correct#number#of ‘shards’ • • • •
  • 18 MR#Index#Building Mapper: Parse#input#into indexable#document Mapper: Parse#input#into indexable#document Mapper: Parse#input#into indexable#document Index#shard 1 Index#shard 2 Arbitrary#reducing#steps#of#indexing#and#merging End@Reducer#(shard#1): Index#document End@Reducer#(shard#2): Index#document
  • 19 SolrCloud(Aware Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster. What(URL’s(to(GoLive(to. The(Schema(to(use(when(building(indexes. Match(hash(E>(shard(assignments(of(a(Solr(cluster. • • • •
  • 20 GoLive After+building+your+indexes+with+map:reduce,+how+do+you deploy+them+to+your+Solr+cluster? We+want+it+to+be+easy+:+so+we+built+the+GoLive+option. GoLive+allows+you+to+easily+merge+the+indexes+you+have created+atomically+into+a+live+running+Solr+cluster. Paired+with+the+ZooKeeper+Aware+ability,+this+allows+you+to simply+point+your+map:reduce+job+to+your+Solr+cluster+and+it+will automatically+discover+how+many+shards+to+build+and+what locations+to+deliver+the+final+indexes+to+in+HDFS. • • • •
  • 21 Flume&Solr&Sync Flume&is&a&distributed,&reliable,&and&available&service&for efficiently&collecting,&aggregating,&and&moving&large&amounts of&log&data.&It&has&a&simple&and&flexible&architecture&based&on streaming&data&flows.&It&is&robust&and&fault&tolerant&with tunable&reliability&mechanisms&and&many&failover&and recovery&mechanisms.&It&uses&a&simple&extensible&data&model that&allows&for&online&analytic&application. =&Apache&Flume&Website
  • Other Logs 22 Flume.Solr.Sync HDFS Flume Agent Flume Agent Solr
  • 23 SolrCloud(Aware Can(‘inspect’(ZooKeeper(to(learn(about(Solr(cluster. What(URL’s(to(send(data(to. The(Schema(for(the(collection(being(indexed(to. • • •
  • 24 HBase&Integration Collaboration&between&NGData&&&Cloudera NGData&are&creators&of&the&Lily&data&management&platform Lily&HBase&Indexer Service&which&acts&as&a&HBase&replication&listener HBase&replication&features,&such&as&filtering,&supported Replication&updates&trigger&indexing&of&updates&(rows) Integrates&Morphlines&library&for&ETL&of&rows AL2&licensed&on&github&https://github.com/ngdata • • • • • • • •
  • 25 HBase&Integration HDFS HBase interactive&load Indexer(s) Triggers&on&updates Solr&server Solr&server Solr&server Solr&server Solr&server
  • 26 Morphlines A,morphline,is,a,configuration,file,that,allows,you,to,define,ETL transformation,pipelines Extract,content,from,input,files,,transform,content,,load,content,(eg to,Solr) Uses,Tika,to,extract,content,from,a,large,variety,of,input,documents Part,of,the,CDK,(Cloudera,Development,Kit) • • • •
  • 27 Morphlines syslog Flume Agent Solr3Sink Command:3readLine Command:3grok Command:3loadSolr Solr 3Open3Source3framework3for3simple3ETL 3Ships3as3part3Cloudera3Developer3Kit3(CDK) 3It’s3a3Java3library 3AL23licensed3on3github https://github.com/cloudera/cdk 3Similar3to3Unix3pipelines 3Configuration3over3coding 3Supports3common3Hadoop3formats Avro Sequence3file Text • • • • • • •
  • 28 Morphlines +Integrate+with+and+load+into+Apache+Solr +Flexible+log+file+analysis +Single:line+record,+multi:line+records,+CSV+files+ +Regex+based+pattern+matching+and+extraction+ +Integration+with+Avro+ +Integration+with+Apache+Hadoop+Sequence+Files +Integration+with+SolrCell+and+all+Apache+Tika+parsers+ +Auto:detection+of+MIME+types+from+binary+data+using+++Apache+Tika • • • • • • • •
  • 29 Morphlines +Scripting+support+for+dynamic+java+code+ +Operations+on+fields+for+assignment+and+comparison +Operations+on+fields+with+list+and+set+semantics+ +if:then:else+conditionals+ +A+small+rules+engine+(tryRules) +String+and+timestamp+conversions+ +slf4j+logging +Yammer+metrics+and+counters+ +Decompression+and+unpacking+of+arbitrarily+nested+container+file formats • • • • • • • • •
  • 30 Morphlines+Example+Config morphlines+:+[ +{ +++id+:+morphline1 +++importCommands+:+["com.cloudera.**",+"org.apache.solr.**"] +++commands+:+[ +++++{+readLine+{}+}++++++++++++++++++++ +++++{+ +++++++grok+{+ +++++++++dictionaryFiles+:+[/tmp/grokFdictionaries]+++++++++++++++++++++++++++++++ +++++++++expressions+:+{+ +++++++++++message+:+"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}+% {SYSLOGHOST:syslog_hostname}+%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:+% {GREEDYDATA:syslog_message}""" +++++++++} +++++++} +++++} +++++{+loadSolr+{}+}+++++ ++++] +} ] Example(Input <164>Feb++4+10:46:14+syslog+sshd[607]:+listening+on+0.0.0.0+port+22 Output(Record syslog_pri:164 syslog_timestamp:Feb++4+10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening+on+0.0.0.0+port+22.
  • 31 Hue$Integration Hue Simple$UI Navigated,$faceted$drill$down Customizable$display Full$text$search,$standard$Solr API$and$query$language • • • •
  • 32 Cloudera)Search https://ccp.cloudera.com/display/SUPPORT/Downloads Or)Google “cloudera=search=download”
  • Mark%Miller,%Cloudera @heismark