• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building a near real time search engine & analytics for logs using solr
 

Building a near real time search engine & analytics for logs using solr

on

  • 5,273 views

Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd ...

Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd

Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.

Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.

Statistics

Views

Total Views
5,273
Views on SlideShare
3,892
Embed Views
1,381

Actions

Likes
6
Downloads
73
Comments
1

8 Embeds 1,381

http://www.lucenerevolution.org 905
http://lucenerevolution.org 455
http://www.linkedin.com 12
https://twitter.com 5
http://lucenerevolution.stephenz.com 1
http://wordpress.com 1
https://www.linkedin.com 1
http://www.lucenerevolution.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • lage raho munna bhai
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building a near real time search engine & analytics for logs using solr Building a near real time search engine & analytics for logs using solr Presentation Transcript

    • Building a Near Real time“Logs Search Engine & Analytics”using SolrLucene/Solr Revolution 2013May 1st , 2013Rahul Jainjainr@ivycomptech.com
    • Who am I? Software Engineer Member of Core technology @ IVY Comptech,Hyderabad, India 6 years of programming experience Areas of expertise/interest High traffic web applications JAVA/J2EE Big data, NoSQL Information-Retrieval, Machine learning2
    • Agenda• Overview• Indexing• Search• Analytics• Architecture• Lessons learned• Q&A3
    • Overview
    • Issues keep coming in “Production”5java.net.ConnectException:Connection refusedServerNotRunningExceptionToo many open filesDBExceptionNullPointerExceptionOutOfMemoryIssues Hidden Bugs DB is down Server crashed OutOfMemory Connection reset Nodes go out of cluster(Due to long GC pause) Attack DOS (Denial of Service) bysending a lot of requestsin a short time frame.5
    • Why Logs Search?• Enable production support team to immediately check for issues at“one place”– Saves time from logging on to multiple servers to check the logs• Debugging production issues– Is it a server specific or occurring in all other servers for that application?• Allows to track user activity across multiple servers/applications.• Correlation of multiple issues with each other.– e.g. Logins might be failing on X Node due to OutOfMemory on Y node.6
    • Key Problems• Hundreds of servers/services generating logs• terabytes of unstructured logs/day to index in Near Real time• Millions of log events (Priority one)• Full Text search & storage of log content• High Indexing Rate of 1GB/min• Search latency in seconds is acceptable7
    • Logs are different• varying size– from few bytes to several KBs– more no. of documents.• average 6-8 million log messages in 1 GB logs– Each line forms one log message except “exception stack trace”.• different types– exception stack-trace– application logs– http access/error logs– gclog• logging format is not uniform across all logs8
    • Indexing
    • Improving Indexing Performance10 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
    • Old Architecture11Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs Transfer
    • Old Architecture12Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs TransferData Copy1 Data Copy2
    • Direct Logs transferIndexing ServerProductionServerIndexing ServerIndexing ServerOpen question :Since now Indexing system is exposed to production servers- what if a new Indexing Server is added on the fly or one of them is down13
    • Solr in Embedded Mode14Single JVMSolrj(EmbeddedSolrServer)SolrApplicationIndexing ServerNo network latency
    • Improving Indexing Performance15 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
    • Message Flow16SolrInputDocumentSolrInputDocument(new object)Single JVMXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoader)<add><doc><field> </field><field> </field><doc></add>xml
    • Bypassing XMLMarshalling/Unmarshalling17SolrInputDocumentXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoader)SolrInputDocument(referenced object)Passing the Direct reference ofSolrInputDocument ObjectSingle JVMDocContentStream#getSolrInputDocuments()RefDocumentLoader#load()DocUpdateRequest#add(List<SolrInputDocument>)LMEmbeddedSolrServer#add(List<SolrInputDocument>)
    • Improving Indexing Performance18 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
    • Old Architecture(Sync)19IncomingMessageLogEventSolrunstructured structuredSolrInputDocument(10K)Thread Pool withmultiple threadsOnce Batch size reaches to10k, one of the thread addsdocuments to Solr as a Synccall and wait for responseaddUpdateResponseBatchWait for responseTime taken :- Indexing 1 chunk (10k) takes anywhere between 400ms-3000ms#- while commit it is from 6000ms-23000ms and even more…- In 1 GB there are around 600 chunks- so most of time is just spent in waiting for response#Indexing time vary based on several factors, for e.g. hardware configurations, application type, nature of data,number of index fields/stored fields, analyzer type etc.
    • Moving to an AsynchronousArchitecture20IncomingMessageLog EventEvent Pipeline(BlockingQueue)Log EventSolrInputDocumentLog MessageTransformation(Analyzer Thread Pool)Log Event toSolrInputDocument(Indexer Thread Pool)Add a Batch of LogEvent to PipelineRemove Batch of LogEvent from PipelineSolrAdd toBatchRemovefrom Batch
    • Improving Indexing Performance21 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
    • Commit Strategy22Solr20130501_020130501_120130501_2Shard(Single Node)SolrInputDocumentIndexingPartitionfunction22
    • Indexing traffic on alternate ShardOnce commit starts on “Main Shard”23Solr20130501_020130501_120130501_2Main Shard(Single Node)20130501_320130501_420130501_5AlternateShardSolrInputDocumentIndexingPairPartitionfunction23
    • Commit Strategy• Merits– Scales well– Indexing can run continuously• De-Merits– Search needs to be done on both cores– but end of the day these two can be merged intoone core24
    • Improving Indexing Performance25 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
    • Other Optimizations• In Solr, Add document does update (add + delete)– for each add document call, Solr internally creates a delete term with “id” fieldfor delete– but log messages are always unique• Changing Buffer Size in BufferIndexInput and BufferIndexOutput– Increasing buffer size improves the indexing performance especially if disk isslow.– More Process heap is required accordingly as lot of files are created if datavolume is high.• Reusing Lucene document and Field instances- Check org/apache/lucene/benchmark/byTask/feeds/DocMaker.java• Check for more information on Improving Indexing performancehttp://rahuldausa.wordpress.com/2013/01/14/scaling-lucene-for-indexing-a-billion-documents/26
    • The Result27
    • Data Volume v/s Indexing time(GB/Minutes)31438561120.5 2 4.59220204060801001201GB 4GB 8GB 17GB 35GBIndexingTimeBeforeAfter28
    • Search
    • Partition• Partitioning the data properly improves the Search performance significantly• Partition Type– Server based Partition• Number of documents does not balance out evenly in all shards– Date and Time based Partition• Hotspot a single shard– Least loaded Shard (index)• By number of documents• Balances out documents evenly in all shards• Can’t provide optimal search performance, as all shards needs to be hit30IncomingmessageServer BasedPartitionDate & timeBasedPartitionSolr ShardHybrid Approach
    • Multi-tier PartitionjacobIncomingMessageDate & timebasedPartition20130501_00_0miaSolr Shard(date_hour_shardId)20130501_06_020130501_00_1Server basedPartitionJacob: {message: hello lucenetime:20130501:11:00:00}Indexing ServerProduction Servermia: {message: hello solr and lucenetime:20130501:04:00:00}31
    • Distributed Search• One shard is chosen as leader shard– Forwards request to all other shards and collects response.• Requires all documents to must have– Unique key (e.g. “id”) across all shards and should be stored– Used a approach based on epoch to generate a unique id across clusterinspired from instagram engineering blog#• Unique Id– Combination of epoch time, unique node id and an incremented numberepoch_time unique_node_id incremented_numberUnique Id#http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram32
    • How Search worksZookeeper(zk)Search Server(tomcat)Pushes shardmapping to zkCreate a watcher on zk nodeand update the In-Memoryshard mapping on changeUser query QueryParser Indexing Server(maestro)Search query withshards parameterShard Mapping(In Memory structure)Lookup33erverIndexingServer
    • How Search works (Cont’d)from: now-24hourserver: jacob from: now-4hourfrom: now-11hourIndexingserverIndexingserverIndexingserverLookup on shards for todayshard(s) having data for jacobfrom last 6hour shardshard(s) having data for last12 hours34Leader shard(maestro)
    • Analytics• Young GC timings/chart• Full GC timings• DB Access/Update Timings– Reveal is there any pattern across all DB servers?• Real time Exceptions/Issues reporting usingfacet query.• Apache Access/Error KPI35
    • Analytics• Custom report based on “Key:Value” PairFor e.g.time – key:value18:28:28, 541 - activeThreadCount:518:28:29, 541- activeThreadCount:818:28:30, 541 - activeThreadCount:918:28:31, 541- activeThreadCount:3360246810activeThreadCount
    • Architecture
    • Data Flows38WeirdLog FileZero CopyserverKafka BrokerIndexingServerSearch UIPeriodic PushReal time transferfrom In Memory(Log4jAppender)o Zero Copy server- Deployed on each Indexing server for data locality- Write incoming files to disk as Indexing server doesn’t index with same rateo Kafka Brokero Kafka Appender pass the messages from in-Memory
    • Periodic Push39ZookeeperIndexingServerZero copyserverNode 1ProductionServerLogs transferDaemonDiskNode…n...
    • Real time transfer40IndexingServerKafkaBrokerIndexingServerSearch UIProductionServer(Kafka Appender)ZookeeperIndexingServerIndexingServerUpdate ConsumedMessage offset
    • ConclusionLessons Learned• Always find sweet-spots for– Number of Indexer threads, that can run in parallel– Randomize• Merge factor• Commit Interval• ramBufferSize– Increasing Cache Size helps in bringing down search latency• but with Full GC penalty• Index size of more than 5GB in one core does not go well with Search• Search on a lot of cores does not provide optimal response time– Overall query response time is limited by slowest shard’s performance• Solr scales both vertically and horizontally• Batching of log messages based on message size (~10KB) in a MessageSet– Kafka adds 10 bytes on each message– Most of the time Log messages are < 100 bytes41
    • Thank Youjainr@ivycomptech.com42