Building a Near Real time“Logs Search Engine & Analytics”using SolrLucene/Solr Revolution 2013May 1st , 2013Rahul Jainjain...
Who am I? Software Engineer Member of Core technology @ IVY Comptech,Hyderabad, India 6 years of programming experience...
Agenda• Overview• Indexing• Search• Analytics• Architecture• Lessons learned• Q&A3
Overview
Issues keep coming in “Production”5java.net.ConnectException:Connection refusedServerNotRunningExceptionToo many open file...
Why Logs Search?• Enable production support team to immediately check for issues at“one place”– Saves time from logging on...
Key Problems• Hundreds of servers/services generating logs• terabytes of unstructured logs/day to index in Near Real time•...
Logs are different• varying size– from few bytes to several KBs– more no. of documents.• average 6-8 million log messages ...
Indexing
Improving Indexing Performance10 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Appro...
Old Architecture11Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs Transfer
Old Architecture12Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs TransferData Copy1 Da...
Direct Logs transferIndexing ServerProductionServerIndexing ServerIndexing ServerOpen question :Since now Indexing system ...
Solr in Embedded Mode14Single JVMSolrj(EmbeddedSolrServer)SolrApplicationIndexing ServerNo network latency
Improving Indexing Performance15 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Appro...
Message Flow16SolrInputDocumentSolrInputDocument(new object)Single JVMXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoa...
Bypassing XMLMarshalling/Unmarshalling17SolrInputDocumentXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoader)SolrInput...
Improving Indexing Performance18 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Appro...
Old Architecture(Sync)19IncomingMessageLogEventSolrunstructured structuredSolrInputDocument(10K)Thread Pool withmultiple t...
Moving to an AsynchronousArchitecture20IncomingMessageLog EventEvent Pipeline(BlockingQueue)Log EventSolrInputDocumentLog ...
Improving Indexing Performance21 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Appro...
Commit Strategy22Solr20130501_020130501_120130501_2Shard(Single Node)SolrInputDocumentIndexingPartitionfunction22
Indexing traffic on alternate ShardOnce commit starts on “Main Shard”23Solr20130501_020130501_120130501_2Main Shard(Single...
Commit Strategy• Merits– Scales well– Indexing can run continuously• De-Merits– Search needs to be done on both cores– but...
Improving Indexing Performance25 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Appro...
Other Optimizations• In Solr, Add document does update (add + delete)– for each add document call, Solr internally creates...
The Result27
Data Volume v/s Indexing time(GB/Minutes)31438561120.5 2 4.59220204060801001201GB 4GB 8GB 17GB 35GBIndexingTimeBeforeAfter28
Search
Partition• Partitioning the data properly improves the Search performance significantly• Partition Type– Server based Part...
Multi-tier PartitionjacobIncomingMessageDate & timebasedPartition20130501_00_0miaSolr Shard(date_hour_shardId)20130501_06_...
Distributed Search• One shard is chosen as leader shard– Forwards request to all other shards and collects response.• Requ...
How Search worksZookeeper(zk)Search Server(tomcat)Pushes shardmapping to zkCreate a watcher on zk nodeand update the In-Me...
How Search works (Cont’d)from: now-24hourserver: jacob from: now-4hourfrom: now-11hourIndexingserverIndexingserverIndexing...
Analytics• Young GC timings/chart• Full GC timings• DB Access/Update Timings– Reveal is there any pattern across all DB se...
Analytics• Custom report based on “Key:Value” PairFor e.g.time – key:value18:28:28, 541 - activeThreadCount:518:28:29, 541...
Architecture
Data Flows38WeirdLog FileZero CopyserverKafka BrokerIndexingServerSearch UIPeriodic PushReal time transferfrom In Memory(L...
Periodic Push39ZookeeperIndexingServerZero copyserverNode 1ProductionServerLogs transferDaemonDiskNode…n...
Real time transfer40IndexingServerKafkaBrokerIndexingServerSearch UIProductionServer(Kafka Appender)ZookeeperIndexingServe...
ConclusionLessons Learned• Always find sweet-spots for– Number of Indexer threads, that can run in parallel– Randomize• Me...
Thank Youjainr@ivycomptech.com42
Upcoming SlideShare
Loading in...5
×

Building a near real time search engine & analytics for logs using solr

7,669

Published on

Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd

Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.

Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.

Published in: Education, Technology
1 Comment
11 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,669
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
124
Comments
1
Likes
11
Embeds 0
No embeds

No notes for slide

Transcript of "Building a near real time search engine & analytics for logs using solr"

  1. 1. Building a Near Real time“Logs Search Engine & Analytics”using SolrLucene/Solr Revolution 2013May 1st , 2013Rahul Jainjainr@ivycomptech.com
  2. 2. Who am I? Software Engineer Member of Core technology @ IVY Comptech,Hyderabad, India 6 years of programming experience Areas of expertise/interest High traffic web applications JAVA/J2EE Big data, NoSQL Information-Retrieval, Machine learning2
  3. 3. Agenda• Overview• Indexing• Search• Analytics• Architecture• Lessons learned• Q&A3
  4. 4. Overview
  5. 5. Issues keep coming in “Production”5java.net.ConnectException:Connection refusedServerNotRunningExceptionToo many open filesDBExceptionNullPointerExceptionOutOfMemoryIssues Hidden Bugs DB is down Server crashed OutOfMemory Connection reset Nodes go out of cluster(Due to long GC pause) Attack DOS (Denial of Service) bysending a lot of requestsin a short time frame.5
  6. 6. Why Logs Search?• Enable production support team to immediately check for issues at“one place”– Saves time from logging on to multiple servers to check the logs• Debugging production issues– Is it a server specific or occurring in all other servers for that application?• Allows to track user activity across multiple servers/applications.• Correlation of multiple issues with each other.– e.g. Logins might be failing on X Node due to OutOfMemory on Y node.6
  7. 7. Key Problems• Hundreds of servers/services generating logs• terabytes of unstructured logs/day to index in Near Real time• Millions of log events (Priority one)• Full Text search & storage of log content• High Indexing Rate of 1GB/min• Search latency in seconds is acceptable7
  8. 8. Logs are different• varying size– from few bytes to several KBs– more no. of documents.• average 6-8 million log messages in 1 GB logs– Each line forms one log message except “exception stack trace”.• different types– exception stack-trace– application logs– http access/error logs– gclog• logging format is not uniform across all logs8
  9. 9. Indexing
  10. 10. Improving Indexing Performance10 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
  11. 11. Old Architecture11Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs Transfer
  12. 12. Old Architecture12Solr ServerCentralizedLog CollectionServerSolr ServerSearch UIProductionServerLogs TransferData Copy1 Data Copy2
  13. 13. Direct Logs transferIndexing ServerProductionServerIndexing ServerIndexing ServerOpen question :Since now Indexing system is exposed to production servers- what if a new Indexing Server is added on the fly or one of them is down13
  14. 14. Solr in Embedded Mode14Single JVMSolrj(EmbeddedSolrServer)SolrApplicationIndexing ServerNo network latency
  15. 15. Improving Indexing Performance15 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
  16. 16. Message Flow16SolrInputDocumentSolrInputDocument(new object)Single JVMXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoader)<add><doc><field> </field><field> </field><doc></add>xml
  17. 17. Bypassing XMLMarshalling/Unmarshalling17SolrInputDocumentXMLMarshalling(UpdateRequest)XMLUnmarshalling(XMLLoader)SolrInputDocument(referenced object)Passing the Direct reference ofSolrInputDocument ObjectSingle JVMDocContentStream#getSolrInputDocuments()RefDocumentLoader#load()DocUpdateRequest#add(List<SolrInputDocument>)LMEmbeddedSolrServer#add(List<SolrInputDocument>)
  18. 18. Improving Indexing Performance18 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
  19. 19. Old Architecture(Sync)19IncomingMessageLogEventSolrunstructured structuredSolrInputDocument(10K)Thread Pool withmultiple threadsOnce Batch size reaches to10k, one of the thread addsdocuments to Solr as a Synccall and wait for responseaddUpdateResponseBatchWait for responseTime taken :- Indexing 1 chunk (10k) takes anywhere between 400ms-3000ms#- while commit it is from 6000ms-23000ms and even more…- In 1 GB there are around 600 chunks- so most of time is just spent in waiting for response#Indexing time vary based on several factors, for e.g. hardware configurations, application type, nature of data,number of index fields/stored fields, analyzer type etc.
  20. 20. Moving to an AsynchronousArchitecture20IncomingMessageLog EventEvent Pipeline(BlockingQueue)Log EventSolrInputDocumentLog MessageTransformation(Analyzer Thread Pool)Log Event toSolrInputDocument(Indexer Thread Pool)Add a Batch of LogEvent to PipelineRemove Batch of LogEvent from PipelineSolrAdd toBatchRemovefrom Batch
  21. 21. Improving Indexing Performance21 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
  22. 22. Commit Strategy22Solr20130501_020130501_120130501_2Shard(Single Node)SolrInputDocumentIndexingPartitionfunction22
  23. 23. Indexing traffic on alternate ShardOnce commit starts on “Main Shard”23Solr20130501_020130501_120130501_2Main Shard(Single Node)20130501_320130501_420130501_5AlternateShardSolrInputDocumentIndexingPairPartitionfunction23
  24. 24. Commit Strategy• Merits– Scales well– Indexing can run continuously• De-Merits– Search needs to be done on both cores– but end of the day these two can be merged intoone core24
  25. 25. Improving Indexing Performance25 Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object
  26. 26. Other Optimizations• In Solr, Add document does update (add + delete)– for each add document call, Solr internally creates a delete term with “id” fieldfor delete– but log messages are always unique• Changing Buffer Size in BufferIndexInput and BufferIndexOutput– Increasing buffer size improves the indexing performance especially if disk isslow.– More Process heap is required accordingly as lot of files are created if datavolume is high.• Reusing Lucene document and Field instances- Check org/apache/lucene/benchmark/byTask/feeds/DocMaker.java• Check for more information on Improving Indexing performancehttp://rahuldausa.wordpress.com/2013/01/14/scaling-lucene-for-indexing-a-billion-documents/26
  27. 27. The Result27
  28. 28. Data Volume v/s Indexing time(GB/Minutes)31438561120.5 2 4.59220204060801001201GB 4GB 8GB 17GB 35GBIndexingTimeBeforeAfter28
  29. 29. Search
  30. 30. Partition• Partitioning the data properly improves the Search performance significantly• Partition Type– Server based Partition• Number of documents does not balance out evenly in all shards– Date and Time based Partition• Hotspot a single shard– Least loaded Shard (index)• By number of documents• Balances out documents evenly in all shards• Can’t provide optimal search performance, as all shards needs to be hit30IncomingmessageServer BasedPartitionDate & timeBasedPartitionSolr ShardHybrid Approach
  31. 31. Multi-tier PartitionjacobIncomingMessageDate & timebasedPartition20130501_00_0miaSolr Shard(date_hour_shardId)20130501_06_020130501_00_1Server basedPartitionJacob: {message: hello lucenetime:20130501:11:00:00}Indexing ServerProduction Servermia: {message: hello solr and lucenetime:20130501:04:00:00}31
  32. 32. Distributed Search• One shard is chosen as leader shard– Forwards request to all other shards and collects response.• Requires all documents to must have– Unique key (e.g. “id”) across all shards and should be stored– Used a approach based on epoch to generate a unique id across clusterinspired from instagram engineering blog#• Unique Id– Combination of epoch time, unique node id and an incremented numberepoch_time unique_node_id incremented_numberUnique Id#http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram32
  33. 33. How Search worksZookeeper(zk)Search Server(tomcat)Pushes shardmapping to zkCreate a watcher on zk nodeand update the In-Memoryshard mapping on changeUser query QueryParser Indexing Server(maestro)Search query withshards parameterShard Mapping(In Memory structure)Lookup33erverIndexingServer
  34. 34. How Search works (Cont’d)from: now-24hourserver: jacob from: now-4hourfrom: now-11hourIndexingserverIndexingserverIndexingserverLookup on shards for todayshard(s) having data for jacobfrom last 6hour shardshard(s) having data for last12 hours34Leader shard(maestro)
  35. 35. Analytics• Young GC timings/chart• Full GC timings• DB Access/Update Timings– Reveal is there any pattern across all DB servers?• Real time Exceptions/Issues reporting usingfacet query.• Apache Access/Error KPI35
  36. 36. Analytics• Custom report based on “Key:Value” PairFor e.g.time – key:value18:28:28, 541 - activeThreadCount:518:28:29, 541- activeThreadCount:818:28:30, 541 - activeThreadCount:918:28:31, 541- activeThreadCount:3360246810activeThreadCount
  37. 37. Architecture
  38. 38. Data Flows38WeirdLog FileZero CopyserverKafka BrokerIndexingServerSearch UIPeriodic PushReal time transferfrom In Memory(Log4jAppender)o Zero Copy server- Deployed on each Indexing server for data locality- Write incoming files to disk as Indexing server doesn’t index with same rateo Kafka Brokero Kafka Appender pass the messages from in-Memory
  39. 39. Periodic Push39ZookeeperIndexingServerZero copyserverNode 1ProductionServerLogs transferDaemonDiskNode…n...
  40. 40. Real time transfer40IndexingServerKafkaBrokerIndexingServerSearch UIProductionServer(Kafka Appender)ZookeeperIndexingServerIndexingServerUpdate ConsumedMessage offset
  41. 41. ConclusionLessons Learned• Always find sweet-spots for– Number of Indexer threads, that can run in parallel– Randomize• Merge factor• Commit Interval• ramBufferSize– Increasing Cache Size helps in bringing down search latency• but with Full GC penalty• Index size of more than 5GB in one core does not go well with Search• Search on a lot of cores does not provide optimal response time– Overall query response time is limited by slowest shard’s performance• Solr scales both vertically and horizontally• Batching of log messages based on message size (~10KB) in a MessageSet– Kafka adds 10 bytes on each message– Most of the time Log messages are < 100 bytes41
  42. 42. Thank Youjainr@ivycomptech.com42

×