Successfully reported this slideshow.
Your SlideShare is downloading. ×

Realtimestream and realtime fastcatsearch

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 32 Ad

More Related Content

Slideshows for you (20)

Advertisement

Similar to Realtimestream and realtime fastcatsearch (20)

Advertisement

Realtimestream and realtime fastcatsearch

  1. 1. Realtimestream and realtime search engine Sang Song fastcatsearch.org 1
  2. 2. About Me • www.linkedin.com/profile/view?id=295484775 • facebook.com/songaal • swsong@websqrd.com 2
  3. 3. Agenda • Introduction • Search Architecture • Realtime Indexing 3
  4. 4. Introduction 4
  5. 5. Goal • Like Splunk • Indexing streaming log data • Search log data in real-time 5
  6. 6. Big data • Data sets so large and complex for database • Difficult to process them using traditional data processing • 3Vs • Volume : Large quantity of data • Variety : Diverse set of data • Velocity : speed of data 출처 : wikipedia 6
  7. 7. About Fastcatsearch • Distributed system • Fast indexing • Fast queries • Popular keyword • GS cetification • 70+ references • Open source • Muti-platform • Easy web management tool • Dictionary management • Plugin extension 7
  8. 8. Reference 8
  9. 9. History • Fastcatsearch v1 (2010-2011) • Single machine • <150 QPS • Fastcatsearch v2 (2013-Now) • Distributed system • Multi collection result aggregation • >200+ Query per second • Fastcatsearch v3 (alpha) • Realtime indexing/searching • Schema-free • Shard/replica • Geo spatial search 9
  10. 10. Search Architecture 10
  11. 11. 11
  12. 12. Realtime Indexing 12
  13. 13. Store log data • HDFS • Write once static file • Flume • Collecting, aggregating, and moving large amounts of log data 13
  14. 14. 14
  15. 15. Flume config agent1.sources = r1 agent1.sinks = hdfssink agent1.channels = c1 agent1.sources.r1.type = netcat agent1.sources.r1.bind = localhost agent1.sources.r1.port = 44443 agent1.sinks.hdfssink.type = hdfs agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream agent1.sinks.hdfssink.hdfs.writeFormat = Text agent1.sinks.hdfssink.hdfs.batchSize= 10 agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sinks.hdfssink.channel = c1 $ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1 15
  16. 16. Flume append? 16
  17. 17. Fastcatsearch HDFS Indexer Merger SSeSegegmgmmeenentnt t Searcher Index File Issue - Segment file commit - Doc deletion 17
  18. 18. Import using Flume 1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf); 2. Configuration conf = fs.getConf(); 3. FileStatus[] status = fs.listStatus(new Path(dirPath)); 4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath()); 5. for (int i = 0; i < status.length; i++) { 6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt); 7. Writable key = (Writable) ReflectionUtils.newInstance( reader.getKeyClass(), conf); 8. Writable value = (Writable) ReflectionUtils.newInstance( reader.getValueClass(), conf); 9. while (reader.next(key, value)) { 10. Map<String, Object> parsedEvent = parseEvent(key.toString(), value.toString()); 11. if (parsedEvent != null) { 12. eventQueue.add(parsedEvent); } } } 18
  19. 19. Making index segment • Index has multiple segments • Document writer • Index writer • Search index writer • Field index writer • Group index writer 19
  20. 20. Segment commit issue • Update / Delete documents • Not in-place update • Append and delete operation • Deletion for previous segments • Mark as deleted 20
  21. 21. Segment merge issue • Performance • 2(n+m) in time and space • Size Compaction - Deleted docs removed. segment #1 segment #2 segment #3 segment #4 merge to new segment 21
  22. 22. Segment merge issue • Why merge? • Segment count grows fast • Search index = Search all leaf segments in turn • Document deletion 22
  23. 23. Inverted Indexing Posting index term1 term3 term5 term7 Postings term1 posting1 term2 posting2 term3 posting3 term4 posting4 term5 posting5 term6 posting6 Good for sequential writing to disk 23
  24. 24. Inverted Indexing How about b tree? block block block block Memory block block block block block block block block block block block block block block Flush occurs much of data random writing to disk File 24
  25. 25. Search in realtime seg #1 seg #2 seg #3 seg #4 1. New created segment Searchable data 25
  26. 26. Search in realtime seg #1 seg #2 seg #3 seg #4 2. Merge segments Searchable data 26
  27. 27. Search in realtime seg #1 seg #2 seg #3 seg #4 seg #5 4. Remove segments 3. New merged segment Searchable data 27
  28. 28. Search in realtime Searchable data seg #1 seg #5 5. Searching data 28
  29. 29. Search in realtime Searchable data seg #1 seg #5 seg #6 New created segment Do this process constantly 29
  30. 30. Visualization • Lucene's merge visualization • http://www.youtube.com/watch?v=ojcpvIY3QgA • Python script + Python Image Library + MEncoder 30
  31. 31. Questions? 31
  32. 32. Learn More • http://fastcatsearch.org/ • https://www.facebook.com/groups/fastcatsearch/ 32

×