Realtimestream and 
realtime search engine 
Sang Song 
fastcatsearch.org 
1
About Me 
• www.linkedin.com/profile/view?id=295484775 
• facebook.com/songaal 
• swsong@websqrd.com 
2
Agenda 
• Introduction 
• Search Architecture 
• Realtime Indexing 
3
Introduction 
4
Goal 
• Like Splunk 
• Indexing streaming log data 
• Search log data in real-time 
5
Big data 
• Data sets so large and complex for database 
• Difficult to process them using traditional data processing 
• 3Vs 
• Volume : Large quantity of data 
• Variety : Diverse set of data 
• Velocity : speed of data 
출처 : wikipedia 
6
About Fastcatsearch 
• Distributed system 
• Fast indexing 
• Fast queries 
• Popular keyword 
• GS cetification 
• 70+ references 
• Open source 
• Muti-platform 
• Easy web management 
tool 
• Dictionary management 
• Plugin extension 
7
Reference 
8
History 
• Fastcatsearch v1 (2010-2011) 
• Single machine 
• <150 QPS 
• Fastcatsearch v2 (2013-Now) 
• Distributed system 
• Multi collection result aggregation 
• >200+ Query per second 
• Fastcatsearch v3 (alpha) 
• Realtime indexing/searching 
• Schema-free 
• Shard/replica 
• Geo spatial search 
9
Search Architecture 
10
11
Realtime Indexing 
12
Store log data 
• HDFS 
• Write once static file 
• Flume 
• Collecting, aggregating, and moving large amounts of 
log data 
13
14
Flume config 
agent1.sources = r1 
agent1.sinks = hdfssink 
agent1.channels = c1 
agent1.sources.r1.type = netcat 
agent1.sources.r1.bind = localhost 
agent1.sources.r1.port = 44443 
agent1.sinks.hdfssink.type = hdfs 
agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events 
agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream 
agent1.sinks.hdfssink.hdfs.writeFormat = Text 
agent1.sinks.hdfssink.hdfs.batchSize= 10 
agent1.channels.c1.type = memory 
agent1.channels.c1.capacity = 1000 
agent1.channels.c1.transactionCapacity = 100 
agent1.sources.r1.channels = c1 
agent1.sinks.hdfssink.channel = c1 
$ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1 
15
Flume append? 
16
Fastcatsearch 
HDFS Indexer 
Merger 
SSeSegegmgmmeenentnt t Searcher 
Index File 
Issue 
- Segment file commit 
- Doc deletion 
17
Import using Flume 
1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf); 
2. Configuration conf = fs.getConf(); 
3. FileStatus[] status = fs.listStatus(new Path(dirPath)); 
4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath()); 
5. for (int i = 0; i < status.length; i++) { 
6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt); 
7. Writable key = (Writable) ReflectionUtils.newInstance( 
reader.getKeyClass(), conf); 
8. Writable value = (Writable) ReflectionUtils.newInstance( 
reader.getValueClass(), conf); 
9. while (reader.next(key, value)) { 
10. Map<String, Object> parsedEvent = parseEvent(key.toString(), 
value.toString()); 
11. if (parsedEvent != null) { 
12. eventQueue.add(parsedEvent); 
} 
} 
} 
18
Making index segment 
• Index has multiple segments 
• Document writer 
• Index writer 
• Search index writer 
• Field index writer 
• Group index writer 
19
Segment commit issue 
• Update / Delete documents 
• Not in-place update 
• Append and delete operation 
• Deletion for previous segments 
• Mark as deleted 
20
Segment merge issue 
• Performance 
• 2(n+m) in time and space 
• Size Compaction - Deleted docs removed. 
segment #1 segment #2 segment #3 
segment #4 
merge to new segment 
21
Segment merge issue 
• Why merge? 
• Segment count grows fast 
• Search index = Search all leaf segments in turn 
• Document deletion 
22
Inverted Indexing 
Posting index term1 
term3 term5 term7 
Postings 
term1 posting1 term2 posting2 term3 posting3 
term4 posting4 term5 posting5 term6 posting6 
Good for sequential writing to disk 
23
Inverted Indexing 
How about b tree? 
block 
block block block 
Memory 
block block block block block block 
block block block block block block block block 
Flush occurs much of data random writing to disk 
File 
24
Search in realtime 
seg #1 seg #2 seg #3 seg #4 
1. New created segment 
Searchable data 
25
Search in realtime 
seg #1 seg #2 seg #3 seg #4 
2. Merge segments 
Searchable data 
26
Search in realtime 
seg #1 seg #2 seg #3 seg #4 seg #5 
4. Remove segments 
3. New merged segment 
Searchable data 
27
Search in realtime 
Searchable data 
seg #1 seg #5 
5. Searching data 
28
Search in realtime 
Searchable data 
seg #1 seg #5 
seg #6 
New created segment 
Do this process constantly 
29
Visualization 
• Lucene's merge visualization 
• http://www.youtube.com/watch?v=ojcpvIY3QgA 
• Python script + Python Image Library + MEncoder 
30
Questions? 
31
Learn More 
• http://fastcatsearch.org/ 
• https://www.facebook.com/groups/fastcatsearch/ 
32

Realtimestream and realtime fastcatsearch

  • 1.
    Realtimestream and realtimesearch engine Sang Song fastcatsearch.org 1
  • 2.
    About Me •www.linkedin.com/profile/view?id=295484775 • facebook.com/songaal • swsong@websqrd.com 2
  • 3.
    Agenda • Introduction • Search Architecture • Realtime Indexing 3
  • 4.
  • 5.
    Goal • LikeSplunk • Indexing streaming log data • Search log data in real-time 5
  • 6.
    Big data •Data sets so large and complex for database • Difficult to process them using traditional data processing • 3Vs • Volume : Large quantity of data • Variety : Diverse set of data • Velocity : speed of data 출처 : wikipedia 6
  • 7.
    About Fastcatsearch •Distributed system • Fast indexing • Fast queries • Popular keyword • GS cetification • 70+ references • Open source • Muti-platform • Easy web management tool • Dictionary management • Plugin extension 7
  • 8.
  • 9.
    History • Fastcatsearchv1 (2010-2011) • Single machine • <150 QPS • Fastcatsearch v2 (2013-Now) • Distributed system • Multi collection result aggregation • >200+ Query per second • Fastcatsearch v3 (alpha) • Realtime indexing/searching • Schema-free • Shard/replica • Geo spatial search 9
  • 10.
  • 11.
  • 12.
  • 13.
    Store log data • HDFS • Write once static file • Flume • Collecting, aggregating, and moving large amounts of log data 13
  • 14.
  • 15.
    Flume config agent1.sources= r1 agent1.sinks = hdfssink agent1.channels = c1 agent1.sources.r1.type = netcat agent1.sources.r1.bind = localhost agent1.sources.r1.port = 44443 agent1.sinks.hdfssink.type = hdfs agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream agent1.sinks.hdfssink.hdfs.writeFormat = Text agent1.sinks.hdfssink.hdfs.batchSize= 10 agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sinks.hdfssink.channel = c1 $ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1 15
  • 16.
  • 17.
    Fastcatsearch HDFS Indexer Merger SSeSegegmgmmeenentnt t Searcher Index File Issue - Segment file commit - Doc deletion 17
  • 18.
    Import using Flume 1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf); 2. Configuration conf = fs.getConf(); 3. FileStatus[] status = fs.listStatus(new Path(dirPath)); 4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath()); 5. for (int i = 0; i < status.length; i++) { 6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt); 7. Writable key = (Writable) ReflectionUtils.newInstance( reader.getKeyClass(), conf); 8. Writable value = (Writable) ReflectionUtils.newInstance( reader.getValueClass(), conf); 9. while (reader.next(key, value)) { 10. Map<String, Object> parsedEvent = parseEvent(key.toString(), value.toString()); 11. if (parsedEvent != null) { 12. eventQueue.add(parsedEvent); } } } 18
  • 19.
    Making index segment • Index has multiple segments • Document writer • Index writer • Search index writer • Field index writer • Group index writer 19
  • 20.
    Segment commit issue • Update / Delete documents • Not in-place update • Append and delete operation • Deletion for previous segments • Mark as deleted 20
  • 21.
    Segment merge issue • Performance • 2(n+m) in time and space • Size Compaction - Deleted docs removed. segment #1 segment #2 segment #3 segment #4 merge to new segment 21
  • 22.
    Segment merge issue • Why merge? • Segment count grows fast • Search index = Search all leaf segments in turn • Document deletion 22
  • 23.
    Inverted Indexing Postingindex term1 term3 term5 term7 Postings term1 posting1 term2 posting2 term3 posting3 term4 posting4 term5 posting5 term6 posting6 Good for sequential writing to disk 23
  • 24.
    Inverted Indexing Howabout b tree? block block block block Memory block block block block block block block block block block block block block block Flush occurs much of data random writing to disk File 24
  • 25.
    Search in realtime seg #1 seg #2 seg #3 seg #4 1. New created segment Searchable data 25
  • 26.
    Search in realtime seg #1 seg #2 seg #3 seg #4 2. Merge segments Searchable data 26
  • 27.
    Search in realtime seg #1 seg #2 seg #3 seg #4 seg #5 4. Remove segments 3. New merged segment Searchable data 27
  • 28.
    Search in realtime Searchable data seg #1 seg #5 5. Searching data 28
  • 29.
    Search in realtime Searchable data seg #1 seg #5 seg #6 New created segment Do this process constantly 29
  • 30.
    Visualization • Lucene'smerge visualization • http://www.youtube.com/watch?v=ojcpvIY3QgA • Python script + Python Image Library + MEncoder 30
  • 31.
  • 32.
    Learn More •http://fastcatsearch.org/ • https://www.facebook.com/groups/fastcatsearch/ 32