2. Who am I
• Evans Ye @
– Dumbo Team
• Dumbo In Taiwan Blog
– Talk in TWHUG 2013 Q4
• Building Hadoop Based Big Data Environment
– Apache Bigtop Contributor
3/10/2014
Copyright 2013 Trend Micro Inc.
3. Agenda
• Problem to Solve
• Solution Design
• Flume ETL Process
• Experience Sharing
• Future Work
3/10/2014
Copyright 2013 Trend Micro Inc.
5. Network Traffic Analysis Example
C&C 2
C&C 1
C&C 3
INTERNET
INTRANET
VICTIM 1
VICTIM 2
TW branch
3/10/2014
Copyright 2013 Trend Micro Inc.
VICTIM 3
VICTIM 4
US branch
6. Find Malicious Connections by Searching
Netflow logs
• ArcSight Common Event Format
– Volume: 250G/180 million record per day
3/10/2014
Copyright 2013 Trend Micro Inc.
7. Valuable Fields in Netflow log
• src: source ip
• dst: destination ip
• spt: source port
• dpt: destination port
• proto: protocol, TCP,UDP…
• rt: timestamp, 1386018915000
3/10/2014
Copyright 2013 Trend Micro Inc.
10. Choosing The Right Tool
• Big data solutions
• Why HBase?
– We want to try and figure out HBase Thrift limitation
– How HBase performs when dealing with this kind of problem
3/10/2014
Copyright 2013 Trend Micro Inc.
12. Architecture
Data
Soruce
Send Netflow via
syslog
Talk to HBase
using C++,
Python, PHP,
Ruby, Perl…
A simple Python
web framework
Only one file
Query
under 150k
HBase
Thrift
Server
3/10/2014
Copyright 2013 Trend Micro Inc.
13. User Requirement
• Searchable Fields
–
–
–
–
–
–
src: source ip
dst: destination ip
spt: source port
dpt: destination port
proto: protocol, TCP,UDP…
rt: timestamp, 1386018915000
• Values
– in, cn2, ad.tcp__flags
3/10/2014
Copyright 2013 Trend Micro Inc.
14. HBase Rowkey Design – First Attempt
• Compose searchable fields to be rowkey
• For client query, scan by applying HBase Filter
– RowFilter (=, 'regexstring:^src#dst#[^#]*#spt#dpt#proto$')“
– See HBase Thrift Filter doc
3/10/2014
Copyright 2013 Trend Micro Inc.
16. Performance
• Test on 12 million sample data
• The search performance……
• Since we need to store at least 3 month data for query,
The performance might not be good enough…
3/10/2014
Copyright 2013 Trend Micro Inc.
17. Lesson Leaned
• Avoid full table scan
– HBase Filters can only helps you to filter out un-wanted data to
client side
– On server side, it still need to compare all the rowkeys when
applying filters
– set STARTROW and STOPROW
3/10/2014
Copyright 2013 Trend Micro Inc.
18. Avoid Full Table Scan
• Since HBase is natively designed to store data sorted
by rowkey
• It’s fast to scan rows when rowkey prefix specified
– This can only be fast when source ip specified
– How about destination ip, port, protocol,…?
3/10/2014
Copyright 2013 Trend Micro Inc.
19. Rethink The User Requirement
• Searchable Fields
–
–
–
–
–
–
src: source ip
required
dst: destination ip
spt: source port
dpt: destination port
proto: protocol
rt: timestamp
• User want to track down suspicious connections
– A query at least need to have an IP
3/10/2014
Copyright 2013 Trend Micro Inc.
20. HBase Rowkey Design – Second Attempt !
– Search on source ip
– Search on destination ip
– Put netflow timestamp into HBase timestamp to leverage HBase
TimeRange Scan
– Set VERSION=>2147483647 to avoid collision
3/10/2014
Copyright 2013 Trend Micro Inc.
21. HBase Rowkey Design – Second Attempt !
• Search other searchable fields by applying Qualifier
Filter:
– QualifierFilter (=, 'regexstring:^spt#dpt#proto$')
3/10/2014
Copyright 2013 Trend Micro Inc.
22. Check The User Requirement
• Searchable Fields
–
–
–
–
–
–
3/10/2014
src: source ip
dst: destination ip
spt: source port
dpt: destination port
proto: protocol
rt: timestamp
Copyright 2013 Trend Micro Inc.
specifiy STARTROW/STOPROW
specify STARTROW/STOPROW
apply qualifier filter
apply qualifier filter
apply qualifier filter
specify HBase TimeRange
24. Performance
• Test on 70 million sample data
• The search performance……
• Enough?
– Since malicious connections won’t have large volume, 80% of
query should be responsed in a second
• Duplicate issue:
– Since we only store needed fields into HBase, the data volume
is only 150MB/day duplicated 300MB/day
– Store 3 month data = 13.5GB duplicated 27GB (GZed)
(record count = 12 Billon)
3/10/2014
Copyright 2013 Trend Micro Inc.
25. Test on Even Large Data
• Test on 240 million sample data
• The search performance……
• The query time is robust on 80% query case
3/10/2014
Copyright 2013 Trend Micro Inc.
28. Serializer
1. Extract needed fields from Netflow log
Flume Process
Data
Soruce
To
2. Create Hbase put object for Sink to execute
Serializer
Flume Spooling
Directory Source
3/10/2014
Flume file Channel
Copyright 2013 Trend Micro Inc.
Flume HBase Sink
30. More Elegant Way
Data
Soruce
Infosec
• A put trigger the prePut Coprocessor
Step1
• Put to dst table in dst#src format in coprocessor
Step2
• Do regular put to src table in src#dst format
Step3
src table
Flume Spooling
Directory Source
3/10/2014
Channel1
Copyright 2013 Trend Micro Inc.
Sink1
dst table
Hook a prePut
Coprocessor
32. Experience Sharing
• Thrift
– Thrift is not the first-class citizen of HBase, for example, thrift do
not support Scan with TimeRange and Version
– Do not support New Filters since thrift has it’s own Filter
Language (for example, FuzzyRowFilter)
• Bottle
– It won’t be hurt when you delete you web backend code which is
implement by bottle
3/10/2014
Copyright 2013 Trend Micro Inc.
33. Experience Sharing
• Flume
– There is also a Flume Syslogudp Source, but can not work well
with out extra works
• 768bytes/per message limitation(fixed in FLUME-2130)
• Still has 2048bytes limitation on netty event decoder
• Data may loss due to messages concatenated...
– Spooling Directory Source is much more stable
3/10/2014
Copyright 2013 Trend Micro Inc.
34. Future Work
• Transparent index table to clients
– Use coprocessor to hook on the client scan and decide which
table is going to scan
• Make thrift scan support specifying version:
– Now I use scan to fetch rows and qualifiers,
then use getVer to fetch different versions
(thrift do support “version” on get)
3/10/2014
Copyright 2013 Trend Micro Inc.