Your SlideShare is downloading. ×
0
Network Traffic Search using
Apache HBase

Evans Ye @ TWHUG 2014 Q1
2014/3/8
Who am I
• Evans Ye @

– Dumbo Team
• Dumbo In Taiwan Blog

– Talk in TWHUG 2013 Q4
• Building Hadoop Based Big Data Envir...
Agenda
• Problem to Solve

• Solution Design
• Flume ETL Process

• Experience Sharing
• Future Work

3/10/2014

Copyright...
Security Department:
Hey SPN, I have a big data
problem…

3/10/2014

Copyright 2013 Trend Micro Inc.

閃開讓專業的來!
Network Traffic Analysis Example
C&C 2

C&C 1

C&C 3

INTERNET

INTRANET

VICTIM 1

VICTIM 2

TW branch
3/10/2014

Copyrig...
Find Malicious Connections by Searching
Netflow logs

• ArcSight Common Event Format
– Volume: 250G/180 million record per...
Valuable Fields in Netflow log
• src: source ip

• dst: destination ip
• spt: source port

• dpt: destination port
• proto...
Search for Connections

Query

……
about
8~10min

3/10/2014

Copyright 2013 Trend Micro Inc.

Netflow
Logger
Big Data Problem

3/10/2014

Copyright 2013 Trend Micro Inc.
Choosing The Right Tool
• Big data solutions

• Why HBase?
– We want to try and figure out HBase Thrift limitation
– How H...
Solution Design
3/10/2014

Copyright 2013 Trend Micro Inc.
Architecture
Data
Soruce

Send Netflow via
syslog

Talk to HBase
using C++,
Python, PHP,
Ruby, Perl…

A simple Python
web ...
User Requirement
• Searchable Fields
–
–
–
–
–
–

src: source ip
dst: destination ip
spt: source port
dpt: destination por...
HBase Rowkey Design – First Attempt
• Compose searchable fields to be rowkey

• For client query, scan by applying HBase F...
RD Style Search Portal

3/10/2014

Copyright 2013 Trend Micro Inc.
Performance
• Test on 12 million sample data

• The search performance……
• Since we need to store at least 3 month data fo...
Lesson Leaned
• Avoid full table scan
– HBase Filters can only helps you to filter out un-wanted data to
client side
– On ...
Avoid Full Table Scan
• Since HBase is natively designed to store data sorted
by rowkey
• It’s fast to scan rows when rowk...
Rethink The User Requirement
• Searchable Fields
–
–
–
–
–
–

src: source ip
required
dst: destination ip
spt: source port...
HBase Rowkey Design – Second Attempt !
– Search on source ip

– Search on destination ip

– Put netflow timestamp into HBa...
HBase Rowkey Design – Second Attempt !

• Search other searchable fields by applying Qualifier
Filter:
– QualifierFilter (...
Check The User Requirement
• Searchable Fields
–
–
–
–
–
–

3/10/2014

src: source ip
dst: destination ip
spt: source port...
Deliver New Portal

3/10/2014

Copyright 2013 Trend Micro Inc.
Performance
• Test on 70 million sample data

• The search performance……
• Enough?
– Since malicious connections won’t hav...
Test on Even Large Data
• Test on 240 million sample data

• The search performance……
• The query time is robust on 80% qu...
Fume ETL Process
3/10/2014

Copyright 2013 Trend Micro Inc.
Architecture
Data
Soruce

Send Netflow via
syslog

Query
Hbase
Thrift
Server

3/10/2014

Copyright 2013 Trend Micro Inc.
Serializer
1. Extract needed fields from Netflow log

Flume Process
Data
Soruce

To

2. Create Hbase put object for Sink t...
Dual Table Write
Data
Soruce
Infosec

flume.conf
…
agent1.sinks.sink1.serializer.rowKey = src, dst
agent1.sinks.sink2.seri...
More Elegant Way
Data
Soruce
Infosec

• A put trigger the prePut Coprocessor

Step1

• Put to dst table in dst#src format ...
Experience Sharing
& Future Work
3/10/2014

Copyright 2013 Trend Micro Inc.
Experience Sharing
• Thrift
– Thrift is not the first-class citizen of HBase, for example, thrift do
not support Scan with...
Experience Sharing
• Flume
– There is also a Flume Syslogudp Source, but can not work well
with out extra works
• 768bytes...
Future Work
• Transparent index table to clients
– Use coprocessor to hook on the client scan and decide which
table is go...
Questions?
Thank you !
Upcoming SlideShare
Loading in...5
×

Network Traffic Search using Apache HBase

2,576

Published on

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,576
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
57
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Transcript of "Network Traffic Search using Apache HBase"

  1. 1. Network Traffic Search using Apache HBase Evans Ye @ TWHUG 2014 Q1 2014/3/8
  2. 2. Who am I • Evans Ye @ – Dumbo Team • Dumbo In Taiwan Blog – Talk in TWHUG 2013 Q4 • Building Hadoop Based Big Data Environment – Apache Bigtop Contributor 3/10/2014 Copyright 2013 Trend Micro Inc.
  3. 3. Agenda • Problem to Solve • Solution Design • Flume ETL Process • Experience Sharing • Future Work 3/10/2014 Copyright 2013 Trend Micro Inc.
  4. 4. Security Department: Hey SPN, I have a big data problem… 3/10/2014 Copyright 2013 Trend Micro Inc. 閃開讓專業的來!
  5. 5. Network Traffic Analysis Example C&C 2 C&C 1 C&C 3 INTERNET INTRANET VICTIM 1 VICTIM 2 TW branch 3/10/2014 Copyright 2013 Trend Micro Inc. VICTIM 3 VICTIM 4 US branch
  6. 6. Find Malicious Connections by Searching Netflow logs • ArcSight Common Event Format – Volume: 250G/180 million record per day 3/10/2014 Copyright 2013 Trend Micro Inc.
  7. 7. Valuable Fields in Netflow log • src: source ip • dst: destination ip • spt: source port • dpt: destination port • proto: protocol, TCP,UDP… • rt: timestamp, 1386018915000 3/10/2014 Copyright 2013 Trend Micro Inc.
  8. 8. Search for Connections Query …… about 8~10min 3/10/2014 Copyright 2013 Trend Micro Inc. Netflow Logger
  9. 9. Big Data Problem 3/10/2014 Copyright 2013 Trend Micro Inc.
  10. 10. Choosing The Right Tool • Big data solutions • Why HBase? – We want to try and figure out HBase Thrift limitation – How HBase performs when dealing with this kind of problem 3/10/2014 Copyright 2013 Trend Micro Inc.
  11. 11. Solution Design 3/10/2014 Copyright 2013 Trend Micro Inc.
  12. 12. Architecture Data Soruce Send Netflow via syslog Talk to HBase using C++, Python, PHP, Ruby, Perl… A simple Python web framework Only one file Query under 150k HBase Thrift Server 3/10/2014 Copyright 2013 Trend Micro Inc.
  13. 13. User Requirement • Searchable Fields – – – – – – src: source ip dst: destination ip spt: source port dpt: destination port proto: protocol, TCP,UDP… rt: timestamp, 1386018915000 • Values – in, cn2, ad.tcp__flags 3/10/2014 Copyright 2013 Trend Micro Inc.
  14. 14. HBase Rowkey Design – First Attempt • Compose searchable fields to be rowkey • For client query, scan by applying HBase Filter – RowFilter (=, 'regexstring:^src#dst#[^#]*#spt#dpt#proto$')“ – See HBase Thrift Filter doc 3/10/2014 Copyright 2013 Trend Micro Inc.
  15. 15. RD Style Search Portal 3/10/2014 Copyright 2013 Trend Micro Inc.
  16. 16. Performance • Test on 12 million sample data • The search performance…… • Since we need to store at least 3 month data for query, The performance might not be good enough… 3/10/2014 Copyright 2013 Trend Micro Inc.
  17. 17. Lesson Leaned • Avoid full table scan – HBase Filters can only helps you to filter out un-wanted data to client side – On server side, it still need to compare all the rowkeys when applying filters –  set STARTROW and STOPROW 3/10/2014 Copyright 2013 Trend Micro Inc.
  18. 18. Avoid Full Table Scan • Since HBase is natively designed to store data sorted by rowkey • It’s fast to scan rows when rowkey prefix specified – This can only be fast when source ip specified – How about destination ip, port, protocol,…? 3/10/2014 Copyright 2013 Trend Micro Inc.
  19. 19. Rethink The User Requirement • Searchable Fields – – – – – – src: source ip required dst: destination ip spt: source port dpt: destination port proto: protocol rt: timestamp • User want to track down suspicious connections – A query at least need to have an IP 3/10/2014 Copyright 2013 Trend Micro Inc.
  20. 20. HBase Rowkey Design – Second Attempt ! – Search on source ip – Search on destination ip – Put netflow timestamp into HBase timestamp to leverage HBase TimeRange Scan – Set VERSION=>2147483647 to avoid collision 3/10/2014 Copyright 2013 Trend Micro Inc.
  21. 21. HBase Rowkey Design – Second Attempt ! • Search other searchable fields by applying Qualifier Filter: – QualifierFilter (=, 'regexstring:^spt#dpt#proto$') 3/10/2014 Copyright 2013 Trend Micro Inc.
  22. 22. Check The User Requirement • Searchable Fields – – – – – – 3/10/2014 src: source ip dst: destination ip spt: source port dpt: destination port proto: protocol rt: timestamp Copyright 2013 Trend Micro Inc.  specifiy STARTROW/STOPROW  specify STARTROW/STOPROW  apply qualifier filter  apply qualifier filter  apply qualifier filter  specify HBase TimeRange
  23. 23. Deliver New Portal 3/10/2014 Copyright 2013 Trend Micro Inc.
  24. 24. Performance • Test on 70 million sample data • The search performance…… • Enough? – Since malicious connections won’t have large volume, 80% of query should be responsed in a second • Duplicate issue: – Since we only store needed fields into HBase, the data volume is only 150MB/day  duplicated 300MB/day – Store 3 month data = 13.5GB  duplicated 27GB (GZed) (record count = 12 Billon) 3/10/2014 Copyright 2013 Trend Micro Inc.
  25. 25. Test on Even Large Data • Test on 240 million sample data • The search performance…… • The query time is robust on 80% query case 3/10/2014 Copyright 2013 Trend Micro Inc.
  26. 26. Fume ETL Process 3/10/2014 Copyright 2013 Trend Micro Inc.
  27. 27. Architecture Data Soruce Send Netflow via syslog Query Hbase Thrift Server 3/10/2014 Copyright 2013 Trend Micro Inc.
  28. 28. Serializer 1. Extract needed fields from Netflow log Flume Process Data Soruce To 2. Create Hbase put object for Sink to execute Serializer Flume Spooling Directory Source 3/10/2014 Flume file Channel Copyright 2013 Trend Micro Inc. Flume HBase Sink
  29. 29. Dual Table Write Data Soruce Infosec flume.conf … agent1.sinks.sink1.serializer.rowKey = src, dst agent1.sinks.sink2.serializer.rowKey = dst, src Channel1 Flume Spooling Directory Source Channel2 Sink1 Sink2 Duplicate, Again! 3/10/2014 Copyright 2013 Trend Micro Inc.
  30. 30. More Elegant Way Data Soruce Infosec • A put trigger the prePut Coprocessor Step1 • Put to dst table in dst#src format in coprocessor Step2 • Do regular put to src table in src#dst format Step3 src table Flume Spooling Directory Source 3/10/2014 Channel1 Copyright 2013 Trend Micro Inc. Sink1 dst table Hook a prePut Coprocessor
  31. 31. Experience Sharing & Future Work 3/10/2014 Copyright 2013 Trend Micro Inc.
  32. 32. Experience Sharing • Thrift – Thrift is not the first-class citizen of HBase, for example, thrift do not support Scan with TimeRange and Version – Do not support New Filters since thrift has it’s own Filter Language (for example, FuzzyRowFilter) • Bottle – It won’t be hurt when you delete you web backend code which is implement by bottle 3/10/2014 Copyright 2013 Trend Micro Inc.
  33. 33. Experience Sharing • Flume – There is also a Flume Syslogudp Source, but can not work well with out extra works • 768bytes/per message limitation(fixed in FLUME-2130) • Still has 2048bytes limitation on netty event decoder • Data may loss due to messages concatenated... – Spooling Directory Source is much more stable 3/10/2014 Copyright 2013 Trend Micro Inc.
  34. 34. Future Work • Transparent index table to clients – Use coprocessor to hook on the client scan and decide which table is going to scan • Make thrift scan support specifying version: – Now I use scan to fetch rows and qualifiers, then use getVer to fetch different versions (thrift do support “version” on get) 3/10/2014 Copyright 2013 Trend Micro Inc.
  35. 35. Questions?
  36. 36. Thank you !
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×