Network Traffic Search using Apache HBase
Upcoming SlideShare
Loading in...5
×
 

Network Traffic Search using Apache HBase

on

  • 2,103 views

 

Statistics

Views

Total Views
2,103
Views on SlideShare
2,103
Embed Views
0

Actions

Likes
8
Downloads
40
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Network Traffic Search using Apache HBase Network Traffic Search using Apache HBase Presentation Transcript

  • Network Traffic Search using Apache HBase Evans Ye @ TWHUG 2014 Q1 2014/3/8
  • Who am I • Evans Ye @ – Dumbo Team • Dumbo In Taiwan Blog – Talk in TWHUG 2013 Q4 • Building Hadoop Based Big Data Environment – Apache Bigtop Contributor 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Agenda • Problem to Solve • Solution Design • Flume ETL Process • Experience Sharing • Future Work 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Security Department: Hey SPN, I have a big data problem… 3/10/2014 Copyright 2013 Trend Micro Inc. 閃開讓專業的來!
  • Network Traffic Analysis Example C&C 2 C&C 1 C&C 3 INTERNET INTRANET VICTIM 1 VICTIM 2 TW branch 3/10/2014 Copyright 2013 Trend Micro Inc. VICTIM 3 VICTIM 4 US branch
  • Find Malicious Connections by Searching Netflow logs • ArcSight Common Event Format – Volume: 250G/180 million record per day 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Valuable Fields in Netflow log • src: source ip • dst: destination ip • spt: source port • dpt: destination port • proto: protocol, TCP,UDP… • rt: timestamp, 1386018915000 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Search for Connections Query …… about 8~10min 3/10/2014 Copyright 2013 Trend Micro Inc. Netflow Logger
  • Big Data Problem 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Choosing The Right Tool • Big data solutions • Why HBase? – We want to try and figure out HBase Thrift limitation – How HBase performs when dealing with this kind of problem 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Solution Design 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Architecture Data Soruce Send Netflow via syslog Talk to HBase using C++, Python, PHP, Ruby, Perl… A simple Python web framework Only one file Query under 150k HBase Thrift Server 3/10/2014 Copyright 2013 Trend Micro Inc.
  • User Requirement • Searchable Fields – – – – – – src: source ip dst: destination ip spt: source port dpt: destination port proto: protocol, TCP,UDP… rt: timestamp, 1386018915000 • Values – in, cn2, ad.tcp__flags 3/10/2014 Copyright 2013 Trend Micro Inc.
  • HBase Rowkey Design – First Attempt • Compose searchable fields to be rowkey • For client query, scan by applying HBase Filter – RowFilter (=, 'regexstring:^src#dst#[^#]*#spt#dpt#proto$')“ – See HBase Thrift Filter doc 3/10/2014 Copyright 2013 Trend Micro Inc.
  • RD Style Search Portal 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Performance • Test on 12 million sample data • The search performance…… • Since we need to store at least 3 month data for query, The performance might not be good enough… 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Lesson Leaned • Avoid full table scan – HBase Filters can only helps you to filter out un-wanted data to client side – On server side, it still need to compare all the rowkeys when applying filters –  set STARTROW and STOPROW 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Avoid Full Table Scan • Since HBase is natively designed to store data sorted by rowkey • It’s fast to scan rows when rowkey prefix specified – This can only be fast when source ip specified – How about destination ip, port, protocol,…? 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Rethink The User Requirement • Searchable Fields – – – – – – src: source ip required dst: destination ip spt: source port dpt: destination port proto: protocol rt: timestamp • User want to track down suspicious connections – A query at least need to have an IP 3/10/2014 Copyright 2013 Trend Micro Inc.
  • HBase Rowkey Design – Second Attempt ! – Search on source ip – Search on destination ip – Put netflow timestamp into HBase timestamp to leverage HBase TimeRange Scan – Set VERSION=>2147483647 to avoid collision 3/10/2014 Copyright 2013 Trend Micro Inc.
  • HBase Rowkey Design – Second Attempt ! • Search other searchable fields by applying Qualifier Filter: – QualifierFilter (=, 'regexstring:^spt#dpt#proto$') 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Check The User Requirement • Searchable Fields – – – – – – 3/10/2014 src: source ip dst: destination ip spt: source port dpt: destination port proto: protocol rt: timestamp Copyright 2013 Trend Micro Inc.  specifiy STARTROW/STOPROW  specify STARTROW/STOPROW  apply qualifier filter  apply qualifier filter  apply qualifier filter  specify HBase TimeRange
  • Deliver New Portal 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Performance • Test on 70 million sample data • The search performance…… • Enough? – Since malicious connections won’t have large volume, 80% of query should be responsed in a second • Duplicate issue: – Since we only store needed fields into HBase, the data volume is only 150MB/day  duplicated 300MB/day – Store 3 month data = 13.5GB  duplicated 27GB (GZed) (record count = 12 Billon) 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Test on Even Large Data • Test on 240 million sample data • The search performance…… • The query time is robust on 80% query case 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Fume ETL Process 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Architecture Data Soruce Send Netflow via syslog Query Hbase Thrift Server 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Serializer 1. Extract needed fields from Netflow log Flume Process Data Soruce To 2. Create Hbase put object for Sink to execute Serializer Flume Spooling Directory Source 3/10/2014 Flume file Channel Copyright 2013 Trend Micro Inc. Flume HBase Sink
  • Dual Table Write Data Soruce Infosec flume.conf … agent1.sinks.sink1.serializer.rowKey = src, dst agent1.sinks.sink2.serializer.rowKey = dst, src Channel1 Flume Spooling Directory Source Channel2 Sink1 Sink2 Duplicate, Again! 3/10/2014 Copyright 2013 Trend Micro Inc.
  • More Elegant Way Data Soruce Infosec • A put trigger the prePut Coprocessor Step1 • Put to dst table in dst#src format in coprocessor Step2 • Do regular put to src table in src#dst format Step3 src table Flume Spooling Directory Source 3/10/2014 Channel1 Copyright 2013 Trend Micro Inc. Sink1 dst table Hook a prePut Coprocessor
  • Experience Sharing & Future Work 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Experience Sharing • Thrift – Thrift is not the first-class citizen of HBase, for example, thrift do not support Scan with TimeRange and Version – Do not support New Filters since thrift has it’s own Filter Language (for example, FuzzyRowFilter) • Bottle – It won’t be hurt when you delete you web backend code which is implement by bottle 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Experience Sharing • Flume – There is also a Flume Syslogudp Source, but can not work well with out extra works • 768bytes/per message limitation(fixed in FLUME-2130) • Still has 2048bytes limitation on netty event decoder • Data may loss due to messages concatenated... – Spooling Directory Source is much more stable 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Future Work • Transparent index table to clients – Use coprocessor to hook on the client scan and decide which table is going to scan • Make thrift scan support specifying version: – Now I use scan to fetch rows and qualifiers, then use getVer to fetch different versions (thrift do support “version” on get) 3/10/2014 Copyright 2013 Trend Micro Inc.
  • Questions?
  • Thank you !