Net flowhadoop flocon2013_yhlee_final

Scalable NetFlow Analysis
with Hadoop
Yeonhee Lee and Youngseok Lee
{yhlee06, lee}@cnu.ac.kr
http://networks.cnu.ac.kr/~yhlee
Chungnam National University, Korea

January 8, 2013
FloCon 2013

Contents
• Introduction
• Overview
• Hadoop-based traffic processing tool
• Evaluation
• Summary

Internet Measurement
• Challenges
• Scalability
• Fault-tolerant system
• Extensibility
• CAIDA data
• Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization
• Ark topology: 1.8 TB
• Telescope: 102 TB
• Packet headers: 18.8 TB
Josh Polterock, “CAIDA: A Data Sharing Case Study,”
Security at the Cyber Border: Exploring Cybersecurity for International Research Network
Connections workshop, 2012 4

Harness Distributed Computing
and Storage ?
Google MapReduce, 2004 Apache Hadoop project
• 1 PB sorting by Google
• 2008: 6 hours and 2
minutes on 4,000
computers
• 2011: 33 minutes on 8000
computers
• 2011: 10PB, 8000
computers, 6 hours and 27
minutes

5

Our Proposal
Hadoop-based Traffic Measurement Administrator
and Analysis Platform

NetFlow v5

Web Visualizer / Hive
Packet
Master
Traffic Analyzer
Traffic Analysis
Mapper & Reducer
Traffic
Collector
Slave Pcap Bin NetFlow
I/O I/O I/O

HDFS Hadoop

1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with
Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011
3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student
Workshop, Dec, 2011 6

Related Work
• Traffic analysis of DNS root server (RIPE, 2011.11)
• PacketPig (2012.03) - Big Data Security Analytics platform
• Sherpasurfing – Open Source Cyber Security Solution, Hadoop World
2011
• Firewall/IDS logs, netflow/packet
• Performing Network and Security Analytics with Hadoop, (Travis
Dawson, Narus), Hadoop Summit 2012
• Distributed Bro (IDS)

7

Hadoop-based NetFlow Analysis
Collect & Anlaysis

NetFlow NetFlow

Distributer

9

Traffic Analyzer
Hive QL Query for Traffic Analysis User
Interface
Scan IP Spoofed IP Heavy User User-defined
query query query query

Traffic MapReduce for Traffic Analysis Web
UI
Collector IP TCP HTTP DDoS NetFlow
& analysis MR analysis MR analysis MR analysis MR analysis MR
Loader
NetFlow
Packet

monitor
IO formats
Pcap Binary Text query
InputFormat Input/OutputFormat Input/OutputFormat
CLI

HDFS Hadoop

Data Source Data Processing User Interface
(Jpcap, HDFS) (HDFS, MapReduce, Hive) (Hive, Web)

Distributer

10

Challenges
1. Data handing issue in HDFS
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop

Scalability Fault Distributed
(~TB/PB) tolerance computation

12

Challenges
1. Data handing issue in Hadoop


testbed

13

Block-level Parallelism
block-level file-level
processing processing

HDFS HDFS
Block2 (64 MB) Block3 (64 MB)

14
00 00 68 2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF ‥‥ 00 21 B5 01 68 2B AD 4C 2B 1C 07 00 3C 00 00 00 3C 00 00 00 01 80 ‥

Block-level IO vs. File-level IO

140 4.5
4.3
120 4.0
3.9
Completion Time (min)

3.5 3.5
100
3.0

SpeedUp
80 2.5 IP Analysis_blockIO

60 2.0 IP Analysis_fileIO
1.5
40 SpeedUp vs fileIO
1.0
20 0.5
0 0.0

# of nodes
15

Challenges




testbed

16

DistributedCache
Aggregation Filtering Rule
cnu;srcip=168.188.0.0-168.188.255.255

Aggregation Rule
as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr
csubnet;dstsubnet;srcport;dstport;

identification Port
IP/UDP packet

aggregation
# of octets

generation
group-key
K: time|AS counts

decoding
NetFlow

filtering
# of packets
packet

v5 header V: count per AS
# of Flows
v5 record Protocol
# of octets
…
# of packets
… # of Flows
AS
identification

v5 record

aggregation
# of octets
generation
group-key K: time|AS
decoding

counts
filtering

# of packets
packet

IP/UDP packet V: count per AS
# of Flows
NetFlow
v5 header Subnet
# of octets
v5 record
# of packets
# of Flows

Block Map Reduce
HDFS IO Phase Phase HDFS
17

Anomaly Detection Detection Rules
DistributedCache
port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20-
syn_flood;ip,proto=6,syn-fin=1-
;srcip,dstip;srcip,dstip;syn-fin=6-

IP/UDP packet
identification

aggregation
&group sort
partitioning
generation
group-key

detection
decoding

matching
NetFlow

pattern
packet

v5 header
syn_flood
v5 record
3.3.3.33.3.3.4
…
…

…

v5 record
identification

aggregation
&group sort
partitioning
generation
group-key

PortScan

detection
decoding

matching

1.1.1.11.1.1.2
pattern
packet

IP/UDP packet
…
NetFlow
v5 header

v5 record

Block Map Shuffle Reduce
HDFS
IO Phase &Sort Phase HDFS
18

Challenges




19

Performance Tuning
• Configuration
• Hadoop IO Buffer (128K  1 MB)
• Java heap space (300 MB 1 024 MB)
• # of MapReduce Slots (# of cores)
• MapReduce Algorithm
• normal combiner vs inMapper combiner
• Job scheduling

20

Job Scheduling
• Different job types
• Periodic jobs (for monitoring)
• guaranteed service within time
• e.g Aggregated Statistics for monitoring, Flow Parse job for
analytics
• Small ad-hoc query job (for analytics)
• fast response time
5 munites 5 munites 5 munites 5 munites

Collector Collect Collect Collect …

FIFO Scheduling … Basic Statistics
Flow
Parse
ad-hoc
query
Basic Statistics
Flow
Parse
ad-hoc
query
Basic Statistics
Flow
Parse
ad-hoc
query

Fair Scheduler
… Basic Statistics Flow Parse Basic Statistics Flow Parse Basic Statistics Flow Parse
ad-hoc query ad-hoc query ad-hoc query

periodic job 21
ad-hoc job

Experiments
• Testbed
Type Nodes Cores CPU Memory HardDisk Rack
Small 3 24 3.4 GHz 8 core 16 GB 2 TB 1 Rack

Medium 30 240 2.93 GHz 8 core 16 GB 4 TB 1 Rack

Large 200 400 2.66 GHz 2 core 2 GB 500 GB 4 Racks

• Data and MapReduce jobs
Type Dataset MapReduce Job Testbed
NetFlow 1 TB from KOREN flowStats, flowDetect, flowPrint Small
IP, TCP, Web
Packet 1 ~ 5 TB from CNU campus N/W Medium, Large
(webpop, User Behavior, DDoS)

23

NetFlow: SpeedUp (vs. Flowtools)
flowStats vs. Flowtools
flowPrint vs. FlowTools

6.0
5.0 5.0
SpeedUp

4.0
3.0
2.0 2.3

1.0
0.0
1 2 3
# of nodes > FlowPrint
flow-cat -p flowfile |flow-print –f14
> FlowStats
flow-cat -p flowfile|flow-stat -f12
24
flow-cat -p flowfile|flow-stat –f5

NetFlow: Scalability
Job Completion Time (min)

120 7
103
100 6 6.1

SpeedUp (vs 5 nodes)
80 5
4
60 flowStats flowStats
3
40 flowDetect flowDetect
2
17
20 1
0 0
5 10 15 20 25 30 5 10 15 20 25 30
# of nodes # of nodes

25

NetFlow: Pattern Matching Result
10
NetFlows Record Distribution
8

# of records (M)
6

4

2

0

100000
worm.sasser,w32.sasser

remote_administrator
10000 vnc

w32.witty.worm

1000 worm.opasoft,w32.opaserv.worm
count

code_red_worm

100 netfairy

kamun

10 emule

shockwave_killer

1 worm.killmsblast,w32.nachi.worm,w32.welchia.w
2012-08-29 2012-09-08 2012-09-18 2012-09-28 2012-10-08 2012-10-18 2012-10-28 orm

date 26

Packet: ScaleOut
140
Completion Time (min)

120 121
100 IP Analysis (min)
80 77 TCP Analysis (min)
60 WebPop (min)
40 UserBehavior (min)
20 15
13 DDoS (min)
0
5 10 15 20 25 30

14
13
12
IP Analysis (Gbps)
Throughput (Gbps)

10
9 TCP Analysis (Gbps)
8
6 WebPop (Gbps)
4 UserBehavior (Gbps)
2
DDos (Gbps)
0
5 10 15 20 25 30
# of nodes 27

Packet: SizeUp (30 nodes)

8000 16 IP Analysis
7007 IP Analysis_ripe
7000 14
Completion Time (sec)

TCPStats
6000 12

Throughput (Gbps)
Webpop
5000 10
UserBehavior
4000 8
DDoS
2934
3000 6 IP Analysis
2000 4 TCPStats
1000 2 Webpop

0 0 UserBehavior
1TB 2TB 3TB 4TB 5TB DDoS

Data Size
28

Packet: SizeUp (200 nodes)

5000 18
4500 16
4000 15 IP Analysis
Completion Time (sec)

14
3500 TCP Analysis
12

Throughput (Gbps)
Webpop
3000 2744
10 UserBehavior
2500
8 8 DDoS
2000 IP Analysis
6
1500 TCP Analysis
1000 4 Webpop
500 2 UserBehavior
0 0 DDoS
1TB 2TB 3TB 4TB 5TB
Data Size

29

Summary
• NetFlow analysis with Hadoop
• NetFlow v5 processing module
• MapReduce algorithms: statistics

• Distributed computing and storage with Hadoop
• Fits Internet measurement application
• Scalability

• Source codes are available at
• Packet, NetFlow
• https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc
h/hadoop
• https://github.com/ssallys/pcap-on-Hadoop

31

Ongoing Work
• Distributed real-time • Scalable collection
monitoring • E.g.) 10GE  10 X 1 GE
• Rule matching for HDFS
Streamed NetFlow
• Developing rule for
MapReduce
• Rule classification for
dedicated rule matching Productivity

RHive RHadoop
Rhipe

• Integration Pig Hive
Maho
ut

• Streaming packages MapReduce

• Enhanced analytics HDFS
• Data mining: Mahout
Performance
• Machine learning

32

Reference
• Papers
1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic
Measurement and Analysis with Hadoop," ACM SIGCOMM
Computer Communication Review (CCR), Jan. 2013
2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace
Processing Tool," The Third TMA, April 2011
3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop",
ACM CoNEXT Student Workshop, Dec, 2011

• Software
1. http://networks.cnu.ac.kr/~yhlee
2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop
3. https://github.com/ssallys/pcap-on-Hadoop

33

Net flowhadoop flocon2013_yhlee_final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Net flowhadoop flocon2013_yhlee_final

Similar to Net flowhadoop flocon2013_yhlee_final (20)

Net flowhadoop flocon2013_yhlee_final