This document summarizes a research paper on scalable NetFlow analysis using Hadoop. It discusses:
1) The challenges of analyzing large volumes of Internet traffic data, including scalability, fault tolerance, and extensibility.
2) How Hadoop can help address these challenges by providing distributed computing and storage capabilities to process petabytes of data across thousands of nodes.
3) The design of a Hadoop-based traffic processing tool for collecting, storing, and analyzing NetFlow and packet data at scale through MapReduce jobs.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
Apache Tajo: A Big Data Warehouse System on Hadoop
- presented by Jae-hwaJeong, Apache Tajo committer and Gruter research engineer
at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
Apache Tajo: A Big Data Warehouse System on Hadoop
- presented by Jae-hwaJeong, Apache Tajo committer and Gruter research engineer
at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Expand data analysis tool at scale with ZeppelinDataWorks Summit
Apache Zeppelin is one of the tools to help users and developers enrich their analysis with beautiful visualization without any additional work. But recently, teams and cooperation started to use it as a team and a cooperate tool, and they are suffering. Thus it should be improved to be used in multiple teams and in a cooperation to overcome an individual tool.
I will explain how to configure Apache Zeppelin and its useful interpreters including Spark and JDBC to help multiple users and teams use it simultaneously, and how to adopt LDAP and Kerberos to authenticate and authorize valid users. The presentation also includes a specific example of line case, what to have developed for realizing these use cases, and the feature roadmap to make a more powerful tool in a production level. For a long time, Apache Zeppelin has focused on making a result beautiful, but now, it should do its efforts to make it a more convenient tool by hiding some sophisticated settings and providing easier configuration. JONGYOUL LEE, Software Development Engineer, LINE
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Our secure remote connectivity tool provides full video recording of all work our engineers perform on client systems. We have requirements to analyze the video log to detect suspicious activity, provide forensic and root cause analysis capabilities. Some of the obvious use cases include detection of credit card patterns or personally identifiable information (PII) as well as malicious activity like dropping database objects. We need to process hundreds of gigabytes per day representing thousands of hours of video. Our solution leverages a variety of Hadoop components to perform optical text recognition and indexing, keyboard and mouse movement analysis as well as integration with variety of other data sources such as our monitoring, documentation, ticketing and communication systems. We will present our complete architecture starting from multi-source data ingestion through data processing and analysis up to the end user interface, reporting and integration layer.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
One of the most popular use cases for Apache Druid is building data applications. Data applications exist to deliver data into the hands of everyone on a team in a business, and are used by these teams to make faster, better decisions. To fulfill this role, they need to support granular drill down, because the devil is in the details, but also be extremely fast, because otherwise people won't use them!
In this talk, Gian Merlino will cover:
*The unique technical challenges of powering data-driven applications
*What attributes of Druid make it a good platform for data applications
*Some real-world data applications powered by Druid
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
Expand data analysis tool at scale with ZeppelinDataWorks Summit
Apache Zeppelin is one of the tools to help users and developers enrich their analysis with beautiful visualization without any additional work. But recently, teams and cooperation started to use it as a team and a cooperate tool, and they are suffering. Thus it should be improved to be used in multiple teams and in a cooperation to overcome an individual tool.
I will explain how to configure Apache Zeppelin and its useful interpreters including Spark and JDBC to help multiple users and teams use it simultaneously, and how to adopt LDAP and Kerberos to authenticate and authorize valid users. The presentation also includes a specific example of line case, what to have developed for realizing these use cases, and the feature roadmap to make a more powerful tool in a production level. For a long time, Apache Zeppelin has focused on making a result beautiful, but now, it should do its efforts to make it a more convenient tool by hiding some sophisticated settings and providing easier configuration. JONGYOUL LEE, Software Development Engineer, LINE
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Our secure remote connectivity tool provides full video recording of all work our engineers perform on client systems. We have requirements to analyze the video log to detect suspicious activity, provide forensic and root cause analysis capabilities. Some of the obvious use cases include detection of credit card patterns or personally identifiable information (PII) as well as malicious activity like dropping database objects. We need to process hundreds of gigabytes per day representing thousands of hours of video. Our solution leverages a variety of Hadoop components to perform optical text recognition and indexing, keyboard and mouse movement analysis as well as integration with variety of other data sources such as our monitoring, documentation, ticketing and communication systems. We will present our complete architecture starting from multi-source data ingestion through data processing and analysis up to the end user interface, reporting and integration layer.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
One of the most popular use cases for Apache Druid is building data applications. Data applications exist to deliver data into the hands of everyone on a team in a business, and are used by these teams to make faster, better decisions. To fulfill this role, they need to support granular drill down, because the devil is in the details, but also be extremely fast, because otherwise people won't use them!
In this talk, Gian Merlino will cover:
*The unique technical challenges of powering data-driven applications
*What attributes of Druid make it a good platform for data applications
*Some real-world data applications powered by Druid
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
Big data analysis has now become an integral part of many computational and statistical departments. Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every minute by different websites related to e-commerce, shopping carts and online banking. Here comes the need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Similar to Net flowhadoop flocon2013_yhlee_final (20)
1. Scalable NetFlow Analysis
with Hadoop
Yeonhee Lee and Youngseok Lee
{yhlee06, lee}@cnu.ac.kr
http://networks.cnu.ac.kr/~yhlee
Chungnam National University, Korea
January 8, 2013
FloCon 2013
4. Internet Measurement
• Challenges
• Scalability
• Fault-tolerant system
• Extensibility
• CAIDA data
• Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization
• Ark topology: 1.8 TB
• Telescope: 102 TB
• Packet headers: 18.8 TB
Josh Polterock, “CAIDA: A Data Sharing Case Study,”
Security at the Cyber Border: Exploring Cybersecurity for International Research Network
Connections workshop, 2012 4
5. Harness Distributed Computing
and Storage ?
Google MapReduce, 2004 Apache Hadoop project
• 1 PB sorting by Google
• 2008: 6 hours and 2
minutes on 4,000
computers
• 2011: 33 minutes on 8000
computers
• 2011: 10PB, 8000
computers, 6 hours and 27
minutes
5
6. Our Proposal
Hadoop-based Traffic Measurement Administrator
and Analysis Platform
NetFlow v5
Web Visualizer / Hive
Packet
Master
Traffic Analyzer
Traffic Analysis
Mapper & Reducer
Traffic
Collector
Slave Pcap Bin NetFlow
I/O I/O I/O
HDFS Hadoop
1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with
Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011
3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student
Workshop, Dec, 2011 6
7. Related Work
• Traffic analysis of DNS root server (RIPE, 2011.11)
• PacketPig (2012.03) - Big Data Security Analytics platform
• Sherpasurfing – Open Source Cyber Security Solution, Hadoop World
2011
• Firewall/IDS logs, netflow/packet
• Performing Network and Security Analytics with Hadoop, (Travis
Dawson, Narus), Hadoop Summit 2012
• Distributed Bro (IDS)
7
15. Block-level IO vs. File-level IO
140 4.5
4.3
120 4.0
3.9
Completion Time (min)
3.5 3.5
100
3.0
SpeedUp
80 2.5 IP Analysis_blockIO
60 2.0 IP Analysis_fileIO
1.5
40 SpeedUp vs fileIO
1.0
20 0.5
0 0.0
# of nodes
15
16. Challenges
1. Data handing issue in Hadoop
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop
testbed
16
17. DistributedCache
Aggregation Filtering Rule
cnu;srcip=168.188.0.0-168.188.255.255
Aggregation Rule
as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;sr
csubnet;dstsubnet;srcport;dstport;
identification Port
IP/UDP packet
aggregation
# of octets
generation
group-key
K: time|AS counts
decoding
NetFlow
filtering
# of packets
packet
v5 header V: count per AS
# of Flows
v5 record Protocol
# of octets
…
# of packets
… # of Flows
AS
identification
v5 record
aggregation
# of octets
generation
group-key K: time|AS
decoding
counts
filtering
# of packets
packet
IP/UDP packet V: count per AS
# of Flows
NetFlow
v5 header Subnet
# of octets
v5 record
# of packets
# of Flows
Block Map Reduce
HDFS IO Phase Phase HDFS
17
31. Summary
• NetFlow analysis with Hadoop
• NetFlow v5 processing module
• MapReduce algorithms: statistics
• Distributed computing and storage with Hadoop
• Fits Internet measurement application
• Scalability
• Source codes are available at
• Packet, NetFlow
• https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc
h/hadoop
• https://github.com/ssallys/pcap-on-Hadoop
31
32. Ongoing Work
• Distributed real-time • Scalable collection
monitoring • E.g.) 10GE 10 X 1 GE
• Rule matching for HDFS
Streamed NetFlow
• Developing rule for
MapReduce
• Rule classification for
dedicated rule matching Productivity
RHive RHadoop
Rhipe
• Integration Pig Hive
Maho
ut
• Streaming packages MapReduce
• Enhanced analytics HDFS
• Data mining: Mahout
Performance
• Machine learning
32
33. Reference
• Papers
1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic
Measurement and Analysis with Hadoop," ACM SIGCOMM
Computer Communication Review (CCR), Jan. 2013
2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace
Processing Tool," The Third TMA, April 2011
3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop",
ACM CoNEXT Student Workshop, Dec, 2011
• Software
1. http://networks.cnu.ac.kr/~yhlee
2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop
3. https://github.com/ssallys/pcap-on-Hadoop
33