OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Security Threats to Hadoop: Data Leakage Attacks and Investigation
1. Security Threats to Hadoop: Data Leakage Attacks
and Investigation
Presented By
KIRAN GAJBHIYE
2. Contents
• Introduction
• Hadoop security
• Hadoop Components
• Data Leakage Attack
• Analyze Data Leakage in Data analyzer
• Disadvantages of Data Leakage
• overcome the threat
• Conclusion
3. Introduction
• Hadoop is an open-source framework that allows to store and process big
data
• popular platforms for big data storage and analysis
ex - manufacturing, healthcare, insurance, and retail
• Hadoop consists of core libraries
• A distributed file system (HDFS) decrease data losses
• YARN for resource management and application scheduling
• Map Reduce for data processing
• powerful processing capacity, huge storage capacity, scalability
4. Hadoop Security
• Hadoop Contains Sensitive Data
As Hadoop adoption grows too has the types of data organizations look to store.
Often the data is proprietary or personal and it must be protected
• Add Kerberos-based authentication to server
– Establishes identity for clients, hosts and services
– Prevents impersonation/passwords are never sent over the wire
• Data stored in encrypted form
• MySQL is often used
5. Hadoop Components
MapReduce (version 1.2.1)
• Scheduling
• Map Reduce is a framework used to write the applications that process
large amounts of data on hardware
• MapReduce framework divides an input file into many chunks and then a
mapper for each chunk reads the data, does computations and provides
outputs in the form of key/value pairs
6. Continue……
HDFS(Hadoop Distributed File System)
• The HDFS is the Java portable file system which is more scalable, reliable
and distributed in the Hadoop framework environment
• Store large data sets and cope with hardware failure
• Namenode
consists of a single Namenode, a master server that manages the file system
namespaces and regulates the access to files by clients
• DataNode
responsible for serving read and write requests from file system clients
8. Data Leakage Attack
application layer data leakage
attackers can obtain private data by application-layer vulnerability or
malware
ex - A vulnerability in the current Hadoop audit mechanism is that it only
records the operation type, time, and content, but no information on who did this
operation
9. Continue…
operating system layer data leakage
• if attackers have the permission to log-in to the host operating
system of a Hadoop node
• They can bypass the monitor of Hadoop and steal data directly
in the OS layer
10. Investigate on Data Leakage
• an investigation framework aimed at data leakage attacks in Hadoop was
proposed
• This framework is composed of many data collectors and one data analyzer
• The data collector is located in the kernel of the host operating system in Hadoop
nodes
• It actively monitors the accesses to important data on each node and transmits
these behavior logs to the data analyzer
• The data it collects includes Hadoop logs, the Fsimage file, our own monitor (i.e.,
HProgger) logs and images or logs of files, processes, networks, and system
13. Continue…
OS-layer data leakage attack, the detection algorithm works in
four dimensions:
• Abnormal directory
In collected logs, if any directory that is out of this range is found, an attack may
have happened.
• Abnormal user
if any other user instead of the Hadoop super user is found in HProgger logs, an
attack may have happened
14. Continue…
• Abnormal operation
if an attacker wants to steal a block from the physical machine, they need to copy
(move) it to another directory and sometimes may rename the block, so if any of
these operations are found, the attack may have happened
• Block Proportion
SusBlockNum (FileID) represent the number of suspicious blocks in a file dataset
and AllBlockNumber (FileID) represent the total number of blocks in a file dataset.
We define block proportion (BP) as the result of dividing SusBlockNum (FileID) by
AllBlockNumber (FileID).
15. Disadvantages of data leakage
• Lost of important and confidential documents
• Hack the whole system
•
16. Overcome the threat
• Transport-level encryption
• Kerberos used for strong authentication
• Apache Knox used for perimeter security
• Data-centric security
• Data-at-Rest Encryption
• Data-in-Motion Encryption
17. Conclusion
This including an on-demand data collection method and an automatic analysis
method for data leakage attacks in Hadoop. It collects data from the machines in the
Hadoop cluster to our forensic server and then analyzes them.
With the automatic detection algorithm, it can find out whether there exist
suspicious data leakage behaviors and give warnings and evidence to users. This
collected evidence can be used to find the attackers and reconstruct the attack
scenarios.
18. References
• D. Das and O. O’Malley, “Adding Security to Apache Hadoop,” Hortonworks
Technical Report, 2011.
• S. Park and Y. Lee, “Secure Hadoop with Encrypted HDFS,” GPC 2013, Seoul,
Korea, 2013, May 9–11.
• R. K. L. Ko and M. A. Will. “Progger: An Efficient, Tamper-Evident Kernel-
Space Logger for Cloud Data Provenance Tracking,” CLOUD 2014, Anchorage,
AK, 2014, June 27– July 2.