SlideShare a Scribd company logo
1 of 11
Download to read offline
Forensic Readiness on Hadoop Platform: Non-
Ambari HDP as a Case Study
Myat Nandar Oo1
,Thandar Thein2
1
University of Computer Studies, Yangon, 2
University of Computer Studies, Maubin
1
myatnandaroo@gmail.com, 2
thandartheinn@gmail.com
Abstract: Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus,
forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major
challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which
awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources
consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in
generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper
undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on
Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of
analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness
which can result in effective evidence generating in real-world forensics on Hadoop Platform.
Keywords: forensic Investigation, forensic Readiness, Hadoop Platform, residual artifact
I. INTRODUCTION
Hadoop is a reliable platform for storage and analysis with its adaptable facility software packages for
cloud and Big Data environment. Hadoop is built for the purpose of distribution for storage and computing. It is
a cross-platform, Java-based solution which enable to run on a wide array of different operating systems,
because it is built in Java, a platform-neutral language. Hadoop itself serves as its own abstract platform layer; it
can be accessed and run almost entirely independent of the host operating system.
The Hadoop version 0.1.0 is published since April, 2006 and continues to increase its versions [13]. Up
till now, latest released Apache Hadoop 2.7 was available in June, 2016 [3]. Hadoop is speedily mutable and
new software packages are being added to Hadoop. Recently, parts of the inventive Hadoop Apache project
have turned to build software, such as Hue, Avro, HBase, Pig, HCatalog, Hive, Flume, Oozie, Sqoop, and
Zookeeper [14].
In Statista report [9], the Hadoop market was valued at 6 billion U.S. dollars worldwide in the year
2015. A number of companies became bundle Hadoop and related technologies into their own Hadoop
distributions as the Hadoop Platforms. The three prominent Hadoop Platforms are MapR, Cloudera, and
Hortonworks [11], among them; HDP is the fully open source Hadoop Platform. The development rate HDP is
faster than that of competitive Hadoop Platforms. Total gross profit was $38.1 million for the first quarter of
2017, compared to $25.0 million for the same period last year. Hue [15] provides a Web application interface
for Hadoop. It supports a file browser, and web user interface for all applications on Hadoop.
Many organizations are doing business in the Hadoop Platforms and so the criminals also find the ways
to utilize it illegally. There is a need for forensic capabilities which support investigations of crime in Hadoop
Platforms. It may take time if the forensic investigation of HDP is conducted without knowing where residual
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
193 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
artifacts may reside. Forensic readiness is needed to implement to assist the real-world forensic on such complex
platforms. As far as I know, there are no other publications of forensic readiness for Hadoop Platform.
In this paper, the residual artifacts are discovered for forensic investigation on Hadoop Platform Non-
Ambari HDP. This paper focuses on implementing the forensic readiness by (clustering of) analysis to these
residual artifacts. This paper is intending to extract effective evidences with the aim of helping forensic
examiners to perform powerful forensic work by reducing time and resources spent in real-world investigation
on Hadoop Platform.
This paper is organized as follow: Section 2 describes theory background and current literatures
focusing on digital forensics, Hadoop forensics, log analysis for forensics and forensic readiness. The Section 3
describes the architecture of targeted Hadoop Platform: Non-Ambari HDP. In section 4, the research questions
for forensic Investigation on targeted Hadoop Platform: Non-Ambari HDP are presented. The research
methodology to give solution to the research questions is presented in section 5. Section 6 implements the
generating evidences for a Crime case by applying the proposed forensic readiness. Section 7 summarizes the
overall paper.
II. RELATED WORKS
The related works of Hadoop forensic Investigation of various aspects are discussed in this section. The
following literature reviews explore the approaches used by other researchers in this particular field.
A. Hadoop Forensics
The evolution of Hadoop Data Platform brings the challenges to forensic investigation. Forensic
investigation of Hadoop is the novel field; hence there are limited publications on this area. The following
section discusses the literature review of related work.
Cho et al. [5] highlighted that the preceding forensic procedures are not suitable for HDFS based cloud
system because of its characteristics; gigantic volume of distributed data, multi-users, and multi-layered data
structures. These characteristics can generate two problems in the gathering evidences phase. One problem is
that file blocks are replicated on different nodes while the other is the excessive time increasing and storage of
the original copying. They proposed a general forensic procedure and guideline for Hadoop based cloud system.
In this proposed procedure, the authors added live analysis and live collection to the original forensic procedure
to avoid the system suspension. By conducting the static and live collection simultaneously, the Hadoop forensic
analysis can diminish the time for proof collection. However, they did not present a case study or specific
scenario to illustrate their proposals.
A specific type of data leakage, namely data spillage, occurs when classified or sensitive information is
moved onto an unauthorized or undesignated compute node or memory media. Sensitive information spillage
from the Hadoop cluster may cause information unauthorized nodes access problem. In order to remedy such
data spillage challenge, Alabi et al. [12] developed the forensic framework for collecting provenance data and
investigating data spillage in Hadoop. The authors aimed to provide developing tools and prevention
mechanisms by analyzing data spillage events in the context of Hadoop environments and MapReduce
applications. The system level metadata was utilized to map the data motion in terms of how data are stored;
who, where, and how data are accessed; and how data change throughout its life cycle in a Hadoop cluster.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
194 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
In the paper [2] the authors discussed the Hadoop Big Data system could give to new difficulties and
challenges to forensic investigators. This paper highlighted that the understanding Hadoop internal structure is
the important point for forensic investigators. They pointed out that the use of different tools and technology can
do the forensics of big data. And then they demonstrated that the automated tool (Autopsy) can help finding the
evidences on big data efficiently.
Information stored in Hadoop log files plays an important role to gather as forensic evidence. This has
led to large Hadoop log files awaiting analysis. Analysis of Hadoop log files can give unobtrusive insights to the
behavior of users and can embody the illegal usages. This paper discovers the forensically important residual
artifacts on Hadoop Platform from the huge amount of log files. This can reduce the time and resources in real-
world forensics on that platform.
B. Log Analysis for Forensics
Backlogs are valuable sources of forensic evidence. They store information about where data was
stored, where data inputs originated, jobs that have been run, and other event-based information.
Analysis techniques have been successfully introduced in many different fields. Recently, log analysis
techniques have also being applied to the field of criminal and digital forensics. Log analysis is performed to
extract powerful evidences from a large amount of data and also to embody the criminal activities. Examples
include detecting deceptive criminal identities, identifying groups of criminals who are engaging in various
illegal activities and many more and also typically aim to produce insight from large volumes of data.
The paper [1] explored a correlation method for establishing provenance of time stamped data for use
as digital evidence. This work has a deep and relevant impact on digital forensics research as it reiterated the
complexity issues of dealing with timestamps because of drift, offsets, and possible human tampering.
In 2006 as well, research [3] explored event analysis to develop poles for computer forensic
investigation purposes. Abraham analyzed computer data in search of discovering owner or usage profiles based
upon sequences of events which may occur on a system. Abraham categorized an owner profile with four
different attributes: subject, object, action, and time stamp.
In 2007, the paper [6] proposed pre-retrieval and post-retrieval clustering of digital forensics text string
search results. Though their work was focused on text mining, the data clustering algorithms used have shown
success in efficiency and improving information retrieval efforts.
C. Forensic Readiness
The initial idea of Digital Forensic Readiness (DFR) was proposed in 2001. It is an active research
field that attracts many researchers since that time. DFR combines forensic expertise, hardware, and software
engineering. Forensic Readiness is a proactive measure that organizations need to enforce, so that the
organization has the ability to comply for forensic investigations with sufficient forensic preparedness.
Tan [7] identified the factors that affect digital forensic readiness; evidence handling, forensic
acquisition, logging methods and intrusion detection methods.
A DFR system has two objectives: maximizing an environment‟s ability to collect credible digital
evidences, and minimizing the cost of Forensics. DFR includes data collection activities that concern some
components such as RAM, registers, raw disks logs [16].
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
195 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Rowlingson [3] described forensic readiness as an objective to maximize the environment‟s capability
of collecting digital forensic information whilst minimizing the cost of the forensic investigation during an
incident response.
In this paper, we implement the forensic readiness which is performed by analyzing the residual
artifacts. This forensic readiness can assist the forensic examiners for powerful forensic works on Hadoop
Platform.
III. ARCHITECTURE OF TARGETED HADOOP PLATFORM: NON-AMBARI HDP WITH HUE
WEB UI
The HDP includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of
data. The platform is designed to deal with data from many sources and formats. The platform includes core
Hadoop technology such as the Hadoop Distributed File System, MapReduce, Yarn and also facility software
packages (HBase, Strom, etc.)[14].
Scheduler
Application
Manager
Zookeeper
Admin
Resource
Manager
Application
Master
Node
Manager
Hue Web UI
Zookeeper
Agent
scripts
Hue Server
Non-AmbariHDP
Datanode
Container
Namenode
Secondary
namenode
Figure 1: Architecture of Targeted Hadoop Platform: Non Ambari HDP
The package Ambari is the open source installation and management system for Hadoop Platforms.
Some installation of HDP is with the help of Ambari management system; Ambari-HDP. In Ambari-HDP,
Ambari can automatically install core Hadoop and other facility software packages. Some HDP installation
excludes Ambari, and all installations are by manual. That type of HDP is so called Non-Ambari HDP.
Hue (Hadoop User Experience) is an open-source web interface that supports Apache Hadoop and its
ecosystem, licensed under the Apache v2 license. Figure 1 shows the architecture of targeted Hadoop Platform
HDP 2.3 of Non-Ambari HDP. And then, Hue 2.6.12 is applied as the web UI and gateway for all operations on
HDP.
IV. RESEARCH QUESTIONS FOR FORENSIC INVESTIGATION ON TARGETED HADOOP
PLATFORM
For forensic investigation on targeted Hadoop Platform, the research questions are raised as the
following shown in Table 1.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
196 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Table 1: Research Questions for Forensic Investigation on Targeted Hadoop Platform
Q1: What backlogs are updated per operation on targeted Hadoop Platform?
Q2: How to address the volume challenge in generating evidences for investigation on target Hadoop Platform?
Q3: Among the sheer amount of backlogs, which are forensically useful as residual artifacts?
Q4: How to implement the forensic readiness for targeted Hadoop Platform?
The sub questions are raised from the above primary question.
Q3.1: For forensic readiness, how are the residual artifacts analyzed?
A. Research Methodology
With the aim to perform a forensic readiness, a forensic research on the targeted Hadoop Platform as a
proactive investigation can address the challenges of the real-world forensic investigation. A major challenge to
this environment is the ongoing growth in the volume of Hadoop Platform backlogs presented for analysis in
generating evidences. However, all of these backlogs are not forensically significant.
This paper discovers the forensically important artifacts from a sheer amount of backlogs. A corpus of
potential evidences is performed as forensic readiness which is implemented by means of analysis to these
residual artifacts. When an examiner investigates a crime case, he can effortlessly extract evidences and embody
the illegal usages of criminal by applying the forensic readiness.
The Figure 2 shows the work flow of research on targeted Hadoop Platform. It describes the
responsibilities of forensic researcher in forensic research and how the examiners apply the readiness for
generating evidences to embody the criminal activity.
Backlogs
Discover
residual
artifacts from
backlogs
Residual
artifacts
Implement
forensic
readiness
Forensic readiness
Extract
evidences
relating to
crime case
Forensic Research Forensic Investigation
Evidences
Investigate
the crime
Crime Case
Carpus of
potential
evidences
Figure 2: Work Flow of Forensic Research and Investigation on Targeted Hadoop Platform
B. Residual Artifacts for Forensic Investigation on Targeted Hadoop Platform
In Hadoop Platform, many log data are updated per operation. However, all logs may not contain
relevant evidence. Exposing residual artifacts among the sheer amount of backlogs can reduce the volume for
analysis.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
197 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
C. Backlogs on Hadoop Platform
As the nature of Hadoop, many log files are updated per processing; there is a large amount of
backlogs. However, actually, most of them are not forensically valuable.
The following types of logs can be found on machines which are running HDP:
Hadoop daemon logs: These are stored in the host operating system; these .log files contain error and
warning information. By default, these log files will have a Hadoop prefix in the filename.
log4j: These logs store information from the log4j process. The log4j application is an Apache logging
interface that is used by many Hadoop applications. These logs are stored in the /var/log/hadoop directory.
Standard out and standard error: Each Hadoop TaskTracker creates and maintains these error logs
to store information written to standard out or standard error. These logs are stored in each TaskTracker node's
/var/log/hadoop/userlogs directory.
Job configuration XML: The Hadoop JobTracker creates these files within HDFS for tracking job
summary details about the configuration and job run. These files can be found in the /var/log/hadoop and
/var/log/hadoop/history directory.
Job statistics: The Hadoop JobTracker creates these logs to store information about the number of job
step attempts and the job runtime for each job.
Some noticeable log files which are update per Hadoop operation are shown as the following.
 Hadoop-<user-running-Hadoop>-<daemon>-<hostname>.log
 For example: Hadoop-Hadoop-Datanode-IP-xxxx.log
 JobTracker Logs
 are created by the jobtracker.
 are stored in two places:/var/log/Hadoop and
/var/log/Hadoop/history.
 or home/Hadoop/logs/history/done/version x/<host-
job_id>/<year>/<month>/<day>/<serial>
 TaskTracker Logs (for a particular task attempt)
 are created by each tasktracker.
 are captured when a task attempt is run.
 /var/log/Hadoop/userlogs/attempt_<job-id>_<map-or-
reduce>_<attempt-id>
 HDFS metadata
 fsimage –
 contains the file system complete state at a point in time
 allocates a unique, monotonically increasing transaction ID
 edits –
 is a log listing each file system change (file creation, deletion or
modification)
 is made after the most recent fsimage.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
198 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
D. Residual Artifacts on Targeted Hadoop Platform
The major challenge to digital forensic analysis is the ongoing growth in the volume of backlogs seized
and presented for analysis. Thus, this paper seeks to identify and locate which backlogs are forensically
important as residual artifacts. This leads to the decent way to extract effective forensic evidence by reducing
time and resources.
Among the large amount of log files, the contents in namenode.log, syslog, blk_#####, job-######,
hdfs-audit.log, hue-access.log files are useful for forensic environment. The contents of log file are able to
uncover the criminal activity. Among them, the forensically complete logs which can embody the crime scene
are hue-access.log and hdfs-audit.log. The criminal activity can be embodied by applying only these two log
files even excluding other backlogs. In the configuration files for handling the logs, we can adjust the maximum
amount of log. If the maximum amount is occupied, the old files are made as backup log files. The following
Table 2 expresses the location and configuration files that store the variables and default location of these two
log files.
Table 2: Residual Artifacts (Forensically important Log Files) of Targeted Hadoop Platform
File Name Location Configuration Files
hue-access.log /var/log/hue/ /etc/hue/config/log.config, access.py
hdfs-audit.log /usr/hdp/Hadoop/hdfs-
audit.log
/etc/hadoop/log4j.properties
(i) Hue-access.log
When accessing a Hadoop file operation via Hue Web UI, hue-access log maintain a history of
requests. Information about request, including requested date/time, log level, client IP address, name of user
operating the file, accessed method, HTTP protocol. More recent entries are typically appended to the end of
file. The Figure 2 shows the parameters and values of a record (entry) in hue-access.log.
[20/Jul/2017 20:41:27 +0000] INFO 192.168.80.1 hue - "GET /filebrowser/download//dirBb/Forensic
Investigation through Data Remnants on Hadoop Big Data Storage System.pdf HTTP/1.0"
Date
Log
Level
Client
IP
User
Name
Access
Method
File Directory
File Name HTTP Protocol
Figure 3: Parameters and Values of a Record in ‘hue-access.log’
(ii) hdfs-audit.log
Audit logging is an accounting process that logs all operations happening in Hadoop and give higher
resolution into Hadoop operations. Its configuration is presented in the log4j properties. By default, the
log4j.properties file has the log threshold set to WARN. By setting this level to INFO, audit logging can be
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
199 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
turned on. The following snippet in Table 3 shows the log4j.properties configuration when HDFS audit logs is
turned on.
Table 3: ‘log4j.properties’ Configuration File for Audit Logging
SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log
-XX:NewSize=128m -XX:MaxNewSize=128m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -
XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,RFAAUDIT"
The Figure 3 shows the parameters and value contain in each record of hdfs-audit log.
2017-07-14 10:01:00,632 INFO FSNamesystem.audit: allowed=true ugi=root (auth:PROXY) via hue
(auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/user/hue dst=null perm=null proto=webhdfs
proto=rpc
Date
Access
Level
User
Group
User
Name
File Operation
Command
Host
IP
File Name ProtocolAuthentication
Figure 4: Parameters and Value of a Record in ‘hdfs-audit.log’
E. Forensic Readiness for Target Hadoop Platform
Forensic readiness can be defined as the ability of a proactive investigation (forensic research) to
maximize the potential to extract effective evidence whilst minimizing the costs of a real-world investigation. It
is essential to prepare a ready condition in which an examiner can extract relevant and sufficient evidence
easily.
In this paper, we implement a forensic readiness that contains a ready corpus of potential evidences
which is the resulting clusters from log analysis model.
a. Log Analysis for Implementation of Forensics Readiness
Analyzing backlogs can give unobtrusive insights to the file operation, behavior of the users and
provide to draw the event line of the criminals. In this paper, cluster analysis to the discovered residual artifacts
will implement a corpus of potential evidences as a forensic readiness which can assist the forensic examiners to
extract evidences easily.
Cluster analysis is the task of grouping a set of object in such a way that objects in the same group
(called a cluster). The following outlines the steps of the proposed log analysis mode to implement forensic
readiness as shown in Figure 5.
(i) Collection
This model extract to collect the forensically important data, so called residual artifacts from a large
amount of backlog through special software called a write-blocker that creates a bit by bit copy of the actual
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
200 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
data itself. The contents of hue-access.log and hdfs-audit.log files are collected for analysis in a timely manner
and stored in a data store.
Data
collection
Data set
Pre-
processing
Cluster
analysis
Log Analysis Model
Forensic Readiness
clusters
Carpus of
potential
evidences
Figure 5: Log Analysis Model for Implementing Forensic Readiness System
(ii) Preprocessing
Preprocessing is concerned with data cleaning and data transformation. The goal of data cleaning is to
remove irrelevant information. In this system, some parameters in log files are not meaningful for analysis.
Eliminating those irrelevant data reduces processing time.
Data transformation converts raw data items into structured information. The certain transformation is
performed on unstructured log files to structured data sets. These above residual artifacts; these two backlogs
are unstructured data. Thus, for the analysis purpose, the logs are transformed into structured dataset.
(iii) Cluster Analysis
This model applies the proposed algorithm; Cluster Analysis Algorithm to group the data by each
significant parameter. The output of the algorithm is the clusters which form as a corpus of potential evidence.
Algorithm 1: Cluster Analysis Algorithm
Input: Data Set D{P1,………,.Pn}
Output: Clusters C{ C 1 {S1}, ………… C n{Sn}}
1. For each Parameter P in D
2. For each Record R in D
3. if Rj {Pi (value) } = = Rj+1 {Pi (value)}
4. Rj is set to C i {Sk}
5. else
6. Rj is set to C i {Sk+1}
7. end if
8. end for
10. end for
12. return Sets in Clusters
Figure 6. Cluster Analysis Algorithm
The number of the sets in each cluster is calculated in Eq1 and Eq2. The Eq3 describes the value of
cluster and also defines the symbols which are used in above Algorithm 1.
Eq1: number of Cluster N(C) = number of Parameter in Dataset N (P)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
201 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Eq2: number of set in a cluster N(S) = number of different values of each Parameter
= N(P(value))
Eq3: Ci = {C i {S1}, C i {S2}, ……… C i {Sn} }
where i=1,…….N(C)
S= set in each cluster
n = N(S) in Ci
The output of the log analysis model; the clusters of artifacts implements the ready corpus of potential
evidence as forensic readiness, so, the forensic examiners can easily extract the evidences from this ready
corpus to embody the crime case.
V. GENERATING EVIDENCES FOR INVESTIGATING A CRIME CASE
In this section, we test the investigating of a crime case on the targeted Hadoop Platform by applying
the proposed forensic readiness. The sample crime case is described as follows:
Crime Case: The DEF organization uses the services of Hadoop Platform of Non-Ambari HDP with
Hue Web UI. Every authorized person in this organization can access this through „http://@@@@@:8000 /‟
from their own web browsers for obtaining the service of uploading, downloading and opening the files on it.
One suspected case is that Mr.Dewel is a suspected person who betrays the organization. Therefore, the
organization needs to know the usage pattern of Mr. Dewel.
In this case, with the aim to trace the activity of Mr.Dewel, the examiner applies potential evidence
corpus in forensic readiness. He extracts the records of username = Mr. Dewel from the cluster which is grouped
by username. The examiner can trace “the activity of Mr.Dewel” easily without time and resources consuming.
VI. CONCLUSION
With the aim of tracing the criminal activities, information stored in Hadoop backlogs plays an
important role. However the extracting evidences from a large amount of backlogs for forensics on emerging
Hadoop Platforms is identified as a challenge for forensic investigators and researchers. It may consume
significant amount of time and resources. Forensic Readiness is one of the solutions to address the above
challenges. Thus, the forensic research of this paper has two folds of contributions. In the first one, it exposes
the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP) with
Hue Web UI. And the next one is that forensic readiness is proposed by means of cluster analysis to exposed
residual artifacts. The outcomes of this paper contribute to a forensic readiness which can result in effective
evidence generating by reducing time and resources spent in real-world investigation on Hadoop Platform to
assist the forensic examiners.
REFERENCES
[1] A.Clark, et al., “A correlation method for establishing Provenance of timestamps in digital evidence”, 6th Annual Digital
Forensic Research Workshop in Digital Investigation, volume 3, supplement 1, pages 98–107, 2006.
[2] S.A.Thanekar, et al., “A Study on Digital Forensics in Hadoop” International Journal of Control Theory and Applications.
Published By: International Science Press, vol. 9, no. 18, pp. 8927-8933, 2016.,
[3] R. Rowlingson “A ten step process for forensic readiness”. International Journal of Digital Evidence, 2(3), 1-28, 2004.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
202 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
[4] A. Pichan, et al., "Cloud forensics: Technical challenges, solutions and comparative analysis," Digital Investigation, vol. 13, pp.
38-57, 2015.
[5] C.H.Cho et al., "Cyber Forensic for Hadoop Based Cloud System," International Journal of Security and its Applications, vol
6.3, pp.83-90, July, 2012.
[6] J.Guynes et al., “Digital forensics text string searching: Improving Information retrieval effectiveness by thematically clustering
search results”, In 6th Annual Digital Forensic Research Workshop, volume 4, pages 49–54, 2007.
[7] J. Tan, “Forensics readiness”. Technical Cambridge USA, 2001.
[8] A. of Chief Police Officers, “Good practice guide for computer based electronic evidence,” ACPO, Tech. Rep.
[9] S. V. President, "Hadoop/Big Data Market Size Worldwide 2015-2020 | Statistic," Statista, 2016. Available:
https://www.statista.com/statistics/587051/worldwide-Hadoop-bigdata-market/. Accessed: Nov. 8, 2016.
[10] J.Galloway, et al., “Network data mining: methods and techniques for discovering deep linkage between attributes”, In APCCM
‟06: Proceedings of the 3rd Asia- Pacific conference on Conceptual modelling, pages 21–32. Australian Computer Society, Inc.,
Darlinghurst, Australia, Australia, 2006. ISBN 1-920-68235-X.
[11] Hortonworks M. Gualtieri and N. Yuhanna, "The Forrester Wave: Big Data Hadoop Solutions, Q1 2014," Forrester, 2014.
[12] O.Alabi et al., "Toward a Data Spillage Prevention Process in Hadoop using Data Provenance," Proceedings of the 2015
Workshop on Changing Landscapes in HPC Security, pp.9-13, 2015.
[13] "Welcome to apache Hadoop," Available: http://hadoop.apache.org/. Accessed: Nov. 8, 2016.
[14] V. Chemitiganti, "What is Apache Hadoop?," in Business Values of Hadoop, Hortonworks, 2016. Available:
http://hortonworks.com/apache/hadoop.
[15] Available: https://hortonworks.com/press-releases/hortonworks-q1-2017-results Accessed: August. 18, 2017. .
[16] L. De Marco, F. Ferrucci, M-T. Kechadi, “Reference Architecture for a Cloud Forensic Readiness System”, EAI Endorsed
Transactions on Security and Safety.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
203 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot

Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesAndré Valdestilhas
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAdnan Akhter
 
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...rahulmonikasharma
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmersKevin Lee
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainClaudia Vitolo
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
DIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGDIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGIJCI JOURNAL
 

What's hot (18)

Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning TechniquesAn Empirical Evaluation of RDF Graph Partitioning Techniques
An Empirical Evaluation of RDF Graph Partitioning Techniques
 
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domain
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
DIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGDIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVING
 

Similar to Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study

Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
 
Data carving using artificial headers info sec conference
Data carving using artificial headers   info sec conferenceData carving using artificial headers   info sec conference
Data carving using artificial headers info sec conferenceRobert Daniel
 
An Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveAn Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveJoe Andelija
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningIRJET Journal
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)JonySaini2
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction EMC
 

Similar to Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study (20)

Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
HDFS
HDFSHDFS
HDFS
 
Hadoop
Hadoop Hadoop
Hadoop
 
Moving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and PerspectivesMoving Toward Big Data: Challenges, Trends and Perspectives
Moving Toward Big Data: Challenges, Trends and Perspectives
 
Data carving using artificial headers info sec conference
Data carving using artificial headers   info sec conferenceData carving using artificial headers   info sec conference
Data carving using artificial headers info sec conference
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
An Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveAn Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache Hive
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
Hadoop
HadoopHadoop
Hadoop
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction
 
Research Poster
Research PosterResearch Poster
Research Poster
 

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study

  • 1. Forensic Readiness on Hadoop Platform: Non- Ambari HDP as a Case Study Myat Nandar Oo1 ,Thandar Thein2 1 University of Computer Studies, Yangon, 2 University of Computer Studies, Maubin 1 myatnandaroo@gmail.com, 2 thandartheinn@gmail.com Abstract: Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus, forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness which can result in effective evidence generating in real-world forensics on Hadoop Platform. Keywords: forensic Investigation, forensic Readiness, Hadoop Platform, residual artifact I. INTRODUCTION Hadoop is a reliable platform for storage and analysis with its adaptable facility software packages for cloud and Big Data environment. Hadoop is built for the purpose of distribution for storage and computing. It is a cross-platform, Java-based solution which enable to run on a wide array of different operating systems, because it is built in Java, a platform-neutral language. Hadoop itself serves as its own abstract platform layer; it can be accessed and run almost entirely independent of the host operating system. The Hadoop version 0.1.0 is published since April, 2006 and continues to increase its versions [13]. Up till now, latest released Apache Hadoop 2.7 was available in June, 2016 [3]. Hadoop is speedily mutable and new software packages are being added to Hadoop. Recently, parts of the inventive Hadoop Apache project have turned to build software, such as Hue, Avro, HBase, Pig, HCatalog, Hive, Flume, Oozie, Sqoop, and Zookeeper [14]. In Statista report [9], the Hadoop market was valued at 6 billion U.S. dollars worldwide in the year 2015. A number of companies became bundle Hadoop and related technologies into their own Hadoop distributions as the Hadoop Platforms. The three prominent Hadoop Platforms are MapR, Cloudera, and Hortonworks [11], among them; HDP is the fully open source Hadoop Platform. The development rate HDP is faster than that of competitive Hadoop Platforms. Total gross profit was $38.1 million for the first quarter of 2017, compared to $25.0 million for the same period last year. Hue [15] provides a Web application interface for Hadoop. It supports a file browser, and web user interface for all applications on Hadoop. Many organizations are doing business in the Hadoop Platforms and so the criminals also find the ways to utilize it illegally. There is a need for forensic capabilities which support investigations of crime in Hadoop Platforms. It may take time if the forensic investigation of HDP is conducted without knowing where residual International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 193 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. artifacts may reside. Forensic readiness is needed to implement to assist the real-world forensic on such complex platforms. As far as I know, there are no other publications of forensic readiness for Hadoop Platform. In this paper, the residual artifacts are discovered for forensic investigation on Hadoop Platform Non- Ambari HDP. This paper focuses on implementing the forensic readiness by (clustering of) analysis to these residual artifacts. This paper is intending to extract effective evidences with the aim of helping forensic examiners to perform powerful forensic work by reducing time and resources spent in real-world investigation on Hadoop Platform. This paper is organized as follow: Section 2 describes theory background and current literatures focusing on digital forensics, Hadoop forensics, log analysis for forensics and forensic readiness. The Section 3 describes the architecture of targeted Hadoop Platform: Non-Ambari HDP. In section 4, the research questions for forensic Investigation on targeted Hadoop Platform: Non-Ambari HDP are presented. The research methodology to give solution to the research questions is presented in section 5. Section 6 implements the generating evidences for a Crime case by applying the proposed forensic readiness. Section 7 summarizes the overall paper. II. RELATED WORKS The related works of Hadoop forensic Investigation of various aspects are discussed in this section. The following literature reviews explore the approaches used by other researchers in this particular field. A. Hadoop Forensics The evolution of Hadoop Data Platform brings the challenges to forensic investigation. Forensic investigation of Hadoop is the novel field; hence there are limited publications on this area. The following section discusses the literature review of related work. Cho et al. [5] highlighted that the preceding forensic procedures are not suitable for HDFS based cloud system because of its characteristics; gigantic volume of distributed data, multi-users, and multi-layered data structures. These characteristics can generate two problems in the gathering evidences phase. One problem is that file blocks are replicated on different nodes while the other is the excessive time increasing and storage of the original copying. They proposed a general forensic procedure and guideline for Hadoop based cloud system. In this proposed procedure, the authors added live analysis and live collection to the original forensic procedure to avoid the system suspension. By conducting the static and live collection simultaneously, the Hadoop forensic analysis can diminish the time for proof collection. However, they did not present a case study or specific scenario to illustrate their proposals. A specific type of data leakage, namely data spillage, occurs when classified or sensitive information is moved onto an unauthorized or undesignated compute node or memory media. Sensitive information spillage from the Hadoop cluster may cause information unauthorized nodes access problem. In order to remedy such data spillage challenge, Alabi et al. [12] developed the forensic framework for collecting provenance data and investigating data spillage in Hadoop. The authors aimed to provide developing tools and prevention mechanisms by analyzing data spillage events in the context of Hadoop environments and MapReduce applications. The system level metadata was utilized to map the data motion in terms of how data are stored; who, where, and how data are accessed; and how data change throughout its life cycle in a Hadoop cluster. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 194 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. In the paper [2] the authors discussed the Hadoop Big Data system could give to new difficulties and challenges to forensic investigators. This paper highlighted that the understanding Hadoop internal structure is the important point for forensic investigators. They pointed out that the use of different tools and technology can do the forensics of big data. And then they demonstrated that the automated tool (Autopsy) can help finding the evidences on big data efficiently. Information stored in Hadoop log files plays an important role to gather as forensic evidence. This has led to large Hadoop log files awaiting analysis. Analysis of Hadoop log files can give unobtrusive insights to the behavior of users and can embody the illegal usages. This paper discovers the forensically important residual artifacts on Hadoop Platform from the huge amount of log files. This can reduce the time and resources in real- world forensics on that platform. B. Log Analysis for Forensics Backlogs are valuable sources of forensic evidence. They store information about where data was stored, where data inputs originated, jobs that have been run, and other event-based information. Analysis techniques have been successfully introduced in many different fields. Recently, log analysis techniques have also being applied to the field of criminal and digital forensics. Log analysis is performed to extract powerful evidences from a large amount of data and also to embody the criminal activities. Examples include detecting deceptive criminal identities, identifying groups of criminals who are engaging in various illegal activities and many more and also typically aim to produce insight from large volumes of data. The paper [1] explored a correlation method for establishing provenance of time stamped data for use as digital evidence. This work has a deep and relevant impact on digital forensics research as it reiterated the complexity issues of dealing with timestamps because of drift, offsets, and possible human tampering. In 2006 as well, research [3] explored event analysis to develop poles for computer forensic investigation purposes. Abraham analyzed computer data in search of discovering owner or usage profiles based upon sequences of events which may occur on a system. Abraham categorized an owner profile with four different attributes: subject, object, action, and time stamp. In 2007, the paper [6] proposed pre-retrieval and post-retrieval clustering of digital forensics text string search results. Though their work was focused on text mining, the data clustering algorithms used have shown success in efficiency and improving information retrieval efforts. C. Forensic Readiness The initial idea of Digital Forensic Readiness (DFR) was proposed in 2001. It is an active research field that attracts many researchers since that time. DFR combines forensic expertise, hardware, and software engineering. Forensic Readiness is a proactive measure that organizations need to enforce, so that the organization has the ability to comply for forensic investigations with sufficient forensic preparedness. Tan [7] identified the factors that affect digital forensic readiness; evidence handling, forensic acquisition, logging methods and intrusion detection methods. A DFR system has two objectives: maximizing an environment‟s ability to collect credible digital evidences, and minimizing the cost of Forensics. DFR includes data collection activities that concern some components such as RAM, registers, raw disks logs [16]. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 195 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. Rowlingson [3] described forensic readiness as an objective to maximize the environment‟s capability of collecting digital forensic information whilst minimizing the cost of the forensic investigation during an incident response. In this paper, we implement the forensic readiness which is performed by analyzing the residual artifacts. This forensic readiness can assist the forensic examiners for powerful forensic works on Hadoop Platform. III. ARCHITECTURE OF TARGETED HADOOP PLATFORM: NON-AMBARI HDP WITH HUE WEB UI The HDP includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from many sources and formats. The platform includes core Hadoop technology such as the Hadoop Distributed File System, MapReduce, Yarn and also facility software packages (HBase, Strom, etc.)[14]. Scheduler Application Manager Zookeeper Admin Resource Manager Application Master Node Manager Hue Web UI Zookeeper Agent scripts Hue Server Non-AmbariHDP Datanode Container Namenode Secondary namenode Figure 1: Architecture of Targeted Hadoop Platform: Non Ambari HDP The package Ambari is the open source installation and management system for Hadoop Platforms. Some installation of HDP is with the help of Ambari management system; Ambari-HDP. In Ambari-HDP, Ambari can automatically install core Hadoop and other facility software packages. Some HDP installation excludes Ambari, and all installations are by manual. That type of HDP is so called Non-Ambari HDP. Hue (Hadoop User Experience) is an open-source web interface that supports Apache Hadoop and its ecosystem, licensed under the Apache v2 license. Figure 1 shows the architecture of targeted Hadoop Platform HDP 2.3 of Non-Ambari HDP. And then, Hue 2.6.12 is applied as the web UI and gateway for all operations on HDP. IV. RESEARCH QUESTIONS FOR FORENSIC INVESTIGATION ON TARGETED HADOOP PLATFORM For forensic investigation on targeted Hadoop Platform, the research questions are raised as the following shown in Table 1. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 196 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. Table 1: Research Questions for Forensic Investigation on Targeted Hadoop Platform Q1: What backlogs are updated per operation on targeted Hadoop Platform? Q2: How to address the volume challenge in generating evidences for investigation on target Hadoop Platform? Q3: Among the sheer amount of backlogs, which are forensically useful as residual artifacts? Q4: How to implement the forensic readiness for targeted Hadoop Platform? The sub questions are raised from the above primary question. Q3.1: For forensic readiness, how are the residual artifacts analyzed? A. Research Methodology With the aim to perform a forensic readiness, a forensic research on the targeted Hadoop Platform as a proactive investigation can address the challenges of the real-world forensic investigation. A major challenge to this environment is the ongoing growth in the volume of Hadoop Platform backlogs presented for analysis in generating evidences. However, all of these backlogs are not forensically significant. This paper discovers the forensically important artifacts from a sheer amount of backlogs. A corpus of potential evidences is performed as forensic readiness which is implemented by means of analysis to these residual artifacts. When an examiner investigates a crime case, he can effortlessly extract evidences and embody the illegal usages of criminal by applying the forensic readiness. The Figure 2 shows the work flow of research on targeted Hadoop Platform. It describes the responsibilities of forensic researcher in forensic research and how the examiners apply the readiness for generating evidences to embody the criminal activity. Backlogs Discover residual artifacts from backlogs Residual artifacts Implement forensic readiness Forensic readiness Extract evidences relating to crime case Forensic Research Forensic Investigation Evidences Investigate the crime Crime Case Carpus of potential evidences Figure 2: Work Flow of Forensic Research and Investigation on Targeted Hadoop Platform B. Residual Artifacts for Forensic Investigation on Targeted Hadoop Platform In Hadoop Platform, many log data are updated per operation. However, all logs may not contain relevant evidence. Exposing residual artifacts among the sheer amount of backlogs can reduce the volume for analysis. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 197 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. C. Backlogs on Hadoop Platform As the nature of Hadoop, many log files are updated per processing; there is a large amount of backlogs. However, actually, most of them are not forensically valuable. The following types of logs can be found on machines which are running HDP: Hadoop daemon logs: These are stored in the host operating system; these .log files contain error and warning information. By default, these log files will have a Hadoop prefix in the filename. log4j: These logs store information from the log4j process. The log4j application is an Apache logging interface that is used by many Hadoop applications. These logs are stored in the /var/log/hadoop directory. Standard out and standard error: Each Hadoop TaskTracker creates and maintains these error logs to store information written to standard out or standard error. These logs are stored in each TaskTracker node's /var/log/hadoop/userlogs directory. Job configuration XML: The Hadoop JobTracker creates these files within HDFS for tracking job summary details about the configuration and job run. These files can be found in the /var/log/hadoop and /var/log/hadoop/history directory. Job statistics: The Hadoop JobTracker creates these logs to store information about the number of job step attempts and the job runtime for each job. Some noticeable log files which are update per Hadoop operation are shown as the following.  Hadoop-<user-running-Hadoop>-<daemon>-<hostname>.log  For example: Hadoop-Hadoop-Datanode-IP-xxxx.log  JobTracker Logs  are created by the jobtracker.  are stored in two places:/var/log/Hadoop and /var/log/Hadoop/history.  or home/Hadoop/logs/history/done/version x/<host- job_id>/<year>/<month>/<day>/<serial>  TaskTracker Logs (for a particular task attempt)  are created by each tasktracker.  are captured when a task attempt is run.  /var/log/Hadoop/userlogs/attempt_<job-id>_<map-or- reduce>_<attempt-id>  HDFS metadata  fsimage –  contains the file system complete state at a point in time  allocates a unique, monotonically increasing transaction ID  edits –  is a log listing each file system change (file creation, deletion or modification)  is made after the most recent fsimage. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 198 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. D. Residual Artifacts on Targeted Hadoop Platform The major challenge to digital forensic analysis is the ongoing growth in the volume of backlogs seized and presented for analysis. Thus, this paper seeks to identify and locate which backlogs are forensically important as residual artifacts. This leads to the decent way to extract effective forensic evidence by reducing time and resources. Among the large amount of log files, the contents in namenode.log, syslog, blk_#####, job-######, hdfs-audit.log, hue-access.log files are useful for forensic environment. The contents of log file are able to uncover the criminal activity. Among them, the forensically complete logs which can embody the crime scene are hue-access.log and hdfs-audit.log. The criminal activity can be embodied by applying only these two log files even excluding other backlogs. In the configuration files for handling the logs, we can adjust the maximum amount of log. If the maximum amount is occupied, the old files are made as backup log files. The following Table 2 expresses the location and configuration files that store the variables and default location of these two log files. Table 2: Residual Artifacts (Forensically important Log Files) of Targeted Hadoop Platform File Name Location Configuration Files hue-access.log /var/log/hue/ /etc/hue/config/log.config, access.py hdfs-audit.log /usr/hdp/Hadoop/hdfs- audit.log /etc/hadoop/log4j.properties (i) Hue-access.log When accessing a Hadoop file operation via Hue Web UI, hue-access log maintain a history of requests. Information about request, including requested date/time, log level, client IP address, name of user operating the file, accessed method, HTTP protocol. More recent entries are typically appended to the end of file. The Figure 2 shows the parameters and values of a record (entry) in hue-access.log. [20/Jul/2017 20:41:27 +0000] INFO 192.168.80.1 hue - "GET /filebrowser/download//dirBb/Forensic Investigation through Data Remnants on Hadoop Big Data Storage System.pdf HTTP/1.0" Date Log Level Client IP User Name Access Method File Directory File Name HTTP Protocol Figure 3: Parameters and Values of a Record in ‘hue-access.log’ (ii) hdfs-audit.log Audit logging is an accounting process that logs all operations happening in Hadoop and give higher resolution into Hadoop operations. Its configuration is presented in the log4j properties. By default, the log4j.properties file has the log threshold set to WARN. By setting this level to INFO, audit logging can be International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 199 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 8. turned on. The following snippet in Table 3 shows the log4j.properties configuration when HDFS audit logs is turned on. Table 3: ‘log4j.properties’ Configuration File for Audit Logging SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=128m -XX:MaxNewSize=128m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc - XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,RFAAUDIT" The Figure 3 shows the parameters and value contain in each record of hdfs-audit log. 2017-07-14 10:01:00,632 INFO FSNamesystem.audit: allowed=true ugi=root (auth:PROXY) via hue (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/user/hue dst=null perm=null proto=webhdfs proto=rpc Date Access Level User Group User Name File Operation Command Host IP File Name ProtocolAuthentication Figure 4: Parameters and Value of a Record in ‘hdfs-audit.log’ E. Forensic Readiness for Target Hadoop Platform Forensic readiness can be defined as the ability of a proactive investigation (forensic research) to maximize the potential to extract effective evidence whilst minimizing the costs of a real-world investigation. It is essential to prepare a ready condition in which an examiner can extract relevant and sufficient evidence easily. In this paper, we implement a forensic readiness that contains a ready corpus of potential evidences which is the resulting clusters from log analysis model. a. Log Analysis for Implementation of Forensics Readiness Analyzing backlogs can give unobtrusive insights to the file operation, behavior of the users and provide to draw the event line of the criminals. In this paper, cluster analysis to the discovered residual artifacts will implement a corpus of potential evidences as a forensic readiness which can assist the forensic examiners to extract evidences easily. Cluster analysis is the task of grouping a set of object in such a way that objects in the same group (called a cluster). The following outlines the steps of the proposed log analysis mode to implement forensic readiness as shown in Figure 5. (i) Collection This model extract to collect the forensically important data, so called residual artifacts from a large amount of backlog through special software called a write-blocker that creates a bit by bit copy of the actual International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 200 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 9. data itself. The contents of hue-access.log and hdfs-audit.log files are collected for analysis in a timely manner and stored in a data store. Data collection Data set Pre- processing Cluster analysis Log Analysis Model Forensic Readiness clusters Carpus of potential evidences Figure 5: Log Analysis Model for Implementing Forensic Readiness System (ii) Preprocessing Preprocessing is concerned with data cleaning and data transformation. The goal of data cleaning is to remove irrelevant information. In this system, some parameters in log files are not meaningful for analysis. Eliminating those irrelevant data reduces processing time. Data transformation converts raw data items into structured information. The certain transformation is performed on unstructured log files to structured data sets. These above residual artifacts; these two backlogs are unstructured data. Thus, for the analysis purpose, the logs are transformed into structured dataset. (iii) Cluster Analysis This model applies the proposed algorithm; Cluster Analysis Algorithm to group the data by each significant parameter. The output of the algorithm is the clusters which form as a corpus of potential evidence. Algorithm 1: Cluster Analysis Algorithm Input: Data Set D{P1,………,.Pn} Output: Clusters C{ C 1 {S1}, ………… C n{Sn}} 1. For each Parameter P in D 2. For each Record R in D 3. if Rj {Pi (value) } = = Rj+1 {Pi (value)} 4. Rj is set to C i {Sk} 5. else 6. Rj is set to C i {Sk+1} 7. end if 8. end for 10. end for 12. return Sets in Clusters Figure 6. Cluster Analysis Algorithm The number of the sets in each cluster is calculated in Eq1 and Eq2. The Eq3 describes the value of cluster and also defines the symbols which are used in above Algorithm 1. Eq1: number of Cluster N(C) = number of Parameter in Dataset N (P) International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 201 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 10. Eq2: number of set in a cluster N(S) = number of different values of each Parameter = N(P(value)) Eq3: Ci = {C i {S1}, C i {S2}, ……… C i {Sn} } where i=1,…….N(C) S= set in each cluster n = N(S) in Ci The output of the log analysis model; the clusters of artifacts implements the ready corpus of potential evidence as forensic readiness, so, the forensic examiners can easily extract the evidences from this ready corpus to embody the crime case. V. GENERATING EVIDENCES FOR INVESTIGATING A CRIME CASE In this section, we test the investigating of a crime case on the targeted Hadoop Platform by applying the proposed forensic readiness. The sample crime case is described as follows: Crime Case: The DEF organization uses the services of Hadoop Platform of Non-Ambari HDP with Hue Web UI. Every authorized person in this organization can access this through „http://@@@@@:8000 /‟ from their own web browsers for obtaining the service of uploading, downloading and opening the files on it. One suspected case is that Mr.Dewel is a suspected person who betrays the organization. Therefore, the organization needs to know the usage pattern of Mr. Dewel. In this case, with the aim to trace the activity of Mr.Dewel, the examiner applies potential evidence corpus in forensic readiness. He extracts the records of username = Mr. Dewel from the cluster which is grouped by username. The examiner can trace “the activity of Mr.Dewel” easily without time and resources consuming. VI. CONCLUSION With the aim of tracing the criminal activities, information stored in Hadoop backlogs plays an important role. However the extracting evidences from a large amount of backlogs for forensics on emerging Hadoop Platforms is identified as a challenge for forensic investigators and researchers. It may consume significant amount of time and resources. Forensic Readiness is one of the solutions to address the above challenges. Thus, the forensic research of this paper has two folds of contributions. In the first one, it exposes the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP) with Hue Web UI. And the next one is that forensic readiness is proposed by means of cluster analysis to exposed residual artifacts. The outcomes of this paper contribute to a forensic readiness which can result in effective evidence generating by reducing time and resources spent in real-world investigation on Hadoop Platform to assist the forensic examiners. REFERENCES [1] A.Clark, et al., “A correlation method for establishing Provenance of timestamps in digital evidence”, 6th Annual Digital Forensic Research Workshop in Digital Investigation, volume 3, supplement 1, pages 98–107, 2006. [2] S.A.Thanekar, et al., “A Study on Digital Forensics in Hadoop” International Journal of Control Theory and Applications. Published By: International Science Press, vol. 9, no. 18, pp. 8927-8933, 2016., [3] R. Rowlingson “A ten step process for forensic readiness”. International Journal of Digital Evidence, 2(3), 1-28, 2004. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 202 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 11. [4] A. Pichan, et al., "Cloud forensics: Technical challenges, solutions and comparative analysis," Digital Investigation, vol. 13, pp. 38-57, 2015. [5] C.H.Cho et al., "Cyber Forensic for Hadoop Based Cloud System," International Journal of Security and its Applications, vol 6.3, pp.83-90, July, 2012. [6] J.Guynes et al., “Digital forensics text string searching: Improving Information retrieval effectiveness by thematically clustering search results”, In 6th Annual Digital Forensic Research Workshop, volume 4, pages 49–54, 2007. [7] J. Tan, “Forensics readiness”. Technical Cambridge USA, 2001. [8] A. of Chief Police Officers, “Good practice guide for computer based electronic evidence,” ACPO, Tech. Rep. [9] S. V. President, "Hadoop/Big Data Market Size Worldwide 2015-2020 | Statistic," Statista, 2016. Available: https://www.statista.com/statistics/587051/worldwide-Hadoop-bigdata-market/. Accessed: Nov. 8, 2016. [10] J.Galloway, et al., “Network data mining: methods and techniques for discovering deep linkage between attributes”, In APCCM ‟06: Proceedings of the 3rd Asia- Pacific conference on Conceptual modelling, pages 21–32. Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 2006. ISBN 1-920-68235-X. [11] Hortonworks M. Gualtieri and N. Yuhanna, "The Forrester Wave: Big Data Hadoop Solutions, Q1 2014," Forrester, 2014. [12] O.Alabi et al., "Toward a Data Spillage Prevention Process in Hadoop using Data Provenance," Proceedings of the 2015 Workshop on Changing Landscapes in HPC Security, pp.9-13, 2015. [13] "Welcome to apache Hadoop," Available: http://hadoop.apache.org/. Accessed: Nov. 8, 2016. [14] V. Chemitiganti, "What is Apache Hadoop?," in Business Values of Hadoop, Hortonworks, 2016. Available: http://hortonworks.com/apache/hadoop. [15] Available: https://hortonworks.com/press-releases/hortonworks-q1-2017-results Accessed: August. 18, 2017. . [16] L. De Marco, F. Ferrucci, M-T. Kechadi, “Reference Architecture for a Cloud Forensic Readiness System”, EAI Endorsed Transactions on Security and Safety. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 203 https://sites.google.com/site/ijcsis/ ISSN 1947-5500