Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus, forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness which can result in effective evidence generating in real-world forensics on Hadoop Platform.
Axa Assurance Maroc - Insurer Innovation Award 2024
Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study
1. Forensic Readiness on Hadoop Platform: Non-
Ambari HDP as a Case Study
Myat Nandar Oo1
,Thandar Thein2
1
University of Computer Studies, Yangon, 2
University of Computer Studies, Maubin
1
myatnandaroo@gmail.com, 2
thandartheinn@gmail.com
Abstract: Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus,
forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major
challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which
awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources
consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in
generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper
undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on
Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of
analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness
which can result in effective evidence generating in real-world forensics on Hadoop Platform.
Keywords: forensic Investigation, forensic Readiness, Hadoop Platform, residual artifact
I. INTRODUCTION
Hadoop is a reliable platform for storage and analysis with its adaptable facility software packages for
cloud and Big Data environment. Hadoop is built for the purpose of distribution for storage and computing. It is
a cross-platform, Java-based solution which enable to run on a wide array of different operating systems,
because it is built in Java, a platform-neutral language. Hadoop itself serves as its own abstract platform layer; it
can be accessed and run almost entirely independent of the host operating system.
The Hadoop version 0.1.0 is published since April, 2006 and continues to increase its versions [13]. Up
till now, latest released Apache Hadoop 2.7 was available in June, 2016 [3]. Hadoop is speedily mutable and
new software packages are being added to Hadoop. Recently, parts of the inventive Hadoop Apache project
have turned to build software, such as Hue, Avro, HBase, Pig, HCatalog, Hive, Flume, Oozie, Sqoop, and
Zookeeper [14].
In Statista report [9], the Hadoop market was valued at 6 billion U.S. dollars worldwide in the year
2015. A number of companies became bundle Hadoop and related technologies into their own Hadoop
distributions as the Hadoop Platforms. The three prominent Hadoop Platforms are MapR, Cloudera, and
Hortonworks [11], among them; HDP is the fully open source Hadoop Platform. The development rate HDP is
faster than that of competitive Hadoop Platforms. Total gross profit was $38.1 million for the first quarter of
2017, compared to $25.0 million for the same period last year. Hue [15] provides a Web application interface
for Hadoop. It supports a file browser, and web user interface for all applications on Hadoop.
Many organizations are doing business in the Hadoop Platforms and so the criminals also find the ways
to utilize it illegally. There is a need for forensic capabilities which support investigations of crime in Hadoop
Platforms. It may take time if the forensic investigation of HDP is conducted without knowing where residual
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
193 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. artifacts may reside. Forensic readiness is needed to implement to assist the real-world forensic on such complex
platforms. As far as I know, there are no other publications of forensic readiness for Hadoop Platform.
In this paper, the residual artifacts are discovered for forensic investigation on Hadoop Platform Non-
Ambari HDP. This paper focuses on implementing the forensic readiness by (clustering of) analysis to these
residual artifacts. This paper is intending to extract effective evidences with the aim of helping forensic
examiners to perform powerful forensic work by reducing time and resources spent in real-world investigation
on Hadoop Platform.
This paper is organized as follow: Section 2 describes theory background and current literatures
focusing on digital forensics, Hadoop forensics, log analysis for forensics and forensic readiness. The Section 3
describes the architecture of targeted Hadoop Platform: Non-Ambari HDP. In section 4, the research questions
for forensic Investigation on targeted Hadoop Platform: Non-Ambari HDP are presented. The research
methodology to give solution to the research questions is presented in section 5. Section 6 implements the
generating evidences for a Crime case by applying the proposed forensic readiness. Section 7 summarizes the
overall paper.
II. RELATED WORKS
The related works of Hadoop forensic Investigation of various aspects are discussed in this section. The
following literature reviews explore the approaches used by other researchers in this particular field.
A. Hadoop Forensics
The evolution of Hadoop Data Platform brings the challenges to forensic investigation. Forensic
investigation of Hadoop is the novel field; hence there are limited publications on this area. The following
section discusses the literature review of related work.
Cho et al. [5] highlighted that the preceding forensic procedures are not suitable for HDFS based cloud
system because of its characteristics; gigantic volume of distributed data, multi-users, and multi-layered data
structures. These characteristics can generate two problems in the gathering evidences phase. One problem is
that file blocks are replicated on different nodes while the other is the excessive time increasing and storage of
the original copying. They proposed a general forensic procedure and guideline for Hadoop based cloud system.
In this proposed procedure, the authors added live analysis and live collection to the original forensic procedure
to avoid the system suspension. By conducting the static and live collection simultaneously, the Hadoop forensic
analysis can diminish the time for proof collection. However, they did not present a case study or specific
scenario to illustrate their proposals.
A specific type of data leakage, namely data spillage, occurs when classified or sensitive information is
moved onto an unauthorized or undesignated compute node or memory media. Sensitive information spillage
from the Hadoop cluster may cause information unauthorized nodes access problem. In order to remedy such
data spillage challenge, Alabi et al. [12] developed the forensic framework for collecting provenance data and
investigating data spillage in Hadoop. The authors aimed to provide developing tools and prevention
mechanisms by analyzing data spillage events in the context of Hadoop environments and MapReduce
applications. The system level metadata was utilized to map the data motion in terms of how data are stored;
who, where, and how data are accessed; and how data change throughout its life cycle in a Hadoop cluster.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
194 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. In the paper [2] the authors discussed the Hadoop Big Data system could give to new difficulties and
challenges to forensic investigators. This paper highlighted that the understanding Hadoop internal structure is
the important point for forensic investigators. They pointed out that the use of different tools and technology can
do the forensics of big data. And then they demonstrated that the automated tool (Autopsy) can help finding the
evidences on big data efficiently.
Information stored in Hadoop log files plays an important role to gather as forensic evidence. This has
led to large Hadoop log files awaiting analysis. Analysis of Hadoop log files can give unobtrusive insights to the
behavior of users and can embody the illegal usages. This paper discovers the forensically important residual
artifacts on Hadoop Platform from the huge amount of log files. This can reduce the time and resources in real-
world forensics on that platform.
B. Log Analysis for Forensics
Backlogs are valuable sources of forensic evidence. They store information about where data was
stored, where data inputs originated, jobs that have been run, and other event-based information.
Analysis techniques have been successfully introduced in many different fields. Recently, log analysis
techniques have also being applied to the field of criminal and digital forensics. Log analysis is performed to
extract powerful evidences from a large amount of data and also to embody the criminal activities. Examples
include detecting deceptive criminal identities, identifying groups of criminals who are engaging in various
illegal activities and many more and also typically aim to produce insight from large volumes of data.
The paper [1] explored a correlation method for establishing provenance of time stamped data for use
as digital evidence. This work has a deep and relevant impact on digital forensics research as it reiterated the
complexity issues of dealing with timestamps because of drift, offsets, and possible human tampering.
In 2006 as well, research [3] explored event analysis to develop poles for computer forensic
investigation purposes. Abraham analyzed computer data in search of discovering owner or usage profiles based
upon sequences of events which may occur on a system. Abraham categorized an owner profile with four
different attributes: subject, object, action, and time stamp.
In 2007, the paper [6] proposed pre-retrieval and post-retrieval clustering of digital forensics text string
search results. Though their work was focused on text mining, the data clustering algorithms used have shown
success in efficiency and improving information retrieval efforts.
C. Forensic Readiness
The initial idea of Digital Forensic Readiness (DFR) was proposed in 2001. It is an active research
field that attracts many researchers since that time. DFR combines forensic expertise, hardware, and software
engineering. Forensic Readiness is a proactive measure that organizations need to enforce, so that the
organization has the ability to comply for forensic investigations with sufficient forensic preparedness.
Tan [7] identified the factors that affect digital forensic readiness; evidence handling, forensic
acquisition, logging methods and intrusion detection methods.
A DFR system has two objectives: maximizing an environment‟s ability to collect credible digital
evidences, and minimizing the cost of Forensics. DFR includes data collection activities that concern some
components such as RAM, registers, raw disks logs [16].
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
195 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. Rowlingson [3] described forensic readiness as an objective to maximize the environment‟s capability
of collecting digital forensic information whilst minimizing the cost of the forensic investigation during an
incident response.
In this paper, we implement the forensic readiness which is performed by analyzing the residual
artifacts. This forensic readiness can assist the forensic examiners for powerful forensic works on Hadoop
Platform.
III. ARCHITECTURE OF TARGETED HADOOP PLATFORM: NON-AMBARI HDP WITH HUE
WEB UI
The HDP includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of
data. The platform is designed to deal with data from many sources and formats. The platform includes core
Hadoop technology such as the Hadoop Distributed File System, MapReduce, Yarn and also facility software
packages (HBase, Strom, etc.)[14].
Scheduler
Application
Manager
Zookeeper
Admin
Resource
Manager
Application
Master
Node
Manager
Hue Web UI
Zookeeper
Agent
scripts
Hue Server
Non-AmbariHDP
Datanode
Container
Namenode
Secondary
namenode
Figure 1: Architecture of Targeted Hadoop Platform: Non Ambari HDP
The package Ambari is the open source installation and management system for Hadoop Platforms.
Some installation of HDP is with the help of Ambari management system; Ambari-HDP. In Ambari-HDP,
Ambari can automatically install core Hadoop and other facility software packages. Some HDP installation
excludes Ambari, and all installations are by manual. That type of HDP is so called Non-Ambari HDP.
Hue (Hadoop User Experience) is an open-source web interface that supports Apache Hadoop and its
ecosystem, licensed under the Apache v2 license. Figure 1 shows the architecture of targeted Hadoop Platform
HDP 2.3 of Non-Ambari HDP. And then, Hue 2.6.12 is applied as the web UI and gateway for all operations on
HDP.
IV. RESEARCH QUESTIONS FOR FORENSIC INVESTIGATION ON TARGETED HADOOP
PLATFORM
For forensic investigation on targeted Hadoop Platform, the research questions are raised as the
following shown in Table 1.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
196 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. Table 1: Research Questions for Forensic Investigation on Targeted Hadoop Platform
Q1: What backlogs are updated per operation on targeted Hadoop Platform?
Q2: How to address the volume challenge in generating evidences for investigation on target Hadoop Platform?
Q3: Among the sheer amount of backlogs, which are forensically useful as residual artifacts?
Q4: How to implement the forensic readiness for targeted Hadoop Platform?
The sub questions are raised from the above primary question.
Q3.1: For forensic readiness, how are the residual artifacts analyzed?
A. Research Methodology
With the aim to perform a forensic readiness, a forensic research on the targeted Hadoop Platform as a
proactive investigation can address the challenges of the real-world forensic investigation. A major challenge to
this environment is the ongoing growth in the volume of Hadoop Platform backlogs presented for analysis in
generating evidences. However, all of these backlogs are not forensically significant.
This paper discovers the forensically important artifacts from a sheer amount of backlogs. A corpus of
potential evidences is performed as forensic readiness which is implemented by means of analysis to these
residual artifacts. When an examiner investigates a crime case, he can effortlessly extract evidences and embody
the illegal usages of criminal by applying the forensic readiness.
The Figure 2 shows the work flow of research on targeted Hadoop Platform. It describes the
responsibilities of forensic researcher in forensic research and how the examiners apply the readiness for
generating evidences to embody the criminal activity.
Backlogs
Discover
residual
artifacts from
backlogs
Residual
artifacts
Implement
forensic
readiness
Forensic readiness
Extract
evidences
relating to
crime case
Forensic Research Forensic Investigation
Evidences
Investigate
the crime
Crime Case
Carpus of
potential
evidences
Figure 2: Work Flow of Forensic Research and Investigation on Targeted Hadoop Platform
B. Residual Artifacts for Forensic Investigation on Targeted Hadoop Platform
In Hadoop Platform, many log data are updated per operation. However, all logs may not contain
relevant evidence. Exposing residual artifacts among the sheer amount of backlogs can reduce the volume for
analysis.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
197 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. C. Backlogs on Hadoop Platform
As the nature of Hadoop, many log files are updated per processing; there is a large amount of
backlogs. However, actually, most of them are not forensically valuable.
The following types of logs can be found on machines which are running HDP:
Hadoop daemon logs: These are stored in the host operating system; these .log files contain error and
warning information. By default, these log files will have a Hadoop prefix in the filename.
log4j: These logs store information from the log4j process. The log4j application is an Apache logging
interface that is used by many Hadoop applications. These logs are stored in the /var/log/hadoop directory.
Standard out and standard error: Each Hadoop TaskTracker creates and maintains these error logs
to store information written to standard out or standard error. These logs are stored in each TaskTracker node's
/var/log/hadoop/userlogs directory.
Job configuration XML: The Hadoop JobTracker creates these files within HDFS for tracking job
summary details about the configuration and job run. These files can be found in the /var/log/hadoop and
/var/log/hadoop/history directory.
Job statistics: The Hadoop JobTracker creates these logs to store information about the number of job
step attempts and the job runtime for each job.
Some noticeable log files which are update per Hadoop operation are shown as the following.
Hadoop-<user-running-Hadoop>-<daemon>-<hostname>.log
For example: Hadoop-Hadoop-Datanode-IP-xxxx.log
JobTracker Logs
are created by the jobtracker.
are stored in two places:/var/log/Hadoop and
/var/log/Hadoop/history.
or home/Hadoop/logs/history/done/version x/<host-
job_id>/<year>/<month>/<day>/<serial>
TaskTracker Logs (for a particular task attempt)
are created by each tasktracker.
are captured when a task attempt is run.
/var/log/Hadoop/userlogs/attempt_<job-id>_<map-or-
reduce>_<attempt-id>
HDFS metadata
fsimage –
contains the file system complete state at a point in time
allocates a unique, monotonically increasing transaction ID
edits –
is a log listing each file system change (file creation, deletion or
modification)
is made after the most recent fsimage.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
198 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
7. D. Residual Artifacts on Targeted Hadoop Platform
The major challenge to digital forensic analysis is the ongoing growth in the volume of backlogs seized
and presented for analysis. Thus, this paper seeks to identify and locate which backlogs are forensically
important as residual artifacts. This leads to the decent way to extract effective forensic evidence by reducing
time and resources.
Among the large amount of log files, the contents in namenode.log, syslog, blk_#####, job-######,
hdfs-audit.log, hue-access.log files are useful for forensic environment. The contents of log file are able to
uncover the criminal activity. Among them, the forensically complete logs which can embody the crime scene
are hue-access.log and hdfs-audit.log. The criminal activity can be embodied by applying only these two log
files even excluding other backlogs. In the configuration files for handling the logs, we can adjust the maximum
amount of log. If the maximum amount is occupied, the old files are made as backup log files. The following
Table 2 expresses the location and configuration files that store the variables and default location of these two
log files.
Table 2: Residual Artifacts (Forensically important Log Files) of Targeted Hadoop Platform
File Name Location Configuration Files
hue-access.log /var/log/hue/ /etc/hue/config/log.config, access.py
hdfs-audit.log /usr/hdp/Hadoop/hdfs-
audit.log
/etc/hadoop/log4j.properties
(i) Hue-access.log
When accessing a Hadoop file operation via Hue Web UI, hue-access log maintain a history of
requests. Information about request, including requested date/time, log level, client IP address, name of user
operating the file, accessed method, HTTP protocol. More recent entries are typically appended to the end of
file. The Figure 2 shows the parameters and values of a record (entry) in hue-access.log.
[20/Jul/2017 20:41:27 +0000] INFO 192.168.80.1 hue - "GET /filebrowser/download//dirBb/Forensic
Investigation through Data Remnants on Hadoop Big Data Storage System.pdf HTTP/1.0"
Date
Log
Level
Client
IP
User
Name
Access
Method
File Directory
File Name HTTP Protocol
Figure 3: Parameters and Values of a Record in ‘hue-access.log’
(ii) hdfs-audit.log
Audit logging is an accounting process that logs all operations happening in Hadoop and give higher
resolution into Hadoop operations. Its configuration is presented in the log4j properties. By default, the
log4j.properties file has the log threshold set to WARN. By setting this level to INFO, audit logging can be
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
199 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
8. turned on. The following snippet in Table 3 shows the log4j.properties configuration when HDFS audit logs is
turned on.
Table 3: ‘log4j.properties’ Configuration File for Audit Logging
SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log
-XX:NewSize=128m -XX:MaxNewSize=128m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -
XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -Xms1024m -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,RFAAUDIT"
The Figure 3 shows the parameters and value contain in each record of hdfs-audit log.
2017-07-14 10:01:00,632 INFO FSNamesystem.audit: allowed=true ugi=root (auth:PROXY) via hue
(auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/user/hue dst=null perm=null proto=webhdfs
proto=rpc
Date
Access
Level
User
Group
User
Name
File Operation
Command
Host
IP
File Name ProtocolAuthentication
Figure 4: Parameters and Value of a Record in ‘hdfs-audit.log’
E. Forensic Readiness for Target Hadoop Platform
Forensic readiness can be defined as the ability of a proactive investigation (forensic research) to
maximize the potential to extract effective evidence whilst minimizing the costs of a real-world investigation. It
is essential to prepare a ready condition in which an examiner can extract relevant and sufficient evidence
easily.
In this paper, we implement a forensic readiness that contains a ready corpus of potential evidences
which is the resulting clusters from log analysis model.
a. Log Analysis for Implementation of Forensics Readiness
Analyzing backlogs can give unobtrusive insights to the file operation, behavior of the users and
provide to draw the event line of the criminals. In this paper, cluster analysis to the discovered residual artifacts
will implement a corpus of potential evidences as a forensic readiness which can assist the forensic examiners to
extract evidences easily.
Cluster analysis is the task of grouping a set of object in such a way that objects in the same group
(called a cluster). The following outlines the steps of the proposed log analysis mode to implement forensic
readiness as shown in Figure 5.
(i) Collection
This model extract to collect the forensically important data, so called residual artifacts from a large
amount of backlog through special software called a write-blocker that creates a bit by bit copy of the actual
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
200 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
9. data itself. The contents of hue-access.log and hdfs-audit.log files are collected for analysis in a timely manner
and stored in a data store.
Data
collection
Data set
Pre-
processing
Cluster
analysis
Log Analysis Model
Forensic Readiness
clusters
Carpus of
potential
evidences
Figure 5: Log Analysis Model for Implementing Forensic Readiness System
(ii) Preprocessing
Preprocessing is concerned with data cleaning and data transformation. The goal of data cleaning is to
remove irrelevant information. In this system, some parameters in log files are not meaningful for analysis.
Eliminating those irrelevant data reduces processing time.
Data transformation converts raw data items into structured information. The certain transformation is
performed on unstructured log files to structured data sets. These above residual artifacts; these two backlogs
are unstructured data. Thus, for the analysis purpose, the logs are transformed into structured dataset.
(iii) Cluster Analysis
This model applies the proposed algorithm; Cluster Analysis Algorithm to group the data by each
significant parameter. The output of the algorithm is the clusters which form as a corpus of potential evidence.
Algorithm 1: Cluster Analysis Algorithm
Input: Data Set D{P1,………,.Pn}
Output: Clusters C{ C 1 {S1}, ………… C n{Sn}}
1. For each Parameter P in D
2. For each Record R in D
3. if Rj {Pi (value) } = = Rj+1 {Pi (value)}
4. Rj is set to C i {Sk}
5. else
6. Rj is set to C i {Sk+1}
7. end if
8. end for
10. end for
12. return Sets in Clusters
Figure 6. Cluster Analysis Algorithm
The number of the sets in each cluster is calculated in Eq1 and Eq2. The Eq3 describes the value of
cluster and also defines the symbols which are used in above Algorithm 1.
Eq1: number of Cluster N(C) = number of Parameter in Dataset N (P)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
201 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
10. Eq2: number of set in a cluster N(S) = number of different values of each Parameter
= N(P(value))
Eq3: Ci = {C i {S1}, C i {S2}, ……… C i {Sn} }
where i=1,…….N(C)
S= set in each cluster
n = N(S) in Ci
The output of the log analysis model; the clusters of artifacts implements the ready corpus of potential
evidence as forensic readiness, so, the forensic examiners can easily extract the evidences from this ready
corpus to embody the crime case.
V. GENERATING EVIDENCES FOR INVESTIGATING A CRIME CASE
In this section, we test the investigating of a crime case on the targeted Hadoop Platform by applying
the proposed forensic readiness. The sample crime case is described as follows:
Crime Case: The DEF organization uses the services of Hadoop Platform of Non-Ambari HDP with
Hue Web UI. Every authorized person in this organization can access this through „http://@@@@@:8000 /‟
from their own web browsers for obtaining the service of uploading, downloading and opening the files on it.
One suspected case is that Mr.Dewel is a suspected person who betrays the organization. Therefore, the
organization needs to know the usage pattern of Mr. Dewel.
In this case, with the aim to trace the activity of Mr.Dewel, the examiner applies potential evidence
corpus in forensic readiness. He extracts the records of username = Mr. Dewel from the cluster which is grouped
by username. The examiner can trace “the activity of Mr.Dewel” easily without time and resources consuming.
VI. CONCLUSION
With the aim of tracing the criminal activities, information stored in Hadoop backlogs plays an
important role. However the extracting evidences from a large amount of backlogs for forensics on emerging
Hadoop Platforms is identified as a challenge for forensic investigators and researchers. It may consume
significant amount of time and resources. Forensic Readiness is one of the solutions to address the above
challenges. Thus, the forensic research of this paper has two folds of contributions. In the first one, it exposes
the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP) with
Hue Web UI. And the next one is that forensic readiness is proposed by means of cluster analysis to exposed
residual artifacts. The outcomes of this paper contribute to a forensic readiness which can result in effective
evidence generating by reducing time and resources spent in real-world investigation on Hadoop Platform to
assist the forensic examiners.
REFERENCES
[1] A.Clark, et al., “A correlation method for establishing Provenance of timestamps in digital evidence”, 6th Annual Digital
Forensic Research Workshop in Digital Investigation, volume 3, supplement 1, pages 98–107, 2006.
[2] S.A.Thanekar, et al., “A Study on Digital Forensics in Hadoop” International Journal of Control Theory and Applications.
Published By: International Science Press, vol. 9, no. 18, pp. 8927-8933, 2016.,
[3] R. Rowlingson “A ten step process for forensic readiness”. International Journal of Digital Evidence, 2(3), 1-28, 2004.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
202 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
11. [4] A. Pichan, et al., "Cloud forensics: Technical challenges, solutions and comparative analysis," Digital Investigation, vol. 13, pp.
38-57, 2015.
[5] C.H.Cho et al., "Cyber Forensic for Hadoop Based Cloud System," International Journal of Security and its Applications, vol
6.3, pp.83-90, July, 2012.
[6] J.Guynes et al., “Digital forensics text string searching: Improving Information retrieval effectiveness by thematically clustering
search results”, In 6th Annual Digital Forensic Research Workshop, volume 4, pages 49–54, 2007.
[7] J. Tan, “Forensics readiness”. Technical Cambridge USA, 2001.
[8] A. of Chief Police Officers, “Good practice guide for computer based electronic evidence,” ACPO, Tech. Rep.
[9] S. V. President, "Hadoop/Big Data Market Size Worldwide 2015-2020 | Statistic," Statista, 2016. Available:
https://www.statista.com/statistics/587051/worldwide-Hadoop-bigdata-market/. Accessed: Nov. 8, 2016.
[10] J.Galloway, et al., “Network data mining: methods and techniques for discovering deep linkage between attributes”, In APCCM
‟06: Proceedings of the 3rd Asia- Pacific conference on Conceptual modelling, pages 21–32. Australian Computer Society, Inc.,
Darlinghurst, Australia, Australia, 2006. ISBN 1-920-68235-X.
[11] Hortonworks M. Gualtieri and N. Yuhanna, "The Forrester Wave: Big Data Hadoop Solutions, Q1 2014," Forrester, 2014.
[12] O.Alabi et al., "Toward a Data Spillage Prevention Process in Hadoop using Data Provenance," Proceedings of the 2015
Workshop on Changing Landscapes in HPC Security, pp.9-13, 2015.
[13] "Welcome to apache Hadoop," Available: http://hadoop.apache.org/. Accessed: Nov. 8, 2016.
[14] V. Chemitiganti, "What is Apache Hadoop?," in Business Values of Hadoop, Hortonworks, 2016. Available:
http://hortonworks.com/apache/hadoop.
[15] Available: https://hortonworks.com/press-releases/hortonworks-q1-2017-results Accessed: August. 18, 2017. .
[16] L. De Marco, F. Ferrucci, M-T. Kechadi, “Reference Architecture for a Cloud Forensic Readiness System”, EAI Endorsed
Transactions on Security and Safety.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
203 https://sites.google.com/site/ijcsis/
ISSN 1947-5500