File recovery is one of the stages in computer forensic investigative process to identify an
acquired file to be used as digital evident. The recovery is performed on files that have been deleted from
a file system. However, in order to recover a deleted file, some considerations should be taken. A deleted
file is potentially modified from its original condition because another file might either partly or entirely
overriding the file content. A typical approach in recovering deleted file is to apply Boyer-Moore algorithm
that has rather high time complexity in terms of string searching. Therefore, a better string matching
approach for recovering deleted file is required. We propose Aho-Corasick parsing technique to read file
attributes from the master file table (MFT) in order to examine the file condition. If the file was deleted, then
the parser search the file content in order to reconstruct the file. Experiments were conducted using
several file modifications, such as 0% (unmodified), 18.98%, 32.21% and 59.77%. From the experimental
results we found that the file reconstruction process on the file system was performed successfully. The
average successful rate for the file recovery from four experiments on each modification was 87.50% and
for the string matching process average time on searching file names was 0.32 second.
During forensic examination, analysis of unallocated space of seized storage media is essential to extract the previously deleted or overwritten files when the file system metadata is missing or corrupted. The process of recovering files from the unallocated space based on file type-specific information (header and footer) and/or file contents is known as Data Carving. The research in this domain has witnessed various
technological enhancements in terms of tools and techniques over the past years. This paper surveys various data carving techniques, in particular multimedia files and classifies the research in the domain into three categories: classical carving techniques, smart carving techniques and modern carving
techniques. Further, seven popular multimedia carving tools are empirically evaluated. We conclude with the need to develop the new techniques in the field for carving multimedia files due to the fact that the fragmentation and compression are very common issues for these files
Comparison of Compression Algorithms in text data for Data MiningIJAEMSJORNAL
Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text.
An Analyzing of different Techniques and Tools to Recover Data from Volatile ...ijsrd.com
Computer forensics has recently gained significant popularity with many local law enforcement agencies. It is currently employed in fraud, theft, drug enforcement and almost every other enforcement activity. There are many relatively new tools available that have been developed in order to recover and dissect the information that can be gleaned from data storage area like hard-disk, pen drive, etc. it's all like a volatile memory, but because this is a relatively new and fast-growing field many forensic analysts do not know or take advantage of these assets. Memory like Volatile memory may contain many pieces of information relevant to a forensic investigation, such as passwords, cryptographic keys, and other data. Having the knowledge which type of method use and tools needed to recover that data is essential, and this capability is becoming increasingly more relevant as hard drive encryption and other security mechanisms make traditional hard disk forensics more challenging. This research will cover the theory behind volatile memory analysis, including why it is important, what kinds of data can be recovered, and the potential pitfalls of this type of analysis, as well as techniques for recovering and analyzing volatile data and currently available toolkits that have been developed for this purpose.
This document summarizes an article on automatic document clustering. It discusses how documents are preprocessed using techniques like tokenization, stop word removal, and stemming before being clustered. Clustering algorithms group similar documents together based on similarity metrics like TF-IDF. Documents can be searched by name, extension, or keywords to retrieve them from the clustered storage in a faster and more organized manner. Future work may include clustering multimedia files and improving algorithm performance.
Data Mining for XML Query-Answering SupportIOSR Journals
This document presents a technique for mining tree-based association rules (TARs) from XML documents to support efficient XML query answering. The framework first parses XML documents and mines frequent subtrees. It then extracts TARs from the subtrees and stores them in a new XML file. When queries are received, the stored TARs are used to quickly retrieve relevant information. Experimental results show that query answering time is significantly reduced compared to approaches without mined knowledge. Extraction time increases linearly with the number of nodes in XML documents. The technique provides a useful way to gain information from XML databases in real-time applications.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
This study discusses the utilization of three types of relationships in performing data replication. As grid
computing offers the ability of sharing huge amount of resources, resource availability is an important issue
to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in
ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and
evaluation is made based on execution time, storage usage, network bandwidth and computing element
usage. Results suggested that the proposed strategy produces a better outcome than an existing method even
though various job workload is introduced.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
During forensic examination, analysis of unallocated space of seized storage media is essential to extract the previously deleted or overwritten files when the file system metadata is missing or corrupted. The process of recovering files from the unallocated space based on file type-specific information (header and footer) and/or file contents is known as Data Carving. The research in this domain has witnessed various
technological enhancements in terms of tools and techniques over the past years. This paper surveys various data carving techniques, in particular multimedia files and classifies the research in the domain into three categories: classical carving techniques, smart carving techniques and modern carving
techniques. Further, seven popular multimedia carving tools are empirically evaluated. We conclude with the need to develop the new techniques in the field for carving multimedia files due to the fact that the fragmentation and compression are very common issues for these files
Comparison of Compression Algorithms in text data for Data MiningIJAEMSJORNAL
Text data compression is a process of reducing text size by encrypting it. In our daily lives, we sometimes come to the point where the physical limitations of a text or data sent to work with it need to be addressed more quickly. Ways to create text compression should be adaptive and challenging. Compressing text data without losing the original context is important because it significantly reduces storage and communication costs. In this study, the focus is on different techniques of encoding text files and comparing them in data processing. Some efficient memory encryption programs are analyzed and executed in this work. These include: Shannon Fano coding, Hoffman coding, Hoffman comparative coding, LZW coding and Run-length coding. These analyzes show how these coding techniques work, how much compression for these coding techniques. Writing, the amount of memory required for each technique, a comparison between these techniques is possible to find out which technique is better in which conditions. Experiments have shown that Hoffman's comparative coding shows a higher compression ratio. In addition, the improved RLE encoding suggests a higher performance than the typical example in compressing data in text.
An Analyzing of different Techniques and Tools to Recover Data from Volatile ...ijsrd.com
Computer forensics has recently gained significant popularity with many local law enforcement agencies. It is currently employed in fraud, theft, drug enforcement and almost every other enforcement activity. There are many relatively new tools available that have been developed in order to recover and dissect the information that can be gleaned from data storage area like hard-disk, pen drive, etc. it's all like a volatile memory, but because this is a relatively new and fast-growing field many forensic analysts do not know or take advantage of these assets. Memory like Volatile memory may contain many pieces of information relevant to a forensic investigation, such as passwords, cryptographic keys, and other data. Having the knowledge which type of method use and tools needed to recover that data is essential, and this capability is becoming increasingly more relevant as hard drive encryption and other security mechanisms make traditional hard disk forensics more challenging. This research will cover the theory behind volatile memory analysis, including why it is important, what kinds of data can be recovered, and the potential pitfalls of this type of analysis, as well as techniques for recovering and analyzing volatile data and currently available toolkits that have been developed for this purpose.
This document summarizes an article on automatic document clustering. It discusses how documents are preprocessed using techniques like tokenization, stop word removal, and stemming before being clustered. Clustering algorithms group similar documents together based on similarity metrics like TF-IDF. Documents can be searched by name, extension, or keywords to retrieve them from the clustered storage in a faster and more organized manner. Future work may include clustering multimedia files and improving algorithm performance.
Data Mining for XML Query-Answering SupportIOSR Journals
This document presents a technique for mining tree-based association rules (TARs) from XML documents to support efficient XML query answering. The framework first parses XML documents and mines frequent subtrees. It then extracts TARs from the subtrees and stores them in a new XML file. When queries are received, the stored TARs are used to quickly retrieve relevant information. Experimental results show that query answering time is significantly reduced compared to approaches without mined knowledge. Extraction time increases linearly with the number of nodes in XML documents. The technique provides a useful way to gain information from XML databases in real-time applications.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
This study discusses the utilization of three types of relationships in performing data replication. As grid
computing offers the ability of sharing huge amount of resources, resource availability is an important issue
to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in
ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and
evaluation is made based on execution time, storage usage, network bandwidth and computing element
usage. Results suggested that the proposed strategy produces a better outcome than an existing method even
though various job workload is introduced.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals
This document compares the performance of three lossless data compression algorithms (Huffman, Shannon-Fano, and LZW) on text data files of various types and sizes. Ten text files ranging from 5KB to 528KB were used as a test bed. The algorithms were implemented in Java and evaluated based on compression ratio, compression/decompression time, and file size savings. Results showed that LZW generally provided better compression ratios and file size reductions than the other two algorithms, though it took longer for compression. Shannon-Fano compression times were generally faster than Huffman's. The best algorithm to use depends on the file type and priorities around compression speed versus ratio.
Data Deduplication: Venti and its improvementsUmair Amjad
This document summarizes the data deduplication system called Venti and improvements over it. Venti identifies duplicate data blocks using cryptographic hashes of block contents. It stores only a single copy of each unique block. The document discusses three key limitations of Venti: hash collisions, fixed-size chunking sensitivity, and access control. It then summarizes approaches taken by other systems to improve on these limitations, such as using multiple hash functions to reduce collisions, variable-length chunking, and stronger authentication and encryption. In conclusion, while Venti was effective at eliminating data duplication, later systems aimed to address its remaining challenges to handle growing archive sizes securely and efficiently.
A Survey: Enhanced Block Level Message Locked Encryption for data DeduplicationIRJET Journal
This document summarizes various techniques for data deduplication. It discusses inline and post-process deduplication approaches and encryption-based deduplication methods like message locked encryption (MLE) and block-level message locked encryption (BL-MLE). The document also reviews literature on deduplication schemes using sampling techniques, flash memory indexing, and combining deduplication with Hadoop Distributed File System. Overall, the document provides an overview of different data deduplication methods and their advantages and disadvantages.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
with the advancement in data storage technology, the cost per gigabyte has reduced significantly, causing users to negligently store redundant files on their system. These may be created while taking manual backups or by improperly written programs. Often, files with the exact content have different file names; and files with different content may have the same name. Hence, devising an algorithm to identify redundant files based on their file name and/or size is not enough. In this paper, the authors have proposed a novel approach where the N-layer hash of all the files are individually calculated and stored in a hash table data structure. If an N-layer hash of a file matches with a hash that already exists in the hash table, that file is marked as a duplicate, which can be deleted or moved to a specific location as per the user's choice. The use of the hash table data structure helps achieve O(n) time complexity and the use of N-layer hashes improve the accuracy of identifying redundant files. This approach can be used for folder specific, drive specific or a system wide scan as required.
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
Classification is an important technique used in information retrieval. Supervised classification suffers
from certain limitations concerning the collection and labeling of the training dataset. When facing Multi-
Domain classification, multiple training datasets and classifiers are needed which is relatively difficult. In
this paper an unsupervised classification system is proposed that can manage the Multi-Domain
classification problem as well. It is a multi-domain system where each domain represented by an ontology.
A document is mapped on each ontology based on the weights of the mutual tokens between them with the
help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried
out showing satisfying classification results with an improvement in the evaluation results of the proposed
system compared to Apache Lucene.
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
This document discusses the challenges of collecting, storing, and analyzing large volumes of internet measurement data. It examines issues such as distributed and resilient data collection, handling multi-timescale and heterogeneous data from various sources, and developing standardized tools and formats. The paper proposes the "datapository" - an internet data repository designed to address these challenges through a collaborative framework for data sharing, storage, and analysis tools. The goal is to help both network operators and researchers more effectively harness the wealth of data available.
Files in Python represent sequences of bytes stored on disk for permanent storage. They can be opened in different modes like read, write, append etc using the open() function, which returns a file object. Common file operations include writing, reading, seeking to specific locations, and closing the file. The with statement is recommended for opening and closing files to ensure they are properly closed even if an exception occurs.
File is the basic unit of information storage on a secondary storage device. Therefore, almost every form of data and information reside on these devices in form of file – whether audio data or video, whether text or binary.
Files may be classified on different bases as follows:
1. On the basis of content:
Text files: Files containing data/information in textual form. It is merely a collection of characters. Document files etc.
Binary files: Files containing machine code. The contents are non-recognizable and can be interpreted only in a specified way using the same application that created it. E.g. executable program files, audio files, video files etc.
Dissertation title and final project: Data source registration in the Virtual Laboratory. The subject of the thesis and related project was to integrate EGEE/WLCG data sources into GridSpace Virtual Laboratory (http://gs.cyfronet.pl/).
Poster presentation entitled Integrating EGEE Storage Services with the Virtual Laboratory:
http://www.plgrid.pl/en/pr_materials/posters
Dissertation available at http://virolab.cyfronet.pl/trac/vlvl#MasterofScienceThesesrelatedtoViroLab
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
Files are containers used for storing data in computer storage devices. They allow data to be preserved even after a program terminates, and make it easy to enter large amounts of data and transfer data between computers. There are two main types of files: text files which contain plain text and have a .txt extension, and binary files which store data in 0s and 1s and have other extensions like .bin. Common file operations include creating, opening, closing, and reading from or writing to files using functions like fopen(), fclose(), fprint(), and fscanf() in C.
fread() and fwrite() are functions used to read and write structured data from files. fread() reads an entire structure block from a file into memory. fwrite() writes an entire structure block from memory to a file. These functions allow efficient reading and writing of complex data types like structures and arrays from binary files.
Metadata for digital long-term preservationMichael Day
Presentation given at the Max Planck Gesellschaft eScience Seminar 2008: Aspects of long-term archiving, hosted by the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG), Göttingen, Germany, 19-20 June 2008
Report blocking ,management of files in secondry memory , static vs dynamic a...NoorMustafaSoomro
Three key topics were discussed:
1) Record blocking - the process of grouping related data records into blocks for storage. Fixed, variable-length spanned, and unspanned blocking methods were described.
2) File management in secondary memory - files have attributes like name, size, permissions. Common file operations are create, open, read, write. Directory structures and access paths organize files.
3) Memory allocation - static allocation assigns memory at compile time while dynamic allocation occurs at runtime using functions like malloc(), free(), calloc(), realloc(). Contiguous, linked lists and indexing are approaches to storing files and managing free space.
Data carving using artificial headers info sec conferenceRobert Daniel
This document proposes a new approach to data carving called File Recovery using Artificial Headers (FRAH) that can recover files with corrupted or missing headers. An evaluation of existing data carving tools found they have difficulty recovering fragmented files. FRAH works by inserting an artificial header onto files to circumvent missing headers. Testing showed FRAH could successfully recover files that standard tools could not. However, FRAH has limitations in recovering files where payload data is also missing. Further research is needed to make FRAH more robust.
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
This document compares two data recovery techniques - Aho-Corasick and logical data recovery - based on efficiency using three datasets. The logical data recovery algorithm was found to be faster than Aho-Corasick according to processing speed, recovering files at speeds of up to 12.8 MB/s compared to 11.9 MB/s for Aho-Corasick. However, both algorithms achieved the same recovery accuracy rates of 94-100% depending on the dataset. The document concludes that a hybrid algorithm combining the strengths of both - using logical recovery for speed and Aho-Corasick for targeted file searches - could achieve even higher efficiency for data recovery.
A large amount of video files have been available due to the wide spread use of
surveillance camera, CCTVs, mobile cameras etc .Chances of corruption or damage of video file is a
critical condition in investigation cases. Video file sometimes play a crucial role in criminal cases.
Recovery of a corrupted or damaged video file is critical in digital forensics. A forensic analyst
examining a disk may encounter many fragments of deleted digital files, but is unable to determine
the proper sequence of fragments to rebuild the files. File restoration can be done using several
approaches which include file based approach and a frame based approach. This paper surveys
various techniques used for video file restoration. By analyzing we found that video file restoration
using a frame based approach provide a better efficiency compared to other techniques. And also we
can recover fragmented as well as partially overwritten files using frame based recovery
Searching Keyword-lacking Files based on Latent Interfile RelationshipsTakashi Kobayashi
FRIDAL provides a concise 3 sentence summary:
FRIDAL is a desktop search method that mines latent inter-file relationships from file access logs to find keyword-lacking files related to search keywords. It calculates relationship strengths based on the overlap of approximate file use durations extracted from the logs. An experiment showed FRIDAL found related files with fewer checks and smaller costs than traditional full-text search.
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
This paper proposes algorithms to reduce the size of firmware updates transmitted over-the-air (FOTA). It uses a three stage approach of chunking files into variable sized pieces, hashing the chunks, and comparing hashes to find similar chunks between versions. This allows creating "delta" updates that only transmit changed parts, rather than full files. The algorithms were able to reduce FOTA delta sizes by up to 30% compared to existing tools on test firmware pairs, saving bandwidth. Experimental results on three firmware pairs demonstrate size reductions and performance compared to Google's existing FOTA update method.
Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals
This document compares the performance of three lossless data compression algorithms (Huffman, Shannon-Fano, and LZW) on text data files of various types and sizes. Ten text files ranging from 5KB to 528KB were used as a test bed. The algorithms were implemented in Java and evaluated based on compression ratio, compression/decompression time, and file size savings. Results showed that LZW generally provided better compression ratios and file size reductions than the other two algorithms, though it took longer for compression. Shannon-Fano compression times were generally faster than Huffman's. The best algorithm to use depends on the file type and priorities around compression speed versus ratio.
Data Deduplication: Venti and its improvementsUmair Amjad
This document summarizes the data deduplication system called Venti and improvements over it. Venti identifies duplicate data blocks using cryptographic hashes of block contents. It stores only a single copy of each unique block. The document discusses three key limitations of Venti: hash collisions, fixed-size chunking sensitivity, and access control. It then summarizes approaches taken by other systems to improve on these limitations, such as using multiple hash functions to reduce collisions, variable-length chunking, and stronger authentication and encryption. In conclusion, while Venti was effective at eliminating data duplication, later systems aimed to address its remaining challenges to handle growing archive sizes securely and efficiently.
A Survey: Enhanced Block Level Message Locked Encryption for data DeduplicationIRJET Journal
This document summarizes various techniques for data deduplication. It discusses inline and post-process deduplication approaches and encryption-based deduplication methods like message locked encryption (MLE) and block-level message locked encryption (BL-MLE). The document also reviews literature on deduplication schemes using sampling techniques, flash memory indexing, and combining deduplication with Hadoop Distributed File System. Overall, the document provides an overview of different data deduplication methods and their advantages and disadvantages.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
with the advancement in data storage technology, the cost per gigabyte has reduced significantly, causing users to negligently store redundant files on their system. These may be created while taking manual backups or by improperly written programs. Often, files with the exact content have different file names; and files with different content may have the same name. Hence, devising an algorithm to identify redundant files based on their file name and/or size is not enough. In this paper, the authors have proposed a novel approach where the N-layer hash of all the files are individually calculated and stored in a hash table data structure. If an N-layer hash of a file matches with a hash that already exists in the hash table, that file is marked as a duplicate, which can be deleted or moved to a specific location as per the user's choice. The use of the hash table data structure helps achieve O(n) time complexity and the use of N-layer hashes improve the accuracy of identifying redundant files. This approach can be used for folder specific, drive specific or a system wide scan as required.
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
Classification is an important technique used in information retrieval. Supervised classification suffers
from certain limitations concerning the collection and labeling of the training dataset. When facing Multi-
Domain classification, multiple training datasets and classifiers are needed which is relatively difficult. In
this paper an unsupervised classification system is proposed that can manage the Multi-Domain
classification problem as well. It is a multi-domain system where each domain represented by an ontology.
A document is mapped on each ontology based on the weights of the mutual tokens between them with the
help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried
out showing satisfying classification results with an improvement in the evaluation results of the proposed
system compared to Apache Lucene.
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree
based on Arabic documents collection. The paper also aims at giving the comparison points between all
these techniques and the performance of this techniques on each of the comparison points. Any information
retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there
are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a
document relevant to a specified query, while space represents the capacity of memory needed to create
the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However,
to measure the performance of each one, a retrieval system must be built to compare the results of using
these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer
Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the
there systems. We found out that the retrieval result for inverted files is better than the retrieval result for
other indices.
This document discusses the challenges of collecting, storing, and analyzing large volumes of internet measurement data. It examines issues such as distributed and resilient data collection, handling multi-timescale and heterogeneous data from various sources, and developing standardized tools and formats. The paper proposes the "datapository" - an internet data repository designed to address these challenges through a collaborative framework for data sharing, storage, and analysis tools. The goal is to help both network operators and researchers more effectively harness the wealth of data available.
Files in Python represent sequences of bytes stored on disk for permanent storage. They can be opened in different modes like read, write, append etc using the open() function, which returns a file object. Common file operations include writing, reading, seeking to specific locations, and closing the file. The with statement is recommended for opening and closing files to ensure they are properly closed even if an exception occurs.
File is the basic unit of information storage on a secondary storage device. Therefore, almost every form of data and information reside on these devices in form of file – whether audio data or video, whether text or binary.
Files may be classified on different bases as follows:
1. On the basis of content:
Text files: Files containing data/information in textual form. It is merely a collection of characters. Document files etc.
Binary files: Files containing machine code. The contents are non-recognizable and can be interpreted only in a specified way using the same application that created it. E.g. executable program files, audio files, video files etc.
Dissertation title and final project: Data source registration in the Virtual Laboratory. The subject of the thesis and related project was to integrate EGEE/WLCG data sources into GridSpace Virtual Laboratory (http://gs.cyfronet.pl/).
Poster presentation entitled Integrating EGEE Storage Services with the Virtual Laboratory:
http://www.plgrid.pl/en/pr_materials/posters
Dissertation available at http://virolab.cyfronet.pl/trac/vlvl#MasterofScienceThesesrelatedtoViroLab
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
Files are containers used for storing data in computer storage devices. They allow data to be preserved even after a program terminates, and make it easy to enter large amounts of data and transfer data between computers. There are two main types of files: text files which contain plain text and have a .txt extension, and binary files which store data in 0s and 1s and have other extensions like .bin. Common file operations include creating, opening, closing, and reading from or writing to files using functions like fopen(), fclose(), fprint(), and fscanf() in C.
fread() and fwrite() are functions used to read and write structured data from files. fread() reads an entire structure block from a file into memory. fwrite() writes an entire structure block from memory to a file. These functions allow efficient reading and writing of complex data types like structures and arrays from binary files.
Metadata for digital long-term preservationMichael Day
Presentation given at the Max Planck Gesellschaft eScience Seminar 2008: Aspects of long-term archiving, hosted by the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG), Göttingen, Germany, 19-20 June 2008
Report blocking ,management of files in secondry memory , static vs dynamic a...NoorMustafaSoomro
Three key topics were discussed:
1) Record blocking - the process of grouping related data records into blocks for storage. Fixed, variable-length spanned, and unspanned blocking methods were described.
2) File management in secondary memory - files have attributes like name, size, permissions. Common file operations are create, open, read, write. Directory structures and access paths organize files.
3) Memory allocation - static allocation assigns memory at compile time while dynamic allocation occurs at runtime using functions like malloc(), free(), calloc(), realloc(). Contiguous, linked lists and indexing are approaches to storing files and managing free space.
Data carving using artificial headers info sec conferenceRobert Daniel
This document proposes a new approach to data carving called File Recovery using Artificial Headers (FRAH) that can recover files with corrupted or missing headers. An evaluation of existing data carving tools found they have difficulty recovering fragmented files. FRAH works by inserting an artificial header onto files to circumvent missing headers. Testing showed FRAH could successfully recover files that standard tools could not. However, FRAH has limitations in recovering files where payload data is also missing. Further research is needed to make FRAH more robust.
Comparison of data recovery techniques on master file table between Aho-Coras...TELKOMNIKA JOURNAL
This document compares two data recovery techniques - Aho-Corasick and logical data recovery - based on efficiency using three datasets. The logical data recovery algorithm was found to be faster than Aho-Corasick according to processing speed, recovering files at speeds of up to 12.8 MB/s compared to 11.9 MB/s for Aho-Corasick. However, both algorithms achieved the same recovery accuracy rates of 94-100% depending on the dataset. The document concludes that a hybrid algorithm combining the strengths of both - using logical recovery for speed and Aho-Corasick for targeted file searches - could achieve even higher efficiency for data recovery.
A large amount of video files have been available due to the wide spread use of
surveillance camera, CCTVs, mobile cameras etc .Chances of corruption or damage of video file is a
critical condition in investigation cases. Video file sometimes play a crucial role in criminal cases.
Recovery of a corrupted or damaged video file is critical in digital forensics. A forensic analyst
examining a disk may encounter many fragments of deleted digital files, but is unable to determine
the proper sequence of fragments to rebuild the files. File restoration can be done using several
approaches which include file based approach and a frame based approach. This paper surveys
various techniques used for video file restoration. By analyzing we found that video file restoration
using a frame based approach provide a better efficiency compared to other techniques. And also we
can recover fragmented as well as partially overwritten files using frame based recovery
Searching Keyword-lacking Files based on Latent Interfile RelationshipsTakashi Kobayashi
FRIDAL provides a concise 3 sentence summary:
FRIDAL is a desktop search method that mines latent inter-file relationships from file access logs to find keyword-lacking files related to search keywords. It calculates relationship strengths based on the overlap of approximate file use durations extracted from the logs. An experiment showed FRIDAL found related files with fewer checks and smaller costs than traditional full-text search.
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
This paper proposes algorithms to reduce the size of firmware updates transmitted over-the-air (FOTA). It uses a three stage approach of chunking files into variable sized pieces, hashing the chunks, and comparing hashes to find similar chunks between versions. This allows creating "delta" updates that only transmit changed parts, rather than full files. The algorithms were able to reduce FOTA delta sizes by up to 30% compared to existing tools on test firmware pairs, saving bandwidth. Experimental results on three firmware pairs demonstrate size reductions and performance compared to Google's existing FOTA update method.
Google File System was innovatively created by Google engineers and it is ready for production in record time. The success of Google is to attributed the efficient search algorithm, and also to the underlying commodity hardware. As Google run number of application then Google’s goal became to build a vast storage network out of inexpensive commodity hardware. So Google create its own file system, named as Google File System that is GFS. Google File system is one of the largest file system in operation. Generally Google File System is a scalable distributed file system of large distributed data intensive apps. In the design phase of Google file system, in which the given stress includes component failures , files are huge and files are mutated by appending data. The entire file system is organized hierarchically in directories and identified by pathnames. The architecture comprises of multiple chunk servers, multiple clients and a single master. Files are divided into chunks, and that is the key design parameter. Google File System also uses leases and mutation order in their design to achieve atomicity and consistency. As of there fault tolerance, Google file system is highly available, replicas of chunk servers and master exists.
This document discusses different file organization techniques for conventional database management systems. It describes sequential file organization where records are stored consecutively. Indexed sequential file organization is introduced to improve query response time for sequential files by adding an index. Direct file organization and multi-key file organization are also mentioned, which allow accessing records using different keys. Trade-offs among these techniques are discussed.
An Efficient Search Engine for Searching Desired FileIDES Editor
With ever increasing data in form of e-files, there
always has been a need of a good application to search for
information in those files efficiently. This paper extends the
implementation of our previous algorithm in the form of a
windows application. The algorithm has the search timecomplexity
of Θ(n) with no pre-processing time and thus is
very efficient in searching sentences in a pool of files.
Google File System (GFS) is a distributed file system designed by Google to provide efficient, scalable, and reliable storage. It is organized hierarchically with directories and files identified by pathnames. Files are divided into chunks which are replicated across multiple servers for fault tolerance. The architecture includes a single master server, multiple chunk servers, and clients. The design focuses on handling partial failures, large files, and append-only mutations. It provides features like leases to ensure consistency, atomic record appends, and snapshots. The master manages the namespace and replication while the system aims for high availability, performance, and scalability.
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
This document proposes an efficient approach to manage small files in distributed file systems. It discusses challenges in storing and retrieving a large number of small files due to their metadata size. The proposed system aims to improve performance by designing a new metadata structure that decreases original metadata size and increases file access speed. It presents a system architecture with clients, metadata server and data servers. The system classifies files into different storage types and combines existing techniques into a single architecture. It provides algorithms for reading local metadata and file data in two phases. The performance of the approach is evaluated on a distributed file system environment.
1. The essay describes the proper technique for holding a deck of cards in one's hand in preparation for various card manipulation moves.
2. It outlines the starting position for holding the deck, with the fingers curled around and supporting the deck in a tilted position resting on the forefinger.
3. Different spreads are explained, such as the regular spread where the left thumb pushes cards to the right and the right hand collects them in a straight line.
This world have numerous kinds and diversity .This kinds and diversity remain in whole world two two third is aquatic ,fresh water and marine water .This kinds and diversites knowledge and their total knowledge file management is very importance for fisheries science.
This freshwater and marine water has a huge number of vertebrate and invertebrate animals and plants. Thair identify and use is vary importance for fisheries and aquaculture .for that their proper file management is play a useful role in fisheries and aquaculture.
If we went to know the total plant and animals this is not possible to proper file management.
Culturable species and there predator knowledge and file management is vary importance for aquaculture .culturable species habitats and their food habit is very importance for successful aquaculture and also importance in breeding season and behavior and high growth rate fish data .There proper management and for fisheries student study documents is very important. So file management is very importance in fisheries science.
The document discusses various digital preservation activities the author undertook as part of an assignment, including archiving, harvesting, mirroring files, extracting metadata, and verifying checksums. The author learned how to use tools like PeaZip, Xena, emulators, and metadata extraction software. They created disk images and analyzed them using bulk extractor to identify sensitive data. The author automated a workflow to generate checksums and write them to an Excel file. Overall, the assignment helped the author gain hands-on experience with digital preservation concepts and tools.
1) Files are the basic unit of storage in an operating system. They provide a logical view of information storage that is abstracted from physical storage devices.
2) A file has attributes like its name, size, location, and permissions. The operating system performs basic operations on files like creating, reading, writing, deleting and truncating files.
3) There are different methods for organizing files and allocating storage space, including contiguous, linked, and indexed allocation schemes as well as single-level, two-level, and tree directory structures. This allows files to be efficiently organized and accessed.
EFFICIENT MIXED MODE SUMMARY FOR MOBILE NETWORKSijwmn
This document proposes a new lossless compression scheme called Mixed Mode Summary-based Lossless Compression for Mobile Networks log files (MMSLC). MMSLC uses the Apriori algorithm to mine frequent patterns from log files. It then assigns unique codes to frequent patterns based on their compression gain. MMSLC exploits correlations between consecutive log files by using a mixed online and offline compression approach. It applies frequent patterns mined from previous files during online compression of current files, while also mining patterns from current files for future compression. The method achieves high compression ratios and provides summaries of frequent patterns to aid in network monitoring.
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcscpconf
This study discusses the utilization of three types of relationships in performing data replication. As grid computing offers the ability of sharing huge amount of resources, resource availability is an important issue to be addressed. The undertaken approach combines the viewpoint of user, system and the grid itself in ensuring resource availability. The realization of the proposed strategy is demonstrated via OptorSim and evaluation is made based on execution time, storage usage, network bandwidth and computing element usage. Results suggested that the proposed strategy produces a better outcome than an existing method even though various job workload is introduced.
IRJET- Proficient Recovery Over Records using Encryption in Cloud ComputingIRJET Journal
This document proposes a scheme for securely storing and retrieving encrypted documents in cloud computing based on attributes. It first designs a hierarchical attribute-based encryption scheme to encrypt document collections such that documents with shared attributes can be encrypted together efficiently. It then constructs an Attribute-based Retrieval Features (ARF) tree index structure based on document vectors incorporating term frequency-inverse document frequency and document attributes. A depth-first search algorithm is designed for efficient retrieval from the encrypted index. The scheme aims to allow fine-grained access control of documents while supporting accurate and efficient searches over the encrypted collection.
The document provides an introduction to database management systems and databases. It discusses:
1) Why we need DBMS and examples of common databases like bank, movie, and railway databases.
2) The definitions of data, information, databases, and DBMS. A DBMS allows for the creation, storage, and retrieval of data from a database.
3) Different types of file organization methods like heap, sorted, indexed, and hash files and their pros and cons. File organization determines how records are stored and accessed in a database.
Database Management System is a software program that allows for the creation, maintenance and use of databases. It provides convenient methods for defining, storing and retrieving data. There are different types of file organizations used in DBMS, including heap, sorted, indexed and hash. Each type has advantages and disadvantages for storage, retrieval, updating and deleting of records from the database. The document discusses these file organization methods in detail.
This document discusses memory management in operating systems. It covers topics like how memory management keeps track of allocated and free memory, provides protection using base and limit registers, and different address binding schemes. It also discusses dynamic loading, dynamic linking, logical versus physical addresses, swapping, memory allocation techniques like single allocation and multiple partitions, and issues like fragmentation. Paging and segmentation techniques for managing memory are also summarized.
Similar to File Reconstruction in Digital Forensic (20)
Amazon products reviews classification based on machine learning, deep learni...TELKOMNIKA JOURNAL
In recent times, the trend of online shopping through e-commerce stores and websites has grown to a huge extent. Whenever a product is purchased on an e-commerce platform, people leave their reviews about the product. These reviews are very helpful for the store owners and the product’s manufacturers for the betterment of their work process as well as product quality. An automated system is proposed in this work that operates on two datasets D1 and D2 obtained from Amazon. After certain preprocessing steps, N-gram and word embedding-based features are extracted using term frequency-inverse document frequency (TF-IDF), bag of words (BoW) and global vectors (GloVe), and Word2vec, respectively. Four machine learning (ML) models support vector machines (SVM), logistic regression (RF), logistic regression (LR), multinomial Naïve Bayes (MNB), two deep learning (DL) models convolutional neural network (CNN), long-short term memory (LSTM), and standalone bidirectional encoder representations (BERT) are used to classify reviews as either positive or negative. The results obtained by the standard ML, DL models and BERT are evaluated using certain performance evaluation measures. BERT turns out to be the best-performing model in the case of D1 with an accuracy of 90% on features derived by word embedding models while the CNN provides the best accuracy of 97% upon word embedding features in the case of D2. The proposed model shows better overall performance on D2 as compared to D1.
Design, simulation, and analysis of microstrip patch antenna for wireless app...TELKOMNIKA JOURNAL
In this study, a microstrip patch antenna that works at 3.6 GHz was built and tested to see how well it works. In this work, Rogers RT/Duroid 5880 has been used as the substrate material, with a dielectric permittivity of 2.2 and a thickness of 0.3451 mm; it serves as the base for the examined antenna. The computer simulation technology (CST) studio suite is utilized to show the recommended antenna design. The goal of this study was to get a more extensive transmission capacity, a lower voltage standing wave ratio (VSWR), and a lower return loss, but the main goal was to get a higher gain, directivity, and efficiency. After simulation, the return loss, gain, directivity, bandwidth, and efficiency of the supplied antenna are found to be -17.626 dB, 9.671 dBi, 9.924 dBi, 0.2 GHz, and 97.45%, respectively. Besides, the recreation uncovered that the transfer speed side-lobe level at phi was much better than those of the earlier works, at -28.8 dB, respectively. Thus, it makes a solid contender for remote innovation and more robust communication.
Design and simulation an optimal enhanced PI controller for congestion avoida...TELKOMNIKA JOURNAL
This document describes using a snake optimization algorithm to tune the gains of an enhanced proportional-integral controller for congestion avoidance in a TCP/AQM system. The controller aims to maintain a stable and desired queue size without noise or transmission problems. A linearized model of the TCP/AQM system is presented. An enhanced PI controller combining nonlinear gain and original PI gains is proposed. The snake optimization algorithm is then used to tune the parameters of the enhanced PI controller to achieve optimal system performance and response. Simulation results are discussed showing the proposed controller provides a stable and robust behavior for congestion control.
Improving the detection of intrusion in vehicular ad-hoc networks with modifi...TELKOMNIKA JOURNAL
Vehicular ad-hoc networks (VANETs) are wireless-equipped vehicles that form networks along the road. The security of this network has been a major challenge. The identity-based cryptosystem (IBC) previously used to secure the networks suffers from membership authentication security features. This paper focuses on improving the detection of intruders in VANETs with a modified identity-based cryptosystem (MIBC). The MIBC is developed using a non-singular elliptic curve with Lagrange interpolation. The public key of vehicles and roadside units on the network are derived from number plates and location identification numbers, respectively. Pseudo-identities are used to mask the real identity of users to preserve their privacy. The membership authentication mechanism ensures that only valid and authenticated members of the network are allowed to join the network. The performance of the MIBC is evaluated using intrusion detection ratio (IDR) and computation time (CT) and then validated with the existing IBC. The result obtained shows that the MIBC recorded an IDR of 99.3% against 94.3% obtained for the existing identity-based cryptosystem (EIBC) for 140 unregistered vehicles attempting to intrude on the network. The MIBC shows lower CT values of 1.17 ms against 1.70 ms for EIBC. The MIBC can be used to improve the security of VANETs.
Conceptual model of internet banking adoption with perceived risk and trust f...TELKOMNIKA JOURNAL
Understanding the primary factors of internet banking (IB) acceptance is critical for both banks and users; nevertheless, our knowledge of the role of users’ perceived risk and trust in IB adoption is limited. As a result, we develop a conceptual model by incorporating perceived risk and trust into the technology acceptance model (TAM) theory toward the IB. The proper research emphasized that the most essential component in explaining IB adoption behavior is behavioral intention to use IB adoption. TAM is helpful for figuring out how elements that affect IB adoption are connected to one another. According to previous literature on IB and the use of such technology in Iraq, one has to choose a theoretical foundation that may justify the acceptance of IB from the customer’s perspective. The conceptual model was therefore constructed using the TAM as a foundation. Furthermore, perceived risk and trust were added to the TAM dimensions as external factors. The key objective of this work was to extend the TAM to construct a conceptual model for IB adoption and to get sufficient theoretical support from the existing literature for the essential elements and their relationships in order to unearth new insights about factors responsible for IB adoption.
Efficient combined fuzzy logic and LMS algorithm for smart antennaTELKOMNIKA JOURNAL
The smart antennas are broadly used in wireless communication. The least mean square (LMS) algorithm is a procedure that is concerned in controlling the smart antenna pattern to accommodate specified requirements such as steering the beam toward the desired signal, in addition to placing the deep nulls in the direction of unwanted signals. The conventional LMS (C-LMS) has some drawbacks like slow convergence speed besides high steady state fluctuation error. To overcome these shortcomings, the present paper adopts an adaptive fuzzy control step size least mean square (FC-LMS) algorithm to adjust its step size. Computer simulation outcomes illustrate that the given model has fast convergence rate as well as low mean square error steady state.
Design and implementation of a LoRa-based system for warning of forest fireTELKOMNIKA JOURNAL
This paper presents the design and implementation of a forest fire monitoring and warning system based on long range (LoRa) technology, a novel ultra-low power consumption and long-range wireless communication technology for remote sensing applications. The proposed system includes a wireless sensor network that records environmental parameters such as temperature, humidity, wind speed, and carbon dioxide (CO2) concentration in the air, as well as taking infrared photos.The data collected at each sensor node will be transmitted to the gateway via LoRa wireless transmission. Data will be collected, processed, and uploaded to a cloud database at the gateway. An Android smartphone application that allows anyone to easily view the recorded data has been developed. When a fire is detected, the system will sound a siren and send a warning message to the responsible personnel, instructing them to take appropriate action. Experiments in Tram Chim Park, Vietnam, have been conducted to verify and evaluate the operation of the system.
Wavelet-based sensing technique in cognitive radio networkTELKOMNIKA JOURNAL
Cognitive radio is a smart radio that can change its transmitter parameter based on interaction with the environment in which it operates. The demand for frequency spectrum is growing due to a big data issue as many Internet of Things (IoT) devices are in the network. Based on previous research, most frequency spectrum was used, but some spectrums were not used, called spectrum hole. Energy detection is one of the spectrum sensing methods that has been frequently used since it is easy to use and does not require license users to have any prior signal understanding. But this technique is incapable of detecting at low signal-to-noise ratio (SNR) levels. Therefore, the wavelet-based sensing is proposed to overcome this issue and detect spectrum holes. The main objective of this work is to evaluate the performance of wavelet-based sensing and compare it with the energy detection technique. The findings show that the percentage of detection in wavelet-based sensing is 83% higher than energy detection performance. This result indicates that the wavelet-based sensing has higher precision in detection and the interference towards primary user can be decreased.
A novel compact dual-band bandstop filter with enhanced rejection bandsTELKOMNIKA JOURNAL
In this paper, we present the design of a new wide dual-band bandstop filter (DBBSF) using nonuniform transmission lines. The method used to design this filter is to replace conventional uniform transmission lines with nonuniform lines governed by a truncated Fourier series. Based on how impedances are profiled in the proposed DBBSF structure, the fractional bandwidths of the two 10 dB-down rejection bands are widened to 39.72% and 52.63%, respectively, and the physical size has been reduced compared to that of the filter with the uniform transmission lines. The results of the electromagnetic (EM) simulation support the obtained analytical response and show an improved frequency behavior.
Deep learning approach to DDoS attack with imbalanced data at the application...TELKOMNIKA JOURNAL
A distributed denial of service (DDoS) attack is where one or more computers attack or target a server computer, by flooding internet traffic to the server. As a result, the server cannot be accessed by legitimate users. A result of this attack causes enormous losses for a company because it can reduce the level of user trust, and reduce the company’s reputation to lose customers due to downtime. One of the services at the application layer that can be accessed by users is a web-based lightweight directory access protocol (LDAP) service that can provide safe and easy services to access directory applications. We used a deep learning approach to detect DDoS attacks on the CICDDoS 2019 dataset on a complex computer network at the application layer to get fast and accurate results for dealing with unbalanced data. Based on the results obtained, it is observed that DDoS attack detection using a deep learning approach on imbalanced data performs better when implemented using synthetic minority oversampling technique (SMOTE) method for binary classes. On the other hand, the proposed deep learning approach performs better for detecting DDoS attacks in multiclass when implemented using the adaptive synthetic (ADASYN) method.
The appearance of uncertainties and disturbances often effects the characteristics of either linear or nonlinear systems. Plus, the stabilization process may be deteriorated thus incurring a catastrophic effect to the system performance. As such, this manuscript addresses the concept of matching condition for the systems that are suffering from miss-match uncertainties and exogeneous disturbances. The perturbation towards the system at hand is assumed to be known and unbounded. To reach this outcome, uncertainties and their classifications are reviewed thoroughly. The structural matching condition is proposed and tabulated in the proposition 1. Two types of mathematical expressions are presented to distinguish the system with matched uncertainty and the system with miss-matched uncertainty. Lastly, two-dimensional numerical expressions are provided to practice the proposed proposition. The outcome shows that matching condition has the ability to change the system to a design-friendly model for asymptotic stabilization.
Implementation of FinFET technology based low power 4×4 Wallace tree multipli...TELKOMNIKA JOURNAL
Many systems, including digital signal processors, finite impulse response (FIR) filters, application-specific integrated circuits, and microprocessors, use multipliers. The demand for low power multipliers is gradually rising day by day in the current technological trend. In this study, we describe a 4×4 Wallace multiplier based on a carry select adder (CSA) that uses less power and has a better power delay product than existing multipliers. HSPICE tool at 16 nm technology is used to simulate the results. In comparison to the traditional CSA-based multiplier, which has a power consumption of 1.7 µW and power delay product (PDP) of 57.3 fJ, the results demonstrate that the Wallace multiplier design employing CSA with first zero finding logic (FZF) logic has the lowest power consumption of 1.4 µW and PDP of 27.5 fJ.
Evaluation of the weighted-overlap add model with massive MIMO in a 5G systemTELKOMNIKA JOURNAL
The flaw in 5G orthogonal frequency division multiplexing (OFDM) becomes apparent in high-speed situations. Because the doppler effect causes frequency shifts, the orthogonality of OFDM subcarriers is broken, lowering both their bit error rate (BER) and throughput output. As part of this research, we use a novel design that combines massive multiple input multiple output (MIMO) and weighted overlap and add (WOLA) to improve the performance of 5G systems. To determine which design is superior, throughput and BER are calculated for both the proposed design and OFDM. The results of the improved system show a massive improvement in performance ver the conventional system and significant improvements with massive MIMO, including the best throughput and BER. When compared to conventional systems, the improved system has a throughput that is around 22% higher and the best performance in terms of BER, but it still has around 25% less error than OFDM.
Reflector antenna design in different frequencies using frequency selective s...TELKOMNIKA JOURNAL
In this study, it is aimed to obtain two different asymmetric radiation patterns obtained from antennas in the shape of the cross-section of a parabolic reflector (fan blade type antennas) and antennas with cosecant-square radiation characteristics at two different frequencies from a single antenna. For this purpose, firstly, a fan blade type antenna design will be made, and then the reflective surface of this antenna will be completed to the shape of the reflective surface of the antenna with the cosecant-square radiation characteristic with the frequency selective surface designed to provide the characteristics suitable for the purpose. The frequency selective surface designed and it provides the perfect transmission as possible at 4 GHz operating frequency, while it will act as a band-quenching filter for electromagnetic waves at 5 GHz operating frequency and will be a reflective surface. Thanks to this frequency selective surface to be used as a reflective surface in the antenna, a fan blade type radiation characteristic at 4 GHz operating frequency will be obtained, while a cosecant-square radiation characteristic at 5 GHz operating frequency will be obtained.
Reagentless iron detection in water based on unclad fiber optical sensorTELKOMNIKA JOURNAL
A simple and low-cost fiber based optical sensor for iron detection is demonstrated in this paper. The sensor head consist of an unclad optical fiber with the unclad length of 1 cm and it has a straight structure. Results obtained shows a linear relationship between the output light intensity and iron concentration, illustrating the functionality of this iron optical sensor. Based on the experimental results, the sensitivity and linearity are achieved at 0.0328/ppm and 0.9824 respectively at the wavelength of 690 nm. With the same wavelength, other performance parameters are also studied. Resolution and limit of detection (LOD) are found to be 0.3049 ppm and 0.0755 ppm correspondingly. This iron sensor is advantageous in that it does not require any reagent for detection, enabling it to be simpler and cost-effective in the implementation of the iron sensing.
Impact of CuS counter electrode calcination temperature on quantum dot sensit...TELKOMNIKA JOURNAL
In place of the commercial Pt electrode used in quantum sensitized solar cells, the low-cost CuS cathode is created using electrophoresis. High resolution scanning electron microscopy and X-ray diffraction were used to analyze the structure and morphology of structural cubic samples with diameters ranging from 40 nm to 200 nm. The conversion efficiency of solar cells is significantly impacted by the calcination temperatures of cathodes at 100 °C, 120 °C, 150 °C, and 180 °C under vacuum. The fluorine doped tin oxide (FTO)/CuS cathode electrode reached a maximum efficiency of 3.89% when it was calcined at 120 °C. Compared to other temperature combinations, CuS nanoparticles crystallize at 120 °C, which lowers resistance while increasing electron lifetime.
In place of the commercial Pt electrode used in quantum sensitized solar cells, the low-cost CuS cathode is created using electrophoresis. High resolution scanning electron microscopy and X-ray diffraction were used to analyze the structure and morphology of structural cubic samples with diameters ranging from 40 nm to 200 nm. The conversion efficiency of solar cells is significantly impacted by the calcination temperatures of cathodes at 100 °C, 120 °C, 150 °C, and 180 °C under vacuum. The fluorine doped tin oxide (FTO)/CuS cathode electrode reached a maximum efficiency of 3.89% when it was calcined at 120 °C. Compared to other temperature combinations, CuS nanoparticles crystallize at 120 °C, which lowers resistance while increasing electron lifetime.
A progressive learning for structural tolerance online sequential extreme lea...TELKOMNIKA JOURNAL
This article discusses the progressive learning for structural tolerance online sequential extreme learning machine (PSTOS-ELM). PSTOS-ELM can save robust accuracy while updating the new data and the new class data on the online training situation. The robustness accuracy arises from using the householder block exact QR decomposition recursive least squares (HBQRD-RLS) of the PSTOS-ELM. This method is suitable for applications that have data streaming and often have new class data. Our experiment compares the PSTOS-ELM accuracy and accuracy robustness while data is updating with the batch-extreme learning machine (ELM) and structural tolerance online sequential extreme learning machine (STOS-ELM) that both must retrain the data in a new class data case. The experimental results show that PSTOS-ELM has accuracy and robustness comparable to ELM and STOS-ELM while also can update new class data immediately.
Electroencephalography-based brain-computer interface using neural networksTELKOMNIKA JOURNAL
This study aimed to develop a brain-computer interface that can control an electric wheelchair using electroencephalography (EEG) signals. First, we used the Mind Wave Mobile 2 device to capture raw EEG signals from the surface of the scalp. The signals were transformed into the frequency domain using fast Fourier transform (FFT) and filtered to monitor changes in attention and relaxation. Next, we performed time and frequency domain analyses to identify features for five eye gestures: opened, closed, blink per second, double blink, and lookup. The base state was the opened-eyes gesture, and we compared the features of the remaining four action gestures to the base state to identify potential gestures. We then built a multilayer neural network to classify these features into five signals that control the wheelchair’s movement. Finally, we designed an experimental wheelchair system to test the effectiveness of the proposed approach. The results demonstrate that the EEG classification was highly accurate and computationally efficient. Moreover, the average performance of the brain-controlled wheelchair system was over 75% across different individuals, which suggests the feasibility of this approach.
Adaptive segmentation algorithm based on level set model in medical imagingTELKOMNIKA JOURNAL
For image segmentation, level set models are frequently employed. It offer best solution to overcome the main limitations of deformable parametric models. However, the challenge when applying those models in medical images stills deal with removing blurs in image edges which directly affects the edge indicator function, leads to not adaptively segmenting images and causes a wrong analysis of pathologies wich prevents to conclude a correct diagnosis. To overcome such issues, an effective process is suggested by simultaneously modelling and solving systems’ two-dimensional partial differential equations (PDE). The first PDE equation allows restoration using Euler’s equation similar to an anisotropic smoothing based on a regularized Perona and Malik filter that eliminates noise while preserving edge information in accordance with detected contours in the second equation that segments the image based on the first equation solutions. This approach allows developing a new algorithm which overcome the studied model drawbacks. Results of the proposed method give clear segments that can be applied to any application. Experiments on many medical images in particular blurry images with high information losses, demonstrate that the developed approach produces superior segmentation results in terms of quantity and quality compared to other models already presented in previeous works.
Automatic channel selection using shuffled frog leaping algorithm for EEG bas...TELKOMNIKA JOURNAL
Drug addiction is a complex neurobiological disorder that necessitates comprehensive treatment of both the body and mind. It is categorized as a brain disorder due to its impact on the brain. Various methods such as electroencephalography (EEG), functional magnetic resonance imaging (FMRI), and magnetoencephalography (MEG) can capture brain activities and structures. EEG signals provide valuable insights into neurological disorders, including drug addiction. Accurate classification of drug addiction from EEG signals relies on appropriate features and channel selection. Choosing the right EEG channels is essential to reduce computational costs and mitigate the risk of overfitting associated with using all available channels. To address the challenge of optimal channel selection in addiction detection from EEG signals, this work employs the shuffled frog leaping algorithm (SFLA). SFLA facilitates the selection of appropriate channels, leading to improved accuracy. Wavelet features extracted from the selected input channel signals are then analyzed using various machine learning classifiers to detect addiction. Experimental results indicate that after selecting features from the appropriate channels, classification accuracy significantly increased across all classifiers. Particularly, the multi-layer perceptron (MLP) classifier combined with SFLA demonstrated a remarkable accuracy improvement of 15.78% while reducing time complexity.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
2. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
777
algorithm will first create a tree-like automata engine, called trie. Trie is an ordered tree data
structure that is used to store dynamic sets or associative arrays where the existing key is
usually a string. A trie has many advantages over the binary tree [5] and can also be
implemented to replace hash tables. A trie has an additional link between the internal nodes of
the keyword or the existing patterns. This additional link enables rapid transitions when there is
a failure in the pattern matching process, by which the automata can move to another trie
branch that has similar prefix without the need for backtracking. Aho-Corasick algorithm has
been applied to solve numerous problem such as signature-based anti-virus application [2], set
matching in Bioinformatics [4], structural-to-syntactic matching for identical documents [6],
searching of text strings on digital forensics [7] and text mining [8].
In terms of recovering files, research work by [3] implemented carving method using
Boyer-Moore algorithm to recover deleted files. In accordance with the results, some issues
such as lengthy processing time and high-capacity storage were faced in the carving process.
Over 1.1 million files with total size of 250GB were produced in the carving process of 8GB
target disk, despite a very large amount of false positive. In conclusion of the research, Boyer-
Moore algorithm was not recommended to be implemented in matching process of file header
and footer, which is O(mn). In 2010, [9] conducted research to reconstruct MP3 file fragment
using Variable Bit Rate (VBR). The proposed method was successfully increased the success
rate in finding the correct file fragment to be reconstructed. The increment percentage for high
quality MP3 file was 49.20–69.42%, 1.80–3.75% for medium quality file, and 41.2–100.00 % for
low quality file. The increasing rate in finding fragment from file will improve the performance of
carving process. Another study by [10] conducted in 2011 applied a carving method for
multimedia file. The result showed that the method was able to successfully recover multimedia
files of MP3, AVI, and WAV for continuously allocated files. Although the file was allocated at
times, it could still be identified through its characteristics after the recovery process. Despite
the difficulty in recovering a compressed multimedia file saved in NTFS, it could still be restored
using the carving method.
Recent works on media file forensic, such as audio, photo, and video were also found in
[11],[12]-[13], respectively, as well as digital forensic on Hadoop [14]. In [11], an audio forensic
on identical microphones was conducted using statistical based method, while in [12] an
algorithm on photo forensic was proposed in order to detect image manipulation using error
level analysis. Forgery detection on video inter-frame had also been conducted in [13] for
survellance and mobile recorded videos. As popularity of big data analysis has been flourish in
recent days, [14] conducted a study on digital forensic in Hadoop.
This paper is an extended version of our previous publication [15], while the initial stage
of research had also been described in [16]. Our previous research works related to this study
was described in [17] which encompassed the file type identification using Distributed Adaptive
Neural Network that was introduced and derived from [18-20].
The objective of this research is to restore all the deleted files from file system using file
undelete approach and Aho-Corasick algorithm so that the file could be analyzed to check
whether it is undamaged file or file that containing fragments of other files. The scope of this
research is focused on hard disk with NTFS file system that was checked and utilized in the
recovery process. Furthermore, it is required that the storage media does not experience any
process of wiping or data overwriting, that will damage the Master File Table.
2. Research Method
The proposed method for this research could be described in four stages, such as disk
imaging, accessing MFT, file type identification and corruption check, and file reconstruction
consisting of undelete, verification and analysis steps. Figure 1 depicts the general architecture
of every stage performed to reconstruct deleted files in a file system. Each stage could be
describe in detailed steps flow as follows.
Step 1. Duplication of storage media contents (disk imaging) to obtain duplicates of storage
media that are identical to the actual storage media;
Step 2. Accessing and reading MFT records to search records from all existing files and
directories;
Step 3. Metadata extraction from MFT record using parsing on MFT record;
3. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
778
Primary drive
Secondary drive
Sector by sector copy Image drive MFT record
Master File Table
MFT Entry
Header
Attribute
MFT Record
Filename
Logical Cluster Number
Deleted flag
File time
File/Directory
File size
Other attributes
Metadata
Blocks containing data
to be undeleted
New file to accommodate data
which will be undeleted
Disk Imaging
Master File Table
Filetype Indentification & Corruption Check
Undelete Analysis
File-
name
Good/
Corrupted
Access
Time
Write
Time
Create
Time
Filesize
File /
Directory
File type
(ext.)
File type
(signature)
Deleted/Not
Deleted
Metadata
obtained from
attributes
Record Parsing
Filename1.ext1
Filename2.ext2
Directory1
Filename3.ext3
Directory2
List of files and directories
List all files and
directories
available
Good file
Corrupted file
Part of original file
which is still
readable
Fragment of
other file
Trie Aho-Corasick
(signature)
Trie Aho-Corasick
(filename extension)
Detect file
coruption
First 32 byte
50 4B 03 04 14 00 06 00
08 00 00 00 21 00 17 32
33 AB 42 02 00 00 8D 0D
00 00 13 00 08 02 5B 43
Sample byte
Signature
Signature Extension File types
List of signature and filename extension
Filename extension
File types
(from signature byte)
File type
(from filename ext.)
Verification
Offset of corrupted
fragment
Offset of readable data
Readable part of original
file
File Reconstruction
Figure 1. Proposed method-general architecture
Step 4. Performing parsing on file name to obtain filename extension;
Step 5. Taking the first 32 bytes of the cluster occupied file as sample;
Step 6. Building trie using Aho-Corasick algorithm based on actual signature and file
extensions;
Step 7. Identifying file type based on signature and filename extension;
Step 8. Comparing the results of file type identification based on filename extension with the
result of signature based identification to see whether the file was damaged;
Step 9. File registration along with its details, such as the file type (based on signature and
filename extension), timestamp, file condition, and other types of information;
Step 10. File reconstruction based on metadata obtained from MFT, verification of the
reconstructed file by opening the file then check for the signature so that it can be
identified whether or not the recovered file experienced any damage; analyze the
damaged files and read the readable information of the file;
4. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
779
Step 11. File recovery on the file system deleted files could then be performed after all the
previous steps were done.
After performing the steps in the proposed method, the developed program will be able
to recover deleted files on file system and to select any file that need to be reconstruct based on
the given keywords. Each step that was performed will be described in detail in the following
sub-sections.
2.1. Disk Imaging
This stage will duplicate the content of the storage media in sector level so that the
duplicate acquired is identical to the original storage media including the boot sector and the
MFT. From this process the acquired duplicate will be used to access and to read the MFT
record. This stage could be optional if the storage media used is a secondary drive as it will not
be accessed directly by the operating system or other applications. Figure 2 shows the disk
imaging scheme.
Figure 2. Disk Imaging Scheme
This research used a 4GB secondary storage media with NTFS file system (the
effective size is 90% from the storage media size that is 3.60GB or 3,873,783,808 bytes) with
cluster size of 4KB (4096 bytes). There were 56 files from various file types with a total size of
3.54GB (3,797,409,792 bytes). The calculation of CRC-32 value was performed on each file to
be used as comparison variable in verification process.
2.2. Master File Table (MFT)
In this stage, the cluster number containing the MFT was read from the 0
th
boot sector,
which was located on 0x30 offset and 8 bytes on length using little endian system. Each record
in the MFT was accessed to read information of each file and directory in the storage media.
Each record went through the parsing process to break up each record based on MFT entry
header. Attribute header contains information about type, size, and name of the file and also
attribute value pointing to actual data.
Each file and directory contained in the storage media has some information stored in
the MFT record. The MFT record provides information such as:
1. The type and condition of the records. This information is attained from the offset 0x16
values of 2 bytes in MFT entries.
a. If the value is 0x00 then the record is a record for file and it can no longer be used (the
files has been deleted from file system)
b. If the value is 0x01 then the record is a record for file and it is still used (the file is still
listed in the file system).
c. If the value is 0x02 then the record is a record for directory and it can no longer be used
(the directory has been deleted)
d. If the value is 0x03 then the record is for directory and it is still used (the directory is still
listed in the file system).
5. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
780
2. File condition. The file condition is identified by comparing the result of file identification
based on filename extension and the signature. The deleted files could be in several
conditions, i.e.:
a. Good. The file is considered in a good condition if after deleted, the cluster occupied by
the file is not used by another file.
b. Damaged. If after deleted, the original cluster is occupied by another file. Then the file is
damaged.
i. If the override file is larger than or equal to the original file, then the original file will be
completely overwritten. The completely overwritten files will have content differences
from the original file.
ii. If the override file is smaller than the original file, the original file will be partially
overwritten. If a file is partially overwritten, then some of the information from the
original file is still readable. Fragment of the original file may contain information that
can still be recovered.
3. File size (in bytes).
Parsing process is mandatory to obtain file metadata for the reconstruction process
such as filename, number of logical cluster occupied by the file, flag to determine the deleted
file, and other information. Based on the metadata obtained, all files with deleted flag will be
listed. An MFT Record consisting of hexadecimal numbers is shown in Table 1.
Table 1. MFT Record
Offset 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00 46 49 4C 45 30 00 03 00 57 34 80 00 00 00 00 00
10 03 00 01 00 38 00 01 00 E0 01 00 00 00 04 00 00
20 00 00 00 00 00 00 00 00 06 00 00 00 03 00 00 00
30 03 00 00 00 00 00 00 00 10 00 00 00 48 00 00 00
40 00 00 18 00 00 00 00 00 30 00 00 00 18 00 00 00
50 24 CC 73 D0 62 D9 CF 01 24 CC 73 D0 62 D9 CF 01
60 24 CC 73 D0 62 D9 CF 01 24 CC 73 D0 62 D9 CF 01
70 06 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80 30 00 00 00 68 00 00 00 00 00 18 00 00 00 01 00
90 50 00 00 00 18 00 01 00 05 00 00 00 00 00 05 00
The MFT entry parsing process is as follows:
a. Offset 0x00 with length of 4 bytes is magic number “FILE”.
b. Offset 0x06 with length of 2 bytes is number of fixup array, which is 0x00 03=3 arrays.
c. Offset for the first attribute obtained from offset 0x14 with length of 2 bytes with little
endian reading of 0x00 38
d. Offset 0x16 with length of 2 bytes is flag, since the value is 0x00 01, so this record is
the record for files. Then parsing is performed on the attributes located in MFT record.
e. The first attribute found on the offset 0x00 38 of the record with the first 4 bytes is a
marker of the attribute type of 0x00 00 00 10. This attribute is an attribute containing
$STANDARD_INFORMATION or standard information.
f. The next 4 bytes are the length of the attribute 0x00 00 00 48=72 bytes.
g. The next 1 byte is a non-resident marker flag, since the value is 0x00 then the attribute
is a resident attribute.
h. The attribute has a content size of 0x00 00 00 30=48 bytes and starts at offset 0x00
18=24 (offset of attribute).
Every record will go through parsing process in order to obtain metadata to perform
reconstruction process, such as file name, logical cluster number occupied by the file, flag
indicating deleted files, flag indicating a file or directory record, timestamp, and other
information.
2.3. File Type Identification and Corruption Checking
In this stage, the files contained in the storage media were identified to determine
whether or not the files were corrupted. File type identification was performed first based on
filename extension and file signature. Filename extension was derived from the file name while
6. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
781
file signature was obtained from the file header by taking sample of the first 32 bytes as shown
in Figure 3.
Filename extension
Example_01.docx
50 4B 03 04 14 00 06 00 08 00 00 00 21 00 54 7F
ED 76 2A 02 00 00 29 0D 00 00 13 00 08 02 5B 43
6F 6E 74 65 6E 74 5F 54 79 70 65 73 5D 2E 78 6D
6C 20 A2 04 02 28 A0 00 02 00 00 00 00 00 00 00
Signature Sampling
(The first 32 byte)
Figure 3. Filename extension and signature sample
Filename extension and signature of the actual file are information required to identify
the file type. The filename extension and signature to be utilized are shown in Table 2.
Table 2. List of Filename Extension and Signature to be utilized
No Filename Extension Signature Description
1. DOC, DOCX, PPT, PPTX, MSI,
VSD, XLS, XLSX
50 4B 03 04 14 00 06 00
D0 CF 11 E0 A1 B1 1A E1
Document or Microsoft
file
2. PDF, FDF 25 50 44 46 Adobe Portable
Document Format
3. JPEG, JPG FF D8 FF E0 00 10 4A 46 49 46 00 01 01
FF D8 FF E0
FF D8 FF E1
FF D8 FF E8
FF D8 FF
JPEG file
4. PNG 89 50 4E 47 0D 0A 1A 0A PNG file
5. GIF 47 49 46 38 39 61 4E 01 53 00 C4 GIF file
6. MP3 49 44 33 MP3 file
7. MKV 1A 45 DF A3 93 42 82 88 6D 61 74 72 6F
73 6B 61
Matroska file
8. MP4, M4V 66 74 79 70 33 67 70 35
66 74 79 70 4D 53 4E 56
66 74 79 70 6D 70 34 32
MPEG-4 video file
9. EXE 4D 5A Executable file
10 RAR 52 61 72 21 1A 07 00
52 61 72 21 1A 07 01 00
Compressed archive
11 ZIP, JAR 57 69 6E 5A 69 70
50 4B 03 04
50 4B 05 06
50 4B 07 08
Zip archive
The information in Table 2 would be converted into two types of tries, namely filename
extension trie and signature trie. An example of those two tries are shown in Figure 4.
7. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
782
Figure 4. Tries for filename extension and signature
After all filename extensions and signatures information are converted into tries then the
file type identification was performed. Aho-Corasick algorithm identifies the filename extension
and signature based on trie for filename extension and trie for signature, respectively. The
identification process generates two identification results, and the two results are compared to
determine the file condition. The comparison of the two results and the output of the file
condition are shown in Table 3.
Table 3. The Comparison of Filename Extension and Signature Identification Results
and Files Condition
No
File type (filename
extension)
File type
(signature)
Comparison of the two
identification results
Condition
1. Identified Identified Match Good
2. Identified Identified Not match Damaged
3. Unidentified Identified Not match Damaged
4. Identified Unidentified Not match Damaged
5. Unidentified Unidentified - Unknown
From Table 3 there are three types of file conditions, namely “Good”, “Damaged”, and
“Unknown” for five comparison conditions, which could be described as follows.
a. If the file type is identified based on filename extension and signature and both identification
results give the same result, then the file is still in “Good” condition.
b. If the file type is identified based on filename extension and signature but both identification
results give different results, then the file is “Damaged”.
c. If the file type failed to be identified based on filename extension but successfully identified
by signature, then the file is damaged. This condition can occur on files that have been
experienced forgery or have been overwritten by other data.
d. If the file type is identified based on filename extension but failed to be identified by
signature, then the file is damaged. This condition can occur on files that have been
overwritten by other data.
e. If the file type failed to be identified based on both filename extension and signature, then
the file condition is “Unknown”.
2.4. File Reconstruction
Metadata obtained from MFT records through parsing process is used to perform file
reconstruction (recovery). Important information required for file reconstruction is the file name,
Logical Cluster Number (LCN), and the size of the allocation. A complete file information
collected from the file system is listed in Table 4.
8. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
783
Table 4. Information Used in Undelete Process
No Information Details
1. Filename The filename corresponds to the entry on MFT.
2. File/Directory Entry type, whether an entry for a file or directory. Obtained from flags on
MFT records
3. Deleted Condition whether the entry has been deleted or is still in use. Retrieved
from the flag in the MFT record.
4. Condition Condition of the file damage obtained from the identification and
comparison of file types.
5. Write Time The time a file is written.
6. Create Time The time a file was created.
7. Access Time The time a file is accessed.
8. Signature Filetype File type based on the identification of first 32 bytes of file.
9. Filename Extension Filetype File type based on the identification of filename extension
10. Allocation Size The size of file allocation in storage media. This information will be used in
file reconstruction.
11. Logical Cluster Number The number of clusters occupied by the file. This information will be used to
read the data stored in the cluster in file reconstruction process.
Although to perform the file reconstruction only requires information in the form of
filename, LCN, and file allocation size, but other information such as file condition (damaged or
good), file type, timestamp, etc. should also be collected in order to help in selecting the files to
recover.
2.4.1. Undelete Process
The undelete process is the first step of the file reconstruction as illustrated in Figure 5.
a. Creating a new empty file with the same name as the file name to be restored and the
same size as the allocation size of the restored file.
b. Opening the LCN of the recovered file and reading the contents of the cluster stored in
the buffer. The amount of data read from the cluster is the same as the amount of buffer
size allocation. The contents of the buffer are then written into a new file in hexadecimal
values. After the data are written into the empty file, the buffer are re-used to
accommodate the next data. The data writing is proceed from the last offset of the
written data. This process will continue iteratively until the contents of all clusters
occupied by deleted files are moved to the new file.
2
1
File 1(undeleted)
The size is the allocated size
25 50 44 46 2D 31 2E 34 0D 25 E2 E3 CF D3 0D 0A
33 30 39 20 30 29 6F 62 6A 0D 3C 3C 20 0D 2F 4C
cluster
The values inside cluster
is stored in buffer
buffer
File 1(undeleted)
Figure 5. File reconstruction process at the time of recovery
2.4.2. Verification Process
Files recovered after deletion could be in several conditions, such as:
a. The file is restored properly. A file could be restored properly only if it is not overwritten by
another file. Thus, if the file is restored, its condition is the same as the file condition before
it was deleted
9. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
784
b. Files whose contents partially had been overwritten by other files. This condition can occur if
after the file was deleted, the storage media was filled with another file smaller than the
deleted file and the file uses the location of the deleted file clusters. If the deleted file was
overwritten in the header section, then after the file has been restored, it can not be opened
with the initial application. However, some of the file contents are still readable.
c. Files that are entirely overwritten by other files. This condition occurs if the cluster location
originally occupied by the deleted file is filled by another file with the same size or larger
than the file. If this file is restored, the contents will be different from the original file.
Based on the above conditions, the recovered files must go through a verification
process. The verification process will determine whether or not the file is damaged. Verification
is performed by opening the file with default application for that file. But if the file is damaged,
then the hexadecimal value of the file will be read. The file will then be analyzed to obtain
information from the file readable part.
2.4.3. Analysis Process
The analysis of the damaged file is conducted to determine whether the file is partially
or completely corrupted. Entirely overwritten files will resulted in the recovered file is different
from the actual file, while partially overwritten files are still have some information from the
original file. Once it is known that the file is partially overwritten, procedure to read the remaining
information from the original file can be applied. The steps are:
a. Determine the size of the data occupying the original file location. There are two ways could
be taken i.e. by searching the footer of the override data or finding the hexadecimal value
used by the operating system to fill the slack space (generally the value of hexadecimal
0x00 or NULL) as shown in Figure 6.
41 8C 2A 61 52 65 5E 58 9C 6D F5 34 51 4B 3E 84
23 4D 25 14 92 76 B7 A7 2F CF D7 5D 4F C2 30 D9
B6 61 98 D5 CC 6B 63 31 35 2B D4 A8 B9 9B A9 27
2E 5D B4 A6 9B 6A 11 5D 23 14 92 ED B9 FF D9 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
48 99 D3 F9 76 1E 25 C6 EF C3 95 4E 3E AB 0C 1A
B6 D8 FA CE C3 19 6D 5A 50 38 F7 12 D4 92 EE C3
40 2D 36 61 EA 9B AB 74 B8 CA 4E 0B D8 36 D2 85
Footer Signature
for other file
The value used by
Operating System to fill
up slack space
Portable Document Format (.PDF) file type
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
.
.
.
B 8580
B 8590
B 85A0
B 85B0
B 85C0
B 85D0
B 85E0
B 85F0
B 8600
B 8610
B 8620
.
.
.
Figure 6. Analysis of damaged files
b. After the end of the override data is found, the size of the override data can be calculated by
subtracting the final offset with the initial data offset.
c. Because files allocation in storage media are based on a clusters, the number of clusters
used by the override data can be calculated by dividing the size of the override data by
cluster size.
d. If n is the number of cluster used by the override data, then the readable part of the original
file is in the n + 1 cluster or at offset m + 1 with m representing the last offset of the cluster
n.
10. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
785
3. Results and Discussion
This section will describe the result obtained from the reconstruction process. There
were 56 files used in this experiment with a total of 3.54GB (3,797,409,792 bytes). The recovery
process were conducted on various file types including .docx, .pdf, .jpg, .png, and .exe files
among others. From the 56 files, 55 file (98.21%) were successfully recovered with a total size
of 3.52GB (3,781,166,380 bytes) or 99.71% from total size of deleted files.
In the testing process, the recovery of a file which is performed twice can result in
different elapsed time. This is due to hardware condition such as processor speed, memory
size, and access speed of the storage media. Even though the recovery is performed with the
same hardware, differences in elapsed time can still happen as it might be caused by different
processor load at the time the recovery process was performed. If the hardware factors
(processor speed, memory size and speed, and storage media access speed) and data
processing speed are considered constant for all files, then elapsed time is proportional to the
size of the file to be recovered.
A line chart showing the relation between the elapsed time needed to recover files and
the amount of bytes recovered is presented in Figure 7. The graph in Figure 7 shows that the
elapsed time needed to recover file is proportional to the size of the file. As for the file types, it
was discovered that they were not given significant impact on the elapsed time in the recovery
process.
Figure 7. Relation between elapsed time versus the number of bytes recovered
Some factors that might affecting the speed of file recovery are hardware specification,
such as processor speed, the size and speed of memory, and the speed of storage media
access. Another factors that should be consider are processor load and the file size. Meanwhile,
factor affecting the success rate of undelete process could be identifed such as condition of the
MFT, size of the overwritten data on storage media, and size of the file to be recovered. A
screenshot of an undelete process is shown in System Preview as in Figure 8.
11. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
786
Figure 8. System preview of the undelete process
3.1. Result of Undelete Process
Analysis was performed based on the information obtained from damaged files in order
to determine their sizes and readable parts. A DOCX file was used in this analysis process.
Information obtained from damaged files is given in Table 5.
Table 5. Information of damaged files
Parameter Value
File type based on filename extension Microsoft Document (.DOCX)
File size 4.65 MB (4,884,115 byte)
File size with slack space (in storage media) 4.66 MB (4,886,528 byte)
(Sample of the file first 32byte) F2 71 16 55 92 CE 0D C1
8F 13 45 0E 3F E6 D4 F9
01 0F E4 D3 2C F4 89 53
88 A0 58 44 C1 D5 F8 EB
No known signature found
File type based on signature Unknown
Examination result of the file damage Damaged file
Sample of the file last 32byte 62 65 72 69 6E 67 2E 78
6D 6C 50 4B 05 06 00 00
00 00 54 00 54 00 D3 16
00 00 AA 6F 4A 00 00 00
Footer for Microsoft document file found
There was no known signature found at the beginning of the file, however footer for the
DOCX file was found at the end of the file. The file was overwritten on its beginning by a
fragment of another file whose signature value was not found in the signature trie, consequently
the override file was unidentified.
Furthermore, the analysis process continues to find the value used by the operating
system to fill the slack space in 0x00 sector and the footer of PDF file was found in the
hexadecimal value of 0x00 from the offset of 0x167373 to 0x167FF0 and offset of 0x167E6D to
0x167E72 as illusrated in Figure 9. Thus, it was identified that the file was overwritten by the
fragment from another file at the beginning. Once the condition of the file damage was
identified, then the calculation of override data offset and the original file offset could be done.
The file had filename extension of DOCX and size of 4,884,115 bytes or from offset
0x00 to 0x4A8692. Then the value of the hexadecimal numbers contained in the file was traced.
The value of hexadecimal numbers of 32 bytes (from offset 0x00 to 0x1F) found at the
beginning of the file are F2 71 16 55 92 CE 0D C1 8F 13 45 0E 3F E6 D4 F9 01 0F E4 D3 2C
F4 89 53 88 A0 58 44 C1 D5 F8 EB . From the hexadecimal values, no signature was found
from the document file DOCX (50 4B 03 04 14 00 06 00) and at the beginning of the file there
12. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
787
was no signature of a known file. Furthermore, offset 0x167E73 to 0x167FFF was filled with
value 00, which was the value filled by the operating system to fill the slack space in the sector
that has been populated with data. In addition, the hexadecimal values 0A 25 25 45 4F 46
were found at offsets 0x167E6D to 0x167E72 which were the signature for the footer of PDF
file. Thus, the data from offsets 0x00 to 0x167E72 were data for PDF file fragment.
According to the results of searching process, information obtained are as follows:
a. Offset containing other file data=0x00 to 0x167E72 or equivalent to 1,474,163 byte.
b. Offset of slack space filled by operating system (hexadecimal value “00”)=0x167E73 to
0x167FFF.
c. Number of occupied clusters:
d.
360359.9
(byte)SizeCluster
(byte)SizeData
e. The occupied size=number of clusters cluster size=360 4,096=1,474,560 bytes (offsets
0x00 to 0x167FFF)
f. The data size of the uncorrupted old files are from offsets 0x168000 to
0x4A8692=3,409,555 bytes.
Figure 9. The value used to fill in slack space and footer from PDF FILE
Therefore, the number of clusters required to accommodate data of 1,474,163 bytes are
359.9 clusters. Because addressing is performed per cluster, the required number of clusters is
rounded to 360 (equivalent to 1,474,560 bytes). Thus, the offsets used are 0x00 to 0x167FFF.
The next data of a readable EXE file is data starting from offset 0x168000 to 0x4A8692.
The data size that overwrites the actual file data is 1,474,163 bytes (1,474,560 bytes when
added with slack space) and the data size of the readable original file is 3,409,555 bytes. The
analysis result of the data contained in a file is shown in Figure 10.
After the file went through the analysis process, the remaining information could be read
from the uncorrupted original file. However, the readable information of the file depends on the
level of file damage and the encoding method the file uses. Figure 11 shows that even though
the file had been corrupted, some information from the file was still readable. The file in Figure
11 experienced damages in the header section. Although the header of the file has been
overwritten by another file data, some information from the text file in ASCII encoding are still
readable.
The readable information of damaged file is not only in the form of text. An image file
can be stored in other files, such as Microsoft documents and Adobe Portable Document
Format (PDF). Figure 12 shows how a JPEG file can still be read from the damaged documents
(overwritten by HTML file at the beginning of the file)
13. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
788
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00 0000
00 0010
00 0020
00 0030
00 0040
00 0050
00 0060
00 0070
00 0080
00 0090
.
.
.
16 7E20
16 7E30
16 7E40
16 7E50
16 7E60
16 7E70
16 7E80
16 7E90
16 7EA0
16 7EB0
.
.
.
16 7FE0
16 7FF0
16 8000
16 8010
16 8020
16 8030
16 8040
16 8050
16 8060
16 8070
.
.
.
40 8660
40 8670
40 8680
40 8690
Footer PDF
F2 71 16 55 92 CE 0D C1 8F 13 45 0E 3F E6 D4 F9
01 0F E4 D3 2C F4 89 53 88 A0 58 44 C1 D5 F8 EB
43 85 C2 31 F3 7F 64 44 BA E9 DD 32 08 E7 AE DF
F4 BB 98 2A B3 66 A0 E4 F8 22 65 3C CD 1E A1 D9
C4 C3 64 99 69 6F ED 1C A9 88 AB 69 76 28 E5 3F
C5 4D 21 FF 08 39 C5 F4 43 12 3A EB 79 C2 02 29
52 E5 E8 4C A2 7E 78 F4 61 7E 5C 8E FE 54 74 D3
8C EA C1 C8 75 29 57 9E 6B 05 F3 47 62 02 AD DC
1E F7 79 79 30 51 6D 3F 80 07 D7 C0 89 24 E5 67
A6 77 FC 28 59 97 FE 29 0D 97 F5 A2 AF C2 3E E8
.
.
.
2F 49 6E 66 6F 20 31 38 35 32 20 30 20 52 0A 2F
50 72 65 76 20 31 36 31 33 32 33 33 36 0A 2F 52
6F 6F 74 20 31 38 34 31 20 30 20 52 0A 2F 53 69
7A 65 20 31 38 35 35 0A 3E 3E 0A 73 74 61 72 74
78 72 65 66 0A 31 36 31 33 33 34 38 38 0A 25 25
45 4F 46 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
.
.
.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
7E D6 6F 42 A9 E6 9F 3B 9A B6 B6 14 B4 F6 27 5E
C0 EA 05 1C D4 40 F6 CA 38 8F 5B ED 7B D0 1C BE
54 29 10 B4 8B 56 F4 FC B2 66 F6 69 59 B0 E6 B9
8A CF C5 02 A6 2C 96 FE 0F F8 50 0C 58 04 81 A6
EA 6D 51 0C 09 18 CA 94 E7 79 96 E7 A3 47 6B 83
2E DC 57 61 DD 69 29 05 88 E3 ED F8 E1 4B 62 43
76 EB D1 F3 8D 97 85 27 40 40 B6 78 4F 16 3F 63
A4 30 79 DF C6 FF E8 FE 31 5F E6 BF 34 78 56 4B
.
.
.
00 00 00 00 00 00 00 59 56 4A 00 77 6F 72 64 2F
6E 75 6D 62 65 72 69 6E 67 2E 78 6D 6C 50 4B 05
06 00 00 00 00 54 00 54 00 D3 16 00 00 AA 6F 4A
00 00 00
Unknown
signature
Fragment
from PDF file
Slack space
Footer DOCX
Part of File DOCX
That can be read
Figure 10. Analysis result of the data obtained
in a file
Figure 11. Readable information of damaged
file
Figure 12. Image file recovery of damaged document file
To recover an image file from a damaged document file, the document file containing
the signatures for the header and the footer of the image file were moved to an empty file and
were written in the form of hexadecimal values. Thus, the file was identified as an image file and
can be opened with an application for image files.
The readable information of the damaged document depends on the types and damage
level of the documents, some information are still readable in spite of the file damage. In the
14. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
789
example, one of the documents with readable contents was a PDF document, even though it
had suffered damage in which some data in the file was overwritten by another file as shown in
Table 6.
Table 6. Information of damaged PDF document
Parameter Value
File type based on filename extension Adobe Portable Document Format (.PDF)
File type based on signature Unknown.
Result of file damage test File damaged.
File size 1,03 MB (1.081.946 bytes)
File size with slack space (in storage media) 1,03 MB (1.085.440 bytes)
Sample of the first 32 bytes B9 BC BD D4 6E 2C ED 27
9B 31 25 AE 97 6B 3C C9
00 BE F3 E6 6B 78 51 63
49 1E 6B 74 79 DE 40 20
No signature found for the PDF header
Signature for the found footer a. FF D9 (Footer JPEG) on offset 0xB85BD.
b. 45 4F 46 0D (Footer PDF) at the end of the file (offset
0x108256).
Initial readable data file size c. Based on the size difference with the override data size: 317 KB
d. The actual size: 290 KB
Number of readable pages 104
The damaged PDF document were overwritten by a JPG image (JPG footer with value
of 0x00 minus value for slack space, was found on offset 0xB85BD). The number of cluster
overwritten by the JPG image is ⌈184.35⌉=185 clusters. Therefore, the size of the overwritten
data is 757,760 bytes and the unaffected data size is 1,081,946–757,760=324,186 byte=317
KBs. The data of 317 KB is not entirely readable. Of the 317 KB fragment size, readable data is
290 KB or as many as 104 pages. Some pages from the PDF documents are still readable as
shown in Figure 13.
Figure 13. Some of readable pages from damaged PDF files
3.2. False Positive Analysis
False positive analysis is when the file is damaged or overwritten by another file but is
still identified as a good file. This is possible if the file was overwritten by similar file smaller than
the original file. Because the override file type was the same as the original file type, the header
of the file which was found at the beginning of the file had the same signature as the original file.
Therefore, if the identification results based on signature was compared with the result of
15. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
790
identification based on filename extension, then the file was identified as a good file. From the
analysis, the results of false positive found were Microsoft documents (DOCX), PDF documents,
and JPG image files.
Microsoft documents (DOCX) that were overwritten by other Microsoft documents could
not be opened after they are restored even though the identification results indicate that the file
was in good condition. When the document was checked, more than one signatures were found
from the footer of the Microsoft documents. This fact indicates that the document had been
overwritten by other Microsoft documents. Information of the Microsoft documents is
shown in Table 7.
Table 7. Information of Microsoft documents (false positive)
Parameter Value
File type based on filename extension Microsoft Document (.DOCX)
File type based on signature Microsoft Document (.DOCX)
Result of file damage test Good File
File size 4,65 MB (4,884,115 bytes)
File size with slack space (file size in
storage media)
4,66 MB (4,886,528 bytes)
Number of the footer signature found 2 on offset:
0x010068 (50 4B 05 06 00 00 00 00 0E 00 0E 00 8D 03 00 00
DB FC 00 00 00 00)
0x4A867D (50 4B 05 06 00 00 00 00 54 00 54 00 D3 16 00 00
AA 6F 4A 00 00 00)
The signature of the Microsoft document footer found on offset 0x4A867D (end of file)
indicates that the footer signifies the end of the file. However, the footer found in the mid-file on
offset 0x010068 indicates that it is the end of the file. Two footers found on two different offsets
indicate that there had been a change in the actual file. Figure 14 illustrates the signature of
Microsoft documents footer found on two different offsets.
Figure 14. Signature of DOCX footer found on two offsets
The 4.66 MB document was overwritten by other documents with smaller size of 65,641
bytes (64.1 KB). This causes the some 68KB of the original document was overwritten so the
actual data of the document was damaged. Because the overriding file type is the same as the
initial file type, the files share the same signature on the header. Thus, the file was a corrupted
file although it was identified as a good file (false positive).
Unlike Microsoft documents that can not be opened if some of the data in the file has
been overwritten, JPG image and PDF files could still be opened even if it had been overwritten.
This will cause the file to look as if it is still good because it can still be opened properly using
the application. Some false positives could be detected by checking the file size where the file
size is too large when compared to the file content. This can happen if the size difference
between the override and the initial files is huge. For example, JPG file is a compressed image
16. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
791
file. When viewed from the width, height, resolution, and bit depth of the image, the file size is
too large for a JPG image. There are two signatures of JPG image footer found in the image file
on different offset of 0xD391 and 0xBA3EA. Information of JPG image file which is a false
positive can be seen in Table 8.
Table 8. Information of JPG Image (False Positive)
Parameter Value
File type based on filename extension JPG file
File type based on signature JPG file
Result of file damage test Good file
Image width 800 pixel
Image height 800 pixel
Horizontal resolution 96 dpi
Vertical resolution 96 dpi
Bit depth 24
File size 744 KB (762,860 byte)
File size with slack space (file size in storage
media)
744 KB (765,952 byte)
Number of the footer signature found 2 on offsets:
a. 0xD391 (FF D9)
b. 0xBA3EA (FF D9)
Figure 15. The search of JPG file content (Hexadecimal Values)
In Figure 15, the search of the file contents showed that the successfully opened JPG image
with dimension of 800 pixel × 800 pixel was the image that overwrites the previous image. This
image has a size of 52.8KB (56 KB if the slack space is included) in accordance with the footer
signature found on the offset 0xD391. Thus, it is known that there was previously a JPG image
file of 744 KB that was overwritten by a smaller JPG image. The override causes the first 56 KB
of the initial JPG file are damaged so that the fragment of the initial JPG file with size of 688KB
starts from offset 0xE011. The search of hexadecimal value of the file is shown in Figure 16.
Figure 16 shows the false positive of a PDF document. PDF file that become false
positive have a content of 4 pages which consists mostly of text. However, the file size is too
large at 1.39 MB (1,459,518 bytes). According to the search, the 4-page contents only occupy
the first 56 KB (56,849 bytes) of the file which is up to offset 0x00DE15. It is also found within
the file two PDF footers which located on offset 0x00DE0F with value of “0A 25 25 45 4F 46 0A”
and at the end of the file on offset 0x164537 with value of “0D 25 25 45 4F 46 0D”.
17. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
792
Figure 16. The search of PDF file content (hexadecimal values)
3.3. Testing Results
We conducted 4 testings with different data size in the media storage. The first testing
shows that using 0 byte data size added to media storage and 56 files in MFT’s entry, 55 files
could be recovered (98,21%) and 1 file unsuccessfully recovered (1.79%) even though the file
entry was still in MFT. Result of second testing was on the storage media with data overwritten
by other data is 724,167,330 bytes or 18.98% from the storage media total size. Readable File
entry on MFT is 55 files, 47 files are successfully recovered (83.92%), 8 files recovered with
some damage, 1 file cannot be recovered, and the total damaged file size is 933,351,059 bytes.
For the third testing with 55 files successfully read in MFT, 48 files unsuccessfully recovered, 7
damaged files, 1 failed to be recovered, and the total damaged file size is 1,631,210,540 bytes.
Lastly, the forth testing shows 55 files entry on MFT, 46 were recovered, 9 recovered with
damages, 1 file need to be recovered, and the damaged file size is 2,337,387,248 bytes.
Results from the first to the fourth testing shown that the greater overwritten data size does not
mean that the more files will be damage. On the contrary, the greater the size of a file can be
recovered the more likely that the file will be damage at the time the overwritten is happened.
The successful rate of undelete is depends on several factors, such as:
a. Condition of the MFT. The undelete process requires metadata obtained from MFT
attribute so that if MFT is damage (for example the storage media is formatted), then the
undelete process couldn’t be performed.
b. The size of data overwritten on the storage media. The smaller the file size overwritten, the
more successful the file recovery and vice versa.
c. The size of file to be recovered. The larger the file size, the more likely that file is
overwritten by another data after the deletion and the successful recovery rate is also
become smaller.
Based on the testing results from the first test until the fourth test, the average test
results can be seen in Table 9. The average test result shows that 98.66% MFT entry can still
be read after file deletion, 87.50% deleted files can still be recovered successfully, 10.71% files
recovered with some damage, and 1.79% failed to be recovered.
Table 9. Average Test Values
Parameter Average
Number of MFT entry read 98.66%
Number of file successfully recovered 87.50%
Number of file recovered with damage 10.71%
Number of file fail to recovered 1.79%
18. TELKOMNIKA ISSN: 1693-6930
File Reconstruction in Digital Forensic (Opim Salim Sitompul)
793
4. Conclusion and Future Work
In this paper, we have implemented a file undelete and the Aho-Corasick algorithms to
reconstruct files in order to recover deleted files from a file system. The implementation of the
proposed file undelete algorithm was able to recover 55 files with a total size of 3.52GB in
229.418 seconds, which count to average data processing speed of 15.77 MB/s. However, the
proposed file reconstruction method was entirely depended on Master File Table (MFT)
condition, meaning that if the MFT was damaged, it will affect the recovery result. In addition,
the size of files to be recovered and the portion of overwritten file were also affecting the
success rate of the recovery process.
String-matching method using Aho-Corasick algorithm implemented in the file type
identification and damage checking was capable in finding the file signature and determining the
damage by comparing the result of identification based on filename extension and signature. It
should be noted that the identification of the damaged file could generate a false positive result
if the file was overwritten by a similar file type. This would cause the signature found in the
recovered file to be the same as the original file so that the file will be identified as a good file
even if some of the file contents had been overwritten. In this research, string-matching method
using Aho-Corasick algorithm was applied only to search signature of the first 32 bytes of the
file in order to identify the file type.
For future research, we should be able to identify file types from file fragments, so the
content of the fragments in a damaged file could be read by using the appropriate file signature.
We could also conduct file header reconstruction of a damaged file so that the file could be read
and reopened by suitable applications. Furthermore, we could also perform information
extraction from a damaged file using a more efficient method so that all information in the
damaged file could be recovered.
Acknowledgement
The authors would like to thanks Lembaga Penelitian Universitas Sumatera Utara in
supporting this research work.
References
[1] Aburabie YT, Alomari M. Computer Forensic: Permanent Erasing. New York Institute of Technology
(NYIT)-Jordan’s campus. 2006.
[2] Lee TH. Generalized Aho-Corasick Algorithm for Signature Based Anti-Virus Applications. Proc. of
16th International Conference on Computer Communications and Networks (ICCCN 2007); 2007:
792-797.
[3] Richard III GG, Roussev V, Marziale L. In-Place File Carving. In: Craiger, P, Shenoi, S. (eds.)
Advances in Digital Forensics III. IFIP. 242; 2007: 217–230.
[4] Kilpeläinen P. Set Matching and Aho-Corasick Algorithm. Biosequence Algorithm. Department of
Computer Science. University of Kuoplo: Kuoplo. 2005.
[5] Bentley J, Sedgewick R. Fast algorithms for sorting and searching strings, Proc. ACM-SIAM
Symposium on Discrete Algorithms. 1997: 360–369.
[6] Aygün RS. S2S: Structural-to-Syntactic Matching Similar Documents. Knowledge and Information
Systems. 2008; 16(3): 303-329.
[7] Beebe NL, Clark JG. Digital Forensic Text String Searching: Improving Information Retrieval
Effectiveness by Thematically Clustering Search Results. Digital Investigation. 2007; 4: 49-54.
[8] Beebe N, Dietrich G. A New Process Model for Text String Searching. In: Craiger P., Shenoi S. (eds)
Advances in Digital Forensics III. Digital Forensics 2007. IFIP-The International Federation for
Information Processing. New York: Springer. 242: 179-191.
[9] Sajja A. Forensic Reconstruction of Variable Bitrates MP3 Files. Thesis. University of New Orleans:
2010.
[10] Yoo B, Park J, Lim S. A Study on A Multimedia Carving Method. Multimedia Tools and Application.
2012; 61(1): 243–261.
[11] Kurniawan F, Rahim MSM, Khalil MS, Khan MK. Statistical Based Audio Forensic on Identical
Microphones. International Journal of Electrical and Computer Engineering (IJECE). 2016; 6(5):
2211-2218.
[12] Gunawan TS, Hanafiah SAM, Kartiwi M, Ismail N, Za’bah NF, Nordin AN. Development of Photo
Forensics Algorithm by Detecting Photoshop Manipulation Using Error Level Analysis. Indonesian
Journal of Electrical Engineering and Computer Science (IJEECS). 2017; 7(1): 131–137.
19. ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 776 – 794
794
[13] Kingra S, Aggarwal N, Singh RD. Video Inter-frame Forgery Detection Approach for Surveillance and
Mobile Recorded Videos. International Journal of Electrical and Computer Engineering (IJECE).
2017; 7(2): 831-841.
[14] Thanekar SA, Subrahmanyam K, Bagwan AB. A Study on Digital Forensics in Hadoop. Indonesian
Journal of Electrical Engineering and Computer Science. 2016; 4(2): 473-478.
[15] Sitompul OS, Handoko A, Rahmat RF. A File Undelete with Aho-Corasick Algorithm In File Recovery.
The International Conference on Informatics and Computing (lCIC). 2016: 427–431.
[16] Rahmat RF, Nicholas F, Purnamawati S, Sitompul OS. File Type Identification of File Fragments
using Longest Common Subsequence (LCS). Journal of Physics: Conference Series. 2017; 801(1):
12054.
[17] Aaron, Sitompul OS, Rahmat RF. Distributed autonomous Neuro-Gen Learning Engine for content-
based document file type identification. International Conference on Cyber and IT Service
Management (CITSM), 2014: 63–68.
[18] Hasibuan ZA, Rahmat RF, Pasha MF, Budiarto R. Adaptive Nested Neural Network (ANNN) Based
on Human Gene Regulatory Network (GRN) for Gene Knowledge Discovery Engine, IJCSNS
International Journal of Computer Science and Network Security. 2009; 9(6): 43–54.
[19] Pasha MF, Rahmat RF, Budiarto R, Syukur M. A distributed autonomous neuro-gen learning engine
and its application to the lattice analysis of cubic structure identification problem, International Journal
of Innovative Computing, Information and Control. 2010; 6(3): 1005–1022.
[20] Rahmat R, Pasha M, Syukur M, Budiarto RA. Gene-Regulated Nested Neural Network, International
Arab Journal of Information Technology. 2015; 12(6): 532–539.