SlideShare a Scribd company logo
1 of 32
Download to read offline
Increasing Security Awareness in Enterprise
Using Automated Feature Extraction with
Long n-gram Analysis for
Filetype Identification
Master Thesis
Burman Noviansyah
MSISPM ’14
December 10th, 2013
1
Context Information Security: Project MASON – Incident Response.
Source:
http://www.thetimes.co.uk/tto/multimedia/archive/00372/DOC10011
3-100120132_372895a.pdf
2
Problem in Enterprise
• Prevent IPR or confidential documents left the
network
• Policy revisited: Data Loss Prevention.
Commercial Product of the shelves only use
simple recognitions methods to detect file type,
that is file extension.
• Anti-forensics case: altering file header and
footer, file container to conceal true file type
• File Type Identification (FTI) is a challenging
task
3
File Type Identification
(FTI)
Previous Works on FTI:
• Libmagic (Darwin, 1973)
• File Extension (Hiller, 1996)
• Byte Frequency Analysis, Byte Frequency
Cross-Correlation, File Header/Trailer
Algorithm (McDaniel, 2001)
• Fileprints with 1-gram Analysis (Li, Wang
and Stolfo)
• Long summarized n-gram (Mayer)
4
N-Gram as Features in FTI
• Introduced in (Damashek, 1995) as a subsequence
of n consecutive tokens in stream of tokens.
• N-gram in FTI: 1 bytes x n
• Feature: Specific and common n-gram among same
file type, and unique between all other file type.
This feature can be choose to a identify file type
• Location of feature is anywhere inside the file, not
necessarily only in header or footer
• Can occurred once in file, or even higher frequency
of occurrences in one single file.
5
Proposed Solution
Characteristics
• Doesn’t interfere the extraction process to create predictability
flag (requires prior knowledge towards file structure)
• Doesn’t summarized (to see exact content without modifying
any byte)
• Doesn’t eliminate short n-gram that as part as longer n-gram
(the position of short n-gram is unknown, whether a subset of
longer n-gram, or indeed stand alone short n-gram)
• Training set was chosen specifically so that only pristine file
types were processed
• Prevent the similar and identic file processed more than once.
• Same data set for prediction rate calculation
6
Proposed Solution
Combination of:
• File Extension
• Libmagic ‘file’ linux command
• ‘exiftool’ linux command
• N-gram analysis
Outcome: consistency check between all
of results from those four techniques
7
Proposed Solution
8
Algorithm: Learning
• Determine the maximum size of n-gram will be extracted from file. In
this thesis, the author chooses the maximum size is 20-gram.
• For each n-gram size, repeat these steps:
– Create sliding window with the size of n, started with value of n = 1.
– Copy the gram inside this current window to memory object as extracted n-grams. If
this gram has not been stored in memory object, then simply put it on memory
object. Otherwise, proceed to next step 2c. This step guarantees that memory
object only contains unique extracted n-grams that exists on this current file and not
repeated between other n-gram.
– Slide the window 1 byte away, then repeat the step 2b above until sliding window
reach the end of the file.
– Write all unique extracted n-grams in memory object to database, along with its file. Until
this step, the database contains all unique extracted n-grams data from size n = 1 up until
n <= 20.
– Increase the size of n with 1, and then repeat from step 2a until the maximum
size of n is reached.
9
Algorithm: Learning
e9e442930c0c13bd9b6f5c074c762d28
Check whether to avoid re-learn the identic file using
md5 hash value
10
Algorithm: Learning
Sliding window of n-size to extract the content
89504E47 for PNG
504E470D for PNG
4E470D0A for PNG
11
Algorithm: Learning
• Repeat this process for all size of sliding
windows that represent n-gram
• Repeat this process for all files in
learning data set
12
Algorithm: Learning
• Find all same n-gram size that belongs
to same file type à common n-gram
• Find all common n-gram that only
belongs to one file type à features
89504E47 for PNG
504E470D for PNG
4E470D0A for PNG
89504E47 for PNG
554E470D for PNG
4E470D0A for PNG
FFD8FFE0 for JPG
D8FFE090 for JPG
4E470D0A for JPG
D8FFE000 for JPG
FFD8FFE0 for JPG
4E470D0A for JPG
FFD8FFE0 for JPG
89504E47 for PNG
13
Algorithm: Prediction
• Determine the maximum size of n-gram will be compared with file. In this thesis, the
author chooses the maximum size is 20-gram similar with Extraction process.
• For each n-gram size, repeat these steps:
– Get all final features from the database based on n-gram size, regardless what file type associated
with it, along with its total final features counter for each file type.
– For each file type in final features with current size of n:
Create a final features statistics container to hold statistics of all matched final features per file type. This container will
be used to count how many final features in occurred in array of bytes. This final features statistics container contains
[file extension, size of n, total final features counter, matched final features counter].
Repeat process 2.b.i until all of file types have each different final features statistics container.
– Create sliding window with size is current n.
– Check whether this current gram in final features. If current gram exists in final features then
updates its statistics in final features statistics container. Otherwise, just skip it.
– Slide the window 1 byte away from the original.
– Repeat from 2.d until the window reaches the end of file.
• Calculate the percentage match based on value in final features statistics container.
• Determine the acceptance threshold toward percentage match so that this matching
feature can be used to determine whether this is accepted as specific file type. For this
thesis, author set the threshold of acceptance is 95%. Hence, if a set of features in size
of n for file type X matches more than 95% of total features, than this file is classified
as file type X.
14
Algorithm: Prediction
Sliding window of n-size to extract the content
89504E47 for PNG
FFD8FFE0 for JPG
If it is match with any feature,
count for matching rate
15
Data Set
Learning Data Set:
• Standard Corpus from (Garfinkel,
Farrell and Roussev)
• Limit to 11 data types with 20 files each
Prediction Data Set:
• Axelsson’s NEW-EXP-0
• Limit to 11 data types
16
Result: ‘file’, ‘exiftool’
17
Result: n-gram
18
Result: Aggregate Prediction
19
Analysis
• A PDF file classified
as PNG?
• This PDF contains
several image
element
• File Container and
file inside container
detected.
20
Analysis
• A DOCX file
classified as PNG,
JPG, and GIF?
• This DOCX
contains several
image element, too.
• File insides this
container is not
encoded
21
Analysis
Confusion Matrix for Testing Result
Num of Samples: Given fact that shows number of file samples
being tested for a specific file type
Classification: Testing result shows number of files that have more
than 95% matching rates for each features extracted for file type22
Analysis
23
Analysis
• ‘Accuracy’: 80% vs 49%
• Average Accuracy: 92.63% vs 96.33%
• Average Precision: 76.07% vs 51%
• Average Recall: 81.09% vs 48%24
What if it is tested on …
• Altered File Extension?
• Altered Libmagic in File Header?
• Altered File Header?
• Random bytes of data?
25
Altered File Extension
26
Altered File Header
27
Altered File Header
28
Conclusion
• Different way to choose learning data
set makes automated extraction of n-
gram in files faster. The prediction
result yields with high accuracy,
precision, and recall rate
• Can detect file container and the file
type inside the container
29
Further Development
• To reduce learning time, use multi-
threaded algorithm and multi-processor
• Possibility to develop using GPU such
nvidia CUDA to have faster calculation
process using jCuda library
• Statistical approach to classification:
Support Vector Machine, k-Nearest
Neighbor for higher rate of accuracy,
precision, and recall
30
Bibliography
• Darwin, F. Ian. Fine Free File Command. 9 9 2013
<http://www.darwinsys.com/file/>.
• McDaniel, Mason B. An Algorithm for Content-Based Automated File Type
Recognition. Master Thesis. James Madison University. Harrisonburg, 2001.
• Damashek, Marc. "Gauging Similarity with n-Grams: Language-Independent
Categorization of Text." Science 267.5199 (1995): 843-848.
• Gopal, Siddharth, et al. "Statistical Learning for File-Type Identification."
International Conference on Machine Learning and Applications and Workshops
(ICMLA) 10th. Honolulu: Software Engineering Institute, 2011.
• Hiller, Scot. The System Registry. 1996. Microsoft. 5 October 2013
<http://msdn.microsoft.com/en-us/library/ms970651.aspx>.
• Mayer, Ryan C. Filetype Identification Using Long, Summarized n-Grams. Monterey:
Master Thesis at Naval Postgraduate School, 2011.
• Axelsson, Stefan. "The Normalised Compression Distance as a file fragment
classifier." The International Journal of Digital Forensics & Incident Response
7.Supplement (2010): S24-S31.
31
Thank You
32

More Related Content

What's hot

Access strategies ppt_ind
Access strategies ppt_indAccess strategies ppt_ind
Access strategies ppt_indItamarCohen16
 
Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu MattatholiAnu Mattatholi
 
Managing data in computational edge clouds
Managing data in computational edge cloudsManaging data in computational edge clouds
Managing data in computational edge cloudsNitinder Mohan
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
 

What's hot (8)

Access strategies ppt_ind
Access strategies ppt_indAccess strategies ppt_ind
Access strategies ppt_ind
 
Presented by Anu Mattatholi
Presented by Anu MattatholiPresented by Anu Mattatholi
Presented by Anu Mattatholi
 
User biglm
User biglmUser biglm
User biglm
 
Managing data in computational edge clouds
Managing data in computational edge cloudsManaging data in computational edge clouds
Managing data in computational edge clouds
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
 
1083 wang
1083 wang1083 wang
1083 wang
 

Similar to Increasing Security Awareness in Enterprise Using Automated Feature Extraction with Long n-gram Analysis for Filetype Identification

Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Yihan Lian & Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]
Yihan Lian &  Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]Yihan Lian &  Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]
Yihan Lian & Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]RootedCON
 
talks-afanasyev2013ndnsim-tutorial.pptx
talks-afanasyev2013ndnsim-tutorial.pptxtalks-afanasyev2013ndnsim-tutorial.pptx
talks-afanasyev2013ndnsim-tutorial.pptxhazwan30
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...mlaij
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...mlaij
 
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...Camille Maumet
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
Digitization Basics for Archives and Special Collections – Part 2: Store and ...
Digitization Basics for Archives and Special Collections – Part 2: Store and ...Digitization Basics for Archives and Special Collections – Part 2: Store and ...
Digitization Basics for Archives and Special Collections – Part 2: Store and ...WiLS
 
Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's GroupMicah Altman
 
What Does Your Next NetApp Refresh Look Like?
What Does Your Next NetApp Refresh Look Like?What Does Your Next NetApp Refresh Look Like?
What Does Your Next NetApp Refresh Look Like?Storage Switzerland
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 

Similar to Increasing Security Awareness in Enterprise Using Automated Feature Extraction with Long n-gram Analysis for Filetype Identification (20)

Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
XgBoost.pptx
XgBoost.pptxXgBoost.pptx
XgBoost.pptx
 
Google File System
Google File SystemGoogle File System
Google File System
 
Yihan Lian & Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]
Yihan Lian &  Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]Yihan Lian &  Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]
Yihan Lian & Zhibin Hu - Smarter Peach: Add Eyes to Peach Fuzzer [rooted2017]
 
talks-afanasyev2013ndnsim-tutorial.pptx
talks-afanasyev2013ndnsim-tutorial.pptxtalks-afanasyev2013ndnsim-tutorial.pptx
talks-afanasyev2013ndnsim-tutorial.pptx
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
 
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
 
Deduplication in Open Spurce Cloud
Deduplication in Open Spurce CloudDeduplication in Open Spurce Cloud
Deduplication in Open Spurce Cloud
 
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...
Supporting image-based meta-analysis with NIDM: Standardized reporting of neu...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
Digitization Basics for Archives and Special Collections – Part 2: Store and ...
Digitization Basics for Archives and Special Collections – Part 2: Store and ...Digitization Basics for Archives and Special Collections – Part 2: Store and ...
Digitization Basics for Archives and Special Collections – Part 2: Store and ...
 
Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
What Does Your Next NetApp Refresh Look Like?
What Does Your Next NetApp Refresh Look Like?What Does Your Next NetApp Refresh Look Like?
What Does Your Next NetApp Refresh Look Like?
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 

Recently uploaded

Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfPharmatech-rx
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil RecordSangram Sahoo
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentslevieagacer
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationSérgio Sacani
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxGlendelCaroz
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfhoangquan21999
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptxyoussefboujtat3
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxMuhammadRazzaq31
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einsteinxgamestudios8
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfSuchita Rawat
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisAreesha Ahmad
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxKyawThanTint
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandRcvets
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloChristian Robert
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET
 

Recently uploaded (20)

Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
Warming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptxWarming the earth and the atmosphere.pptx
Warming the earth and the atmosphere.pptx
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptx
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einstein
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
 
Mining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptxMining Activity and Investment Opportunity in Myanmar.pptx
Mining Activity and Investment Opportunity in Myanmar.pptx
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 

Increasing Security Awareness in Enterprise Using Automated Feature Extraction with Long n-gram Analysis for Filetype Identification

  • 1. Increasing Security Awareness in Enterprise Using Automated Feature Extraction with Long n-gram Analysis for Filetype Identification Master Thesis Burman Noviansyah MSISPM ’14 December 10th, 2013 1
  • 2. Context Information Security: Project MASON – Incident Response. Source: http://www.thetimes.co.uk/tto/multimedia/archive/00372/DOC10011 3-100120132_372895a.pdf 2
  • 3. Problem in Enterprise • Prevent IPR or confidential documents left the network • Policy revisited: Data Loss Prevention. Commercial Product of the shelves only use simple recognitions methods to detect file type, that is file extension. • Anti-forensics case: altering file header and footer, file container to conceal true file type • File Type Identification (FTI) is a challenging task 3
  • 4. File Type Identification (FTI) Previous Works on FTI: • Libmagic (Darwin, 1973) • File Extension (Hiller, 1996) • Byte Frequency Analysis, Byte Frequency Cross-Correlation, File Header/Trailer Algorithm (McDaniel, 2001) • Fileprints with 1-gram Analysis (Li, Wang and Stolfo) • Long summarized n-gram (Mayer) 4
  • 5. N-Gram as Features in FTI • Introduced in (Damashek, 1995) as a subsequence of n consecutive tokens in stream of tokens. • N-gram in FTI: 1 bytes x n • Feature: Specific and common n-gram among same file type, and unique between all other file type. This feature can be choose to a identify file type • Location of feature is anywhere inside the file, not necessarily only in header or footer • Can occurred once in file, or even higher frequency of occurrences in one single file. 5
  • 6. Proposed Solution Characteristics • Doesn’t interfere the extraction process to create predictability flag (requires prior knowledge towards file structure) • Doesn’t summarized (to see exact content without modifying any byte) • Doesn’t eliminate short n-gram that as part as longer n-gram (the position of short n-gram is unknown, whether a subset of longer n-gram, or indeed stand alone short n-gram) • Training set was chosen specifically so that only pristine file types were processed • Prevent the similar and identic file processed more than once. • Same data set for prediction rate calculation 6
  • 7. Proposed Solution Combination of: • File Extension • Libmagic ‘file’ linux command • ‘exiftool’ linux command • N-gram analysis Outcome: consistency check between all of results from those four techniques 7
  • 9. Algorithm: Learning • Determine the maximum size of n-gram will be extracted from file. In this thesis, the author chooses the maximum size is 20-gram. • For each n-gram size, repeat these steps: – Create sliding window with the size of n, started with value of n = 1. – Copy the gram inside this current window to memory object as extracted n-grams. If this gram has not been stored in memory object, then simply put it on memory object. Otherwise, proceed to next step 2c. This step guarantees that memory object only contains unique extracted n-grams that exists on this current file and not repeated between other n-gram. – Slide the window 1 byte away, then repeat the step 2b above until sliding window reach the end of the file. – Write all unique extracted n-grams in memory object to database, along with its file. Until this step, the database contains all unique extracted n-grams data from size n = 1 up until n <= 20. – Increase the size of n with 1, and then repeat from step 2a until the maximum size of n is reached. 9
  • 10. Algorithm: Learning e9e442930c0c13bd9b6f5c074c762d28 Check whether to avoid re-learn the identic file using md5 hash value 10
  • 11. Algorithm: Learning Sliding window of n-size to extract the content 89504E47 for PNG 504E470D for PNG 4E470D0A for PNG 11
  • 12. Algorithm: Learning • Repeat this process for all size of sliding windows that represent n-gram • Repeat this process for all files in learning data set 12
  • 13. Algorithm: Learning • Find all same n-gram size that belongs to same file type à common n-gram • Find all common n-gram that only belongs to one file type à features 89504E47 for PNG 504E470D for PNG 4E470D0A for PNG 89504E47 for PNG 554E470D for PNG 4E470D0A for PNG FFD8FFE0 for JPG D8FFE090 for JPG 4E470D0A for JPG D8FFE000 for JPG FFD8FFE0 for JPG 4E470D0A for JPG FFD8FFE0 for JPG 89504E47 for PNG 13
  • 14. Algorithm: Prediction • Determine the maximum size of n-gram will be compared with file. In this thesis, the author chooses the maximum size is 20-gram similar with Extraction process. • For each n-gram size, repeat these steps: – Get all final features from the database based on n-gram size, regardless what file type associated with it, along with its total final features counter for each file type. – For each file type in final features with current size of n: Create a final features statistics container to hold statistics of all matched final features per file type. This container will be used to count how many final features in occurred in array of bytes. This final features statistics container contains [file extension, size of n, total final features counter, matched final features counter]. Repeat process 2.b.i until all of file types have each different final features statistics container. – Create sliding window with size is current n. – Check whether this current gram in final features. If current gram exists in final features then updates its statistics in final features statistics container. Otherwise, just skip it. – Slide the window 1 byte away from the original. – Repeat from 2.d until the window reaches the end of file. • Calculate the percentage match based on value in final features statistics container. • Determine the acceptance threshold toward percentage match so that this matching feature can be used to determine whether this is accepted as specific file type. For this thesis, author set the threshold of acceptance is 95%. Hence, if a set of features in size of n for file type X matches more than 95% of total features, than this file is classified as file type X. 14
  • 15. Algorithm: Prediction Sliding window of n-size to extract the content 89504E47 for PNG FFD8FFE0 for JPG If it is match with any feature, count for matching rate 15
  • 16. Data Set Learning Data Set: • Standard Corpus from (Garfinkel, Farrell and Roussev) • Limit to 11 data types with 20 files each Prediction Data Set: • Axelsson’s NEW-EXP-0 • Limit to 11 data types 16
  • 20. Analysis • A PDF file classified as PNG? • This PDF contains several image element • File Container and file inside container detected. 20
  • 21. Analysis • A DOCX file classified as PNG, JPG, and GIF? • This DOCX contains several image element, too. • File insides this container is not encoded 21
  • 22. Analysis Confusion Matrix for Testing Result Num of Samples: Given fact that shows number of file samples being tested for a specific file type Classification: Testing result shows number of files that have more than 95% matching rates for each features extracted for file type22
  • 24. Analysis • ‘Accuracy’: 80% vs 49% • Average Accuracy: 92.63% vs 96.33% • Average Precision: 76.07% vs 51% • Average Recall: 81.09% vs 48%24
  • 25. What if it is tested on … • Altered File Extension? • Altered Libmagic in File Header? • Altered File Header? • Random bytes of data? 25
  • 29. Conclusion • Different way to choose learning data set makes automated extraction of n- gram in files faster. The prediction result yields with high accuracy, precision, and recall rate • Can detect file container and the file type inside the container 29
  • 30. Further Development • To reduce learning time, use multi- threaded algorithm and multi-processor • Possibility to develop using GPU such nvidia CUDA to have faster calculation process using jCuda library • Statistical approach to classification: Support Vector Machine, k-Nearest Neighbor for higher rate of accuracy, precision, and recall 30
  • 31. Bibliography • Darwin, F. Ian. Fine Free File Command. 9 9 2013 <http://www.darwinsys.com/file/>. • McDaniel, Mason B. An Algorithm for Content-Based Automated File Type Recognition. Master Thesis. James Madison University. Harrisonburg, 2001. • Damashek, Marc. "Gauging Similarity with n-Grams: Language-Independent Categorization of Text." Science 267.5199 (1995): 843-848. • Gopal, Siddharth, et al. "Statistical Learning for File-Type Identification." International Conference on Machine Learning and Applications and Workshops (ICMLA) 10th. Honolulu: Software Engineering Institute, 2011. • Hiller, Scot. The System Registry. 1996. Microsoft. 5 October 2013 <http://msdn.microsoft.com/en-us/library/ms970651.aspx>. • Mayer, Ryan C. Filetype Identification Using Long, Summarized n-Grams. Monterey: Master Thesis at Naval Postgraduate School, 2011. • Axelsson, Stefan. "The Normalised Compression Distance as a file fragment classifier." The International Journal of Digital Forensics & Incident Response 7.Supplement (2010): S24-S31. 31