SlideShare a Scribd company logo
1 of 11
Download to read offline
MTA-KDD’19: A Dataset for
Malware Traffic Detection
Authors:
Ivan Letteri
Giuseppe Della Penna
Luca Di Vita
Maria Teresa Grifa
University of L’Aquila
Road Map
● 1st - MTA-KDD’19 Dataset
○ legitimate traffic
○ malware traffic
● 2nd - Dataset Features
○ from 50 to 33 features
● 3rd - Dataset Generation Process
○ phase I. Feature Extraction
○ phase II. Dataset Profiling
○ phase III. Imputation
○ phase IV. Scaling
● 4th - Dataset Evaluation
○ Outlier Detection
○ Classification Experiment
Generation Process
● Network logs
○ Feature Extraction phase from
pcap files
■ Legitimate traffic
■ Malware traffic
- Legitimate traffic come from pcap files marked as Normal in MCFP composed of 15 pcap files with a
total size greater than 7 GBytes which produced ≃ 45 thousand samples
- Malware traffic come from the MTA repository, is made up of ≃ 2112 pcap files, with a total size of
than 4.8 GB. These observations cover a time span from June 2013 to August 2019.
MCFP
Generation Process ( 50 Dataset Features )
1. {Ack,Syn,Fin,Psh,Urg,Rst}FlagDist
2. {TCP,UDP,DNS}OverIP
3. MaxLen, MinLen, AvgLen,
StdDevLen, MaxIAT, MinIAT, ...
4. PktIORatio
5. FirstPktLen:
6. DNSQDist,DNSADist, DNSRDist, ...
7. {RepeatedPkt, SmallPkt} Ratio
8. AvgDomain{Char,Dot,Hyph,Digit}
9. AvgTTL
10. NumConnections, SynAcksyn
11. NumDstAddr, NumPorts
12. DistinctUA, AvgDistinctUALen, ...
- The selected features inspired by best practices in the field of malware detection
- For each pcap file, we group the packets based on the source address
- Extracted a total of 50 features with ...
Generation Process
● Gathering
○ where came from Dataset
■ MTA
■ MCFP
○ 50 features extracted
■ dataset in CSV file format
○ Next: profiling activity
Dataset Profiling
- drop duplicates, missing values ...
- remove feature with high correlation, and so on ...
- … idea to apply a minimal preprocessing on the data
Generation Process (Profiling)
● Issues
○ detect low-complexity
○ remove low-complexity
● High-correlation
○ Pearson coefficient
○ High correlation > 0.95
● Not a Number (NaN)
○ ‘Real’ Zeros
○ Unavailable ( incomplete )
○ Missing Values ( noisy )
- To detect and remove certain low-complexity issues that may affect the data
- High-correlated features measured using the Pearson coefficient (from 50 features to 33 features)
- some features may have zero values for different reasons: Zeros, Unavailable, and Missing values
Generation Process (Imputation)
● remaining NaN
○ StdDevLen feature
○ StdDevLenRx feature
● replacing NaN
○ with real(istic) values
○ MICE
○ Datawig, KNN
- MICE imputation works by filling the missing data multiple times
- Multiple imputations are better than a single imputation as they measure the uncertainty of the
missing values more precisely
- chained equations approach is flexible and handle data types (i.e., continuous or binary)
Imputing (MICE, KNN,DataWig)
Generation Process (Scaling)
● 4 Scalers and 2 Transformers
○ MinMax scaler
○ MaxAbs scaler
○ Standard scaler
○ Robust scaler
○ Quantile transformer
○ Power transformer
● Precision metric
○ Gaussian Naive Bayes (GNB)
○ Logistic Regression (LogReg)
○ Random Forest (RF)
- Dataset contains features highly varying in magnitudes, units, and range
- feature scaling may heavily influence the results of some algorithms
- to have an idea of the impact of such scaling we trained 3 “fast” classifiers (GNB, LogReg, RF)
Dataset Evaluation
● Outlier Detection
○ feature importance with
Random Forest (Gini index)
○ 6 most important features
○ Matching between:
Outlier in BoxPlot <-> pkt in PCAP
● Classification Experiment
○ Multi Layer Perceptron
○ Accuracy, Precision, Recall,
Specificity
- Further evaluation of quality looking for the matching between Outliers in the CSV dataset and the
traffic in PCAP files on a subset of 6 “important” features
- Figure to the bottom shows the confusion matrix taken from one of the experiments, with an
accuracy of 99.74%
MTA pcap 258.5 MB
Conclusion
● MTA-KDD’19
○ malware traffic analysis
○ Data source up-to-date
○ Open source
● More Accurate
○ Feature Selection
○ Data preprocessing
● Future Work
○ more complex features
○ different NN architectures
○ other ML models
○ handle Unavailable values
- The data sources are continuously updated, making the dataset quite realistic
- Accurate feature selection and data preprocessing to make the dataset as small as possible
- More complex feature selection strategies to further reduce the number of features to the most
informative ones.
- Validate the dataset different neural network architectures and other machine learning models.
HERE these slides: https://www.ivanletteri.it/itasec2020/
● Reduced Dataset
○ 33 features
● Source Code
○ PCAP downloader
○ Feature Extractor
● Entire Dataset
○ Dataset 4 pkts tracking
● Call For Collaboration
○ Registration form
○ List of projects
● Form Questions
○ repeat experiment (I. Letteri, G.D.P.)
○ maths questions (M.T. Grifa)
○ data mining code (L. Di Vita)
ivanletteri.it/call-for-collaboration/
ivanletteri.it/mta-kdd19-qf/
github.com/IvanLetteri/MTA-KDD-19

More Related Content

What's hot

01_NYC Elevator Project Summary
01_NYC Elevator Project Summary01_NYC Elevator Project Summary
01_NYC Elevator Project SummaryMatthew Girard
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018Fast Reports
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink Forward
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraParis Carbone
 
Towards Verified Implementation of Event-B Models in Dafny
Towards Verified Implementation of Event-B Models in DafnyTowards Verified Implementation of Event-B Models in Dafny
Towards Verified Implementation of Event-B Models in DafnySadegh Dalvandi
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorSun-Li Beatteay
 

What's hot (9)

01_NYC Elevator Project Summary
01_NYC Elevator Project Summary01_NYC Elevator Project Summary
01_NYC Elevator Project Summary
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
 
Towards Verified Implementation of Event-B Models in Dafny
Towards Verified Implementation of Event-B Models in DafnyTowards Verified Implementation of Event-B Models in Dafny
Towards Verified Implementation of Event-B Models in Dafny
 
PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems
 PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems
PV 2014 - Montali - Verification of Parameterized Data-Aware Dynamic Systems
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editor
 

Similar to Itasec2020

OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...NETWAYS
 
Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Gobinath Loganathan
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Alan Walker
 
KaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep LearningKaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep LearningVan Huy
 
` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning ` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning butest
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictionsAnton Kulesh
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...KamleshKumar394
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionKnoldus Inc.
 
Impact of Asymmetry of Internet Traffic for Heuristic Based Classification
Impact of Asymmetry of Internet Traffic for Heuristic Based ClassificationImpact of Asymmetry of Internet Traffic for Heuristic Based Classification
Impact of Asymmetry of Internet Traffic for Heuristic Based ClassificationCSCJournals
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 

Similar to Itasec2020 (20)

Druid
DruidDruid
Druid
 
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
 
Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
 
KaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep LearningKaoNet: Face Recognition and Generation App using Deep Learning
KaoNet: Face Recognition and Generation App using Deep Learning
 
` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning ` Traffic Classification based on Machine Learning
` Traffic Classification based on Machine Learning
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
DC02. Interpretation of predictions
DC02. Interpretation of predictionsDC02. Interpretation of predictions
DC02. Interpretation of predictions
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
IDS for IoT.pptx
IDS for IoT.pptxIDS for IoT.pptx
IDS for IoT.pptx
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Impact of Asymmetry of Internet Traffic for Heuristic Based Classification
Impact of Asymmetry of Internet Traffic for Heuristic Based ClassificationImpact of Asymmetry of Internet Traffic for Heuristic Based Classification
Impact of Asymmetry of Internet Traffic for Heuristic Based Classification
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 

Recently uploaded

How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.pptRachmaGhifari
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...siskavia95
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 

Recently uploaded (20)

How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 

Itasec2020

  • 1. MTA-KDD’19: A Dataset for Malware Traffic Detection Authors: Ivan Letteri Giuseppe Della Penna Luca Di Vita Maria Teresa Grifa University of L’Aquila
  • 2. Road Map ● 1st - MTA-KDD’19 Dataset ○ legitimate traffic ○ malware traffic ● 2nd - Dataset Features ○ from 50 to 33 features ● 3rd - Dataset Generation Process ○ phase I. Feature Extraction ○ phase II. Dataset Profiling ○ phase III. Imputation ○ phase IV. Scaling ● 4th - Dataset Evaluation ○ Outlier Detection ○ Classification Experiment
  • 3. Generation Process ● Network logs ○ Feature Extraction phase from pcap files ■ Legitimate traffic ■ Malware traffic - Legitimate traffic come from pcap files marked as Normal in MCFP composed of 15 pcap files with a total size greater than 7 GBytes which produced ≃ 45 thousand samples - Malware traffic come from the MTA repository, is made up of ≃ 2112 pcap files, with a total size of than 4.8 GB. These observations cover a time span from June 2013 to August 2019. MCFP
  • 4. Generation Process ( 50 Dataset Features ) 1. {Ack,Syn,Fin,Psh,Urg,Rst}FlagDist 2. {TCP,UDP,DNS}OverIP 3. MaxLen, MinLen, AvgLen, StdDevLen, MaxIAT, MinIAT, ... 4. PktIORatio 5. FirstPktLen: 6. DNSQDist,DNSADist, DNSRDist, ... 7. {RepeatedPkt, SmallPkt} Ratio 8. AvgDomain{Char,Dot,Hyph,Digit} 9. AvgTTL 10. NumConnections, SynAcksyn 11. NumDstAddr, NumPorts 12. DistinctUA, AvgDistinctUALen, ... - The selected features inspired by best practices in the field of malware detection - For each pcap file, we group the packets based on the source address - Extracted a total of 50 features with ...
  • 5. Generation Process ● Gathering ○ where came from Dataset ■ MTA ■ MCFP ○ 50 features extracted ■ dataset in CSV file format ○ Next: profiling activity Dataset Profiling - drop duplicates, missing values ... - remove feature with high correlation, and so on ... - … idea to apply a minimal preprocessing on the data
  • 6. Generation Process (Profiling) ● Issues ○ detect low-complexity ○ remove low-complexity ● High-correlation ○ Pearson coefficient ○ High correlation > 0.95 ● Not a Number (NaN) ○ ‘Real’ Zeros ○ Unavailable ( incomplete ) ○ Missing Values ( noisy ) - To detect and remove certain low-complexity issues that may affect the data - High-correlated features measured using the Pearson coefficient (from 50 features to 33 features) - some features may have zero values for different reasons: Zeros, Unavailable, and Missing values
  • 7. Generation Process (Imputation) ● remaining NaN ○ StdDevLen feature ○ StdDevLenRx feature ● replacing NaN ○ with real(istic) values ○ MICE ○ Datawig, KNN - MICE imputation works by filling the missing data multiple times - Multiple imputations are better than a single imputation as they measure the uncertainty of the missing values more precisely - chained equations approach is flexible and handle data types (i.e., continuous or binary) Imputing (MICE, KNN,DataWig)
  • 8. Generation Process (Scaling) ● 4 Scalers and 2 Transformers ○ MinMax scaler ○ MaxAbs scaler ○ Standard scaler ○ Robust scaler ○ Quantile transformer ○ Power transformer ● Precision metric ○ Gaussian Naive Bayes (GNB) ○ Logistic Regression (LogReg) ○ Random Forest (RF) - Dataset contains features highly varying in magnitudes, units, and range - feature scaling may heavily influence the results of some algorithms - to have an idea of the impact of such scaling we trained 3 “fast” classifiers (GNB, LogReg, RF)
  • 9. Dataset Evaluation ● Outlier Detection ○ feature importance with Random Forest (Gini index) ○ 6 most important features ○ Matching between: Outlier in BoxPlot <-> pkt in PCAP ● Classification Experiment ○ Multi Layer Perceptron ○ Accuracy, Precision, Recall, Specificity - Further evaluation of quality looking for the matching between Outliers in the CSV dataset and the traffic in PCAP files on a subset of 6 “important” features - Figure to the bottom shows the confusion matrix taken from one of the experiments, with an accuracy of 99.74% MTA pcap 258.5 MB
  • 10. Conclusion ● MTA-KDD’19 ○ malware traffic analysis ○ Data source up-to-date ○ Open source ● More Accurate ○ Feature Selection ○ Data preprocessing ● Future Work ○ more complex features ○ different NN architectures ○ other ML models ○ handle Unavailable values - The data sources are continuously updated, making the dataset quite realistic - Accurate feature selection and data preprocessing to make the dataset as small as possible - More complex feature selection strategies to further reduce the number of features to the most informative ones. - Validate the dataset different neural network architectures and other machine learning models.
  • 11. HERE these slides: https://www.ivanletteri.it/itasec2020/ ● Reduced Dataset ○ 33 features ● Source Code ○ PCAP downloader ○ Feature Extractor ● Entire Dataset ○ Dataset 4 pkts tracking ● Call For Collaboration ○ Registration form ○ List of projects ● Form Questions ○ repeat experiment (I. Letteri, G.D.P.) ○ maths questions (M.T. Grifa) ○ data mining code (L. Di Vita) ivanletteri.it/call-for-collaboration/ ivanletteri.it/mta-kdd19-qf/ github.com/IvanLetteri/MTA-KDD-19