Malware Detection -
A Machine Learning Perspective
C.K.Chen
2014.06.05
Outline
• A Large Wave of Malware Is Coming
• Is Machine Learning the Savior
• You Can't Make Something out of Nothing
• A Garbage In, Garbage Out Game?
• Model, Model, It’s All About The Model
• Every Evaluation in Every Paper is ‘Perfect’
• Democracy World in Machine Learning
• WYSIWYG
• Known Where Your Enemy Is
A Large Wave of Malware Is
Coming
• There are million malware created every year
McAfee Labs Threat Report in Fourth Quarter 2013
Your Anti-Virus Will Not Tell You
• Although the
overall detection
looks well
Attack Windows in AntiVirus
Anti-Virus Lifecycle
• Attack Windows
Malware Life Cycle
Is Machine Learning the Savior
• Problem is that
• Signature generation is mutual work and time
comsuming
• Most malware is not brand new one, but modify or
rewrite from old one
• Automatic malware creation tool chain
• Mutation Technique
• May leave some clue for us
• Machine learning shed a light to aromatic construct
model and detect malware
How Machine Learning Work?
• Training
• Feature Extraction -> Learning Algorithms -> Generate Classfier
• Testing
• Feature Extraction -> Classifier -> Classifier Result
Catalogs of Machine Learning
Approaches
• Catalog by Representation/Feature Selection/Classification
Algorithms
You Can't Make Something out of
Nothing
• Data Set is the first step for ML
• No data, ML can do nothing
• Where to collect samples
• Web, Honet Pot, User Upload
• Balanced vs. Imbalanced data
A Garbage In, Garbage Out Game
• There are so many features can be choose
• The quality of feature decide the precision of machine
learning
• Feature
• Static / Dynamic / PE Structure
• N-gram
• Feature Selection is needed
• ReliefF
• Chi-squared
• F-Statistics
Model, Model, It’s All About The
Model
• Most important part
• You need to choose the model which can interpreter
your data more closefitting
• How to choose model
Numerical Data
 Classical Classifier (SVM)
Catalog Data
 Dummy Variable
 Decision Tree
Sequence Data
 N-gram Algorithms
 Bayes, Markov Chain
Every Evaluation in Every Paper is
‘Perfect’
• Unlike other research area, malware detection has
no standard benchmark
• Malware created every day
• Privacy wealthy
• Also no guideline for evaluation
• Therefore, some researchers observe this problem
and do a great survey
• Provide some rule to rvaluate
Is Machine Learning the Savior
• Machine learning can help us to recognize similar
and variant malware
• It can not identify brand new malware
• Machine learning based detector need carefully
training and long time for tuning
Democracy World in Machine
Learning
• There are many type of classifier
• SVM, Decision Tree, Neural Network, ….
• Voting to increasing precision
WYSIWYG
Known Where Your Enemy Is
• In security field, bad guy always try to break your system
• Causative game
• Attacker poisons data
• Defender trains ML on poisoned data
• Exploratory game
• Defender trains on clean data
• Attacker evades learned classifier/detector
Reference
1. McAfee Labs Threat Report in Fourth Quarter 2013
2. http://www.fireeye.com/blog/corporate/2014/05/ghost-hunting-with-anti-virus.html
3. AV alone is not enough to protect PC from zero-day malware
4. AV Isn't Dead, It Just Can't Keep Up
5. AV comparatives, File Detection Test of Malicious Software, 2014
6. G. Yan, N. Brown, and D. Kong, “Exploring Discriminatory Features for Automated Malware
Classification,” DIMVA, 2013.
7. A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of malicious code by applying
machine learning classifiers on static features: A state-of-the-art survey,” Inf. Secur. Tech. Rep.,
2009.
8. C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. Van
Steen, “Prudent Practices for Designing Malware Experiments: Status Quo and Outlook,” IEEE
S&P, 2012.
9. D. Kong and G. Yan, “Discriminant malware distance learning on structural information for
automated malware classification,” Proc. 19th ACM SIGKDD KDD ’13, 2013.

Malware Detection - A Machine Learning Perspective

  • 1.
    Malware Detection - AMachine Learning Perspective C.K.Chen 2014.06.05
  • 2.
    Outline • A LargeWave of Malware Is Coming • Is Machine Learning the Savior • You Can't Make Something out of Nothing • A Garbage In, Garbage Out Game? • Model, Model, It’s All About The Model • Every Evaluation in Every Paper is ‘Perfect’ • Democracy World in Machine Learning • WYSIWYG • Known Where Your Enemy Is
  • 3.
    A Large Waveof Malware Is Coming • There are million malware created every year McAfee Labs Threat Report in Fourth Quarter 2013
  • 4.
    Your Anti-Virus WillNot Tell You • Although the overall detection looks well
  • 5.
    Attack Windows inAntiVirus Anti-Virus Lifecycle • Attack Windows Malware Life Cycle
  • 6.
    Is Machine Learningthe Savior • Problem is that • Signature generation is mutual work and time comsuming • Most malware is not brand new one, but modify or rewrite from old one • Automatic malware creation tool chain • Mutation Technique • May leave some clue for us • Machine learning shed a light to aromatic construct model and detect malware
  • 7.
    How Machine LearningWork? • Training • Feature Extraction -> Learning Algorithms -> Generate Classfier • Testing • Feature Extraction -> Classifier -> Classifier Result
  • 8.
    Catalogs of MachineLearning Approaches • Catalog by Representation/Feature Selection/Classification Algorithms
  • 9.
    You Can't MakeSomething out of Nothing • Data Set is the first step for ML • No data, ML can do nothing • Where to collect samples • Web, Honet Pot, User Upload • Balanced vs. Imbalanced data
  • 10.
    A Garbage In,Garbage Out Game • There are so many features can be choose • The quality of feature decide the precision of machine learning • Feature • Static / Dynamic / PE Structure • N-gram • Feature Selection is needed • ReliefF • Chi-squared • F-Statistics
  • 12.
    Model, Model, It’sAll About The Model • Most important part • You need to choose the model which can interpreter your data more closefitting • How to choose model Numerical Data  Classical Classifier (SVM) Catalog Data  Dummy Variable  Decision Tree Sequence Data  N-gram Algorithms  Bayes, Markov Chain
  • 13.
    Every Evaluation inEvery Paper is ‘Perfect’ • Unlike other research area, malware detection has no standard benchmark • Malware created every day • Privacy wealthy • Also no guideline for evaluation • Therefore, some researchers observe this problem and do a great survey • Provide some rule to rvaluate
  • 16.
    Is Machine Learningthe Savior • Machine learning can help us to recognize similar and variant malware • It can not identify brand new malware • Machine learning based detector need carefully training and long time for tuning
  • 17.
    Democracy World inMachine Learning • There are many type of classifier • SVM, Decision Tree, Neural Network, …. • Voting to increasing precision
  • 18.
  • 19.
    Known Where YourEnemy Is • In security field, bad guy always try to break your system • Causative game • Attacker poisons data • Defender trains ML on poisoned data • Exploratory game • Defender trains on clean data • Attacker evades learned classifier/detector
  • 20.
    Reference 1. McAfee LabsThreat Report in Fourth Quarter 2013 2. http://www.fireeye.com/blog/corporate/2014/05/ghost-hunting-with-anti-virus.html 3. AV alone is not enough to protect PC from zero-day malware 4. AV Isn't Dead, It Just Can't Keep Up 5. AV comparatives, File Detection Test of Malicious Software, 2014 6. G. Yan, N. Brown, and D. Kong, “Exploring Discriminatory Features for Automated Malware Classification,” DIMVA, 2013. 7. A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey,” Inf. Secur. Tech. Rep., 2009. 8. C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. Van Steen, “Prudent Practices for Designing Malware Experiments: Status Quo and Outlook,” IEEE S&P, 2012. 9. D. Kong and G. Yan, “Discriminant malware distance learning on structural information for automated malware classification,” Proc. 19th ACM SIGKDD KDD ’13, 2013.