Data Mining Techniques 
for malware detection 
-BY Aditya Deshmukh(TE-CSE1) 
-BY ULLAS KAKANADAN(TE-CSE1) 
-BY ANKIT GELDA(TE-CSE1) 
-BY SUDARSHAN RANDIVE(TE-CSE1)
CONTENTS 
•DATA MINING??? 
•TECHNIQUES??? 
•WHAT IS MALWARE??? 
•TECHNIQUES OVER MALWARE 
•VARIOUS APPLICATIONS 
•CONCLUSION 
•QUESTION?
WHY MINE DATA??? 
 Lots of data is being collected and 
warehoused 
 Potentially valuable resource 
 Stored data grows very fast 
 Information is crucial
DATA MINING 
Extracting 
 IMPLICIT 
 PREVIOUSLY UNKNOWN 
 POTENTIALLY USEFUL 
Needed: programs that detect patterns and regularities in the data 
Knowledge Discovery in Data
KNoWlEDGE DIscovErY procEss
Data, InformatIon, anD 
KnowleDge 
• Data 
operational or transactional data 
nonoperational data 
meta data - data about the data itself 
• Information 
patterns, associations, or relationships among all this data 
• Knowledge
How Data mInIng worKs?? 
•Classes: Stored data is used to locate data in predetermined groups. 
•Clusters: Data items are grouped according to logical relationships or consumer 
preferences 
•Associations: Data can be mined to identify associations. 
•Sequential patterns: Data is mined to anticipate behavior patterns and trends
wHat Is malware??? 
 Short for malicious software 
old as software itself 
programmer might create malware 
most common types 
Virus 
Trojans 
Worms 
Zombies 
Spyware
vIrus 
most well-known 
not to cause damage, but to clone itself onto another host 
virus causes damage it is more likely to be detected 
very small footprint 
remain undetected for a very long time
worms 
very similar to viruses in many ways 
worms are network-aware 
computer-to-computer hurdle by seeking new hosts on the network 
capable of going global in a matter of seconds 
Very hard to be controlled and stopped
trojans 
conceal itself inside software 
Greeks were able to enter the fortified city of Troy by hiding their 
soldiers in a big wooden horse given to the Trojans as a gift 
Disguises that a trojan can take are only limited by the programmer’s 
imagination 
Cyber-crooks often use viruses, trojans and worms 
Trojans also drop spyware
zombies 
works in a similar way to spyware 
infection mechanisms remain the same 
just sits there waiting for commands from the hacker 
infect tens of thousands of computers, turning them into zombie 
machines 
distributed denial of service attack
algorithm in data mining 
C4.5 and beyond 
The k-means algorithm 
Support vector machines 
The Apriori algorithm 
The EM algorithm
malware detection techniques 
• anomaly-based detection technique 
• signature-based detection technique
K-means algorithm 
• takes the number of components of the population equal to the final 
required number of clusters 
• examines each component in the population 
• assigns it to one of the clusters depending on the minimum distance 
• centroid's position is recalculated everytime a component is added
flowchart
aDVaNtaGES of Data MINING 
Marking/Retailing 
Banking/Crediting 
Law enforcement 
Researchers
DISaDVaNtaGES of Data MINING 
Privacy Issues 
Security issues 
Misuse of information/inaccurate information

Data mining techniques for malware detection.pptx

  • 1.
    Data Mining Techniques for malware detection -BY Aditya Deshmukh(TE-CSE1) -BY ULLAS KAKANADAN(TE-CSE1) -BY ANKIT GELDA(TE-CSE1) -BY SUDARSHAN RANDIVE(TE-CSE1)
  • 2.
    CONTENTS •DATA MINING??? •TECHNIQUES??? •WHAT IS MALWARE??? •TECHNIQUES OVER MALWARE •VARIOUS APPLICATIONS •CONCLUSION •QUESTION?
  • 3.
    WHY MINE DATA???  Lots of data is being collected and warehoused  Potentially valuable resource  Stored data grows very fast  Information is crucial
  • 4.
    DATA MINING Extracting  IMPLICIT  PREVIOUSLY UNKNOWN  POTENTIALLY USEFUL Needed: programs that detect patterns and regularities in the data Knowledge Discovery in Data
  • 5.
  • 6.
    Data, InformatIon, anD KnowleDge • Data operational or transactional data nonoperational data meta data - data about the data itself • Information patterns, associations, or relationships among all this data • Knowledge
  • 7.
    How Data mInIngworKs?? •Classes: Stored data is used to locate data in predetermined groups. •Clusters: Data items are grouped according to logical relationships or consumer preferences •Associations: Data can be mined to identify associations. •Sequential patterns: Data is mined to anticipate behavior patterns and trends
  • 8.
    wHat Is malware???  Short for malicious software old as software itself programmer might create malware most common types Virus Trojans Worms Zombies Spyware
  • 9.
    vIrus most well-known not to cause damage, but to clone itself onto another host virus causes damage it is more likely to be detected very small footprint remain undetected for a very long time
  • 10.
    worms very similarto viruses in many ways worms are network-aware computer-to-computer hurdle by seeking new hosts on the network capable of going global in a matter of seconds Very hard to be controlled and stopped
  • 11.
    trojans conceal itselfinside software Greeks were able to enter the fortified city of Troy by hiding their soldiers in a big wooden horse given to the Trojans as a gift Disguises that a trojan can take are only limited by the programmer’s imagination Cyber-crooks often use viruses, trojans and worms Trojans also drop spyware
  • 12.
    zombies works ina similar way to spyware infection mechanisms remain the same just sits there waiting for commands from the hacker infect tens of thousands of computers, turning them into zombie machines distributed denial of service attack
  • 13.
    algorithm in datamining C4.5 and beyond The k-means algorithm Support vector machines The Apriori algorithm The EM algorithm
  • 14.
    malware detection techniques • anomaly-based detection technique • signature-based detection technique
  • 15.
    K-means algorithm •takes the number of components of the population equal to the final required number of clusters • examines each component in the population • assigns it to one of the clusters depending on the minimum distance • centroid's position is recalculated everytime a component is added
  • 16.
  • 17.
    aDVaNtaGES of DataMINING Marking/Retailing Banking/Crediting Law enforcement Researchers
  • 18.
    DISaDVaNtaGES of DataMINING Privacy Issues Security issues Misuse of information/inaccurate information