Intelligent Software Engineering:
Synergy between AI and Software
Engineering
Tao Xie
University of Illinois at Urbana-Champaign
taoxie@illinois.edu
http://taoxie.cs.illinois.edu/
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligent Software Engineering
1st International Workshop on
Intelligent Software Engineering (WISE 2017)
Tao Xie
University of Illinois at
Urbana-Champaign, USA
Abhik Roychoudhury
National University of
Singapore, Singapore
Organizing Committee
Wolfram Schulte
Facebook, USA
Qianxiang Wang
Huawei, China
Sponsor:
Co-Located with ASE 2017
https://isofteng.github.io/wise2017/
Workshop Program
8 invited speakers
1 panel discussion
https://isofteng.github.io/wise2017/
International Workshop on Intelligent Software Engineering (WISE 2017)
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligent Software Engineering
Past: Automated Software Testing
• 10 years of collaboration with Microsoft Research on Pex
• .NET Test Generation Tool based on Dynamic Symbolic Execution
• Example Challenges
• Path explosion [DSN’09: Fitnex]
• Method sequence explosion [OOPSLA’12: Seeker]
• Shipped in Visual Studio 2015/2017 Enterprise Edition
• As IntelliTest
• Code Hunt w/ > 6 million (6,114,978) users after 3.5 years
• Including registered users playing on www.codehunt.com, anonymous
users and accounts that access api.codehunt.com directly via the
documented REST APIs) http://api.codehunt.com/
http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf
Past: Android App Testing
• 2 years of collaboration with Tencent Inc. WeChat testing team
• Guided Random Test Generation Tool improved over Google Monkey
• Resulting tool deployed in daily WeChat testing practice
• WeChat = WhatsApp + Facebook + Instagram + PayPal + Uber …
• #monthly active users: 963 millions @2017 2ndQ
• Daily#: dozens of billion messages sent, hundreds of million photos uploaded,
hundreds of million payment transactions executed
• First studies on testing industrial Android apps
[FSE’16IN][ICSE’17SEIP]
• Beyond open source Android apps
focused by academia
WeChat
http://taoxie.cs.illinois.edu/publications/esecfse17industry-replay.pdf
http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf
Next: Intelligent Software Testing(?)
• Learning from others working on the same things
• Our work on mining API usage method sequences to test the API [ESEC/FSE’09:
MSeqGen] http://taoxie.cs.illinois.edu/publications/esecfse09-mseqgen.pdf
• Visser et al. Green: Reducing, reusing and recycling constraints in program
analysis. FSE’12.
• Learning from others working on similar things
• Jia et al. Enhancing reuse of constraint solutions to improve symbolic execution..
ISSTA’15.
• Aquino et al. Heuristically Matching Solution Spaces of Arithmetic Formulas to
Efficiently Reuse Solutions. ICSE’17.
[Jia et al. ISSTA’15]
Mining and Understanding Software Enclaves (MUSE)
http://materials.dagstuhl.de/files/15/15472/15472.SureshJagannathan1.Slides.pdf
DARPA
Pliny: Mining Big
Code to help
programmers
(Rice U., UT Austin,
Wisconsin, Grammatech)
http://pliny.rice.edu/ http://news.rice.edu/2014/11/05/next-for-darpa-autocomplete-for-programmers-2/
$11 million (4 years)
Program Synthesis: NSF Expeditions in Computing
https://excape.cis.upenn.edu/https://www.sciencedaily.com/releases/2016/08/160815134941.htm
10 millions (5 years)
Software related data are pervasive
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
Keystrokes
…
Software Analytics
Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing
Software analytics is to enable software practitioners to
perform data exploration and analysis in order to obtain
insightful and actionable information for data-driven tasks
around software and services.
In Collaboration with Microsoft Research Asia
http://taoxie.cs.illinois.edu/publications/malets11-analytics.pdf
Past: Software Analytics
• StackMine [ICSE’12]: performance debugging in the large
• Data Source: Performance call stack traces from Windows end users
• Analytics Output: Ranked clusters of call stack traces based on shared patterns
• Impact: Deployed/used in daily practice of Windows Performance Analysis team
• XIAO [ACSAC’12]: code-clone detection and search
• Data Source: Source code repos (+ given code segment optionally)
• Analytics Output: Code clones
• Impact: Shipped in Visual Studio 2012; deployed/used in daily practice of
Microsoft Security Response Center
In Collaboration with Microsoft Research Asia
Internet
Past: Software Analytics
• Service Analysis Studio [ASE’13]: service incident management
• Data Source: Transaction logs, system metrics, past incident reports
• Analytics Output: Healing suggestions/likely root causes of the given incident
• Impact: Deployed and used by an important Microsoft service (hundreds of
millions of users) for incident management
In Collaboration with Microsoft Research Asia
Next: Intelligent Software Analytics(?)
Microsoft Research Asia - Software Analytics Group - Smart Data Discovery
IN4: INteractive, Intuitive, Instant, INsights
Quick Insights -> Microsoft Power BI
Gartner Magic Quadrant for Business
Intelligence & Analytics Platforms
Microsoft Research Asia - Software Analytics Group
https://www.hksilicon.com/articles/1213020
Open Topics in Intelligent Software Engineering (ISE)
• How to determine whether a software engineering tool is indeed
“intelligent”?
• Turing test for such tool?
• What sub-areas/problems in ISE shall the research community invest
efforts on as high priority?
• How to turn ISE research results into industrial/open source practice?
• …
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligent Software Engineering
White-House-Sponsored Workshop (2016 June 28)
http://www.cmu.edu/safartint/
Self-Driving Tesla Involved in Fatal Crash (2016 June 30)
http://www.nytimes.com/2016/07/01/business/self-driving-tesla-fatal-crash-investigation.html
“A Tesla car in autopilot crashed into a trailer
because the autopilot system failed to recognize
the trailer as an obstacle due to its “white color
against a brightly lit sky” and the “high ride
height”
http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
Microsoft's Teen Chatbot Tay
Turned into Genocidal Racist (2016 March 23/24)
http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3
"There are a number of precautionary steps
they [Microsft] could have taken. It wouldn't
have been too hard to create a blacklist of
terms; or narrow the scope of replies.
They could also have simply manually
moderated Tay for the first few days, even if
that had meant slower responses."
“businesses and other AI developers will
need to give more thought to the protocols
they design for testing and training AIs
like Tay.”
• An example
• Declaration from the PuTTY Website for Google’s search result
change
• This declaration suggests that Google’s search results for “putty” at
some time may not be satisfactory and may cause confusions of the
users.
Motivation: Testing Search Engines
23
Our Past Work on Testing Search Engines
• Given a large set of search results, find the failing tests from them
• Test oracles: relevance judgments
• Test selection
• Inspect parts of search results for some queries by mining search results of all
queries
24
Queries and
Search Results
Feature Vectors
(Itemsets)
Association
Rules
Detecting
Violations
http://taoxie.cs.illinois.edu/publications/ase11-searchenginetest.pdf
[ASE 2011]
• Query items
• Query words, query types, query length, etc.
• Search result items
• Domain, domain’s Alexa rank, etc.
• Query-result matching items
• Whether the domain name has the query, whether the title has the query, etc.
• Search engine items
• Search engine names
Mining Output Rules
25
• SE:bing, Q:boston colleges, QW:boston, QW:colleges, TwoWords,
CommonQ, top10:searchboston.com, top1:searchboston.com,
top10:en.wikipedia.org, …, SOMEGE100K, SOMELE1K
• SE:bing, Q:day after day, QW:day, QW:after, ManyWords,
CommonQ, top10:en.wikipedia.org, top1:en.wikipedia.org,
top10:dayafterday.org, …, SOMEGE100K, SOMELE1K
Example Itemsets
26
Mining Association Rules
• Mining frequent itemsets with length constraint
• An itemset is frequent if its support is larger than the min_support
{SE:bing, top10:en.wikipedia.org}
• Generating rules with only one item in the right hand side
• For each item xi in Y, generate a rule Y-xi => xi
SE:bing=>top10:en.wikipedia.org
27
Experiments
• Search results
• Use the Web services of Google and Bing to collect the top 10
search results of each query from December 25, 2010 to April 21,
2011
• 390797 ranked lists of search results (each list contains the top 10
search results)
28
The Mined Rules
• Mining from one search engine' results in one day
• Google's search results on Dec. 25, 2010
• minsup = 20, minconf = 0.95, and maxL = 3
29
The Mined Rules
• Mining from multiple search engines' results in one day
• Google and Bing's search results on Dec. 25, 2010
• minsup = 20, minconf = 0.95, and maxL = 3
• Rules 9-12 show the different opinions of search engines to certain
Websites
30
The Mined Rules
• Mining from one search engine' results in multiple days
• Google's search results from December 25, 2010 to March 31, 2011.
• minsup = 20, minconf = 0.95, and maxL = 2
• Rules 13-18 show the rules about the top 1 results for the queries
31
Example Violations
• Search results of Bing on April 1st, 2011 violate the following
rule
• The actual result of Bing
http://www.jcu.edu/index.php
points to the homepage of the John Carroll University, not easy to get
the answer of the query
32
Our Recent Work: Multiple-Implementation
Testing of Machine Learning Software
Faults Detected for kNN and Naive Bayes
Implementations
Detected Fault Types
Other’s Work at Columbia/Lehigh U.:
SOSP 2017 Best Paper Award
http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
Our Most Recent Work:
Malware Detection in Adversarial Settings:
Exploiting Feature Evolutions and Confusions in Android Apps
WeiYang, Deguang Kong,Tao Xie and Carl A. Gunter
Annual Computer Security Applications Conference (ACSAC 2017)
http://taoxie.cs.illinois.edu/publications/acsac17-malware.pdf
http://web.engr.illinois.edu/~weiyang3/
Wei Yang on the job market!
Machine learning (ML) in adversarial settings
38
• The development of machine learning is algorithm-oriented, often
neglecting real-world constraints: machine budget, time cost, security,
privacy etc.
• The real world is complex, uncertain and adversarial.
• Our work
• Robustness of ML algorithms in security systems (well-known).
• Practicability of previous proposed attacks on ML-based security system
(something new).
• We use mobile malware detection systems as an example for study.
• Complexity (handled by logic), uncertainty (handled by probability) and, adversarial (focus of our work).
Robustness of ML algorithms in security systems
39
A Dataset
Training
Data
Test
Data
Training
Feature
Extraction
Feature
Vector
ML
Algorithm
Trained
Classifier
Result
• Conventionally, to train an ML model, researchers separate training data
and test data from the same dataset.
• Overly-strong assumption: training data shares similar stationary
distribution as test data (i.e., training data is representative).
Robustness of machine learning algorithms in
security systems
40
A Dataset
Training
Data
Test
Data
Training
Feature
Extraction
Feature
Vector
ML
Algorithm
Trained
Classifier
Result
• In adversarial settings, such assumption often does not hold
• Training data and test data may not come from the same data set.
• Poisoning attack: engineering on features or labels of training data
• Compromise the discriminant classifier.
• Evade attack: engineering on features of testing data
• Change discriminant feature values to indiscriminant ones.
Focus: Evasion Attack on Mobile Malware
Detection
• Goals: Understand classifier robustness;
Build better classifiers (or give up)
• Existing work:
43
Existing work
• Purely based on mathematics.
• Making slight perturbations (e.g.,
𝛿𝑥2) to incur dramatic change on
output (e.g., 𝑂𝑢𝑡𝑝𝑢𝑡(𝑥1, 𝑥2)).
• Example:
• Derivative-based approach
44
Malicious
Benign
https://arxiv.org/abs/1511.07528
What is missing: Practicability
• These proposed attacks manipulate the feature vectors of a malware
in theory w/o considering the feasibility/impact of the mutations on
the malware’s code.
• Mutations may cause compilation errors.
• Mutations may crash the app, cause undesirable behaviors, or disable
malicious functionalities.
45
Three practical constraints to craft a realistic attack
• Preserving Malicious Behaviors
• Maintain the original malicious purposes.
• Maintaining the Robustness of Apps
• Robust enough to be installed and executed in mobile devices.
• Evading Malware Detectors w/ little or no knowledge of the model
• Identify and change feature values to evade detection.
46
Malware Recomposition Variation (MRV)
• Malware Evolution Attack (existing malware evolution histories )
• Reapply feature mutations in malware evolution to create new malware
variants.
• Malware Confusion Attack (existing evasive malware)
• Mutate a malware’s feature values to mimic classified benign apps’ to create
new malware variants.
• Insight: feature mutations are less likely to break the apps when the
mutations follow feature patterns of existing malware.
47
Intuition: Evolving space of malware
• Malware Evolution: make a feature vector Vi that is correctly
classified as malware to be misclassified as benign apps by mutating
Vi’s feature values.
Three options:
• Mutating Vi to be exactly same as a benign feature
vector (e.g., B1)
• Mutating Vi to be exactly same as a malware feature
vector that is misclassified as benign (e.g., M2)
• Mutating Vi to a new feature vector Vn that can be
potentially classified as benign (e.g., Mv)
Malware Detection
• A feature vector space V,
• The feature vectors of existing
benign apps B,
• The feature vectors of existing
malware M,
• The feature vectors detected by
detection model D,
• The feature vectors of all
potential malware M’.
𝑉
𝐵 𝑀
𝐷
𝑀′
Malware Detection
If the feature selection is perfect
(i.e., all differences between
benign apps and malware are
captured by selected features),
no malware and benign apps
should be projected to the same
feature vector.
𝑉
𝐵 𝑀
𝐷
𝑀′
Insight 1: Differentiability of Feature Set
Malware detection usually
performs poorly in differentiating
the malware and benign apps with
the same feature vector.
We try to convert malware with
unique malicious features (i.e., M
n (B M)) to possess the
feature vectors shared by benign
apps (i.e., BM)).
𝑉
𝐵 𝑀
𝐷
𝑀′
Attack 1: Malware confusion attack
Mutate a malware’s feature values existing in only malware to the
values existing in both malware and benign apps
E.g., M1 -> M2
Insight 2: Robustness of detection model
• Detection models are limited to
detect existing malware.
• A robust malware detection
model should aim to detect
malware variants produced
through known transformation.
The robustness of a detection
model can be represented by
𝑉
𝐵 𝑀
𝐷
𝑀′
Attack 2: Malware evolution attack
• Mutate malware feature values to the “blind spots” of malware
detection.
• Mimic feature mutations in the actual evolution of malware in the
real world.
• We conduct a phylogenetic analysis for each family of malware
• E.g., Apply the mutation
M1->M2 (e.g., c->c’ in 𝐹3)
on M3 to derive 𝑀𝑣
Why MRV works
• Large feature set has numerous non-essential features
• non-informative or even misleading features.
• Observation 1: Malware detectors often confuse non-essential features in
code clones as discriminative features.
• Learning-based detectors may regard non-essential features (e.g., minor implementation detail) in
code clones as major discriminant factor
• Observation 2: Using a universal set of features for all malware families
would result in a large number of non-essential features for each family 
opportunities to be mutated to evade detection while preserving malicious
purposes
• Features essential to malicious behaviors are different for each malware family
55
Overview
56
Feature Model
• A substitute model
• Summarize essential features (i.e., security-sensitive resources) and
contextual features (e.g., when, where, how the security-sensitive resources
are obtained and used) commonly used in malware detection
• Transferability property
• Some adversarial samples generated based on the substitute model can also
evade the original detectors
57
Approach
• Mutation strategy synthesis
• Phylogenetic analysis for evolution attack
• Similarity metric for confusion attack
• Program mutation
• Program transplantation/ refactoring
58
Practicability of attacks
• Checking the preserving of malicious behaviors
• Our impact analysis is based on the insight that the component-based nature of
Android constrains the impact of mutations within certain components.
• Checking the robustness of mutated apps
• Each mutated app is tested against 5,000 events randomly generated by Monkey
to ensure that the app does not crash.
59
Results
60
Strengthening the robustness of detection
• Adversarial Training
• Randomly choose half of our generated malware variants into the training set
to train the model
• Variant Detector
• Create a new classifier called variant detector to detect whether an app is a
variants derived from existing malware.
• Weight Bounding
• Constrain the weight on a few dominant features to make feature weights
more evenly distributed.
61
Results
62
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligent Software Engineering

Intelligent Software Engineering: Synergy between AI and Software Engineering

  • 1.
    Intelligent Software Engineering: Synergybetween AI and Software Engineering Tao Xie University of Illinois at Urbana-Champaign taoxie@illinois.edu http://taoxie.cs.illinois.edu/
  • 2.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  • 3.
    1st International Workshopon Intelligent Software Engineering (WISE 2017) Tao Xie University of Illinois at Urbana-Champaign, USA Abhik Roychoudhury National University of Singapore, Singapore Organizing Committee Wolfram Schulte Facebook, USA Qianxiang Wang Huawei, China Sponsor: Co-Located with ASE 2017 https://isofteng.github.io/wise2017/
  • 4.
    Workshop Program 8 invitedspeakers 1 panel discussion https://isofteng.github.io/wise2017/ International Workshop on Intelligent Software Engineering (WISE 2017)
  • 5.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  • 6.
    Past: Automated SoftwareTesting • 10 years of collaboration with Microsoft Research on Pex • .NET Test Generation Tool based on Dynamic Symbolic Execution • Example Challenges • Path explosion [DSN’09: Fitnex] • Method sequence explosion [OOPSLA’12: Seeker] • Shipped in Visual Studio 2015/2017 Enterprise Edition • As IntelliTest • Code Hunt w/ > 6 million (6,114,978) users after 3.5 years • Including registered users playing on www.codehunt.com, anonymous users and accounts that access api.codehunt.com directly via the documented REST APIs) http://api.codehunt.com/ http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf
  • 7.
    Past: Android AppTesting • 2 years of collaboration with Tencent Inc. WeChat testing team • Guided Random Test Generation Tool improved over Google Monkey • Resulting tool deployed in daily WeChat testing practice • WeChat = WhatsApp + Facebook + Instagram + PayPal + Uber … • #monthly active users: 963 millions @2017 2ndQ • Daily#: dozens of billion messages sent, hundreds of million photos uploaded, hundreds of million payment transactions executed • First studies on testing industrial Android apps [FSE’16IN][ICSE’17SEIP] • Beyond open source Android apps focused by academia WeChat http://taoxie.cs.illinois.edu/publications/esecfse17industry-replay.pdf http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf
  • 8.
    Next: Intelligent SoftwareTesting(?) • Learning from others working on the same things • Our work on mining API usage method sequences to test the API [ESEC/FSE’09: MSeqGen] http://taoxie.cs.illinois.edu/publications/esecfse09-mseqgen.pdf • Visser et al. Green: Reducing, reusing and recycling constraints in program analysis. FSE’12. • Learning from others working on similar things • Jia et al. Enhancing reuse of constraint solutions to improve symbolic execution.. ISSTA’15. • Aquino et al. Heuristically Matching Solution Spaces of Arithmetic Formulas to Efficiently Reuse Solutions. ICSE’17. [Jia et al. ISSTA’15]
  • 9.
    Mining and UnderstandingSoftware Enclaves (MUSE) http://materials.dagstuhl.de/files/15/15472/15472.SureshJagannathan1.Slides.pdf DARPA
  • 10.
    Pliny: Mining Big Codeto help programmers (Rice U., UT Austin, Wisconsin, Grammatech) http://pliny.rice.edu/ http://news.rice.edu/2014/11/05/next-for-darpa-autocomplete-for-programmers-2/ $11 million (4 years)
  • 11.
    Program Synthesis: NSFExpeditions in Computing https://excape.cis.upenn.edu/https://www.sciencedaily.com/releases/2016/08/160815134941.htm 10 millions (5 years)
  • 12.
    Software related dataare pervasive Runtime traces Program logs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases Keystrokes …
  • 13.
    Software Analytics Vertical Horizontal Information Visualization DataAnalysis Algorithms Large-scale Computing Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. In Collaboration with Microsoft Research Asia http://taoxie.cs.illinois.edu/publications/malets11-analytics.pdf
  • 14.
    Past: Software Analytics •StackMine [ICSE’12]: performance debugging in the large • Data Source: Performance call stack traces from Windows end users • Analytics Output: Ranked clusters of call stack traces based on shared patterns • Impact: Deployed/used in daily practice of Windows Performance Analysis team • XIAO [ACSAC’12]: code-clone detection and search • Data Source: Source code repos (+ given code segment optionally) • Analytics Output: Code clones • Impact: Shipped in Visual Studio 2012; deployed/used in daily practice of Microsoft Security Response Center In Collaboration with Microsoft Research Asia Internet
  • 15.
    Past: Software Analytics •Service Analysis Studio [ASE’13]: service incident management • Data Source: Transaction logs, system metrics, past incident reports • Analytics Output: Healing suggestions/likely root causes of the given incident • Impact: Deployed and used by an important Microsoft service (hundreds of millions of users) for incident management In Collaboration with Microsoft Research Asia
  • 16.
    Next: Intelligent SoftwareAnalytics(?) Microsoft Research Asia - Software Analytics Group - Smart Data Discovery IN4: INteractive, Intuitive, Instant, INsights Quick Insights -> Microsoft Power BI Gartner Magic Quadrant for Business Intelligence & Analytics Platforms
  • 17.
    Microsoft Research Asia- Software Analytics Group https://www.hksilicon.com/articles/1213020
  • 18.
    Open Topics inIntelligent Software Engineering (ISE) • How to determine whether a software engineering tool is indeed “intelligent”? • Turing test for such tool? • What sub-areas/problems in ISE shall the research community invest efforts on as high priority? • How to turn ISE research results into industrial/open source practice? • …
  • 19.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  • 20.
    White-House-Sponsored Workshop (2016June 28) http://www.cmu.edu/safartint/
  • 21.
    Self-Driving Tesla Involvedin Fatal Crash (2016 June 30) http://www.nytimes.com/2016/07/01/business/self-driving-tesla-fatal-crash-investigation.html “A Tesla car in autopilot crashed into a trailer because the autopilot system failed to recognize the trailer as an obstacle due to its “white color against a brightly lit sky” and the “high ride height” http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
  • 22.
    Microsoft's Teen ChatbotTay Turned into Genocidal Racist (2016 March 23/24) http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3 "There are a number of precautionary steps they [Microsft] could have taken. It wouldn't have been too hard to create a blacklist of terms; or narrow the scope of replies. They could also have simply manually moderated Tay for the first few days, even if that had meant slower responses." “businesses and other AI developers will need to give more thought to the protocols they design for testing and training AIs like Tay.”
  • 23.
    • An example •Declaration from the PuTTY Website for Google’s search result change • This declaration suggests that Google’s search results for “putty” at some time may not be satisfactory and may cause confusions of the users. Motivation: Testing Search Engines 23
  • 24.
    Our Past Workon Testing Search Engines • Given a large set of search results, find the failing tests from them • Test oracles: relevance judgments • Test selection • Inspect parts of search results for some queries by mining search results of all queries 24 Queries and Search Results Feature Vectors (Itemsets) Association Rules Detecting Violations http://taoxie.cs.illinois.edu/publications/ase11-searchenginetest.pdf [ASE 2011]
  • 25.
    • Query items •Query words, query types, query length, etc. • Search result items • Domain, domain’s Alexa rank, etc. • Query-result matching items • Whether the domain name has the query, whether the title has the query, etc. • Search engine items • Search engine names Mining Output Rules 25
  • 26.
    • SE:bing, Q:bostoncolleges, QW:boston, QW:colleges, TwoWords, CommonQ, top10:searchboston.com, top1:searchboston.com, top10:en.wikipedia.org, …, SOMEGE100K, SOMELE1K • SE:bing, Q:day after day, QW:day, QW:after, ManyWords, CommonQ, top10:en.wikipedia.org, top1:en.wikipedia.org, top10:dayafterday.org, …, SOMEGE100K, SOMELE1K Example Itemsets 26
  • 27.
    Mining Association Rules •Mining frequent itemsets with length constraint • An itemset is frequent if its support is larger than the min_support {SE:bing, top10:en.wikipedia.org} • Generating rules with only one item in the right hand side • For each item xi in Y, generate a rule Y-xi => xi SE:bing=>top10:en.wikipedia.org 27
  • 28.
    Experiments • Search results •Use the Web services of Google and Bing to collect the top 10 search results of each query from December 25, 2010 to April 21, 2011 • 390797 ranked lists of search results (each list contains the top 10 search results) 28
  • 29.
    The Mined Rules •Mining from one search engine' results in one day • Google's search results on Dec. 25, 2010 • minsup = 20, minconf = 0.95, and maxL = 3 29
  • 30.
    The Mined Rules •Mining from multiple search engines' results in one day • Google and Bing's search results on Dec. 25, 2010 • minsup = 20, minconf = 0.95, and maxL = 3 • Rules 9-12 show the different opinions of search engines to certain Websites 30
  • 31.
    The Mined Rules •Mining from one search engine' results in multiple days • Google's search results from December 25, 2010 to March 31, 2011. • minsup = 20, minconf = 0.95, and maxL = 2 • Rules 13-18 show the rules about the top 1 results for the queries 31
  • 32.
    Example Violations • Searchresults of Bing on April 1st, 2011 violate the following rule • The actual result of Bing http://www.jcu.edu/index.php points to the homepage of the John Carroll University, not easy to get the answer of the query 32
  • 33.
    Our Recent Work:Multiple-Implementation Testing of Machine Learning Software
  • 34.
    Faults Detected forkNN and Naive Bayes Implementations
  • 35.
  • 36.
    Other’s Work atColumbia/Lehigh U.: SOSP 2017 Best Paper Award http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
  • 37.
    Our Most RecentWork: Malware Detection in Adversarial Settings: Exploiting Feature Evolutions and Confusions in Android Apps WeiYang, Deguang Kong,Tao Xie and Carl A. Gunter Annual Computer Security Applications Conference (ACSAC 2017) http://taoxie.cs.illinois.edu/publications/acsac17-malware.pdf http://web.engr.illinois.edu/~weiyang3/ Wei Yang on the job market!
  • 38.
    Machine learning (ML)in adversarial settings 38 • The development of machine learning is algorithm-oriented, often neglecting real-world constraints: machine budget, time cost, security, privacy etc. • The real world is complex, uncertain and adversarial. • Our work • Robustness of ML algorithms in security systems (well-known). • Practicability of previous proposed attacks on ML-based security system (something new). • We use mobile malware detection systems as an example for study. • Complexity (handled by logic), uncertainty (handled by probability) and, adversarial (focus of our work).
  • 39.
    Robustness of MLalgorithms in security systems 39 A Dataset Training Data Test Data Training Feature Extraction Feature Vector ML Algorithm Trained Classifier Result • Conventionally, to train an ML model, researchers separate training data and test data from the same dataset. • Overly-strong assumption: training data shares similar stationary distribution as test data (i.e., training data is representative).
  • 40.
    Robustness of machinelearning algorithms in security systems 40 A Dataset Training Data Test Data Training Feature Extraction Feature Vector ML Algorithm Trained Classifier Result • In adversarial settings, such assumption often does not hold • Training data and test data may not come from the same data set. • Poisoning attack: engineering on features or labels of training data • Compromise the discriminant classifier. • Evade attack: engineering on features of testing data • Change discriminant feature values to indiscriminant ones.
  • 41.
    Focus: Evasion Attackon Mobile Malware Detection • Goals: Understand classifier robustness; Build better classifiers (or give up) • Existing work: 43
  • 42.
    Existing work • Purelybased on mathematics. • Making slight perturbations (e.g., 𝛿𝑥2) to incur dramatic change on output (e.g., 𝑂𝑢𝑡𝑝𝑢𝑡(𝑥1, 𝑥2)). • Example: • Derivative-based approach 44 Malicious Benign https://arxiv.org/abs/1511.07528
  • 43.
    What is missing:Practicability • These proposed attacks manipulate the feature vectors of a malware in theory w/o considering the feasibility/impact of the mutations on the malware’s code. • Mutations may cause compilation errors. • Mutations may crash the app, cause undesirable behaviors, or disable malicious functionalities. 45
  • 44.
    Three practical constraintsto craft a realistic attack • Preserving Malicious Behaviors • Maintain the original malicious purposes. • Maintaining the Robustness of Apps • Robust enough to be installed and executed in mobile devices. • Evading Malware Detectors w/ little or no knowledge of the model • Identify and change feature values to evade detection. 46
  • 45.
    Malware Recomposition Variation(MRV) • Malware Evolution Attack (existing malware evolution histories ) • Reapply feature mutations in malware evolution to create new malware variants. • Malware Confusion Attack (existing evasive malware) • Mutate a malware’s feature values to mimic classified benign apps’ to create new malware variants. • Insight: feature mutations are less likely to break the apps when the mutations follow feature patterns of existing malware. 47
  • 46.
    Intuition: Evolving spaceof malware • Malware Evolution: make a feature vector Vi that is correctly classified as malware to be misclassified as benign apps by mutating Vi’s feature values. Three options: • Mutating Vi to be exactly same as a benign feature vector (e.g., B1) • Mutating Vi to be exactly same as a malware feature vector that is misclassified as benign (e.g., M2) • Mutating Vi to a new feature vector Vn that can be potentially classified as benign (e.g., Mv)
  • 47.
    Malware Detection • Afeature vector space V, • The feature vectors of existing benign apps B, • The feature vectors of existing malware M, • The feature vectors detected by detection model D, • The feature vectors of all potential malware M’. 𝑉 𝐵 𝑀 𝐷 𝑀′
  • 48.
    Malware Detection If thefeature selection is perfect (i.e., all differences between benign apps and malware are captured by selected features), no malware and benign apps should be projected to the same feature vector. 𝑉 𝐵 𝑀 𝐷 𝑀′
  • 49.
    Insight 1: Differentiabilityof Feature Set Malware detection usually performs poorly in differentiating the malware and benign apps with the same feature vector. We try to convert malware with unique malicious features (i.e., M n (B M)) to possess the feature vectors shared by benign apps (i.e., BM)). 𝑉 𝐵 𝑀 𝐷 𝑀′
  • 50.
    Attack 1: Malwareconfusion attack Mutate a malware’s feature values existing in only malware to the values existing in both malware and benign apps E.g., M1 -> M2
  • 51.
    Insight 2: Robustnessof detection model • Detection models are limited to detect existing malware. • A robust malware detection model should aim to detect malware variants produced through known transformation. The robustness of a detection model can be represented by 𝑉 𝐵 𝑀 𝐷 𝑀′
  • 52.
    Attack 2: Malwareevolution attack • Mutate malware feature values to the “blind spots” of malware detection. • Mimic feature mutations in the actual evolution of malware in the real world. • We conduct a phylogenetic analysis for each family of malware • E.g., Apply the mutation M1->M2 (e.g., c->c’ in 𝐹3) on M3 to derive 𝑀𝑣
  • 53.
    Why MRV works •Large feature set has numerous non-essential features • non-informative or even misleading features. • Observation 1: Malware detectors often confuse non-essential features in code clones as discriminative features. • Learning-based detectors may regard non-essential features (e.g., minor implementation detail) in code clones as major discriminant factor • Observation 2: Using a universal set of features for all malware families would result in a large number of non-essential features for each family  opportunities to be mutated to evade detection while preserving malicious purposes • Features essential to malicious behaviors are different for each malware family 55
  • 54.
  • 55.
    Feature Model • Asubstitute model • Summarize essential features (i.e., security-sensitive resources) and contextual features (e.g., when, where, how the security-sensitive resources are obtained and used) commonly used in malware detection • Transferability property • Some adversarial samples generated based on the substitute model can also evade the original detectors 57
  • 56.
    Approach • Mutation strategysynthesis • Phylogenetic analysis for evolution attack • Similarity metric for confusion attack • Program mutation • Program transplantation/ refactoring 58
  • 57.
    Practicability of attacks •Checking the preserving of malicious behaviors • Our impact analysis is based on the insight that the component-based nature of Android constrains the impact of mutations within certain components. • Checking the robustness of mutated apps • Each mutated app is tested against 5,000 events randomly generated by Monkey to ensure that the app does not crash. 59
  • 58.
  • 59.
    Strengthening the robustnessof detection • Adversarial Training • Randomly choose half of our generated malware variants into the training set to train the model • Variant Detector • Create a new classifier called variant detector to detect whether an app is a variants derived from existing malware. • Weight Bounding • Constrain the weight on a few dominant features to make feature weights more evenly distributed. 61
  • 60.
  • 61.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering