ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Software Engineering
1. Intelligent Software Engineering:
Synergy between AI and Software
Engineering
Tao Xie
University of Illinois at Urbana-Champaign
taoxie@illinois.edu
http://taoxie.cs.illinois.edu/
Innovations in Software Engineering Conference (ISEC 2018)
Feb 9-11 2018, Hyderabad, India
3. 1st International Workshop on
Intelligent Software Engineering (WISE 2017)
Tao Xie
University of Illinois at
Urbana-Champaign, USA
Abhik Roychoudhury
National University of
Singapore, Singapore
Organizing Committee
Wolfram Schulte
Facebook, USA
Qianxiang Wang
Huawei, China
Sponsor:
Co-Located with ASE 2017
https://isofteng.github.io/wise2017/
4. Workshop Program
8 invited speakers
1 panel discussion
https://isofteng.github.io/wise2017/
International Workshop on Intelligent Software Engineering (WISE 2017)
6. Past: Automated Software Testing
• 10 years of collaboration with Microsoft Research on Pex
• .NET Test Generation Tool based on Dynamic Symbolic Execution
• Example Challenges
• Path explosion [DSN’09: Fitnex]
• Method sequence explosion [OOPSLA’11: Seeker]
• Shipped in Visual Studio 2015/2017 Enterprise Edition
• As IntelliTest
• Code Hunt [ICSE’15 JSEET] w/ > 6 million (6,114,978) users after 3.5 years
• Including registered users playing on www.codehunt.com, anonymous users and
accounts that access http://api.codehunt.com/ directly via the documented REST
APIs) https://www.codehunt.com/
http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf
7. Past: Android App Testing
• 2 years of collaboration with Tencent Inc. WeChat testing team
• Guided Random Test Generation Tool improved over Google Monkey
• Resulting tool deployed in daily WeChat testing practice
• WeChat = WhatsApp + Facebook + Instagram + PayPal + Uber …
• #monthly active users: 963 millions @2017 2ndQ
• Daily#: dozens of billion messages sent, hundreds of million photos uploaded,
hundreds of million payment transactions executed
• First studies on testing industrial Android apps
[FSE’16IN][ICSE’17SEIP]
• Beyond open source Android apps
focused by academia
WeChat
http://taoxie.cs.illinois.edu/publications/esecfse17industry-replay.pdf
http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf
8. Next: Intelligent Software Testing(?)
• Learning from others working on the same things
• Our work on mining API usage method sequences to test the API
[ESEC/FSE’09: MSeqGen]
• Visser et al. Green: Reducing, reusing and recycling constraints in program
analysis. FSE’12.
• Learning from others working on similar things
• Jia et al. Enhancing reuse of constraint solutions to improve symbolic execution.
ISSTA’15.
• Aquino et al. Heuristically Matching Solution Spaces of Arithmetic Formulas to
Efficiently Reuse Solutions. ICSE’17.
[Jia et al. ISSTA’15]
9. Mining and Understanding Software Enclaves (MUSE)
http://materials.dagstuhl.de/files/15/15472/15472.SureshJagannathan1.Slides.pdf
DARPA
10. Pliny: Mining Big
Code to help
programmers
(Rice U., UT Austin,
Wisconsin, Grammatech)
http://pliny.rice.edu/ http://news.rice.edu/2014/11/05/next-for-darpa-autocomplete-for-programmers-2/
$11 million (4 years)
11. Program Synthesis: NSF Expeditions in Computing
https://excape.cis.upenn.edu/https://www.sciencedaily.com/releases/2016/08/160815134941.htm
10 millions (5 years)
12. Software related data are pervasive
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
Keystrokes
…
13. In Collaboration with Microsoft Research Asia
Software analytics is to enable software practitioners to
perform data exploration and analysis in order to obtain
insightful and actionable information for data-driven tasks
around software and services.
http://taoxie.cs.illinois.edu/publications/malets11-analytics.pdf
Software Analytics
14. Past: Software Analytics
• StackMine [ICSE’12, IEEESoft’13]: performance debugging in the large
• Data Source: Performance call stack traces from Windows end users
• Analytics Output: Ranked clusters of call stack traces based on shared patterns
• Impact: Deployed/used in daily practice of Windows Performance Analysis team
• XIAO [ACSAC’12, ICSE’17 SEIP]: code-clone detection and search
• Data Source: Source code repos (+ given code segment optionally)
• Analytics Output: Code clones
• Impact: Shipped in Visual Studio 2012; deployed/used in daily practice of
Microsoft Security Response Center
In Collaboration with Microsoft Research Asia
Internet
15. Past: Software Analytics
• Service Analysis Studio [ASE’13-EX]: service incident management
• Data Source: Transaction logs, system metrics, past incident reports
• Analytics Output: Healing suggestions/likely root causes of the given incident
• Impact: Deployed and used by an important Microsoft service (hundreds of
millions of users) for incident management
In Collaboration with Microsoft Research Asia
16. Next: Intelligent Software Analytics(?)
Microsoft Research Asia - Software Analytics Group - Smart Data Discovery
IN4: INteractive, Intuitive, Instant, INsights
Quick Insights -> Microsoft Power BI
Gartner Magic Quadrant for Business
Intelligence & Analytics Platforms
17. Microsoft Research Asia - Software Analytics Group
https://www.hksilicon.com/articles/1213020
18. 18
Existing Approaches on NL Regular Expressions
[Ranta 1998], [Kushman and Barzilay 2013], [Locascio et al. 2016]
Used only synthetic data for training and testing
Are these approaches effective to address
real-world situations ?
Deep Learning for NLRegex: Get Real!
Zhong et al. Generating Regular Expressions from Natural Language Specifications: Are We There Yet? In AAAI 2018
Workshop on NLP for Software Engineering (NL4SE 2018)
http://taoxie.cs.illinois.edu/publications/nl4se18-regex.pdf
19. Synthetic datasets
KB13 [Kushman and Barzilay 2013] (824 pairs)
Write NL sentences to capture the examples strings
NL-RX [Locascio et al., 2016] (10,000 pairs)
Parse a regex and generate initial NL sentences based on a predefined grammar
Paraphrase the generated sentences
Real-world dataset
RegexLib (3,619 pairs)
From regexlib.com
19
Characteristic Study
20. 20
Complexity of regular expressions
• Synthetic dataset support only a subset of regex language:
e.g., ‘?’ ∈ RegexLib, but ∉ NL-RX or KB-13
Length statistics of regular expressions
21. •# of distinct words: 13,491 (RegexLib) vs 715 (KB13)
vs 560(NL-RX)
21
Complexity of NL sentences
#words statistics of NL sentences
22. Deep-Regex [Locascio et al. 2016]
Regular expression generation Machine translation
22
Experimental Study
Sequence-to-sequence learning
https://github.com/nicholaslocascio/deep-regexhttps://aclweb.org/anthology/D/D16/D16-1197.pdf
24. Experiment settings
Use Deep-Regex to train a model using synthetic NL-RX dataset
Build a testing set (1,091 pairs) from RegexLib
Eliminate long NL sentences
Results
Without beam search: cannot generate any correct regex
Beam search (size: 20): generate correct regexs for 5 NL (0.46%)
Huge Drop of Top-20 accuracy! (90.9% 0.46%)
24
Experiments on Real-world Dataset
25. Variations of NL sentences
NL-RX: NL sentences are generated from a predefined grammar
Augmenting training data may alleviate the error
Numerical range
25
New Causes of Errors on Real-world Dataset
Description Ground Truth Predicted Result
Match the numbers 100 to 199. 1[0-9][0-9] ([0-9])*
26. RegexLib is too sparse to be a sufficient training set
Collect sufficient labeled real-world data
Synthesize data to supplement the collected real-world data
26
Ongoing Work: Large Real-world Benchmark
Dataset # Pairs # distinct words
NL-RX 10,000 560
RegexLib 3619 13,491
27. String test cases can handle the ambiguity of NL sentences
String test cases can differentiate regular expression candidates
help select the best candidate during beam search
27
Description Ground Truth Predicted Result
Items with a small letter preceding “dog”,
at least thrice
([a-b].*dog.*){3,} ([a-b]).*((dog){3,})
Test case:“adogadogadog”
Ongoing Work: Testability of Regular Expressions
29. 29
AI for SE Startups Rooted from Research
http://www.diffblue.com/
Oxford University spin-off, Daniel Kroening et al.
Peking University spin-off, Ge Li et al.
https://www.codota.com/
Technion spin-off, Eran Yahav et al.
Technical University Munich spin-off, Benedikt Hauptmann et al.
https://www.qualicen.de/en/
http://aixcoder.com/
30. Open Topics in Intelligent Software Engineering (ISE)
• How to determine whether a software engineering tool is indeed
“intelligent”?
• Turing test for such tool?
• What sub-areas/problems in ISE shall the research community invest
efforts on as high priority?
• How to turn ISE research results into industrial/open source practice?
• …
33. Self-Driving Tesla Involved in Fatal Crash (2016 June 30)
http://www.nytimes.com/2016/07/01/business/self-driving-tesla-fatal-crash-investigation.html
“A Tesla car in autopilot crashed into a trailer
because the autopilot system failed to recognize
the trailer as an obstacle due to its “white color
against a brightly lit sky” and the “high ride
height”
http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
34. Microsoft's Teen Chatbot Tay
Turned into Genocidal Racist (2016 March 23/24)
http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3
"There are a number of precautionary
steps they [Microsoft] could have taken.
It wouldn't have been too hard to create
a blacklist of terms; or narrow the scope
of replies. They could also have simply
manually moderated Tay for the first few
days, even if that had meant slower
responses."
“businesses and other AI developers will
need to give more thought to the
protocols they design for testing and
training AIs like Tay.”
35. NSF New Program: Formal Methods in the Field
• Anticipated Funding: $8 millions; #awards: 8
• Deadline: May 8th 2018
Machine Learning: The sheer complexity of machine learning algorithms
and their applications makes it hard to ensure correctness. Exploration of
new formal methods can be used to characterize boundaries of behavior,
and may bring much needed rigor to machine learning algorithms and
applications. These techniques could range from novel programming
languages and compilers for more robust machine learning to formal
verification techniques for machine learning systems that could provide
assurances of safety, correctness, and fairness. The interplay between
program synthesis and machine learning offers many interesting
possibilities to both improve machine learning and formal techniques.
https://www.nsf.gov/pubs/2018/nsf18536/nsf18536.htm
36. Problems in Testing ML Software
● ML Software suffers from the “no oracle problem”
○ Previous approach @Columbia U. on metamorphic testing:
check satisfaction of a property with different inputs in
equivalent classes
https://medium.com/trustableai/testing-ai-with-metamorphic-testing-61d690001f5c
● Inaccuracy may be desirable to avoid the overfitting problem
● Auto-generated test inputs have no expected outputs
36
38. Evaluation Setup
● kNN:
○ 19 implementations (including Weka, RapidMiner, and KNIME)
○ Parameters: k = 1, Euclidean-distance metric
○ 3 data sets: Iris, Breast Cancer Wisconsin (BCW), Glass Identification
(Glass)
● Naive Bayes (NB):
○ 7 implementations (including Weka, RapidMiner, and KNIME)
○ Parameters: none
○ 3 data sets: Breast Cancer Wisconsin (BCW), Haberman’s Survival Data
(Haberman), Hayes-Roth (Hayes)
● Randomly split each data set into training and test set with the ratio of 4:1
● The data sets contain about 1000 instances in total
38
39. Effectiveness of Majority Oracle
Overall, 20.5% of the tests are deviating tests, and 97.5% of the
deviating tests reveal faults
39
Algorithm
Major-Oracle
Deviating Tests
(%)
Fault Revealing
Tests (%)
#Faults
kNN 23.84% 100.00% 13
NB 16.29% 94.31% 16
kNN+NB 20.50% 97.50% 29
44. Fault Example 2 (in kNN)
● When k = 1, the method returns the first element without sorting
44
45. Other’s Work at Columbia/Lehigh U.:
SOSP 2017 Best Paper Award
http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
https://github.com/peikexin9/deepxplore
46. Other’s Work at Columbia U./UVa: ICSE 2018
https://arxiv.org/pdf/1708.08559.pdf
47. Our Most Recent Work:
“Testing” a Classifier (aka Adversarial Machine Learning)
Malware Detection in Adversarial Settings:
Exploiting Feature Evolutions and Confusions in Android Apps
WeiYang, Deguang Kong,Tao Xie and Carl A. Gunter
Annual Computer Security Applications Conference (ACSAC 2017)
http://taoxie.cs.illinois.edu/publications/acsac17-malware.pdf
48. Evasion attack on classifiers
• Goals: Understand classifier robustness;
Generate testing samples to help build better classifiers.
• Example:
4848
50. Three practical constraints to craft a realistic
attack against mobile malware classifiers
• Preserving Malicious Behaviors.
• Maintaining the Robustness of Apps.
• Evading Malware Detectors.
50
51. Malware Recomposition Variation (MRV)
• Malware Evolution Attack
• Malware Confusion Attack
• Insight
• Follow existing patterns!
• In our mutation strategies, the feature
patterns are extracted from existing malware
evolution histories and existing evasive
malware.
51
Figure Credit: Trend Micro
Figure Credit: Malware News
52. Why MRV works
• Large feature set has numerous non-informative or even misleading
features.
• Insight 1: Malware detectors often confuse non-essential features in code
clones as discriminative features.
• Insight 2: Using a universal set of features for all malware families would
result in a large number of non-essential features to characterize each
family.
52
53. Feature Model
• A substitute model
• Resource Temporal Locale
Dependency model
• Summarize the essential features
and contextual features commonly
used in malware detection
• Transferability property
53
Target
Model
Substitute
Model
Adversarial
Samples
Labeled
Data
TrainClassify
Adversarial
craftingAttack
54. Approach
• Mutation strategy synthesis:
• Phylogenetic analysis for evolution
attack
• Similarity metric for confusion attack
• Program mutation
• Program transplantation/refactoring
54
55. Practicability of attacks
• Check the preserving of malicious behaviors
• Our impact analysis is based on the insight that the component-based nature
of Android constrains the impact of mutations within certain components
• Check the robustness of mutated apps
• Each mutated app was tested against 5,000 events randomly generated by
Monkey to ensure that the app does not crash
55
56. Evaluation
• Malware detection techniques:
• AppContext, a malware detector leveraging semantic features extracted from
call graphs and control-flow graphs.
• Drebin, a malware detector leveraging eight categories of features that reside
either in the manifest file or in the disassembled code.
• Subjects: 1,917 malware and 1,935 benign apps
• Baseline:
• OCTOPUS, a syntactic app obfuscation tool similar to DroidChameleon.
• Random MRV
56
57. Results - Defeating existing malware detection
57
• ORI: Original test dataset (ORI)
• MRV: Test dataset with
adversarial samples.
58. Results – Comparing with Baselines
58
• MRV produces much more evasive variants than both OCTOPUS and Random
MRV for all three tools, especially the learning-based tools
59. Results – Comparing with Baselines
59
• Random MRV generates more than 320,000 variants, but only 212 of them
can run without crashing (and only 2 can evade detection of AppContext).
60. Strengthening the robustness of detection
• Adversarial Training
• We randomly chose half of our generated malware variants into the training
set to train the model
• Variant Detector
• We create a new classifier called variant detector to detect whether an app
is a variants derived from existing malware.
• Weight Bounding
• We constrain the weight on a few dominant features to make feature
weights more evenly distributed.
60