Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Software Engineering

2,414 views

Published on

ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Software Engineering https://isoft.acm.org/isec2018/keynote.php

Published in: Software
  • Hello there! Get Your Professional Job-Winning Resume Here! http://bit.ly/topresum
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

ISEC'18 Keynote: Intelligent Software Engineering: Synergy between AI and Software Engineering

  1. 1. Intelligent Software Engineering: Synergy between AI and Software Engineering Tao Xie University of Illinois at Urbana-Champaign taoxie@illinois.edu http://taoxie.cs.illinois.edu/ Innovations in Software Engineering Conference (ISEC 2018) Feb 9-11 2018, Hyderabad, India
  2. 2. Artificial Intelligence  Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  3. 3. 1st International Workshop on Intelligent Software Engineering (WISE 2017) Tao Xie University of Illinois at Urbana-Champaign, USA Abhik Roychoudhury National University of Singapore, Singapore Organizing Committee Wolfram Schulte Facebook, USA Qianxiang Wang Huawei, China Sponsor: Co-Located with ASE 2017 https://isofteng.github.io/wise2017/
  4. 4. Workshop Program 8 invited speakers 1 panel discussion https://isofteng.github.io/wise2017/ International Workshop on Intelligent Software Engineering (WISE 2017)
  5. 5. Artificial Intelligence  Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  6. 6. Past: Automated Software Testing • 10 years of collaboration with Microsoft Research on Pex • .NET Test Generation Tool based on Dynamic Symbolic Execution • Example Challenges • Path explosion [DSN’09: Fitnex] • Method sequence explosion [OOPSLA’11: Seeker] • Shipped in Visual Studio 2015/2017 Enterprise Edition • As IntelliTest • Code Hunt [ICSE’15 JSEET] w/ > 6 million (6,114,978) users after 3.5 years • Including registered users playing on www.codehunt.com, anonymous users and accounts that access http://api.codehunt.com/ directly via the documented REST APIs) https://www.codehunt.com/ http://taoxie.cs.illinois.edu/publications/ase14-pexexperiences.pdf
  7. 7. Past: Android App Testing • 2 years of collaboration with Tencent Inc. WeChat testing team • Guided Random Test Generation Tool improved over Google Monkey • Resulting tool deployed in daily WeChat testing practice • WeChat = WhatsApp + Facebook + Instagram + PayPal + Uber … • #monthly active users: 963 millions @2017 2ndQ • Daily#: dozens of billion messages sent, hundreds of million photos uploaded, hundreds of million payment transactions executed • First studies on testing industrial Android apps [FSE’16IN][ICSE’17SEIP] • Beyond open source Android apps focused by academia WeChat http://taoxie.cs.illinois.edu/publications/esecfse17industry-replay.pdf http://taoxie.cs.illinois.edu/publications/fse16industry-wechat.pdf
  8. 8. Next: Intelligent Software Testing(?) • Learning from others working on the same things • Our work on mining API usage method sequences to test the API [ESEC/FSE’09: MSeqGen] • Visser et al. Green: Reducing, reusing and recycling constraints in program analysis. FSE’12. • Learning from others working on similar things • Jia et al. Enhancing reuse of constraint solutions to improve symbolic execution. ISSTA’15. • Aquino et al. Heuristically Matching Solution Spaces of Arithmetic Formulas to Efficiently Reuse Solutions. ICSE’17. [Jia et al. ISSTA’15]
  9. 9. Mining and Understanding Software Enclaves (MUSE) http://materials.dagstuhl.de/files/15/15472/15472.SureshJagannathan1.Slides.pdf DARPA
  10. 10. Pliny: Mining Big Code to help programmers (Rice U., UT Austin, Wisconsin, Grammatech) http://pliny.rice.edu/ http://news.rice.edu/2014/11/05/next-for-darpa-autocomplete-for-programmers-2/ $11 million (4 years)
  11. 11. Program Synthesis: NSF Expeditions in Computing https://excape.cis.upenn.edu/https://www.sciencedaily.com/releases/2016/08/160815134941.htm 10 millions (5 years)
  12. 12. Software related data are pervasive Runtime traces Program logs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases Keystrokes …
  13. 13. In Collaboration with Microsoft Research Asia Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. http://taoxie.cs.illinois.edu/publications/malets11-analytics.pdf Software Analytics
  14. 14. Past: Software Analytics • StackMine [ICSE’12, IEEESoft’13]: performance debugging in the large • Data Source: Performance call stack traces from Windows end users • Analytics Output: Ranked clusters of call stack traces based on shared patterns • Impact: Deployed/used in daily practice of Windows Performance Analysis team • XIAO [ACSAC’12, ICSE’17 SEIP]: code-clone detection and search • Data Source: Source code repos (+ given code segment optionally) • Analytics Output: Code clones • Impact: Shipped in Visual Studio 2012; deployed/used in daily practice of Microsoft Security Response Center In Collaboration with Microsoft Research Asia Internet
  15. 15. Past: Software Analytics • Service Analysis Studio [ASE’13-EX]: service incident management • Data Source: Transaction logs, system metrics, past incident reports • Analytics Output: Healing suggestions/likely root causes of the given incident • Impact: Deployed and used by an important Microsoft service (hundreds of millions of users) for incident management In Collaboration with Microsoft Research Asia
  16. 16. Next: Intelligent Software Analytics(?) Microsoft Research Asia - Software Analytics Group - Smart Data Discovery IN4: INteractive, Intuitive, Instant, INsights Quick Insights -> Microsoft Power BI Gartner Magic Quadrant for Business Intelligence & Analytics Platforms
  17. 17. Microsoft Research Asia - Software Analytics Group https://www.hksilicon.com/articles/1213020
  18. 18. 18 Existing Approaches on NL  Regular Expressions [Ranta 1998], [Kushman and Barzilay 2013], [Locascio et al. 2016] Used only synthetic data for training and testing Are these approaches effective to address real-world situations ? Deep Learning for NLRegex: Get Real! Zhong et al. Generating Regular Expressions from Natural Language Specifications: Are We There Yet? In AAAI 2018 Workshop on NLP for Software Engineering (NL4SE 2018) http://taoxie.cs.illinois.edu/publications/nl4se18-regex.pdf
  19. 19. Synthetic datasets KB13 [Kushman and Barzilay 2013] (824 pairs)  Write NL sentences to capture the examples strings NL-RX [Locascio et al., 2016] (10,000 pairs)  Parse a regex and generate initial NL sentences based on a predefined grammar  Paraphrase the generated sentences Real-world dataset RegexLib (3,619 pairs)  From regexlib.com 19 Characteristic Study
  20. 20. 20 Complexity of regular expressions • Synthetic dataset support only a subset of regex language: e.g., ‘?’ ∈ RegexLib, but ∉ NL-RX or KB-13 Length statistics of regular expressions
  21. 21. •# of distinct words: 13,491 (RegexLib) vs 715 (KB13) vs 560(NL-RX) 21 Complexity of NL sentences #words statistics of NL sentences
  22. 22. Deep-Regex [Locascio et al. 2016] Regular expression generation  Machine translation 22 Experimental Study Sequence-to-sequence learning https://github.com/nicholaslocascio/deep-regexhttps://aclweb.org/anthology/D/D16/D16-1197.pdf
  23. 23. String-Equal: exact-matching DFA-Equal: semantically matching 23 Effectiveness on Synthetic Datasets DFA: Deterministic Finite Automaton
  24. 24. Experiment settings Use Deep-Regex to train a model using synthetic NL-RX dataset Build a testing set (1,091 pairs) from RegexLib Eliminate long NL sentences Results Without beam search: cannot generate any correct regex Beam search (size: 20): generate correct regexs for 5 NL (0.46%) Huge Drop of Top-20 accuracy! (90.9%  0.46%) 24 Experiments on Real-world Dataset
  25. 25. Variations of NL sentences  NL-RX: NL sentences are generated from a predefined grammar  Augmenting training data may alleviate the error Numerical range 25 New Causes of Errors on Real-world Dataset Description Ground Truth Predicted Result Match the numbers 100 to 199. 1[0-9][0-9] ([0-9])*
  26. 26. RegexLib is too sparse to be a sufficient training set Collect sufficient labeled real-world data Synthesize data to supplement the collected real-world data 26 Ongoing Work: Large Real-world Benchmark Dataset # Pairs # distinct words NL-RX 10,000 560 RegexLib 3619 13,491
  27. 27. String test cases can handle the ambiguity of NL sentences String test cases can differentiate regular expression candidates help select the best candidate during beam search 27 Description Ground Truth Predicted Result Items with a small letter preceding “dog”, at least thrice ([a-b].*dog.*){3,} ([a-b]).*((dog){3,}) Test case:“adogadogadog” Ongoing Work: Testability of Regular Expressions
  28. 28. https://medium.com/ai-for-software-engineering/ai-for-software-engineering-industry-landscape-d8c7c7f82ba
  29. 29. 29 AI for SE Startups Rooted from Research http://www.diffblue.com/ Oxford University spin-off, Daniel Kroening et al. Peking University spin-off, Ge Li et al. https://www.codota.com/ Technion spin-off, Eran Yahav et al. Technical University Munich spin-off, Benedikt Hauptmann et al. https://www.qualicen.de/en/ http://aixcoder.com/
  30. 30. Open Topics in Intelligent Software Engineering (ISE) • How to determine whether a software engineering tool is indeed “intelligent”? • Turing test for such tool? • What sub-areas/problems in ISE shall the research community invest efforts on as high priority? • How to turn ISE research results into industrial/open source practice? • …
  31. 31. Artificial Intelligence  Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  32. 32. White-House-Sponsored Workshop (2016 June 28) http://www.cmu.edu/safartint/
  33. 33. Self-Driving Tesla Involved in Fatal Crash (2016 June 30) http://www.nytimes.com/2016/07/01/business/self-driving-tesla-fatal-crash-investigation.html “A Tesla car in autopilot crashed into a trailer because the autopilot system failed to recognize the trailer as an obstacle due to its “white color against a brightly lit sky” and the “high ride height” http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf
  34. 34. Microsoft's Teen Chatbot Tay Turned into Genocidal Racist (2016 March 23/24) http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3 "There are a number of precautionary steps they [Microsoft] could have taken. It wouldn't have been too hard to create a blacklist of terms; or narrow the scope of replies. They could also have simply manually moderated Tay for the first few days, even if that had meant slower responses." “businesses and other AI developers will need to give more thought to the protocols they design for testing and training AIs like Tay.”
  35. 35. NSF New Program: Formal Methods in the Field • Anticipated Funding: $8 millions; #awards: 8 • Deadline: May 8th 2018 Machine Learning: The sheer complexity of machine learning algorithms and their applications makes it hard to ensure correctness. Exploration of new formal methods can be used to characterize boundaries of behavior, and may bring much needed rigor to machine learning algorithms and applications. These techniques could range from novel programming languages and compilers for more robust machine learning to formal verification techniques for machine learning systems that could provide assurances of safety, correctness, and fairness. The interplay between program synthesis and machine learning offers many interesting possibilities to both improve machine learning and formal techniques. https://www.nsf.gov/pubs/2018/nsf18536/nsf18536.htm
  36. 36. Problems in Testing ML Software ● ML Software suffers from the “no oracle problem” ○ Previous approach @Columbia U. on metamorphic testing: check satisfaction of a property with different inputs in equivalent classes https://medium.com/trustableai/testing-ai-with-metamorphic-testing-61d690001f5c ● Inaccuracy may be desirable to avoid the overfitting problem ● Auto-generated test inputs have no expected outputs 36
  37. 37. Multiple-Implementation Testing 37 http://taoxie.cs.illinois.edu/publications/edsmls18-mitest.pdf Srisakaokul et al. Multiple-Implementation Testing of Supervised Learning Software. InAAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS 2018)
  38. 38. Evaluation Setup ● kNN: ○ 19 implementations (including Weka, RapidMiner, and KNIME) ○ Parameters: k = 1, Euclidean-distance metric ○ 3 data sets: Iris, Breast Cancer Wisconsin (BCW), Glass Identification (Glass) ● Naive Bayes (NB): ○ 7 implementations (including Weka, RapidMiner, and KNIME) ○ Parameters: none ○ 3 data sets: Breast Cancer Wisconsin (BCW), Haberman’s Survival Data (Haberman), Hayes-Roth (Hayes) ● Randomly split each data set into training and test set with the ratio of 4:1 ● The data sets contain about 1000 instances in total 38
  39. 39. Effectiveness of Majority Oracle Overall, 20.5% of the tests are deviating tests, and 97.5% of the deviating tests reveal faults 39 Algorithm Major-Oracle Deviating Tests (%) Fault Revealing Tests (%) #Faults kNN 23.84% 100.00% 13 NB 16.29% 94.31% 16 kNN+NB 20.50% 97.50% 29
  40. 40. Effectiveness of Majority Oracle (cont.) 40
  41. 41. Effectiveness of Majority Oracle (cont.) 41
  42. 42. Fault Example 1 (in kNN) ● Returns NaN ● !(max==min) should be !(maxValue==minValue) 42
  43. 43. Effectiveness of Majority Oracle (cont.) 43
  44. 44. Fault Example 2 (in kNN) ● When k = 1, the method returns the first element without sorting 44
  45. 45. Other’s Work at Columbia/Lehigh U.: SOSP 2017 Best Paper Award http://www.cs.columbia.edu/~suman/docs/deepxplore.pdf https://github.com/peikexin9/deepxplore
  46. 46. Other’s Work at Columbia U./UVa: ICSE 2018 https://arxiv.org/pdf/1708.08559.pdf
  47. 47. Our Most Recent Work: “Testing” a Classifier (aka Adversarial Machine Learning) Malware Detection in Adversarial Settings: Exploiting Feature Evolutions and Confusions in Android Apps WeiYang, Deguang Kong,Tao Xie and Carl A. Gunter Annual Computer Security Applications Conference (ACSAC 2017) http://taoxie.cs.illinois.edu/publications/acsac17-malware.pdf
  48. 48. Evasion attack on classifiers • Goals: Understand classifier robustness; Generate testing samples to help build better classifiers. • Example: 4848
  49. 49. Generating adversarial example helps build better classifiers 49 Figure Credit: GoodFellow 2016
  50. 50. Three practical constraints to craft a realistic attack against mobile malware classifiers • Preserving Malicious Behaviors. • Maintaining the Robustness of Apps. • Evading Malware Detectors. 50
  51. 51. Malware Recomposition Variation (MRV) • Malware Evolution Attack • Malware Confusion Attack • Insight • Follow existing patterns! • In our mutation strategies, the feature patterns are extracted from existing malware evolution histories and existing evasive malware. 51 Figure Credit: Trend Micro Figure Credit: Malware News
  52. 52. Why MRV works • Large feature set has numerous non-informative or even misleading features. • Insight 1: Malware detectors often confuse non-essential features in code clones as discriminative features. • Insight 2: Using a universal set of features for all malware families would result in a large number of non-essential features to characterize each family. 52
  53. 53. Feature Model • A substitute model • Resource Temporal Locale Dependency model • Summarize the essential features and contextual features commonly used in malware detection • Transferability property 53 Target Model Substitute Model Adversarial Samples Labeled Data TrainClassify Adversarial craftingAttack
  54. 54. Approach • Mutation strategy synthesis: • Phylogenetic analysis for evolution attack • Similarity metric for confusion attack • Program mutation • Program transplantation/refactoring 54
  55. 55. Practicability of attacks • Check the preserving of malicious behaviors • Our impact analysis is based on the insight that the component-based nature of Android constrains the impact of mutations within certain components • Check the robustness of mutated apps • Each mutated app was tested against 5,000 events randomly generated by Monkey to ensure that the app does not crash 55
  56. 56. Evaluation • Malware detection techniques: • AppContext, a malware detector leveraging semantic features extracted from call graphs and control-flow graphs. • Drebin, a malware detector leveraging eight categories of features that reside either in the manifest file or in the disassembled code. • Subjects: 1,917 malware and 1,935 benign apps • Baseline: • OCTOPUS, a syntactic app obfuscation tool similar to DroidChameleon. • Random MRV 56
  57. 57. Results - Defeating existing malware detection 57 • ORI: Original test dataset (ORI) • MRV: Test dataset with adversarial samples.
  58. 58. Results – Comparing with Baselines 58 • MRV produces much more evasive variants than both OCTOPUS and Random MRV for all three tools, especially the learning-based tools
  59. 59. Results – Comparing with Baselines 59 • Random MRV generates more than 320,000 variants, but only 212 of them can run without crashing (and only 2 can evade detection of AppContext).
  60. 60. Strengthening the robustness of detection • Adversarial Training • We randomly chose half of our generated malware variants into the training set to train the model • Variant Detector • We create a new classifier called variant detector to detect whether an app is a variants derived from existing malware. • Weight Bounding • We constrain the weight on a few dominant features to make feature weights more evenly distributed. 60
  61. 61. Results: strengthening the robustness of detection 6161
  62. 62. Artificial Intelligence  Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering
  63. 63. 63 Thank You! Q & A This work was supported in part by NSF under grants no. CCF-1409423, CNS-1434582, CNS-1513939, CNS-1564274.
  64. 64. Artificial Intelligence  Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligent Software Engineering

×