Successfully reported this slideshow.
Your SlideShare is downloading. ×

Software Analytics: Data Analytics for Software Engineering

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
introduction to Web system
introduction to Web system
Loading in …3
×

Check these out next

1 of 85 Ad

More Related Content

Slideshows for you (20)

Similar to Software Analytics: Data Analytics for Software Engineering (20)

Advertisement

More from Tao Xie (20)

Recently uploaded (20)

Advertisement

Software Analytics: Data Analytics for Software Engineering

  1. 1. Software Analytics: Data Analytics for Software Engineering Achievements and Opportunities Tao Xie Department of Computer Science University of Illinois at Urbana-Champaign, USA taoxie@illinois.edu In Collaboration with Microsoft Research; Most Slides Adapted from Slides Co-Presented with Dongmei Zhang (Microsoft Research, http://research.microsoft.com/en-us/groups/sa/)
  2. 2. New Era…Software itself is changing... Software Services
  3. 3. How people use software is changing…
  4. 4. Individual Isolated Not much data/content generation How people use software is changing…
  5. 5. How people use software is changing… Individual Isolated Not much data/content generation
  6. 6. How people use software is changing… Individual Social Isolated Not much data/content generation Collaborative Huge amount of data/artifacts generated anywhere anytime
  7. 7. How software is built & operated is changing…
  8. 8. How software is built & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
  9. 9. How software is built & operated is changing… Data pervasive Long product cycle Experience & gut-feeling In-lab testing Informed decision making Centralized development Code centric Debugging in the large Distributed development Continuous release … …
  10. 10. Software Analytics Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
  11. 11. Software Analytics Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. http://research.microsoft.com/en-us/groups/sa/ http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
  12. 12. Research topics – the trinity view Software Users Software Development Process Software System • Covering different areas of software domain • Throughout entire development cycle • Enabling practitioners to obtain insights
  13. 13. Data sources Runtime traces Program logs System events Perf counters … Usage log User surveys Online forum posts Blog & Twitter … Source code Bug history Check-in history Test cases Eye tracking MRI/EMG …
  14. 14. Target audience – software practitioners
  15. 15. Target audience – software practitioners Developer Tester
  16. 16. Target audience – software practitioners Developer Tester Program Manager Usability engineer Designer Support engineer Management personnel Operation engineer
  17. 17. Output – insightful information • Conveys meaningful and useful understanding or knowledge towards completing the target task • Not easily attainable via directly investigating raw data without aid of analytics technologies • Examples – It is easy to count the number of re-opened bugs, but how to find out the primary reasons for these re-opened bugs? – When the availability of an online service drops below a threshold, how to localize the problem?
  18. 18. Output – actionable information • Enables software practitioners to come up with concrete solutions towards completing the target task • Examples – Why bugs were re-opened? • A list of bug groups each with the same reason of re- opening – Why availability of online services dropped? • A list of problematic areas with associated confidence values
  19. 19. Research topics & technology pillars Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
  20. 20. Connection to practice • Software Analytics is naturally tied with software development practice • Getting real
  21. 21. Connection to practice • Software Analytics is naturally tied with software development practice • Getting real Real Data Real Problems Real Users Real Tools
  22. 22. Outline • Software Engineering for Big Data • Overview of Software Analytics • Selected Projects – StackMine: Performance debugging in the large – XIAO: Scalable code clone analysis – SAS: Incident management of online services – Code Hunt/Pex4Fun: Teaching & learning via interactive gaming
  23. 23. StackMine Performance debugging in the large via mining millions of stack traces http://research.microsoft.com/en-us/groups/sa/stackmine_icse2012.pdf
  24. 24. Performance issues in the real world • One of top user complaints • Impacting large number of users every day • High impact on usability and productivity High Disk I/O High CPU consumption
  25. 25. Performance issues in the real world • One of top user complaints • Impacting large number of users every day • High impact on usability and productivity High Disk I/O High CPU consumption As modern software systems tend to get more and more complex, given limited time and resource before software release, development- site testing and debugging become more and more insufficient to ensure satisfactory software performance.
  26. 26. Performance debugging in the large
  27. 27. Performance debugging in the large Trace Storage Trace collection Network
  28. 28. Performance debugging in the large Trace Storage Trace collection Network Trace analysis
  29. 29. Performance debugging in the large Trace Storage Trace collection Bug Database Network Trace analysis Bug filing
  30. 30. Performance debugging in the large Trace Storage Trace collection Problematic Pattern Repository Bug Database Network Trace analysis Bug filing
  31. 31. Performance debugging in the large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis Bug filing
  32. 32. Performance debugging in the large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis Bug filing Key to issue discovery
  33. 33. Performance debugging in the large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis Bug filing Key to issue discovery Bottleneck of scalability
  34. 34. Performance debugging in the large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis How many issues are still unknown? Bug filing Key to issue discovery Bottleneck of scalability
  35. 35. Performance debugging in the large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis How many issues are still unknown? Which trace file should I investigate first? Bug filing Key to issue discovery Bottleneck of scalability
  36. 36. Problem definition Given OS traces collected from tens of thousands (potentially millions) of users, help domain experts identify impactful program execution patterns (that cause the most impactful underlying performance problems) with limited time and resource.
  37. 37. Challenges Combination of expertise • Generic machine learning tools without domain knowledge guidance do not work well Highly complex analysis • Numerous program runtime combinations triggering performance problems • Multi-layer runtime components from application to kernel being intertwined Large-scale trace data • TBs of trace files and increasing • Millions of events in single trace streamInternet
  38. 38. Technical highlights • Machine learning for system domain – Formulate the discovery of problematic execution patterns as callstack mining & clustering – Systematic mechanism to incorporate domain knowledge • Interactive performance analysis system – Parallel mining infrastructure based on HPC + MPI – Visualization aided interactive exploration
  39. 39. Impact “We believe that the MSRA tool is highly valuable and much more efficient for mass trace (100+ traces) analysis. For 1000 traces, we believe the tool saves us 4-6 weeks of time to create new signatures, which is quite a significant productivity boost.” Highly effective new issue discovery on Windows mini-hang Continuous impact on future Windows versions
  40. 40. XIAO Scalable code clone analysis 2012 http://research.microsoft.com/jump/175199
  41. 41. XIAO: Code Clone Analysis • Motivation – Copy-and-paste is a common developer behavior – A real tool widely adopted internally and externally • XIAO enables code clone analysis in the following way – High tunability – High scalability – High compatibility – High explorability
  42. 42. High tunability – what you tune is what you get • Intuitive similarity metric: effective control of the degree of syntactical differences between two code snippets for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; } for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; }
  43. 43. High scalability • Four-step analysis process • Easily parallelizable based on source code partition Pre-processing Pruning Fine Matching Coarse Matching
  44. 44. High compatibility • Compiler independent • Light-weight built-in parsers for C/C++ and C# • Open architecture for plug-in parsers to support different languages • Easy adoption by product teams – Different build environment – Almost zero cost for trial
  45. 45. High explorability 1. Clone navigation based on source tree hierarchy 2. Pivoting of folder level statistics 3. Folder level statistics 4. Clone function list in selected folder 5. Clone function filters 6. Sorting by bug or refactoring potential 7. Tagging 1 2 3 4 5 6 7 1. Block correspondence 2. Block types 3. Block navigation 4. Copying 5. Bug filing 6. Tagging 1 2 3 4 1 6 5
  46. 46. Scenarios & Solutions Quality gates at milestones • Architecture refactoring • Code clone clean up • Bug fixing Post-release maintenance • Security bug investigation • Bug investigation for sustained engineering Development and testing • Checking for similar issues before check-in • Reference info for code review • Supporting tool for bug triage Online code clone search Offline code clone analysis
  47. 47. Benefiting developer community Available in Visual Studio 2012 RC Searching similar snippets for fixing bug once Finding refactoring opportunity
  48. 48. More secure Microsoft products Code Clone Search service integrated into workflow of Microsoft Security Response Center Over 590 million lines of code indexed across multiple products Real security issues proactively identified and addressed
  49. 49. Example – MS Security Bulletin MS12-034 Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012 3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically, one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document Insufficient bounds check within the font parsing subsystem of win32k.sys Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer Microsoft Technet Blog about this bulletin However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
  50. 50. SAS Incident management of online services http://research.microsoft.com/apps/pubs/?id=202451
  51. 51. Motivation Incident Management (IcM) is a critical task to assure service quality • Online services are increasingly popular & important • High service quality is the key
  52. 52. Incident Management: Workflow Detect a service issue Alert On- Call Engineers (OCEs) Investigate the problem Restore the service Fix root cause via postmortem analysis
  53. 53. Incident Management: Characteristics Shrink-Wrapped Software Debugging Root Cause and Fix Debugger Controlled Environment Online Service Incident Management Workaround No Debugger Live Data
  54. 54. Incident Management: Challenges Large volume and noisy data Highly complex problem space No knowledge of entire system Knowledge not well organized
  55. 55. SAS: Incident management of online services SAS, developed and deployed to effectively reduce MTTR (Mean Time To Restore) via automatically analyzing monitoring data 5 5  Design Principle of SAS  Automating Analysis  Handling Heterogeneity  Accumulating Knowledge  Supporting human-in-the-loop (HITL)
  56. 56. Techniques Overview • System metrics – Identifying Incident Beacons • Transaction logs – Mining Suspicious Execution Patterns • Historical incidents – Mining Historical Workaround Solutions
  57. 57. Industry Impact of SAS Deployment • SAS deployed to worldwide datacenters for Service X (serving hundreds of millions of users) since June 2011 • OCEs now heavily depend on SAS Usage • SAS helped successfully diagnose ~76% of the service incidents assisted with SAS
  58. 58. Code Hunt/Pex4Fun Teaching and learning via interactive gaming http://research.microsoft.com/pubs/189240/icse13see-pex4fun.pdf http://pex4fun.com/https://www.codehunt.com/
  59. 59. Coding Duels 1,841,880 clicked 'Ask Pex!' http://pex4fun.com/
  60. 60. Code Hunt: Resigned As Game https://www.codehunt.com/
  61. 61. Behind the Scene Secret Implementation class Secret { public static int Puzzle(int x) { if (x <= 0) return 1; return x * Puzzle(x-1); } } Player Implementation class Player { public static int Puzzle(int x) { return x; } } class Test { public static void Driver(int x) { if (Secret.Puzzle(x) != Player.Puzzle(x)) throw new Exception(“Mismatch”); } } behavior Secret Impl == Player Impl 62
  62. 62. Coding Duels: Fun and Engaging Iterative gameplay Adaptive Personalized No cheating Clear winning criterion
  63. 63. Teaching and Learning
  64. 64. Coding Duels for Course Assignments @Grad Software Engineering Course http://pexforfun.com/gradsofteng Observed Benefits • Automatic Grading • Real-time Feedback (for Both Students and Teachers) • Fun Learning Experiences
  65. 65. Coding Duel Competition @ICSE 2011 http://pexforfun.com/icse2011
  66. 66. Example User Feedback “It really got me *excited*. The part that got me most is about spreading interest in teaching CS: I do think that it’s REALLY great for teaching | learning!” “I used to love the first person shooters and the satisfaction of blowing away a whole team of Noobies playing Rainbow Six, but this is far more fun.” “I’m afraid I’ll have to constrain myself to spend just an hour or so a day on this really exciting stuff, as I’m really stuffed with work.” Released since 2010 X
  67. 67. Usage Scenarios Students: proceed through a sequence on puzzles to learn and practice. Educators: exercise different parts of a curriculum, and track students’ progress Recruiters: use contests to inspire communities and results to guide hiring Researchers: mine extensive data in Azure to evaluate how people code and learn http://taoxie.cs.illinois.edu/publications/icse15jseet-codehunt.pdf ICSE 2015 JSEET: http://taoxie.cs.illinois.edu/publications/icse13see-pex4fun.pdf ICSE 2013 SEE:
  68. 68. Conclusion Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
  69. 69. Q & A http://taoxie.cs.illinois.edu/ Contact: taoxie@illinois.edu
  70. 70. Conclusion Software Users Software Development Process Software System Vertical Horizontal Information Visualization Data Analysis Algorithms Large-scale Computing
  71. 71. Code Hunt Programming Game
  72. 72. Code Hunt Programming Game
  73. 73. Code Hunt Programming Game
  74. 74. Code Hunt Programming Game
  75. 75. Code Hunt Programming Game
  76. 76. Code Hunt Programming Game
  77. 77. Code Hunt Programming Game
  78. 78. Code Hunt Programming Game
  79. 79. Code Hunt Programming Game
  80. 80. Code Hunt Programming Game
  81. 81. Code Hunt Programming Game
  82. 82. Code Hunt Programming Game
  83. 83. Code Hunt Programming Game
  84. 84. Leaderboard and Dashboard Publically visible, updated during the contest Visible only to the organizer
  85. 85. It’s a game! iterative gameplay adaptive personalized no cheating clear winning criterion code test cases

×