Software Analytics: Data Analytics for Software Engineering and Security

Software Analytics: Data Analytics
for Software Engineering and Security
(Speaker Info)
Frodo Baggins
Ring Bearer
FOTR, LLC
Tao Xie
Department of Computer Science
University of Illinois at Urbana-Champaign, USA
taoxie@illinois.edu
In Collaboration with Microsoft Research and NC State University

New Era…Software itself is changing...
Software Services

How people use software is changing…

Individual Isolated
Not much data/content
generation

Individual Isolated
generation

Individual
Social
Isolated
generation
Collaborative
Huge amount of data/artifacts
generated anywhere anytime

How software is built & operated is changing…

How software is built & operated is changing…
Data pervasive
Long product cycle
Experience & gut-feeling
In-lab testing
Informed decision making
Centralized development
Code centric
Debugging in the large
Distributed development
Continuous release
… …

Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software
Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf

Software Analytics
Software analytics is to enable software
practitioners to perform data exploration and
analysis in order to obtain insightful and
actionable information for data-driven tasks
around software and services.
http://research.microsoft.com/en-us/groups/sa/
http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx

Data sources
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
Eye tracking
MRI/EMG
…

Target audience – software practitioners

Developer
Tester

Developer
Tester
Program Manager
Usability engineer
Designer
Support engineer
Management personnel
Operation engineer

Output – insightful information
• Conveys meaningful and useful understanding or
knowledge towards completing the target task
• Not easily attainable via directly investigating raw data
without aid of analytics technologies
• Example
– It is easy to count the number of re-opened bugs, but how to
find out the primary reasons for these re-opened bugs?

Output – actionable information
• “So what” -- enables software practitioners to come up
with concrete solutions towards completing the target
task
• Example
– Why bugs were re-opened?
• A list of bug groups each with the same reason of re-
opening

Research topics & technology pillars
Software
Users
Software
Development
Process
Software
System
Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing

Outline
• Overview of Software Analytics
• Software Engineering Tasks
– XIAO: Scalable code clone analysis
– SAS: Incident management of online services
• Mobile App Security Tasks
– WHYPER: NLP on app descriptions
– AppContext: Machine learning to classify malware

XIAO
Scalable code clone analysis
2012
http://research.microsoft.com/jump/175199

XIAO: Code Clone Analysis
• Motivation
– Copy-and-paste is a common developer behavior
– A real tool widely adopted internally and externally
• XIAO enables code clone analysis in the following way
– High tunability
– High scalability
– High compatibility
– High explorability

High tunability – what you tune is what you get
• Intuitive similarity metric: effective control of the
degree of syntactical differences between two code
snippets
for (i = 0; i < n; i ++) {
a ++;
b ++;
c = foo(a, b);
d = bar(a, b, c);
e = a + c; }
for (i = 0; i < n; i ++) {
c = foo(a, b);
a ++;
b ++;
d = bar(a, b, c);
e = a + d;
e ++; }

High explorability
1. Clone navigation based on source tree hierarchy
2. Pivoting of folder level statistics
3. Folder level statistics
4. Clone function list in selected folder
5. Clone function filters
6. Sorting by bug or refactoring potential
7. Tagging
1 2 3 4 5 6
7
1. Block correspondence
2. Block types
3. Block navigation
4. Copying
5. Bug filing
6. Tagging
1
2
3
4
1
6
5

Scenarios & Solutions
Quality gates at milestones
• Architecture refactoring
• Code clone clean up
• Bug fixing
Post-release maintenance
• Security bug investigation
• Bug investigation for sustained engineering
Development and testing
• Checking for similar issues before check-in
• Reference info for code review
• Supporting tool for bug triage
Online code clone search
Offline code clone analysis

Benefiting developer community
Available in Visual Studio 2012 RC
Searching similar snippets
for fixing bug once
Finding refactoring
opportunity

More secure Microsoft products
Code Clone Search service integrated into
workflow of Microsoft Security Response Center
Over 590 million lines of code indexed across
multiple products
Real security issues proactively identified and
addressed

Example – MS Security Bulletin MS12-034
Combined Security Update for Microsoft Office, Windows, .NET Framework, and
Silverlight, published: Tuesday, May 08, 2012
3 publicly disclosed vulnerabilities and seven privately reported involved. Specifically,
one is exploited by the Duqu malware to execute arbitrary code when a user opened
a malicious Office document
Insufficient bounds check within the font parsing subsystem of win32k.sys
Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer
Microsoft Technet Blog about this bulletin
However, we wanted to be sure to address the vulnerable code wherever it appeared
across the Microsoft code base. To that end, we have been working with Microsoft
Research to develop a “Cloned Code Detection” system that we can run for every
MSRC case to find any instance of the vulnerable code in any shipping product. This
system is the one that found several of the copies of CVE-2011-3402 that we are
now addressing with MS12-034.

SAS
Incident management of online services
http://research.microsoft.com/apps/pubs/?id=202451

Motivation
Incident Management (IcM) is a critical task to
assure service quality
• Online services are increasingly popular & important
• High service quality is the key

Incident Management: Workflow
Detect a
service
issue
Alert On-
Call
Engineers
(OCEs)
Investigate
the problem
Restore
the
service
Fix root cause
via
postmortem
analysis

Incident Management: Characteristics
Shrink-Wrapped
Software Debugging
Root Cause and Fix
Debugger
Controlled
Environment
Online Service
Incident
Management
Workaround
No Debugger
Live Data

Incident Management: Challenges
Large volume and noisy data
Highly complex problem space
No knowledge of entire system
Knowledge not well organized

SAS: Incident management of online services
SAS, developed and deployed to effectively reduce MTTR
(Mean Time To Restore) via automatically analyzing
monitoring data
3
3
 Design Principle of SAS
 Automating Analysis
 Handling Heterogeneity
 Accumulating Knowledge
 Supporting human-in-the-loop (HITL)

Techniques Overview
• System metrics
– Identifying Incident Beacons
• Transaction logs
– Mining Suspicious Execution Patterns
• Historical incidents
– Mining Historical Workaround Solutions

Industry Impact of SAS
Deployment
• SAS deployed to
worldwide datacenters for
Service X (serving
hundreds of millions of
users) since June 2011
• OCEs now heavily depend
on SAS
Usage
• SAS helped successfully
diagnose ~76% of the
service incidents assisted
with SAS

“Conceptual” Model
38
APP DEVELOPERS
APP USERS
App
Functional
Requirements
App Security
Requirements
User
Functional
Requirements
User Security
Requirements
informal: app description, etc. permission list, etc.
App Code

Requirements:
App Description
39
App
Code
App
Permissions

App Security Requirements:
Permission List
40

“Conceptual” Model
41
APP DEVELOPERS
APP USERS
App
Functional
Requirements
App Security
Requirements
User
Functional
Requirements
User Security
Requirements
informal: app description, etc. permission list, etc.
App Code

Example Andriod App: Angry Birds
42

o Focus on permission  app descriptions
o permissions (protecting user understandable resources)
should be discussed
o What does the users expect (w.r.t. app functionalities)?
o GPS Tracker: record and send location
o Phone-Call Recorder: record audio during phone call
WHYPER: Text Analytics for Mobile Security
43
App Description Sentence
Permission
Linkage
Pandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013
http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf

WHYPER Overview
Application Market
WHYPER
DEVELOPERS
USERS
44
Pandita et al. WHYPER: Towards Automating Risk Assessment of Mobile Applications. USENIX Security 2013
http://web.engr.illinois.edu/~taoxie/publications/usenixsec13-whyper.pdf
• Enhance user experience while installing apps
• Enforce functionality disclosure on developers
• Complement program analysis to ensure justifications

Natural Language Processing on App Description
45
• “Also you can share the yoga exercise to your friends via Email and SMS.
– Implication of using the contact permission
– Permission sentences
• Confounding effects:
– Certain keywords such as “contact” have a confounding meaning
– E.g., “... displays user contacts, ...” vs “... contact me at abc@xyz.com”.
• Semantic inference:
– Sentences describe a sensitive action w/o referring to keywords
– E.g., “share yoga exercises with your friends via Email and SMS”
NLP + Semantic Graphs/Ontologies Derived from Android API Documents

• Synonym analysis
• Ex non-permission sentence: “You can now turn recordings into
ringtones.”
• functionality that allows users to create ringtones from previously recorded sounds but
NOT requiring permission to record audio
• false positive due to using synonym: (turn, start)
• Limitations of Semantic Graphs
• Ex. permission sentence: “blow into the mic to extinguish the flame like
a real candle”
• false negative due to failing to associate “blow into” with “record”
• Automatic mining from user comments and forums
Challenges
46

Not All Malware Developers Are “Dumb”
or “Lazy”
47

Not All Malware Developers Are “Dumb” or “Lazy”
Benign? Malicious?

Our Insight
Different goals of benign apps vs. malware.
• Benign apps
– Meet requirements from users (as delivering utility)
• Malware
– Trigger malicious behaviors frequently (as maximizing profits)
– Evade detection (as prolonging lifetime)
50

Differentiating characteristics
Mobile malware (vs. benign apps)
– Frequently enough to meet the need: frequent
occurrences of imperceptible system events;
• E.g., many malware families trigger malicious behaviors via
background events.
– Not too frequently for users to notice anomaly:
indicative states of external environments
• E.g., Send premium SMS every 12 hours
Balance!!!

ActionReceiver.OnReceive()
Date date = new Date();
if(data.getHours>23 || date.getHours< 5 ){
ContextWrapper.StartService(MainService);
…
MainService.OnCreate()
DummyMainMethod()
SendTextActivity$4.onClick()
SplashActivity.OnCreate()
SmsManager.sendTextMessage()
long last = db.query(“LastConnectTime");
long current = System.currentTimeMillis();
if(current – last > 43200000 ){
SmsManager.sendTextMessage();
db.save(“LastConnectTime”, current);
…
SendTextActivity$5.run()
MainService.b()
ContextWrapper.StartService()
The app will send an SMS when
• user clicks a button in the app
Example of malicious app
SendTextActivity$4.onClick
SmsManager.sendTextMessage

…
DummyMainMethod()
…
SendTextActivity$5.run()MainService.b()
• phone signal strength changes
(frequent)
• current time is within 11PM-5 AM
(not too frequent, User not
around)
Example of malicious app
Android.intent.action.SIG_STR

…
DummyMainMethod()
…
SendTextActivity$5.run()
MainService.b()
• user enters the app (frequent)
• (current time – time when last msg
sent) >12 hours (not too frequent)
Example

AppContext
• Capture differentiating characteristics with
contexts of security-sensitive behavior.
• Leverage contexts in machine learning
(classification) to differentiate malware and
benign apps.
Yang et al. AppContext: Differentiating Malicious and Benign Mobile App Behavior Under Contexts. ICSE 2015.
http://taoxie.cs.illinois.edu/publications/icse15-appcontext.pdf

Techniques
• Abstraction for expressing context of security-
sensitive behaviors, e.g., a permission protected
API method.
– To precisely capture the differentiating
characteristics
• Inter-component analysis for extracting contexts
– To identify entry point for activation events
– To connect control flows for context factors

Context of security-sensitive behavior
• Activation events:
• E.g., signal strength changes
• Context factors:
• Environmental attributes for affecting security-
sensitive behavior’s invocation (or not)
• E.g., current system time

AppContext - Workflow
CG: Call Graph; ECG: Extended CG; RICFG: Reduced ICFG

Context-based
Security-Behavior Classification
Context1:
(Event: Signal strength changes),
(Factor: Calendar)
Context2:
(Event: Entering app),
(Factor: Database, SystemTime)
Context3:
(Event: Clicking a button)
Transforming Labelling Training ClassifyingStep 1. Transform contexts for each app’s security behavior as
features

Context-based
Security-Behavior Classification (Cont.)
Transforming Labelling Training Classifying
Systematically label security-sensitive method calls as
malicious based on the existing malware signatures
Support Vector Machine (SVM)
• SVM is resilient to over-fitting
• SVM can handle high dimension data such as our
context factor data (dimension reduction may be
another option).

Evaluation
Subjects: 846 Android apps
• 633 benign apps: randomly selected from popular
apps on Google Play.
• 202 malicious apps: collected through three
different malware dataset (Genome, VirusShare,
Contagio).
• 11 open source apps: randomly selected from F-
Droid.

Research Questions
• RQ1: How effective is AppContext in identifying
malware?
• RQ2: How do activation events and context factors
in our context definition contribute to the
effectiveness of malware identification?
• RQ3: How accurate is our static analysis in inferring
contexts?

Evaluation
Complete Context has higher precision (87.7%)
and recall (95.0%)

Evaluation
Activation events effectively help identify malicious
method calls without context factors

Evaluation
Context factors effectively help identify malicious
behaviors triggered by UI events or malicious
behaviors with no activation events

Limitations
• False negatives
– Malicious behaviors triggered by UI events and
without context factors.
• UI events have less indication of the maliciousness of a
security-sensitive method call
• False positives
– Reflective method calls, dynamic code loading in
benign apps.
– Uncommon security-sensitive method calls used in
benign apps.

Conclusion
Software
Users
Software
Development
Process
Software
System
Vertical
Horizontal
Information Visualization
Data Analysis Algorithms
Large-scale Computing

Q & A
http://taoxie.cs.illinois.edu/
Contact: taoxie@illinois.edu

Software Analytics: Data Analytics for Software Engineering and Security

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Software Analytics: Data Analytics for Software Engineering and Security

Similar to Software Analytics: Data Analytics for Software Engineering and Security (20)

More from Tao Xie

More from Tao Xie (16)

Recently uploaded

Recently uploaded (20)

Software Analytics: Data Analytics for Software Engineering and Security