1. HACETTEPE UNIVERSITY
Department : Computer Engineering
Course : BIL 656(Advanced Computer And Network Security)
Supervisor : Asst.Prof.Sevil Sen
Student : Ulvi Ismayilov (ID:N14124839)
Date : 04/11/2014
2. A Machine-Learning Approach for
Classifying and Categorizing Android
Sources and Sinks
Technical University of Darmstadt
Authors :
Steven Arzt (EC SPRIDE)
Siegfried Rasthofer (EC SPRIDE)
Eric Bodden (EC SPRIDE)
3. CONTENTS :
1. Introduction
2. Definition of Sources and Sinks
3. Example to source-sink connection(leak)
4. Classification Approach
5. Evaluation ( Tradeoffs )
6. Related Work
7. Conclusion
4. 1.1 Motivation
Why do we need machine-learning approach
for identifying sources and sinks?
Information-flow tools require specifications of
sources and sinks
Analysis approaches often use a small hand-selected
set of sources and sinks known from
literature
Lists of sources and sinks known from literature are
incomplete, causing many data leaks in systems.
Manual identification of lists(sources and sinks) is
impractical(over 110000 public methods in Android 4.2).
5. 1.2 What is SuSi ?
An automated machine-learning guided approach for
identifying sources and sinks directly from the code of
Android API.
Features :
SuSi analyzes not only the framework API methods but also pre-installed
application codes.
Cross-validation over 92%.
SuSi doesn’t use permission lists for detecting sources and sinks
SuSi is an open-source project and available at :
htttps://github.com/secure-software-engineering/SuSi
Susi has an ability to detect sources and sinks in case of new, previously unseen
Android versions.
Main Goal :
Fully automated generation of a categorized list of
sources and sinks for android applications.
(the list can be directly used by existing static and dynamic analysis approaches)
6. 2.1 Definition of Sources and Sinks
There are 2 main concepts must be spoken about before defining
sources and sinks:
1)Data : is a value or a reference to a value
2)Resource method: reads data from or writes data to a shared
resource.
There is only one restriction that if method values(return value ,
parameter value ) are constant then we decide that this resource
method is neither a source nor a sink .
Example 1 Example 2
Resource the phone’s hardware GSM network
Data IMEI (as numerical
value)
Message (as string
value)
Resource method getDeviceId()
In TelephonyManager
Class
sendTextMessage()
In SmsManager Class
Source getDeviceId()
Sink sendTextMessage()
7. 2.2 Definition of Sources and
Sinks Android Source:
Sources are calls into resource methods returning
non-constant values into the application code.
Ex: getLac() returns Location Area Code
which is not a constant value
Android Sink:
Sinks are calls into resource methods accepting at least
one non-constant data value from the application code
as parameter, if and only if resource method parameter
gets a new value or is overwritten
Ex: sendTextMessage(a , b) receives 2 non-constant
parameters:
a)The message text b)the phone number
8. 3.1 Example to source-sink
connection
The example creates a publicly
accessible file on the phone’s internal
storage, which can be accessed by
arbitrary other applications without
requiring any permissions.
The code uses such resource method
that SuSi identifies as a ”FILE” sink
but which is normally hidden from the
SDK
9. 3.2 Example to source-sink
connection
Line 12:
the code checks for the specific well-known
cell-tower ID in Berlin(it returns
true-false)
Line 14:
Converts needed data to string type
(assign as taint)
Line 15:
Create a direction from where
attacker will easily reach to private
data(shared)
Line 17:
the code uses a little known Android
system function instead of Java’s
normal writing functions
10. 4.1 Classification Approach
There are two steps for classification
of Android resource methods :
Identification
Susi decides whether it is a source , a sink or
neither
Categorization
Susi separates sources and sinks which were
identified in the first step to the specific
categories
Note: All methods previously identified as neither
sink nor source are ignored for the second
step
11. 4.2 Simple machine-learning
explanation
As shown in Table I ,there
are three features(input) :
1) Driving Experience :
negatively correlated with
accident rate
2) Blood alcohol level:
positively correlated with
accident rate
3) Driver’s phone number :
completely unrelated
Note:
The impact of a single feature
on the overall estimate is
deduced from its value
distribution over the annotated
training set.
12. 4.3 Support Vector
Machines Tested approaches:
A simple rule-based classifier
Problem: In some cases, the classifier would actually pick randomly , since
both accident : yes and accident : no are equally likely
A probabilistic classifier(Naive Bayes)
Problem: Gives very imprecise results because our classification is almost
rule-based and has a fixed semantics
Pruned C4.5 decision tree.
Problem: Lack of flexibility of rule set
Support Vector Machine (SMO in Weka)
Chosen for implementation: Usually gives the best results , but not always,
can be expressed more appropriately by shifting the hyper-plane for
separation
13. 4.3 Support Vector Machines
SVM is a supervised learning model to train
classifier .
The main principle is to represent datasets of two
classes(in our scenario “sink” and “source”) using
vectors in a vector space.
If the data is not linearly separable problem can be
transformed into higher-dimensional spaces(you
may also assume as multidimensional matrix )
SMO is only capable of separating two classes .
However , in SuSi , we have three classes in the
first step.(Sink , Source , Neither)
Solution: one-against-all technique applied.
14. 4.4 SuSi’s overall architecture
Training dataset << Test dataset
Identification 0.7% training and 99.3% test data
Categorization0.4% training and 99.6% test data
• No-category concept and adding a new category
16. Output of classifier
Feature classes
Source Sink Neither Source
nor Sink
Method Name Starts with
”get”
Method has Parameters Less
parameters
more parameters
Method Return Value
Type
Returned
cursor
Void value type
Method Parameter Type Specific
types
Ex: java.io.*
Specific types
Ex: java.io.*
Method Parameter is an
Interface
Don’t perform any
actual operation on
data itself
Method Modifiers Public
Methods
Public Methods Static Methods
Class Modifiers Methods declared
in Protected
Classes
Dataflow to Sink Method Parameter
calls other specific
method update()
4.6 Feature Database
17. 4.7 Dataflow features
It becomes apparent that semantic features are much more
suitable for identifying sources and sinks than categorizing them.
On the source-code level , Android’s sources and sinks share
common patterns which can be exploited by dataflow feature.
Based on initialization, we then run a fixed-point iteration
with the following rules:
When the first source-to-sink connection is found the iteration is
aborted and returns “True” . If the dataflow analysis completes
without any source-to-sink connections ,the feature returns
“False”
18. 5.1 Evaluation (Cross
validation) Precision is the fraction of correctly classified elements in
class within all elements that were assigned to the same
class.
Recall is the fraction of correctly classified elements in
class within all elements should have been assigned to the
same class.
Interestingly, the average
precision and recall are almost the
same with the permission featured
and without
Implicit annotations for Virtual
Dispatch generic machine-learning
tool has no knowledge
about the language semantics of
Java.
Evaluated SuSi on the extended
test set using the implicit
annotation and again got more
than 92% precision
19. 5.2 Sources and Sinks in Malware
Apps
Tested 11000 malware Apps from Virus Share and
founded that current malware is leaking more
private information.
Second example is LeakMiner. It creates its own
source and sink list from a permission map .But
SuSi determined that there are more other not well-known
methods which don’t need a permission
Ex: getSimOperatorName() , getCountry() ,
getSimCounrtyIso()
SuSi found that there are plenty of wrapper
methods in internal Android classes or per-installed
apps that return privacy-sensitive information, such
as the IMEI .
20. 5.3 Changes during different Android
versions From the figure we can
clearly deduce that new
sources are introduced
with every version.
The results show that
SuSi detects the changes
in different API versions
very well .
Susi reliably finds new
sources and sinks that
were added to the Android
platform
But new detected
sources and sinks which
couldn’t be categorized by
SuSi should be done by
hand (create a new
category)
21. 5.4 Source and Sink lists used by other analysis
tools
Analysis Tools Source Lists Identifying Method
Leak Miner Permission Map
CHEX Semi-automatic approach(not public)
ScanDal Do not provide
AndroidLeaks Do not provide
Aurasium Intercept calls at system level libraries(Linux and
Android)
TaintDroid Like Aurasium but in lower-level internal system
Scandroid Not public but was extracted source and sink
specifications from the source code and appeared
list is fully covered by SuSi’s output.
23. 5.6 Disadvantages of SuSi
If number of test set is less in specific category then the
precision of categorization will decrease
(Ex: BLUETOOTH category just a few methods among
110000 Android API methods)
Many developers of Android framework do in fact follow a
certain regular coding style or duplicate parts of one’s
method’s implementation . These aspects lead to a
regularity and redundancy in the code base.
That’s why machine-learning approach can take an
advantage of it.
But if developer uses not regular coding style ?
There are call back methods(receive data from operating
system) SuSi can not detect these methods as sources
or sinks
( onNmeaReceived() instead of onLocationChanged() )
24. 6.1 Related Work
MERLIN
Probabilistic approach
Uses incomplete specifications of (sources and
sinks)
Based on string-related vulnerabilities(scripting or
sql-injections)
Need information about client or application
Fit a web application scenario but SuSi focuses on
privacy related aspects of Android where data is
usually not of type string
25. 6.2 Related Work
Machine Learning used for security :
1)Automatic Spam detection
2)Anomaly detection(network traffic)
3)MCA(Multiple Correspondence
analysis)
Identifies malwares from different markets
Difference between MCA and Susi :
SuSi works on independent and discrete classes
but MCA requires a logical ordering of records
26. Conclusion
Future aim for improvement of project :
1) Implement it to other platforms(J2EE,PHP,C++ and etc)
2)Automated detecting sensitive calbacks