3. Specifications
• Does the program do what it is
supposed to do?
• Will it continue to do so in the future?
• How to define what its supposed to do?
3
Formal Methods
4. Flappy Bird
• Your aim is to move a little bird up and down
such that it does not hit an obstacle.
• As a developer you list undesired properties (no
crash, no spying).
• How to specify gameplay to computer?
• Can we teach a computer how to check a
program against expectations?
• Learn what program behavior is normal in a
given context?
4
5. APP MINING
5
App mining leverages common knowledge in thousands of
apps to automatically learn what is “normal” behavior—
and in contrast, automatically identify “abnormal” behavior.
6. APP MINING
• Leverage the knowledge encoded into the hundreds
of thousands of apps available in app stores
• Determine what would be normal behavior, to
detect what would be abnormal behavior
• Guide programmers and users toward better security
and usability
A p p s i n a p p s t o r e s h a v e t h r e e f e a t u r e s
1. Apps come with all sorts of metadata, such as names, categories,
and user interfaces. All of these can be associated with program
features, so you can, for instance, associate program behavior with
descriptions.
2. Apps are pretty much uniform. They use the same libraries, which
on top, use fairly recent designs. All this makes apps easy to analyze,
execute, and test—and consequently, easy to compare.
3. Apps are redundant. There are plenty of apps that all address
similar problems. This is in sharp contrast to open source programs..
This redundancy in apps allows us to learn common patterns of how
problems are addressed—and, in return, detect anomalies.
6
7. DETECTING ABNORMAL BEHAVIOR
The problem with “normal” behavior is that it varies according to the
app’s purpose.:
• If an app sends out text messages, that would normally be a sign of
malicious behavior—unless it is a messaging application, where
sending text messages is one of the advertised features.
• If an app continuously monitors your position, this might be
malicious behavior—unless it is a tracking app that again advertises
this as a feature.
• Simply checking for a set of predefined “undesired” features is not
enough—if the features are clearly advertised, then it is reasonable
to assume the user tolerates, or even wants these features, because
otherwise, she would not have chosen the app.
7
8. 8
Introducing CHABADA
• To determine what is normal, we thus must assess program behavior together with its description. If the
behavior is advertised then it’s fine; if not, it may come as a surprise to the user, and thus should be flagged.
• This is the idea we followed in our first app mining work, the CHABADA tool.
• A general tool to detect mismatches between the behavior of an app and its description
• Applied on a set of 22,500 apps, CHABADA can detect 74% of novel malware, with a false positive rate
below 10%.
• Our recent MUDFLOW prototype, which learns normal data flows from apps, can even detect more than
90% of novel malware leaking sensitive data.
“Checking App Behavior Against Descriptions of Apps”
9. CHABADA
• CHABADA starts with a (large) set of apps to be analyzed.
• It first applies tried-and-proven natural language
processing techniques (stemming, LDA (Latent Dirichlet
Analysis), topic analysis) to abstract the app descriptions
into topics.
• It builds clusters of those apps whose topics have the
most in common. Thus, all apps whose descriptions refer
to messaging end up in a “Messaging” cluster.
9
10. 10
CHABADA
• Within each cluster, CHABADA will now search for outliers
regarding app behavior.
• Simply use the set of API calls contained in each app; these
are easy to extract using simple static analysis tools.
• CHABADA uses tried-and-proven outlier analysis techniques,
which provide a ranking of the apps in a cluster, depending
on how far away their API usage is from the norm. Those
apps that are ranked highest are the most likely outliers.
11. 11
A TREASURE OF DATA …
1. Future techniques will tie program analysis to user interface analysis.
2. Mining user interaction may reveal behavior patterns we could reuse in various contexts.
3. Violating behavior patterns may also imply usability issues. If a button named “Login” does nothing, for
instance, it would be very different from the other “Login” buttons used in other apps—and hopefully be
flagged as an anomaly.
4. Given good test generators, one can systematically explore the dynamic behavior, and gain information on
concrete text and resources accessed
a n u mb er of id eas th at ap p stores all make p ossib le
12. OBSTACLES
1. Getting apps is not hard, but not easy either. Besides the official stores, there is no publicly available repository
of apps where you could simply download thousands of apps, because violation of copyright.
2. For apps, there’s no easily accessible source code, version, or bug information. If you monitor a store for a
sufficient time, you may be able to access and compare releases, but that’s it. Vendors not going to help you and
open source is limited . Fortunately, app byte code is not too hard to get through.
3. Metadata is only a very weak indicator of program quality. Lots of one-star reviews may refer to a recent price
increase or political reasons; but reviews talking about crashes or malicious behavior might give clear signs.
4. Never underestimate developers. Vendors typically have a pretty clear picture of what their users do, If you think
you can mine metadata to predict release dates, reviews, or sentiments: talk to vendors first and check your
proposal against the realities of app development.