1. Finding the Hidden Scenes Behind Android Applications
Joey Allen
Mentor: Xiangyu Niu
CURENT REU Program: Final Presentation
7/16/2014
2. Previous Work
• Crawled Google Play Store
• Scraped Descriptions, Author, and Categories of
Applications
• Applied LDA Model
• Descriptions
• Permissions
• Applied Author Topic Model
• Descriptions
3. APPIC Framework
Figure 1. Flow Chart of APPIC Framework.
1. User Requests to Download
App A.
2. Description, Category, and
Permissions are filtered.
3. Category is assigned to Ca.
4. Embedded Topic models
auto-tag the description, Sa,
and permissions, pa.
5. Ca , Sa , and pa are compared.
6. If they all match, the app is
considered safe.
4. LDA MODEL
• Latent Dirichlet Allocation (LDA) is a generative probabilistic
model for collections of discrete data such as a text corpora
[1].
• The LDA Model creates topics that are distributions over words.
• The words in a document can then be compared to a set of
topics, and a category can be chosen for a document.
Figure 2. Graphical Representations of LDA Model [1].
5. Author Topic Model
• Author-topic model is a generative model for documents that
extends LDA to include authorship information [2].
• Authors are distributed over topics and topics are distributed
over words.
Figure 3. Graphical Model of Author-Topic Model [2].
6. Calculating Results
User Reads
Application
Description
Compare
APPIC tags
with Author’s
Tags
CI = Correct
Inference
II = Incorrect
Inference
APPIC finds App
in wrong
category.
(CI + 1)
APPIC incorrectly
categorizes
application
(II + 1)
APPIC and
author incorrectly
categorize app.
(II + 1)
APPIC and
author incorrectly
categorize app.
(II + 1)
Accuracy =
CI
II +CI
(5) Calculating Accuracy
11. Conclusion
• LDA performed better than AT at categorizing descriptions.
• More tags increase accuracy but decrease efficiency.
• AT model was not as accurate in categorizing applications.
• Useful for finding authors that create similar apps
12. Future Work
• Find a better method to calculate accuracy.
• Learn a different method to categorize permissions
• Dependencies between permissions and descriptions.
• Modify AT Model
13. D
Document
Author-Topic Model (Modified)
β ϕ
T
Topic distribution over
words
w
Word
z
Topic
α θ
A
Distribution of permissions over topics
x
Nd
Permissions
pd
Uniform distribution of
documents over
permissions
14. References
{slide #}
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of
machine Learning research, vol. 3, pp. 993–1022, 2003.
[2] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author- topic model for
authors and documents,” in Proceedings of the 20th conference on Uncertainty
in artificial intelligence, 2004, pp. 487–494.
[3] Y. Yang, J. S. Sun, and M. W. Berry, “APPIC: Finding The Hidden Scene Behind
Description Files for Android Apps.”