AYOUB MAHDI - SUMMARY of FLOWPRINT: SEMI-SUPERVISED MOBILE-APP FINGERPRINTING ON ENCRYPTED NETWORK TRAFFIC .pdf
1. FLOWPRINT: SEMI-SUPERVISED
MOBILE-APP FINGERPRINTING ON
ENCRYPTED NETWORK TRAFFIC
Tesi di laurea triennale in ingegneria elettronica e informatica
Dipartimento di Ingegneria e Architettura
CANDIDATO: RELATORE:
MAHDI AYOUB PROF. ALBERTO BARTOLI
ANNO ACCADEMICO 2022-2023
3. 2
INTRODUCTION
The numerous devices that a single user can have and the enormous number of
apps that may be installed on a device can make keeping a network secure a
challenge to the operators.
In large networks there might be a high possibility for harmful or malicious software
to achieve the infrastructure of that network and cause real damage.
This problem requires a practical solution to provide a certain level of security and
control over the infrastructure network even though there will be some invasion of
the privacy of the users.
In other words, a network with many users and even more applications will be a real
challenge for operators to control. They need to observe a limitless amount of mobile
data traffic.
For that reason, the controllers of the networks tried to benefit from the advantages
of Machine Learning in order to achieve the target.
But after a look into the statistics of mobile data traffic, it becomes clearer how big
the challenge is of detecting apps because most of the data are encrypted. [1] Plus
the difficult nature of the traffic:
● Evolving: (install, uninstall, updating apps) it needs to keep up to date with the
newest version of apps by depending on traffic data not labeled data.
● Homogenous: different apps use the same protocols to transfer data between
devices and their destinations with a lot of them being common destinations
(e.g.: CDN’s, advertisement, cloud providers).
● Dynamic: depend on users that could use the same app and still generate
different data traffic every time.
The author provides a solution with FLOWPRINT a Semi-Supervised Mobile-App
Fingerprinting on Encrypted Network Traffic. It is a real-time approach used to
recognize the working apps on the network and detect previously unseen ones
without prior knowledge.
But how will it perform with the encrypted data taken from the examined datasets
(Recon, Andrubis, Cross platform) and the evolving, dynamic, homogenous nature of
the traffic?
4. 3
APPROACH
The key of the FlowPrint approach is to study the characteristics of communication
patterns between apps and their destinations, and then label each pattern
(fingerprint).
This facilitates real-time detection with high accuracy of various applications without
prior knowledge or labeled data.
Also, FlowPrint can detect previously unseen apps by matching the fingerprints, and
if the new one does not match any in its list that means it is a previously unseen app.
For app detectors, they used different types of encrypted network traffic labeled per
app datasets, that can simulate data of different types of users, application stores,
O.S. different versions of the same app and potentially harmful apps in order to have
more realistic results.
For generating the fingerprints, it extracts 4 important features from the data traffic
flows and then scores all features according to the Adjusted Mutual Information
(AMI: “a metric for scoring features in unsupervised learning”). [13]
Moreover, they categorize those features into 4 different categories:
1. Temporal features: The Inter-flow timing and Packet interarrival time
(incoming).
2. Device features: The IP address - source.
3. Destination features: The IP address of the server, and various TLS
certificates.
4. Size features: Both incoming and outgoing Packet size features.
Only the features that showed a high AMI they had been considered.
Fig.1 represents a brief insight into the
approach and exploits the features that
have been extracted from flows of data at
the level of TCP/UDP from the network
traces for each device.
It isolates browsers, by using another
labeled approach that can detect flows from
them. Browsers can complicate the results
because they use CDN’s and another
5. 4
procedure of communication. For that reason, it is better to not consider them when
generating fingerprints.
After determining cross-correlation intensity [15] between destinations, then the
software can identify maximal cliques that consist of subgraphs by eliminating weak
edges between destinations.
A cross-correlation is considered weak if it is lower than a threshold τ-correlation.
After the elimination of weak correlation remain only the strong clusters that consist
of groups of TLS certificates and destinations. Those clusters are strong enough to
consider a fingerprint that it can match with the fingerprints that already exist from
previous operation of controlling. FlowPrint conclude if it is a new fingerprint or not
by computing the Jaccard similarity between the new fingerprint and the old ones
[16].
EVALUATION
The authors aimed to get the best performance out of the experiment, for which they
optimized the parameters to get a high F-1 score. To have a better idea about the
approach performance, they also need to compare it to one of the best in the market
(the AppScanner [7, 17]), as represented in Table 1.
It demonstrates that it was able to keep an excellent performance with different
datasets despite their differences:
ReCon (to simulate the performance with apps different versions after the new
releases over a long period of time), Cross platform (to simulate the difference in
users-generated data), Andrubis (to test the potentially harmful apps) [8, 9, 10, 11,
12].
(Flow: is a TCP/UDP flow of data between monitored device and the destination it
interacts with).
It is obvious from the numbers of the table that FlowPrint can be a peer to
Appscanner but there is more: Precision is an important parameter in ML
experiments and is the ratio of correctly predicted positive observations to the total
predicted positive observations (Precision = TP/TP+FP).
6. 5
By process some numbers of the table above (for example the precision of recon
dataset) FlowPrint was able to recognize 0.9470 (~95%) of the applications
correctly. Thus, the approach not only precise but also valid for a long period of time
since it was able to overpass the challenge of different versions.
It can generate multiple confident fingerprints for the same app because the data
traffic of the same app depends on how the users use it. That generates different
patterns of flows. As a result, the apps of Cross-Platform have the highest number of
fingerprints. The nature of mobile network traffic creates different challenges
(homogeneous, dynamic, evolving). However, the results of the experiment illustrate
that FlowPrint is robust against the homogeneity and dynamic nature of data traffic.
It is a little different when it comes to talking about evolving nature because the
authors noticed that when they do not update the data labeled in the previously
unseen app detector after the release of new versions or a short interval of time,
then the detection rate slowly drops because destination-based features change.
This can be resolved easily by keeping the app detector data up to date. The training
Data size also has some influence on the performance depending on how many
apps are installed on the devices. It performs better with a few installed apps, then
the performance drops a little until it stabilizes after increasing the number of
installed apps.
The authors use a mid-range available laptop to execute the software, create
fingerprints, and match them with the existing ones in the app detector. The authors
conclude through that experiment that FlowPrint is a practical and reasonable real-
time solution.
CONCLUSIONS
In conclusion, FlowPrint can generate fingerprints of mobile apps from encrypted
data traffic in real-time.
A solution that is robust against different challenges and most of the malicious ways
of playing the system. Absolutely, it is a useful tool that has almost unlimited future
applications, which make it a possible solution to the problem of network security. It
is not the first or only solution [2, 3, 4, 5, 6], nor it is a perfect software due to its
flaws (performance decrease with the increasing number of apps, difficulties with
browser data, work only with applications on mobile network) but a promising one, it
all depends on the purpose of using FlowPrint.
7. 6
References
[1] Google. An Update on Android TLS Adoption. https://security.googleblog.com/2019/12/an-update-on-
android-tls-adoption.html, December 2019.
[2] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescape. Multi-Classification
Approaches for Classifying Mobile App ´ Traffic. Journal of Network and Computer Applications, 2018.
[3] Khaled Al-Naami, Swarup Chandra, Ahmad Mustafa, Latifur Khan, Zhiqiang Lin, Kevin Hamlen, and Bhavani
Thuraisingham. Adaptive Encrypted Traffic Fingerprinting with Bi-Directional Dependence. In Proc. of the Annual
Computer Security Applications Conference (ACSAC), 2016.
[4] Hasan Faik Alan and Jasleen Kaur. Can Android Applications Be Identified Using Only TCP/IP Headers of
Their Launch Time Traffic? In Proc. of the ACM Conference on Security & Privacy in Wireless and Mobile
Networks (WiSec), 2016.
[5] Yi Chen, Wei You, Yeonjoon Lee, Kai Chen, XiaoFeng Wang, and Wei Zou. Mass Discovery of Android
Traffic Imprints through Instantiated Partial Execution. In Proc. of the ACM Conference on Computer and
Communications Security (CCS), 2017.
[6] Shuaifu Dai, Alok Tongaonkar, Xiaoyin Wang, Antonio Nucci, and Dawn Song. NetworkProfiler: Towards
Automatic Fingerprinting of Android Apps. In Proc. of the IEEE International Conference on Computer
Communications (INFOCOM), 2013.
[7] Vincent F. Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. AppScanner: Automatic
Fingerprinting of Smartphone Apps from Encrypted Network Traffic. In Proc. of the IEEE European Symposium
on Security and Privacy (EuroS&P), 2016.
[8] Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der
Veen, and Christian Platzer. Andrubis - 1,000,000 Apps Later: A View on Current Android Malware Behaviors. In
Proc. of the IEEE International Workshop on Building Analysis Datasets and Gathering Experience Returns for
Security (BADGERS), 2014.
[9] Jingjing Ren, Daniel J. Dubois, and David Choffnes. An International View of Privacy Risks for Mobile Apps,
2019.
[10] Jingjing Ren, Martina Lindorfer, Daniel Dubois, Ashwin Rao, David Choffnes, and Narseo
Vallina-Rodriguez. Bug Fixes, Improvements, ... and Privacy Leaks – A Longitudinal Study of PII Leaks Across
Android App Versions. In Proc. of the ISOC Network and Distributed System Security Symposium (NDSS), 2018.
[11] Jingjing Ren, Ashwin Rao, Martina Lindorfer, Arnaud Legout, and David Choffnes. ReCon: Revealing and
Controlling PII Leaks in Mobile Network Traffic. In Proc. of the International Conference on Mobile Systems,
Applications and Services (MobiSys), 2016.
[12] Statcounter. Mobile Browser Market Share Worldwide. https:// gs.statcounter.com/browser-market-
share/mobile/worldwide. Accessed: February 2019.
[13] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information Theoretic Measures for Clusterings
8. 7
Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning
Research, 2010.
[14] Lawrence R. Rabiner and Bernard Gold. Theory and Application of Digital Signal Processing. Prentice
Hall, 1975.
[15] Paul Jaccard. The Distribution of the Flora of the Alpine Zone. New Phytologist, 1912.
[16] Vincent F Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. Robust Smartphone App
Identification via Encrypted Network Traffic Analysis. IEEE Transactions on Information Forensics and Security,
2018.