This talk presents apkfile, a library for extracting machine learning features from Android apps as well as describing several interesting, high-value features for malware detection such as compiler fingerprinting, anti-vm detection, and Markov models for detecting unusual strings. Additionally, it provides tips for improving model performance with data preparation, feature selection, model tuning, and model blending.
2. WHO AM I
• Researcher @ SentinelOne
• Previously @ Lookout and @SourceClear
• Enjoy reading, cryptocurrency, economics
• Made Simplify and other Android tools
• @caleb_fenton
• github.com/CalebFenton
CALEB
3. WHO ARE WE
• rednaga.io
• Banded together by the love of 0days and hot sauces
• Collaborate and try to improve the community
• Disclosures / Code / Lessons on GitHub
• @RedNagaSec
• github.com/RedNaga
RED NAGA
4. TALK OVERVIEW
1. Machine learning overview
2. Using apkfile for feature extraction
3. Useful features for Android malware
4. Tips for building good models
6. STEP 1: UNDERSTAND THE FORMAT
• Android apps come as APK files
• APKs are just ZIPs
• APKs are rich with variety
• Android manifest / binary XML
• Dalvik executables
• Signing certificates
• Other resources (icons, maps, sounds, …)
• Offensive & Defensive Android Reverse Engineering
github.com/rednaga/training/tree/master/DEFCON23
7. STEP 2: COLLECT SAMPLES
• Need lots of good and bad samples
• Diversity of good and bad is important
• Sample sources:
• VirusTotal, VirusShare, market crawlers, other
researchers, friends
10. STEP 3: ENGINEER FEATURES
App label: MX Player Pro
Package: com.mxtech.videoplayer.pro
CN=Kim Jae hyun, O=MX Technologies, L=Seoul, ST=South Korea, C=KR
App label: Google Service Updater
Package: it.googleandroid.updater
CN=GService inc, OU=G Service inc, O=G, L=New York, ST=New York, C=US
Example 1
Example 2
11. STEP 3: ENGINEER FEATURES
• Certificate details - common name, country, …
• Suspicious strings - “pm uninstall”, “google”
• Permissions - which ones and how many
• API calls - send SMS, load DEX file
• Overall app quality - default icons, typos
12. STEP 4: BUILD AND TUNE MODELS
• Collect and prepare data
• Drop low value features
• Try many algorithms
• Train and blend multiple models
15. WHAT IS APKFILE?
• APK feature extraction library (Java)
• github.com/CalebFenton/apkfile
• Parses DEX files (dexlib2)
• Parses APK certificates
• Parses Android manifest (based on ArscBlamer)
• Hardened for use against obfuscation
• Everything is an object for easy inspection
16. EXAMPLE: ANDROID MANIFEST
ApkFile apkFile = new ApkFile("someapp.apk");
AndroidManifest androidManifest = apkFile.getAndroidManifest();
// Get some manifest properties
String packageName = androidManifest.getPackageName();
String appLabel = androidManifest.getApplication().getLabel();
// Print permission names
for (Permission permission : androidManifest.getPermissions()) {
System.out.println("permission: " + permission.getName());
}
// Print exported services
for (Service service : androidManifest.getApplication().getServices()) {
if (service.isExported()) {
System.out.println("exported: " + service.getName());
}
}
17. EXAMPLE: APK CERTIFICATE
ApkFile apkFile = new ApkFile("example-malware.apk");
Certificate certificate = apkFile.getCertificate();
Collection<Certificate.SubjectAndIssuerRdns> allRdns =
certificate.getAllRdns();
// APK may be signed by multiple certificates
for (Certificate.SubjectAndIssuerRdns rdns : allRdns) {
Map<String, String> subjectRdns = rdns.getSubjectRdns();
// Get certificate subject CN and O properties
System.out.println("Subject common name: " + subjectRdns.get("CN"));
System.out.println("Subject organization: " + subjectRdns.get("O"));
// Print all certificate properties
System.out.println("Issuer RDNS: " + rdns.getIssuerRdns());
}
18. EXAMPLE: DALVIK EXECUTABLES
Map<String, DexFile> pathToDexFile = apkFile.getDexFiles();
for (Map.Entry<String, DexFile> e : pathToDexFile.entrySet()) {
String path = e.getKey();
DexFile dexFile = e.getValue();
System.out.println("Analyzing " + path);
dexFile.analyze();
// Average cyclomatic complexity, also available for each method
System.out.println("Cyclomatic complexity: " + dexFile.getCyclomaticComplexity());
// Get API call counts over all methods
// Trove maps generally preferred for unboxing, incrementing performance
TObjectIntIterator<MethodReference> iterator = dexFile.getApiCounts().iterator();
while (iterator.hasNext()) {
iterator.advance();
MethodReference methodRef = iterator.key();
int count = iterator.value();
// E.g. Ljava/lang/StringBuilder;->toString called 18 times
System.out.println(methodRef + " called " + count + " times");
}
// Print op code histograms for each method
for (Map.Entry<String, DexMethod> me : dexFile.getMethodDescriptorToMethod().entrySet()) {
String methodDescriptor = me.getKey();
// E.g. Lit/googleandroid/updater/a;->a(Ljava/lang/String;)Ljava/lang/String; op counts
System.out.println(methodDescriptor + " op counts");
DexMethod dexMethod = me.getValue();
TObjectIntIterator<Opcode> opIter = dexMethod.getOpCounts().iterator();
while (opIter.hasNext()) {
opIter.advance();
// E.g. MOVE_RESULT_OBJECT: 46
System.out.println(" " + opIter.key() + ": " + opIter.value());
}
}
}
20. ANDROID MANIFEST
• Has main launcher activity
• No launcher implies no user interaction
• Number of activity package paths
• Malicious activities injected?
• Permissions / number of permissions
• Good clue what app may do
21. APKID FEATURES
• “PEiD for Android” - detects compilers, packers, …
• Compiler - dx (native) / dexlib (modified)
• Anti-VM strings - avoiding VM analysis
• Build.MANUFACTURER, SIM operator, device ID, subscriber ID
• Detecting Pirated and Malicious Android Apps with APKiD
rednaga.io/2016/07/31/detecting_pirated_and_malicious_android_apps_with_apkid/
22. STRINGS
• Number of gibberish strings
• Find weird certificate details
• Find unusual obfuscation
•
Using Markov Chains for Android Malware Detection
calebfenton.github.io/2017/08/23/using-markov-chains-for-android-malware-detection/
24. TIPS
• Most guides are for toy data sets
• No one talks about large data set problems
• Everyone assumes you have a dense matrix
• Assuming sklearn, but applies to other libs
25. PREPARING DATA
• Normalization is important
• Scale with MaxAbs or MinMax if many 0s
• Needed for some algorithms (not decision trees)
• Needed for dropping invariant features
• Drop invariant features
• Reduces chance of overfitting
• Example: file hash, app label, rare API calls
26. SELECTING FEATURES
• Score features and plot scores to build intuition
• Usually long tail of useless features
• Gives ideas for new features
• Top 100 features almost as good as top 1000
• Run experiments with subsets of features
• Improves speed
• Only interested in relative differences
27. BUILDING MODELS
• Grid search to find best algorithms and parameters
• Iterate on several, smaller searches
• Decision tree ensembles aren’t hip, but work well
sentinelone.com/blog/detecting-malware-pre-execution-static-analysis-machine-learning/
• Build and blend multiple models
sentinelone.com/blog/measuring-the-usefulness-of-multiple-models/
• Feature Selection and Grid Searching Hyper-parameters
gist.github.com/CalebFenton/66aa04af7b4a4d98efca059cb8c2e7aa