Malicious domains are one of the main resources used to mount attacks over the Internet. It is important to detect such activities by mining the large-scale network traffic data and identifying malicious URLs, domains or IPs. The attackers often take advantages of vulnerabilities in DNS and commit activities such as stealing private information, spamming, phishing, and DDoS attacks, and tend to by-pass botnet detection by generating domain clusters from Domain Generation Algorithms (DGA). We have billions of DNS records per day. Spark AI platform hence serves as an efficient distributed platform for the processing and mining of this huge amount of data. We work on the following two cybersecurity use cases. 1. Detect DGA, Porn, and Gambling domains Each malware-compromised host machine will have a large amount of DNS request in sequential order. The domain names are either generated by DGAs or preserve particular string patterns by design. We use spark to generate DNS request domain sequences and use Word2Vec to estimate the embedding of the domains. We then estimate the similarity and the most similar domains in the embedding space are discovered as the potential malicious domains. 2. Detect cryptocurrency mining pool domains The attackers are interested in accessing computing resources to mine cryptocurrency. The malware infected computers will be directed to attacker-controlled mining pool domains. This type of DNS request does not preserve sequential order and is relatively random. Since each mining pool domain cluster is visited by a wide range of different host machines, we used LSH to evaluate the similarity among sets of hosts. As a result, LSH generates domain-bucket bipartite graph and FastUnfolding algorithm is used to discover the domain clusters. We leveraged spark AI for large scale DNS data analysis and discovered hundreds of thousands of malicious domains each day at high precision.
Authors: Ting Chen, Hao Guo
3. About The Speakers
Hao Guo
• Applied Research Scientist @ Tencent Security
• Master degree in Computer Science from HIT with research
interest in NLP, deep learning and large-scaled machine
learning
Ting Chen
• Director, Applied Machine Learning @ Tencent Jarvis Lab
• PhD degree in Computer Science from UFL with research
interest in computer vision and machine learning
• Previously, Senior ML engineer and DS manager at Uber
3#UnifiedAnalytics #SparkAISummit
26. LSH based detection
host IP sets for accessing domain1:
host IP sets for accessing domain2:
Jaccard similarityVictim hosts
𝑆9 = {𝐼𝑃0, 𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3}
𝑆C = {𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3, 𝐼𝑃4}
Similarity of domain1 and domain2 is:
𝑠𝑖𝑚 𝑑𝑜𝑚𝑎𝑖𝑛1, 𝑑𝑜𝑚𝑎𝑖𝑛2 =
| EF9,EFC,EFG |
| EFH,EF9,EFC,EFG,EFI |
= 3/5
26#UnifiedAnalytics #SparkAISummit
27. • High dimensional and sparse
– tens of millions hosts
• O(N*N) comparisons
– million unique domains
• Spark provides Locality Sensitivity Hashing for fast near-duplicate detection
Why LSH
27#UnifiedAnalytics #SparkAISummit
28. LSH
With hight probablity domain1 and doman2 are
hashed into the same buckets .
With high probablity domain1 and domain2 are
hashed into the different buckets.
28#UnifiedAnalytics #SparkAISummit
29. Minhash and Jaccard Similarity
• There is a suitable hash function for
the Jaccard similarity : minhash
• The probability that
minhash(domain1) = minhash(domain2)
is equal to the similarity of
Jaccard(domain1, domain2)
29#UnifiedAnalytics #SparkAISummit
33. Key Functions
Domain-host Mapping Generation
Input RDD[(IP,domain)] // both IP and domain of DNS query
zipwithIndex,join input into RDD[(IP_id,domain_id)]
map,combineByKey into RDD[(domain_id,List[IP_ids])] // both domain and IP sets
map into RDD[(domain_id,sparseVector(IP_ids))] // dense vector map into sparse vector
Output RDD[(domain,sparseVectors(IPs))]
33#UnifiedAnalytics #SparkAISummit
34. Key Functions
LSH
Input RDD[(domain_id,sparseVector(IP_ids))] //domain_id and high dimensional sparse vector
MinHashLSH into RDD[(domain_id,List(hashvalues))] // reduce dimension into hundreds
flatMap into bucket RDD[(domain_id,List(bucket_id))] // map similar domains into the same bucket
fastunfolding // self implemented function
Output RDD[(domain_id,bucket_id)]
34#UnifiedAnalytics #SparkAISummit