Large-Scale Malicious Domain Detection with Spark AI

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Hao Guo, Tencent
Ting Chen, Tencent
Large-scale Malicious
Domain Detection with
Spark
#UnifiedAnalytics #SparkAISummit

About The Speakers
Hao Guo
• Applied Research Scientist @ Tencent Security
• Master degree in Computer Science from HIT with research
interest in NLP, deep learning and large-scaled machine
learning
Ting Chen
• Director, Applied Machine Learning @ Tencent Jarvis Lab
• PhD degree in Computer Science from UFL with research
interest in computer vision and machine learning
• Previously, Senior ML engineer and DS manager at Uber
3#UnifiedAnalytics #SparkAISummit

Agenda
• DDoS Attack & Advance Persistent Threat
• Sequence based detection
• Crypto Mining Malware
• Locality Sensitivity Hashing based detection
• Conclusion

What is DDoS Attack
https://medium.com/@kapil.sharma91812/understanding-ddos-attack

DDoS Attack Trend
2018 https://securelist.com/ddos-attacks-in-q4-2018/89565/

DDoS Attack Trend
http://francescomolfese.it/en/2018/12/la-protezione-da-attacchi-ddos-in-azure/

What is APT
Symantec APT white paper

APT Activities
https://www.fireeye.com/current-threats/annual-threat-report.html

Behind The Attacks: C&C

Behind The Attacks: DGA

Malicious Domain Detection
Scenario 1: DGA
uuybcc.com
igmgdc.com
lpppxa.com
swdosv.com
grevun.com
djiyei.com
cvevrm.com
vyjyui.com
Victim
hosts Sequences of domains accessed for each host

Crypto Mining Malware

Malicious Domain Detection
cab217f6.space
6850c644.space
cbb21989.space
c8b214d0.space
ceb21e42.space
cfb21fd5.space
c9b21663.space
d4b227b4.space
Scenario 2: Crypto mining domains
Victim hosts

Data Scale @ Tencent
Billions
(day)
Billions
(day)
Tens
Millions
(day)
TB
DNS Records Domains IPs/Hosts Storage
Spark enables large-scale data analysis.
#UnifiedAnalytics #SparkAISummit

Sequence Based Detection

sentence
word
Victim
hosts

document
Victim
hosts

Domain2Vec Representation
ii-1i-2 i+1 i+2
Key Idea
• Estimate the domain2vec (work2vec)
representation of each domain with CBOW
Example domain2vec representation
• uymbkc.com
CBOW framework
-0.10 0.58 -0.04 … 0.29 0.26 -0.17
Victim
hosts

Domain Clustering
• Start with seed domains (known malicious
domains)
• Find most similar domains
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = cos θ =
𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐4 5 𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐6
||𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐4||||𝑑𝑜𝑚𝑎𝑖𝑛𝑉𝑒𝑐6||

Domain Clustering

Example Clustering Results
Virus Domains
uuybcc.com
igmgdc.com
lpppxa.com
swdosv.com
grevun.com
djiyei.com
cvevrm.com
vyjyui.com
... ...
Nymaim Domains
pjbgwwt.ru
mphsnkjgnfh.biz
nkposroyfkr.net
uusuux.ru
dsvlvlnkcj.ru
jtduakh.ru
... ...
Conficker Domains
kicuxexj.org
xiaxyvyn.net
jhpruj.biz
cvlyfcz.org
dsvevamq.biz
blqdisrp.cc
eujyvcvj.org
... ....

Implementation Framework
billions/day

Key Functions
Domain Sequence Generation
Domain2Vec Calculation
Similarity Domain Clustering
domain_sequence_rdd =
input_rdd.combineByKey(to_list, append, extend, numPartitions = N).mapValues(sortbytime)
domain_sequence_rdd = domain_sequence _rdd.map( lambda r:Row(r) )
domain_sequence = spark.createDataFrame(domain_sequence_rdd,[”domainSequence”])
domain2vec = Word2Vec(vectorsize=100, minCount=3, numPartitions =N, seed=42,
inputCol=”domainSequce”, outputCol=”model”, windowSize=8, maxSentenceLength=1000)
model = domain2vec.fit(domain_sequence)
from pyspark.sql import SparkSession, Row
from pyspark.ml.feature import Word2Vec, Word2VecModel
model.findSynonyms(domain_seed, M).select("word", fmt("similarity", m).alias("similarity"))

LSH Based Detection

LSH based detection
host IP sets for accessing domain1:
host IP sets for accessing domain2:
Jaccard similarityVictim hosts
𝑆9 = {𝐼𝑃0, 𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3}
𝑆C = {𝐼𝑃1, 𝐼𝑃2, 𝐼𝑃3, 𝐼𝑃4}
Similarity of domain1 and domain2 is:
𝑠𝑖𝑚 𝑑𝑜𝑚𝑎𝑖𝑛1, 𝑑𝑜𝑚𝑎𝑖𝑛2 =
| EF9,EFC,EFG |
| EFH,EF9,EFC,EFG,EFI |
= 3/5

• High dimensional and sparse
– tens of millions hosts
• O(N*N) comparisons
– million unique domains
• Spark provides Locality Sensitivity Hashing for fast near-duplicate detection
Why LSH

LSH
With hight probablity domain1 and doman2 are
hashed into the same buckets .
With high probablity domain1 and domain2 are
hashed into the different buckets.

Minhash and Jaccard Similarity
• There is a suitable hash function for
the Jaccard similarity : minhash
• The probability that
minhash(domain1) = minhash(domain2)
is equal to the similarity of
Jaccard(domain1, domain2)

FastUnfolding
6850c644
.space
abc.comcbb21989
.space
ceb21e42
.space
defag.ur gha.com d4b227b4
.space

Modularity:
FastUnfolding
Fast unfolding of communities in large networks, VD Blondel et.al. J. Stat. Mech. (2008)

Implementation Framework
billions /day

Key Functions
Domain-host Mapping Generation
Input RDD[(IP,domain)] // both IP and domain of DNS query
zipwithIndex,join input into RDD[(IP_id,domain_id)]
map,combineByKey into RDD[(domain_id,List[IP_ids])] // both domain and IP sets
map into RDD[(domain_id,sparseVector(IP_ids))] // dense vector map into sparse vector
Output RDD[(domain,sparseVectors(IPs))]

Key Functions
LSH
Input RDD[(domain_id,sparseVector(IP_ids))] //domain_id and high dimensional sparse vector
MinHashLSH into RDD[(domain_id,List(hashvalues))] // reduce dimension into hundreds
flatMap into bucket RDD[(domain_id,List(bucket_id))] // map similar domains into the same bucket
fastunfolding // self implemented function
Output RDD[(domain_id,bucket_id)]

Clustering Result
• Crypto mining domains
cab217f6.space
6850c644.space
cbb21989.space
c8b214d0.space
ceb21e42.space
cfb21fd5.space
c9b21663.space
d4b227b4.space
... ...

Conclusion
• Sequence-based mining to detect DGA domains
– Sequence and co-occurrence
– Find near-neighbors/most similar domains
• LSH-based detect cryptocurrency mining domains
– Jaccard similarity
– Clustering

Tencent Security
Tencent Security

Thank You
Acknowledgement
Shubing Long Chunhua Hong
Yu Liang Yong Deng
Mengling Han Xiangqian Wei
Na Yi Tingwei Mao
Tencent Security

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Large-Scale Malicious Domain Detection with Spark AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large-Scale Malicious Domain Detection with Spark AI

Similar to Large-Scale Malicious Domain Detection with Spark AI (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Large-Scale Malicious Domain Detection with Spark AI