DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.Classification 8/2/2013 1
DLP Systems: Models, Architecture and Algorithms
Liwei Ren, Ph.D, Sr. Architect
Data Security Research, Trend Micro™
May, 2013, UCSC, Santa Cruz, CA

Copyright 2011 Trend Micro Inc.
Backgrounds:
• Liwei Ren, Data Security Research, Trend Micro™
– Research interests:
• DLP, differential compression, data de-duplication, file transfer protocols, database
security, and practical algorithms.
– Education:
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
– Relevant works for this talk:
• Provilla, Inc : a startup focusing on endpoint based DLP products and solutions. It was
co-founded by Liwei and acquired by Trend Micro a few years ago.
• Patents --- Liwei holds 10+ patents for DLP, mostly, for DLP content inspection
techniques.
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in
Nanjing, Taipei and Silicon Valley.
– One of top 3 anti-malware vendors
– Pioneer in cloud security
– DLP vendor via Provilla™ acquisition
2

Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 3

What Is Data Loss Prevention?
• What is Data Loss Prevention?
– Data loss prevention (aka, DLP) is a data security technology that detects
data breach incidents in timely manner and prevents them by monitoring
data in-use (endpoints), in-motion (network traffic), and at-rest (data
storage) in an organization’s network.
– A.k.a. ,Data Leak Prevention (DLP),Information Leak Prevention (ILP) or
Information Leak Detection and Prevention (ILDP).

What Is Data Loss Prevention?
• A Few Elements of a DLP system:
– WHAT data to protect?
– WHO leaks data?
– HOW the data is leaked?
– WHERE to protect data?
– WHAT actions to take?

Concepts, Models and Architecture
• WHAT data to protect?
• WHO causes data leaks?
External Hackers

Three Data States:

• Data-in-use:
• Data-in-motion:

• Data-at-rest at risk:

• DLP for data-in-use and data-in-motion:
• A conceptual view!

• DLP for data-in-use and data-in-motion:
• A technical view!

• DLP Model for data-in-use and data-in-motion:
– If DATA flows from SOURCE to DESTINATION via CHANNEL, the
system takes ACTIONs
– DATA specifies what confidential data is
– SOURCE can be an user, an endpoint, an email
address, or a group of them
– DESTINATION can be an endpoint, an email address,
or a group of them, or simply the external world
– CHANNEL indicates the data leak channel such as
USB, email, network protocols and etc
– ACTION is the action that needs to be taken by the
DLP system when an incident occurs

• DLP for data-at-rest:

• DLP Model for data-at-rest:
– If DATA resides at SOURCE , the system takes ACTIONs
– DATA specifies what the sensitive data (which has
potential for leakage) is
– SOURCE can be an endpoint, a storage server or a
group of them
– ACTION is the action that needs to be taken by the
DLP system when confidential data is identified at
rest.

• Typical DLP systems:
– DLP Management Console
– DLP Endpoint Agent
– DLP Network Gateway
– Data Discovery Agent (or Appliance)

• Typical DLP system architecture:

Agenda
•Content Inspection Problems
• Summary
• References
• Q&A

Content Inspection Problems
• Two fundamental problems for a DLP system:
• It is a pair of problems that always come together:
• One determines data sensitivity based on what has been
defined.

• Four typical approaches for <defining, determining>
sensitive data in a DLP system:
1. Document fingerprinting
2. Database record fingerprinting
3. Multiple Keyword matching
4. Regular expression matching

• Document fingerprinting:
• A technique for identifying modified versions of known documents
• Problem Definition (Model 1):
– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T, one needs to determine if there exist at least a
document t ϵ S such that T and t share common textual content
significantly, where multiple returned documents are ranked by how
much common content are shared.

• An alternative model (Model 2):
– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T and X%, one needs to determine if there exist at
least a text t ϵ S such that SIM(T,t)≥ X%, where SIM() is a function to
measure the similarity between two texts.
• Multiple documents are ranked by the percentiles .

• Database record fingerprinting:
– A technique for identifying sensitive data records within a text.
– A.k.a., Exact Match in DLP field
• Use Case:
– We have several personal data records of <SSN, Phone#, address>
that are included in a text, we want to extract all records from the
text to determine the sensitivity of the file.

Hhhhhdds ghghg 178-76-6754 ggkjkfddfdkkkk879-45-6785kjkjjk 43
Atword Street, Pittsburgh, PA 15260 kllkll 412-876-6789 kjkjjkj 76
Parkview Ave, Sunnyvale, CA 94086 hhsjskkdhjhjhj 408-780-8876
hjhjkjkjjj 159-87-8965 hjhjhjhjmnnmnxcbls w243 54y45 wefddew
dddw3n nn xxxxxxxxxx
23
SSN Phone # Address
178-76-6754 412-876-6789 43 Atword Street, Pittsburgh, PA 15260
159-87-8965 408-780-8876 76 Parkview Ave, Sunnyvale, CA 94086
…… …… ……
An example: a text contains a few data records:

• Problem Definition (Model 3) :
– Let S= { R1, R2, …,Rn} be a set of known data records from a same table.
– Given any text T, one needs to extract all records or sub-records from T
while the record cells may appear randomly within the text.

• Problem Definition for Keyword Match:
– Let S= {K1,K2,…,Kn} be a dictionary of keywords.
– Given any text T, one needs to identify all keyword occurrences in T.
• Problem Definition for RegEx Match:
– Let S= {P1,P2,…,Pm} be a set of RegEx patterns.
– Given any text T, one needs to identify all pattern instances from T.
Easy problems?
– Not at all! For large n and m, one will
have performance issue.
– That’s the problem of scalability.
– Scalable algorithms must be provided.

Agenda
• Summary
• References
• Q&A

Practical Algorithms for DLP
• We investigate some algorithms for 2 problems:
1. Document fingerprinting
2. Multiple keyword matching
Assumption: a text T is a sequence of UTF-8 characters without
loss of generality.

Document Fingerprinting Algorithms
• Lets investigate algorithmic solutions for Model 2 ( document
fingerprinting).
• Analysis for Solution:
1. We need to construct the function SIM(T,t). For example:
– SIM(T,t) = |T ∩t| /Min(|T|,|t|) based on common sub-strings.
2. An Obvious Challenge:
– If n is large, say, in scale of millions, we can not compute SIM(T, Tk) one by one
to find the t that satisfies SIM(T,t) ≥ X%
– We need to figure out an approach that can identify a possible candidate quickly.
3. General search engines like Google use keywords to index/identify
the documents. Should we? There are too many keywords and
language dependency. The answer is NO.
4. So, which features can we use for indexing/searching?
– One answer is documents fingerprints.
28

• What are document fingerprints?
– A fingerprint is a hash value
– One text has multiple fingerprints
– Unique to the text: two irrelevant texts do not share any fingerprints.
– Robustness: it can survive moderate textual changes.
29

• How to extract fingerprints from a text?
– Anchoring point:
• A point in the text that can endure the moderate changes.
• Its neighborhood (of fixed size) is unique to the text
– We select a few anchoring points to fingerprints:
• To generate hash values around their neighborhoods.
• These hash values are the fingerprints
30
•Samples of anchoring points and their neighborhood:
Thereareabundantliteraturesonhowtogeneratedifferencebetween
twofilesBasicallytherearetwofundamentalapproachestoattackthisgenericp
roblemLCSmodelwhereLCSstandsforlargestcommonsubsequenceCalculate
thelargestcommonsubsequenceoftwostringFindasequenceofeditoperation
sbasedontheLCSsothatonecanapplytheeditoperationstothereferencefiletoc
onstructthetargetfileBlock movemodel

• Conclusion : we have a solution that consists of two
algorithms and one search technology:
– An algorithm for computing SIM(T,t)
– An algorithm for fingerprint generator FPGEN(T)
– Fingerprint search engine
31

• Fingerprint generation algorithm 1:
– INPUT: String T
•Select top M candidate characters based on a score function
– Character frequency n
– Character positions in the text T: P(1), …, P(n)
– SCORE(c) = SQRT(D(n) * [ P(n)-P(1)] / SQRT(D)
» Where D= [(P(2)-P(1)]2+ [(P(3)-P(2)] 2 + … + *(P(n)-P(n-1)] 2
•For each selected character c
– Create a hash around the neighborhood at each occurrence
– Sort these hashes
– Select the top N hashes
– These N hashes are fingerprints
– OUTPUT: M*N fingerprints
32
Note 2: Two keys of this algorithm are (a)
the score function; (b)sorting the hashes.
Note 1: M and N are pre-defined.

33
• About the score function:
– Why SQRT(n) ?
• Measurement of frequency for the given character
• The larger the value, more stable the character is
– Why [ P(n)-P(1)] / SQRT(D) ?
• Measurement of distribution for the given character
• The larger the value, more even distributed the character, and more
stable the character;
• WHY? Think about a constrained optimization problem:
– min f(X1,X2 , … Xm) = X1
2+ X2
2 + … Xm
2
» subject to
» X1+ X2 + … Xm = c AND
» Xk ≥ 0, k=1,2,…,m
Note: The solution of the
optimization problem is Xk
= c/m, k=1,2,…,m

There are alternative algorithms to construct a
fingerprint generation function.
34
We recently constructed algorithm 2:
– A novel approach based on rolling hash function
H(x);
– It selects anchoring points with first filter H(x) = 0
mod p;
– It further selects anchoring points with a heuristic
second filter.
– It also employs the asymmetric architecture of
fingerprint match;
Note 1: The anchoring
points have better
distribution across text.
Note 2: Two keys of this algorithm are (a) Rolling hash;
(b)Asymmetric use of two filters.

Multiple Keyword Match
Essentially, it is a multi-pattern
string match problems.
35
Problem Definition:
– Let S={P1,P2,…,Pk} be multiple short strings as
patterns;
– Given any string T, one needs to identify all pattern
occurrences in T.

Existing string match algorithms:
36
Algorithm Type
Naïve string match One pattern
Knuth–Morris–Pratt One pattern
Boyer-Moore One pattern
Boyer-Moore-Horspool One pattern
Boyer-Moore-Horspool-Raita One pattern
Rabin-Karp Multi-patterns
Aho-Corasick Multi-patterns
Sun-Manber Multi-patterns

37
Key elements of the algorithm:
– Character comparison can be made from right to left, starting from the end of
the pattern.
– Ending Character Heuristics
• Consider that we are pointing to character R[i] and try to compare it with the
ending character of P
• Bad character
– If R[i- ≠P,m- and R,i- is not included in P’s alphabet, then it is safe for the pointer to skip
m positions arriving at R[i+m].
– If R[i- ≠P,m-, R,i- is included in P’s alphabet, and R,i-’s last occurrence within P has
distance q from the end of P, then it is safe for the pointer to skip q positions arriving at
R[i+q].
• Good character
– If R[i] =P[m] , P is not matched , and R[i] has no other occurrences within P, then it is safe
for the pointer to skip m positions arriving at R[i+m].
– If R[i] =P[m] , P is not matched and R[i-’s last occurrence other than P,m- has distance q
from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].
• Matched instance
– If R[i] =P[m] and P is matched, then save the instance.
– It is almost safe to move the pointer to skip m positions arriving at R[i+m].
Boyer-Moore-Horspool (BMH) Algorithm

38
• Rabin-Karp Algorithm
– Hash based string match
• Rabin-Karp hash function H(S):
– For a given string S = x1x2…xm with length m, a hash function can be
constructed as:
• H(S) = x1bm-1 + x2 bm-2 + … + xm-1 b + xm mod q
• Where b is a base number, usually we take b=256 , and q is a big prime
number.
– For pattern P, H(P) = p1bm-1 + p2 bm-2 + … + pm-1 b + pm mod q
– If we denote Rk = R[k,k+m-1], we can derive H(Rk+1) from H(Rk) with
relatively small cost
– H(Rk+1) = [ H(Rk) – rkbm-1 ] b + rk+m mod q
– This is an iterative formula which is a common practice for algorithm
optimization

39
• Rabin-Karp hash function:
– The quantity bm-1 mod q can be pre-calculated to save CPU time.
– For each iteration, we only need 5 arithmetic operations.
• It can be further reduced to 4
• One considers the number rkbm-1
– Horner’s rule
• H(S) = (…( (x1b + x2)b + x3) b + … + x m-1 ) b + xm mod q
• Yet another formula for performance tuning

40
• Rabin-Karp algorithm for multiple patterns:
– Input:
• String R, multiple patterns {P1,…,Pk},
• n= Length(R), mj =Length(Pj), q, b,
– Procedure:
• Step 0:
– Let m = Min(mk)
– Calculate the number bm-1 mod q
– Calculate all H(Pj,1,…,m-) (j=1,..,k) and H(R1) by Horner’s rule
• Step 1: Let i=1
• Step 2:
If there exists j in *1,2,…,k+ such that
H(Pj,1,…,m-) = H(Ri) and Pj = R[i,…, mj +i-1],
it is a match and output the instance
• Step 3: i = i + 1
• Step 4: If i > n-m, stop
• Step 5: Calculate H(Ri+1) using the iterative formula.
• Step 6 Go to step 2
– Output: All matched instances

41
A practical hybrid method:
– BMH or Rabin-Karp
– If k < Magic-number,
• Use BMH k times,
• Otherwise, use Rabin-Harp
– Magic-number=100 is my exercise in DLP products.
Rabin-Karp has its weakness :
• when Min({Length(Pi)| i =1,2,…,k +) is
small, say, less than 4, we have trouble.
• We need to introduce efficient multiple
pattern match for short patterns.

42
We have a complimentary solution to RK algorithm when
handling multiple short patterns
– This is Reverse-trie matching algorithm.
A reverse-trie presents a set of keywords,
especially, it is good for CJK languages in
UTF-8 encoding :
c d
b
a
c
b a
a
root
The keyword set: {abc,abcd,acd}

Agenda
• Summary
• References
• Q&A

Summary
• What DLP is.
• DLP Security Model
• Architecture of a DLP System
• Four Content Inspection Problems
• Two Algorithms for DLP Content Inspection
– Document Fingerprinting
– Multi-keyword matching

References
• Liwei Ren et al., Document fingerprinting with asymmetric selection of anchor
points, US patent 8359472
• Liwei Ren et al., Two tiered architecture of named entity recognition engine, US
patent 8321434.
• Yingqiang Lin el al., Scalable document signature search engine, US patent
8266150
• Liwei Ren et al., Fingerprint based entity extraction, US patent 7950062
• Liwei Ren et al., Document match engine using asymmetric signature generation,
US patent 7860853
• Liwei Ren et al., Match engine for querying relevant documents, US patent
7747642
• Liwei Ren et al., Matching engine with signature generation, US patent 7516130

Q&A
Any questions?

Thank You!
Innovation is not a part
time job, and it is not even
a full-time job. It’s a life
style.

DLP Systems: Models, Architecture and Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to DLP Systems: Models, Architecture and Algorithms

Similar to DLP Systems: Models, Architecture and Algorithms (20)

More from Liwei Ren任力偉

More from Liwei Ren任力偉 (20)

Recently uploaded

Recently uploaded (20)

DLP Systems: Models, Architecture and Algorithms