Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
Scale your database traffic with Read & Write split using MySQL Router
Mathematical Modeling for Practical Problems
1. Copyright 2011 Trend Micro Inc. 1
Mathematical Modeling for Practical
Problems
Liwei Ren, Ph.D
Scientific Adviser, Trend Micro
May 12, 2014, UC Santa Cruz, Silicon Valley Center, Santa Clara
2. Copyright 2011 Trend Micro Inc.
Backgrounds:
• Liwei Ren
– Research interests:
• DLP, cloud data security, network security, differential compression, math modeling &
practical algorithms.
– Education:
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
– Relevant works for this talk:
• Provilla : a startup focusing on endpoint based DLP products and solutions. It was co-
founded by Liwei and acquired by Trend Micro.
• Patents --- Liwei has 20 patents granted in both DLP & differential compression … most
works include strong algorithmic elements.
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in
Nanjing, Taipei and Silicon Valley.
– Acquired Provilla™ in 2007.
2
3. Copyright 2011 Trend Micro Inc.
Agenda
• What Is a Math Model?
• A Process of Practice
• A Problem from a Startup
• Math Modeling
• Math Modeling Again
• Summary
Classification 5/12/2014 3
4. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A math model describes a practical problem in mathematical
language:
– Using mathematical symbols, expressions, concepts, and even logic
operations;
– Using mathematical equations;
– Using mathematical structures such as graphs;
– Using mathematical procedures such as algorithms.
• A math model may describe a practical problem
approximately:
– It needs to include the most essential parts of the problem while ignoring
those unimportant features.
– However, we cannot go too far for ignoring unimportant features.
4
5. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A simple example:
– Problem: Two cars are driving toward each other on a street with an
initial distance one and half mile. A naughty dog is running between
them. Two cars drive at 4 miles/hr and 6 miles/hr respectively. The dog
runs at 20 miles/hr. What is the total in mile that the dog runs?
Classification 5/12/2014 5
6. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A simple example:
6
– Analysis:
– to calculate the distance that the dog runs, one needs to know the
time T it takes. T is how long two cars take to meet;
– T = D / ( V1 + V2).
– Math model: d = V * D/( V1 + V2).
– Solution: d = 20*1.5/(4+6)= 3 miles.
7. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A notable example:
– Seven Bridges of Königsberg (in Prussia, 18th century)
– Problem Proposal: to find a walk through the city that would cross
each bridge once and only once.
Classification 5/12/2014 7
8. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A notable example :
– Analysis : Leonhard Euler in 1735.
Classification 5/12/2014 8
9. Copyright 2011 Trend Micro Inc.
What is a Math Model?
• Classic example:
– Model: to find a path ( or Euler Trail) that uses each edge in this
undirected graph exactly once.
Classification 5/12/2014 9
• Solution: Euler proved that there exists no solution.
• Contribution: This problem started 2 important branches of
modern mathematics --- graph theory & topology.
10. Copyright 2011 Trend Micro Inc.
A Process of Practice
• Let me summarize a process from my experience:
– How to create mathematical models from practical
problems.
Classification 5/12/2014 10
11. Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 11
12. Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• Text Model for constructing EvalSim:
Classification 5/12/2014 12
13. Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 13
14. Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 14
Data Inspection Problem:
S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.
15. Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 15
16. Copyright 2011 Trend Micro Inc.
Math Modeling
• To solve the DLP Data Inspection Problem, we introduce the
concept of fingerprints:
1. To identify unique and robust features from a string;
2. To generate fingerprints from these features by hashing.
• Given a string T, we denote its fingerprints as:
– SFP(T) = {FP1, FP2 ,…, FPm(T)}
16
NOTE: Many years later, we realized the problem
is actually close to the problem :
• Near Duplicate Document Detection.
17. Copyright 2011 Trend Micro Inc.
Math Modeling
• With fingerprints, the problem is divided into two parts:
– Indexing:
• For each string T ∊ S that is assigned a unique string ID as SID, we
generate fingerprints SFP(T), then we index SID with all fingerprints in
SFP(T).
• The whole indices is contained in FP-DB.
– Searching + Matching:
• For given T, we have SFP(T). We search SFP(T) against FP-DB to identify
possible candidates (i.e., suspects) of similar strings, say, {t1, t2 ,…, tk}
• Calculate EvalSim(T, tj) where j = 1,2,…,k.
– Pick those with EvalSim(T,*) ≥ X% as result.
• The above is similar to keyword-based search if we view
fingerprints as keywords.
• What remains :
– How to generate fingerprints from a given string?
Classification 5/12/2014 17
18. Copyright 2011 Trend Micro Inc.
Math Modeling
• String fingerprints :
1. Fingerprints are generated from features of a given string.
2. Robust: we expect SFP(T1) ∩ SFP(T2) ≠ NIL if they are similar;
3. Unique: SFP(T1) ∩ SFP(T2) = NIL if they are irrelevant.
• How to select robust and unique features?
– Selecting anchor points may be a good choice.
– A character in the string is an anchor point if
• Its neighborhood ( of fixed length M) could be a common sub-string across
similar strings with high probability;
– A fingerprint is generated by hashing the neighborhood:
• When M is long enough, we should have uniqueness;
• The high probability means robustness:
– Resilient to changes.
Classification 5/12/2014 18
19. Copyright 2011 Trend Micro Inc.
Math Modeling
• Anchor points and fingerprints:
Classification 5/12/2014 19
• How to identify anchor points?
20. Copyright 2011 Trend Micro Inc.
Math Modeling
• Review: A character in the string is an anchor point if
• Its neighborhood could be a common sub-string across similar strings with
high probability;
• This definition is not rigorous.
• Let us try a rigorous way to describe anchor points:
– That is what mathematical modeling is about.
• Math Modeling for Anchor Points:
– Let A = *0x00, 0x01, ….,0xFF+ as the binary alphabet.
– Let K be a small integer (say, 5). We select K different binary
characters from A in order for identifying anchor point candidates .
– Two requirements:
1. Those candidates must have high frequency in given string;
2. They are as evenly distributed as possible.
Classification 5/12/2014 20
21. Copyright 2011 Trend Micro Inc.
Math Modeling
• Math Modeling for Anchor Points:
– We use a score function F to describe the requirements :
where b ϵ A , n is the number of occurrences of character b, and {P1,
P2…, Pn} represent all offsets of b in string.
– measures the frequency of character b … intuitively !
– The 2nd term measures its
distribution.
• WHY ?
21
22. Copyright 2011 Trend Micro Inc.
Math Modeling
• Let us consider the constrained optimization problem :
where (C is a constant), and Xi ≥ 0, i=1,2,…,m
• It is equivalent to the problem:
where and Xi ≥ 0, i=1,2,…,m
Classification 5/12/2014 22
]
23. Copyright 2011 Trend Micro Inc.
Math Modeling
• Its solution is Xi = , i =1, 2 , …, m
• It means the even distribution of character b in the string:
– Let Xi = Pi+1 - Pi , i = 1, 2 , …, m, and m=n-1;
– For even distribution, we have Pi+1 - Pi = C/(n-1) for i = 1,
2 , …, n-1.
– Meaning : If character b appears n times in a constant range C,
F(b) achieves the maximum value when evenly distributed!
23
24. Copyright 2011 Trend Micro Inc.
Math Modeling
• With this score function F(b), we select K characters {b1, b2, …,bK} from
A with K top scores.
• For each selected character bk , at each occurrence in string, we generate
a fingerprint from its neighborhood with a hash function H1:
• We obtain a set of fingerprints {FP1, FP2, …, FPn}.
• Let us sort them in an ascending order, and pick up first N fingerprints.
The number N may be pre-selected depending on the string size.
24
25. Copyright 2011 Trend Micro Inc.
Math Modeling
• We get K*N anchor points ( to generate K*N fingerprints).
• We are done with modeling the anchor points:
– It should be very easy to provide an algorithm based on the model.
• Let us name the Math Model ( of anchor points) as MODEL 1.
• With MODEL 1, we developed an algorithm to generate
fingerprints from a given string:
– DataDNA 1.0.
• With DataDNA 1.0, we solve DLP Data Inspection Problem:
25
S is a set of documents . For any document d, we need to find D from S such that
EvalSim(D,d) ≥ X%.
26. Copyright 2011 Trend Micro Inc.
Math Modeling Again
• Not long, we started to face a few challenges:
1. If we make more than 60% change to a document D, we find the
new document d may share 0 fingerprints with D;
2. Our customers challenged us with a question:
• If we copy & paste a small text into a very large document, does your
DLP Data Inspection technology work?
3. Due to product architecture change, we replaced new EvalSim with:
26
NOTE: This is because that the original EvalSim has to compare two strings
byte-to-byte for common sub-strings. This new formula is based on
number of common fingerprints.
• We have an issue : the anchor points selected by DataDNA 1.0 are not
evenly distributed over the string. So the EvalSim() as calculated above is
not as accurate as expected . We need to fix it!
27. Copyright 2011 Trend Micro Inc.
Math Modeling Again
• We had to propose new model to select anchor points.
– We use rolling hash H to describe anchor points this time.
27
NOTE 1: Many applications do
the similar trick for identifying
anchor points:
• Data de-duplication ( cut
points)
• SSDEEP
NOTE 2: We can use
• Karp-Rabin rolling hash OR
• Adler-32 .
28. Copyright 2011 Trend Micro Inc.
Math Modeling Again
• After identifying anchor points, we can generate fingerprints
from right neighborhoods (of anchor points) with another
hash function h:
– This h can be a regular hash function, however, it is better use 2nd
rolling hash for performance.
28
29. Copyright 2011 Trend Micro Inc.
Math Modeling Again
• This is MODEL 2 for describing anchor points. It can solve
the 3 issues that we raised.
• WHY?
– Statistically, H(x)=0 mod p provides us with an anchor point per p
consecutive characters in average.
– This is close to our expectation:
• Even distribution of anchor points.
29
30. Copyright 2011 Trend Micro Inc.
Math Modeling Again
• With MODEL 2, we developed an algorithm to generate
fingerprints from a given string.
– DataDNA 2.0
• With DataDNA 2.0, we solve DLP Data Inspection Problem
with better solution and simple EvalSim function:
where
30
S is a set of documents . For any document d, we need to find D from S such that
EvalSim(D,d) ≥ X%.
31. Copyright 2011 Trend Micro Inc.
Summary
• We proposed a process for math modeling of real world
problems.
• We practiced the process with DLP Data Inspection Problem .
– Proposed by a DLP startup many years ago.
• The problem was reduced to string fingerprinting problem :
31
• MODEL 1 was introduced to describe anchor points in order
for generating fingerprints.
• MODEL 2 was introduced to describe evenly distributed
anchor points in order for generating fingerprints.
32. Copyright 2011 Trend Micro Inc.
Summary
• The problem of DLP Data Inspection has been studied as the
problem of Near Duplicate Document Detection.
• Many applications:
– Data leak prevention
– Document classification and clustering
– Anti-plagiarism
– eDiscovery
– Web search engine: index optimization.
– More….
32
33. Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your attention.
• Do you have questions?
33
34. Copyright 2011 Trend Micro Inc.
References
1. US patent 8359472, Document fingerprinting with asymmetric
selection of anchor points, Jan 2013
2. US Patent 8266150, Scalable document signature search engine,
Sep 2012
3. US patent 7860853, Document matching engine using
asymmetric signature generation, Dec 28, 2010
4. US patent 7516130, Matching engine with signature generation,
April, 2009
5. My Information:
– Email : liwei_ren@trendmicro.com
– Linkedin: http://www.linkedin.com/in/drliweiren
– Academic Space: https://pittsburgh.academia.edu/LiweiRen
34