SlideShare a Scribd company logo
1 of 34
Download to read offline
Copyright 2011 Trend Micro Inc. 1
Mathematical Modeling for Practical
Problems
Liwei Ren, Ph.D
Scientific Adviser, Trend Micro
May 12, 2014, UC Santa Cruz, Silicon Valley Center, Santa Clara
Copyright 2011 Trend Micro Inc.
Backgrounds:
• Liwei Ren
– Research interests:
• DLP, cloud data security, network security, differential compression, math modeling &
practical algorithms.
– Education:
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
– Relevant works for this talk:
• Provilla : a startup focusing on endpoint based DLP products and solutions. It was co-
founded by Liwei and acquired by Trend Micro.
• Patents --- Liwei has 20 patents granted in both DLP & differential compression … most
works include strong algorithmic elements.
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in
Nanjing, Taipei and Silicon Valley.
– Acquired Provilla™ in 2007.
2
Copyright 2011 Trend Micro Inc.
Agenda
• What Is a Math Model?
• A Process of Practice
• A Problem from a Startup
• Math Modeling
• Math Modeling Again
• Summary
Classification 5/12/2014 3
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A math model describes a practical problem in mathematical
language:
– Using mathematical symbols, expressions, concepts, and even logic
operations;
– Using mathematical equations;
– Using mathematical structures such as graphs;
– Using mathematical procedures such as algorithms.
• A math model may describe a practical problem
approximately:
– It needs to include the most essential parts of the problem while ignoring
those unimportant features.
– However, we cannot go too far for ignoring unimportant features.
4
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A simple example:
– Problem: Two cars are driving toward each other on a street with an
initial distance one and half mile. A naughty dog is running between
them. Two cars drive at 4 miles/hr and 6 miles/hr respectively. The dog
runs at 20 miles/hr. What is the total in mile that the dog runs?
Classification 5/12/2014 5
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A simple example:
6
– Analysis:
– to calculate the distance that the dog runs, one needs to know the
time T it takes. T is how long two cars take to meet;
– T = D / ( V1 + V2).
– Math model: d = V * D/( V1 + V2).
– Solution: d = 20*1.5/(4+6)= 3 miles.
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A notable example:
– Seven Bridges of Königsberg (in Prussia, 18th century)
– Problem Proposal: to find a walk through the city that would cross
each bridge once and only once.
Classification 5/12/2014 7
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• A notable example :
– Analysis : Leonhard Euler in 1735.
Classification 5/12/2014 8
Copyright 2011 Trend Micro Inc.
What is a Math Model?
• Classic example:
– Model: to find a path ( or Euler Trail) that uses each edge in this
undirected graph exactly once.
Classification 5/12/2014 9
• Solution: Euler proved that there exists no solution.
• Contribution: This problem started 2 important branches of
modern mathematics --- graph theory & topology.
Copyright 2011 Trend Micro Inc.
A Process of Practice
• Let me summarize a process from my experience:
– How to create mathematical models from practical
problems.
Classification 5/12/2014 10
Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 11
Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• Text Model for constructing EvalSim:
Classification 5/12/2014 12
Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 13
Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 14
Data Inspection Problem:
S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.
Copyright 2011 Trend Micro Inc.
A Problem from a Startup
• A conversation in 2004 :
Classification 5/12/2014 15
Copyright 2011 Trend Micro Inc.
Math Modeling
• To solve the DLP Data Inspection Problem, we introduce the
concept of fingerprints:
1. To identify unique and robust features from a string;
2. To generate fingerprints from these features by hashing.
• Given a string T, we denote its fingerprints as:
– SFP(T) = {FP1, FP2 ,…, FPm(T)}
16
NOTE: Many years later, we realized the problem
is actually close to the problem :
• Near Duplicate Document Detection.
Copyright 2011 Trend Micro Inc.
Math Modeling
• With fingerprints, the problem is divided into two parts:
– Indexing:
• For each string T ∊ S that is assigned a unique string ID as SID, we
generate fingerprints SFP(T), then we index SID with all fingerprints in
SFP(T).
• The whole indices is contained in FP-DB.
– Searching + Matching:
• For given T, we have SFP(T). We search SFP(T) against FP-DB to identify
possible candidates (i.e., suspects) of similar strings, say, {t1, t2 ,…, tk}
• Calculate EvalSim(T, tj) where j = 1,2,…,k.
– Pick those with EvalSim(T,*) ≥ X% as result.
• The above is similar to keyword-based search if we view
fingerprints as keywords.
• What remains :
– How to generate fingerprints from a given string?
Classification 5/12/2014 17
Copyright 2011 Trend Micro Inc.
Math Modeling
• String fingerprints :
1. Fingerprints are generated from features of a given string.
2. Robust: we expect SFP(T1) ∩ SFP(T2) ≠ NIL if they are similar;
3. Unique: SFP(T1) ∩ SFP(T2) = NIL if they are irrelevant.
• How to select robust and unique features?
– Selecting anchor points may be a good choice.
– A character in the string is an anchor point if
• Its neighborhood ( of fixed length M) could be a common sub-string across
similar strings with high probability;
– A fingerprint is generated by hashing the neighborhood:
• When M is long enough, we should have uniqueness;
• The high probability means robustness:
– Resilient to changes.
Classification 5/12/2014 18
Copyright 2011 Trend Micro Inc.
Math Modeling
• Anchor points and fingerprints:
Classification 5/12/2014 19
• How to identify anchor points?
Copyright 2011 Trend Micro Inc.
Math Modeling
• Review: A character in the string is an anchor point if
• Its neighborhood could be a common sub-string across similar strings with
high probability;
• This definition is not rigorous.
• Let us try a rigorous way to describe anchor points:
– That is what mathematical modeling is about.
• Math Modeling for Anchor Points:
– Let A = *0x00, 0x01, ….,0xFF+ as the binary alphabet.
– Let K be a small integer (say, 5). We select K different binary
characters from A in order for identifying anchor point candidates .
– Two requirements:
1. Those candidates must have high frequency in given string;
2. They are as evenly distributed as possible.
Classification 5/12/2014 20
Copyright 2011 Trend Micro Inc.
Math Modeling
• Math Modeling for Anchor Points:
– We use a score function F to describe the requirements :
where b ϵ A , n is the number of occurrences of character b, and {P1,
P2…, Pn} represent all offsets of b in string.
– measures the frequency of character b … intuitively !
– The 2nd term measures its
distribution.
• WHY ?
21
Copyright 2011 Trend Micro Inc.
Math Modeling
• Let us consider the constrained optimization problem :
where (C is a constant), and Xi ≥ 0, i=1,2,…,m
• It is equivalent to the problem:
where and Xi ≥ 0, i=1,2,…,m
Classification 5/12/2014 22
]
Copyright 2011 Trend Micro Inc.
Math Modeling
• Its solution is Xi = , i =1, 2 , …, m
• It means the even distribution of character b in the string:
– Let Xi = Pi+1 - Pi , i = 1, 2 , …, m, and m=n-1;
– For even distribution, we have Pi+1 - Pi = C/(n-1) for i = 1,
2 , …, n-1.
– Meaning : If character b appears n times in a constant range C,
F(b) achieves the maximum value when evenly distributed!
23
Copyright 2011 Trend Micro Inc.
Math Modeling
• With this score function F(b), we select K characters {b1, b2, …,bK} from
A with K top scores.
• For each selected character bk , at each occurrence in string, we generate
a fingerprint from its neighborhood with a hash function H1:
• We obtain a set of fingerprints {FP1, FP2, …, FPn}.
• Let us sort them in an ascending order, and pick up first N fingerprints.
The number N may be pre-selected depending on the string size.
24
Copyright 2011 Trend Micro Inc.
Math Modeling
• We get K*N anchor points ( to generate K*N fingerprints).
• We are done with modeling the anchor points:
– It should be very easy to provide an algorithm based on the model.
• Let us name the Math Model ( of anchor points) as MODEL 1.
• With MODEL 1, we developed an algorithm to generate
fingerprints from a given string:
– DataDNA 1.0.
• With DataDNA 1.0, we solve DLP Data Inspection Problem:
25
S is a set of documents . For any document d, we need to find D from S such that
EvalSim(D,d) ≥ X%.
Copyright 2011 Trend Micro Inc.
Math Modeling Again
• Not long, we started to face a few challenges:
1. If we make more than 60% change to a document D, we find the
new document d may share 0 fingerprints with D;
2. Our customers challenged us with a question:
• If we copy & paste a small text into a very large document, does your
DLP Data Inspection technology work?
3. Due to product architecture change, we replaced new EvalSim with:
26
NOTE: This is because that the original EvalSim has to compare two strings
byte-to-byte for common sub-strings. This new formula is based on
number of common fingerprints.
• We have an issue : the anchor points selected by DataDNA 1.0 are not
evenly distributed over the string. So the EvalSim() as calculated above is
not as accurate as expected . We need to fix it!
Copyright 2011 Trend Micro Inc.
Math Modeling Again
• We had to propose new model to select anchor points.
– We use rolling hash H to describe anchor points this time.
27
NOTE 1: Many applications do
the similar trick for identifying
anchor points:
• Data de-duplication ( cut
points)
• SSDEEP
NOTE 2: We can use
• Karp-Rabin rolling hash OR
• Adler-32 .
Copyright 2011 Trend Micro Inc.
Math Modeling Again
• After identifying anchor points, we can generate fingerprints
from right neighborhoods (of anchor points) with another
hash function h:
– This h can be a regular hash function, however, it is better use 2nd
rolling hash for performance.
28
Copyright 2011 Trend Micro Inc.
Math Modeling Again
• This is MODEL 2 for describing anchor points. It can solve
the 3 issues that we raised.
• WHY?
– Statistically, H(x)=0 mod p provides us with an anchor point per p
consecutive characters in average.
– This is close to our expectation:
• Even distribution of anchor points.
29
Copyright 2011 Trend Micro Inc.
Math Modeling Again
• With MODEL 2, we developed an algorithm to generate
fingerprints from a given string.
– DataDNA 2.0
• With DataDNA 2.0, we solve DLP Data Inspection Problem
with better solution and simple EvalSim function:
where
30
S is a set of documents . For any document d, we need to find D from S such that
EvalSim(D,d) ≥ X%.
Copyright 2011 Trend Micro Inc.
Summary
• We proposed a process for math modeling of real world
problems.
• We practiced the process with DLP Data Inspection Problem .
– Proposed by a DLP startup many years ago.
• The problem was reduced to string fingerprinting problem :
31
• MODEL 1 was introduced to describe anchor points in order
for generating fingerprints.
• MODEL 2 was introduced to describe evenly distributed
anchor points in order for generating fingerprints.
Copyright 2011 Trend Micro Inc.
Summary
• The problem of DLP Data Inspection has been studied as the
problem of Near Duplicate Document Detection.
• Many applications:
– Data leak prevention
– Document classification and clustering
– Anti-plagiarism
– eDiscovery
– Web search engine: index optimization.
– More….
32
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your attention.
• Do you have questions?
33
Copyright 2011 Trend Micro Inc.
References
1. US patent 8359472, Document fingerprinting with asymmetric
selection of anchor points, Jan 2013
2. US Patent 8266150, Scalable document signature search engine,
Sep 2012
3. US patent 7860853, Document matching engine using
asymmetric signature generation, Dec 28, 2010
4. US patent 7516130, Matching engine with signature generation,
April, 2009
5. My Information:
– Email : liwei_ren@trendmicro.com
– Linkedin: http://www.linkedin.com/in/drliweiren
– Academic Space: https://pittsburgh.academia.edu/LiweiRen
34

More Related Content

What's hot

introduction to Numerical Analysis
introduction to Numerical Analysisintroduction to Numerical Analysis
introduction to Numerical AnalysisGhulam Mehdi Sahito
 
Complex Number's Applications
Complex Number's ApplicationsComplex Number's Applications
Complex Number's ApplicationsNikhil Deswal
 
application of complex numbers
application of complex numbersapplication of complex numbers
application of complex numbersKaustubh Garud
 
Applications of analytic functions and vector calculus
Applications of analytic functions and vector calculusApplications of analytic functions and vector calculus
Applications of analytic functions and vector calculusPoojith Chowdhary
 
Numerical solution of ordinary differential equation
Numerical solution of ordinary differential equationNumerical solution of ordinary differential equation
Numerical solution of ordinary differential equationDixi Patel
 
Fractional calculus and applications
Fractional calculus and applicationsFractional calculus and applications
Fractional calculus and applicationsPlusOrMinusZero
 
Imaginary numbers
Imaginary numbersImaginary numbers
Imaginary numbersJordan Vint
 
Chapter 3 mathematical modeling
Chapter 3 mathematical modelingChapter 3 mathematical modeling
Chapter 3 mathematical modelingBin Biny Bino
 
Lesson 11: Limits and Continuity
Lesson 11: Limits and ContinuityLesson 11: Limits and Continuity
Lesson 11: Limits and ContinuityMatthew Leingang
 
ROOT OF NON-LINEAR EQUATIONS
ROOT OF NON-LINEAR EQUATIONSROOT OF NON-LINEAR EQUATIONS
ROOT OF NON-LINEAR EQUATIONSfenil patel
 
Laplace transforms
Laplace transformsLaplace transforms
Laplace transformsRahul Narang
 
First order linear differential equation
First order linear differential equationFirst order linear differential equation
First order linear differential equationNofal Umair
 
Second order homogeneous linear differential equations
Second order homogeneous linear differential equations Second order homogeneous linear differential equations
Second order homogeneous linear differential equations Viraj Patel
 
MATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationMATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationAinul Islam
 
An Introduction to Mathematical Modelling
An Introduction to Mathematical ModellingAn Introduction to Mathematical Modelling
An Introduction to Mathematical ModellingHeni Widayani
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.pptLadallaRajKumar
 

What's hot (20)

introduction to Numerical Analysis
introduction to Numerical Analysisintroduction to Numerical Analysis
introduction to Numerical Analysis
 
Numerical analysis ppt
Numerical analysis pptNumerical analysis ppt
Numerical analysis ppt
 
Complex Number's Applications
Complex Number's ApplicationsComplex Number's Applications
Complex Number's Applications
 
application of complex numbers
application of complex numbersapplication of complex numbers
application of complex numbers
 
Applications of analytic functions and vector calculus
Applications of analytic functions and vector calculusApplications of analytic functions and vector calculus
Applications of analytic functions and vector calculus
 
Numerical solution of ordinary differential equation
Numerical solution of ordinary differential equationNumerical solution of ordinary differential equation
Numerical solution of ordinary differential equation
 
Chapter 17 - Multivariable Calculus
Chapter 17 - Multivariable CalculusChapter 17 - Multivariable Calculus
Chapter 17 - Multivariable Calculus
 
Fractional calculus and applications
Fractional calculus and applicationsFractional calculus and applications
Fractional calculus and applications
 
Imaginary numbers
Imaginary numbersImaginary numbers
Imaginary numbers
 
Chapter 3 mathematical modeling
Chapter 3 mathematical modelingChapter 3 mathematical modeling
Chapter 3 mathematical modeling
 
Lesson 11: Limits and Continuity
Lesson 11: Limits and ContinuityLesson 11: Limits and Continuity
Lesson 11: Limits and Continuity
 
ROOT OF NON-LINEAR EQUATIONS
ROOT OF NON-LINEAR EQUATIONSROOT OF NON-LINEAR EQUATIONS
ROOT OF NON-LINEAR EQUATIONS
 
Laplace transforms
Laplace transformsLaplace transforms
Laplace transforms
 
First order linear differential equation
First order linear differential equationFirst order linear differential equation
First order linear differential equation
 
Graph of functions
Graph of functionsGraph of functions
Graph of functions
 
Second order homogeneous linear differential equations
Second order homogeneous linear differential equations Second order homogeneous linear differential equations
Second order homogeneous linear differential equations
 
Galois theory
Galois theoryGalois theory
Galois theory
 
MATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and IntegrationMATLAB : Numerical Differention and Integration
MATLAB : Numerical Differention and Integration
 
An Introduction to Mathematical Modelling
An Introduction to Mathematical ModellingAn Introduction to Mathematical Modelling
An Introduction to Mathematical Modelling
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.ppt
 

Viewers also liked

Transfer function and mathematical modeling
Transfer  function  and  mathematical  modelingTransfer  function  and  mathematical  modeling
Transfer function and mathematical modelingvishalgohel12195
 
Lecture 2 ME 176 2 Mathematical Modeling
Lecture 2 ME 176 2 Mathematical ModelingLecture 2 ME 176 2 Mathematical Modeling
Lecture 2 ME 176 2 Mathematical ModelingLeonides De Ocampo
 
Lecture 4 ME 176 2 Mathematical Modeling
Lecture 4 ME 176 2 Mathematical ModelingLecture 4 ME 176 2 Mathematical Modeling
Lecture 4 ME 176 2 Mathematical ModelingLeonides De Ocampo
 
Class 10 mathematical modeling of continuous stirred tank reactor systems (...
Class 10   mathematical modeling of continuous stirred tank reactor systems (...Class 10   mathematical modeling of continuous stirred tank reactor systems (...
Class 10 mathematical modeling of continuous stirred tank reactor systems (...Manipal Institute of Technology
 
Modern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsModern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsAmr E. Mohamed
 

Viewers also liked (7)

Transfer function and mathematical modeling
Transfer  function  and  mathematical  modelingTransfer  function  and  mathematical  modeling
Transfer function and mathematical modeling
 
Lecture 2 ME 176 2 Mathematical Modeling
Lecture 2 ME 176 2 Mathematical ModelingLecture 2 ME 176 2 Mathematical Modeling
Lecture 2 ME 176 2 Mathematical Modeling
 
Lecture 4 ME 176 2 Mathematical Modeling
Lecture 4 ME 176 2 Mathematical ModelingLecture 4 ME 176 2 Mathematical Modeling
Lecture 4 ME 176 2 Mathematical Modeling
 
Class 6 basics of mathematical modeling
Class 6   basics of mathematical modelingClass 6   basics of mathematical modeling
Class 6 basics of mathematical modeling
 
Class 10 mathematical modeling of continuous stirred tank reactor systems (...
Class 10   mathematical modeling of continuous stirred tank reactor systems (...Class 10   mathematical modeling of continuous stirred tank reactor systems (...
Class 10 mathematical modeling of continuous stirred tank reactor systems (...
 
Modern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsModern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of Systems
 
Class 7 mathematical modeling of liquid-level systems
Class 7   mathematical modeling of liquid-level systemsClass 7   mathematical modeling of liquid-level systems
Class 7 mathematical modeling of liquid-level systems
 

Similar to Mathematical Modeling for Practical Problems

Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential CompressionLiwei Ren任力偉
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringLiwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsLiwei Ren任力偉
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial IntelligenceZavain Dar
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator ProgramGoDataDriven
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxPierre Schaus
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptRahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .pptGanesh E
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptkalai75
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptAravind Reddy
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
 
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4shortJun Miyazaki
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015lbishal
 
From a sea of projects to collaboration opportunities within seconds
From a sea of projects to collaboration opportunities within secondsFrom a sea of projects to collaboration opportunities within seconds
From a sea of projects to collaboration opportunities within secondsMichel Drescher
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 

Similar to Mathematical Modeling for Practical Problems (20)

Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
21AI401 AI Unit 1.pdf
21AI401 AI Unit 1.pdf21AI401 AI Unit 1.pdf
21AI401 AI Unit 1.pdf
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- Redux
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Era ofdataeconomyv4short
Era ofdataeconomyv4shortEra ofdataeconomyv4short
Era ofdataeconomyv4short
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
 
From a sea of projects to collaboration opportunities within seconds
From a sea of projects to collaboration opportunities within secondsFrom a sea of projects to collaboration opportunities within seconds
From a sea of projects to collaboration opportunities within seconds
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 

More from Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based SecurityLiwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsLiwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemLiwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudLiwei Ren任力偉
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
 

More from Liwei Ren任力偉 (20)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the Cloud
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Mathematical Modeling for Practical Problems

  • 1. Copyright 2011 Trend Micro Inc. 1 Mathematical Modeling for Practical Problems Liwei Ren, Ph.D Scientific Adviser, Trend Micro May 12, 2014, UC Santa Cruz, Silicon Valley Center, Santa Clara
  • 2. Copyright 2011 Trend Micro Inc. Backgrounds: • Liwei Ren – Research interests: • DLP, cloud data security, network security, differential compression, math modeling & practical algorithms. – Education: • MS/BS in mathematics, Tsinghua University, Beijing • Ph.D in mathematics, MS in information science, University of Pittsburgh – Relevant works for this talk: • Provilla : a startup focusing on endpoint based DLP products and solutions. It was co- founded by Liwei and acquired by Trend Micro. • Patents --- Liwei has 20 patents granted in both DLP & differential compression … most works include strong algorithmic elements. • Trend Micro™ – Global security software company with headquarter in Tokyo, and R&D centers in Nanjing, Taipei and Silicon Valley. – Acquired Provilla™ in 2007. 2
  • 3. Copyright 2011 Trend Micro Inc. Agenda • What Is a Math Model? • A Process of Practice • A Problem from a Startup • Math Modeling • Math Modeling Again • Summary Classification 5/12/2014 3
  • 4. Copyright 2011 Trend Micro Inc. What is a Math Model? • A math model describes a practical problem in mathematical language: – Using mathematical symbols, expressions, concepts, and even logic operations; – Using mathematical equations; – Using mathematical structures such as graphs; – Using mathematical procedures such as algorithms. • A math model may describe a practical problem approximately: – It needs to include the most essential parts of the problem while ignoring those unimportant features. – However, we cannot go too far for ignoring unimportant features. 4
  • 5. Copyright 2011 Trend Micro Inc. What is a Math Model? • A simple example: – Problem: Two cars are driving toward each other on a street with an initial distance one and half mile. A naughty dog is running between them. Two cars drive at 4 miles/hr and 6 miles/hr respectively. The dog runs at 20 miles/hr. What is the total in mile that the dog runs? Classification 5/12/2014 5
  • 6. Copyright 2011 Trend Micro Inc. What is a Math Model? • A simple example: 6 – Analysis: – to calculate the distance that the dog runs, one needs to know the time T it takes. T is how long two cars take to meet; – T = D / ( V1 + V2). – Math model: d = V * D/( V1 + V2). – Solution: d = 20*1.5/(4+6)= 3 miles.
  • 7. Copyright 2011 Trend Micro Inc. What is a Math Model? • A notable example: – Seven Bridges of Königsberg (in Prussia, 18th century) – Problem Proposal: to find a walk through the city that would cross each bridge once and only once. Classification 5/12/2014 7
  • 8. Copyright 2011 Trend Micro Inc. What is a Math Model? • A notable example : – Analysis : Leonhard Euler in 1735. Classification 5/12/2014 8
  • 9. Copyright 2011 Trend Micro Inc. What is a Math Model? • Classic example: – Model: to find a path ( or Euler Trail) that uses each edge in this undirected graph exactly once. Classification 5/12/2014 9 • Solution: Euler proved that there exists no solution. • Contribution: This problem started 2 important branches of modern mathematics --- graph theory & topology.
  • 10. Copyright 2011 Trend Micro Inc. A Process of Practice • Let me summarize a process from my experience: – How to create mathematical models from practical problems. Classification 5/12/2014 10
  • 11. Copyright 2011 Trend Micro Inc. A Problem from a Startup • A conversation in 2004 : Classification 5/12/2014 11
  • 12. Copyright 2011 Trend Micro Inc. A Problem from a Startup • Text Model for constructing EvalSim: Classification 5/12/2014 12
  • 13. Copyright 2011 Trend Micro Inc. A Problem from a Startup • A conversation in 2004 : Classification 5/12/2014 13
  • 14. Copyright 2011 Trend Micro Inc. A Problem from a Startup • A conversation in 2004 : Classification 5/12/2014 14 Data Inspection Problem: S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.
  • 15. Copyright 2011 Trend Micro Inc. A Problem from a Startup • A conversation in 2004 : Classification 5/12/2014 15
  • 16. Copyright 2011 Trend Micro Inc. Math Modeling • To solve the DLP Data Inspection Problem, we introduce the concept of fingerprints: 1. To identify unique and robust features from a string; 2. To generate fingerprints from these features by hashing. • Given a string T, we denote its fingerprints as: – SFP(T) = {FP1, FP2 ,…, FPm(T)} 16 NOTE: Many years later, we realized the problem is actually close to the problem : • Near Duplicate Document Detection.
  • 17. Copyright 2011 Trend Micro Inc. Math Modeling • With fingerprints, the problem is divided into two parts: – Indexing: • For each string T ∊ S that is assigned a unique string ID as SID, we generate fingerprints SFP(T), then we index SID with all fingerprints in SFP(T). • The whole indices is contained in FP-DB. – Searching + Matching: • For given T, we have SFP(T). We search SFP(T) against FP-DB to identify possible candidates (i.e., suspects) of similar strings, say, {t1, t2 ,…, tk} • Calculate EvalSim(T, tj) where j = 1,2,…,k. – Pick those with EvalSim(T,*) ≥ X% as result. • The above is similar to keyword-based search if we view fingerprints as keywords. • What remains : – How to generate fingerprints from a given string? Classification 5/12/2014 17
  • 18. Copyright 2011 Trend Micro Inc. Math Modeling • String fingerprints : 1. Fingerprints are generated from features of a given string. 2. Robust: we expect SFP(T1) ∩ SFP(T2) ≠ NIL if they are similar; 3. Unique: SFP(T1) ∩ SFP(T2) = NIL if they are irrelevant. • How to select robust and unique features? – Selecting anchor points may be a good choice. – A character in the string is an anchor point if • Its neighborhood ( of fixed length M) could be a common sub-string across similar strings with high probability; – A fingerprint is generated by hashing the neighborhood: • When M is long enough, we should have uniqueness; • The high probability means robustness: – Resilient to changes. Classification 5/12/2014 18
  • 19. Copyright 2011 Trend Micro Inc. Math Modeling • Anchor points and fingerprints: Classification 5/12/2014 19 • How to identify anchor points?
  • 20. Copyright 2011 Trend Micro Inc. Math Modeling • Review: A character in the string is an anchor point if • Its neighborhood could be a common sub-string across similar strings with high probability; • This definition is not rigorous. • Let us try a rigorous way to describe anchor points: – That is what mathematical modeling is about. • Math Modeling for Anchor Points: – Let A = *0x00, 0x01, ….,0xFF+ as the binary alphabet. – Let K be a small integer (say, 5). We select K different binary characters from A in order for identifying anchor point candidates . – Two requirements: 1. Those candidates must have high frequency in given string; 2. They are as evenly distributed as possible. Classification 5/12/2014 20
  • 21. Copyright 2011 Trend Micro Inc. Math Modeling • Math Modeling for Anchor Points: – We use a score function F to describe the requirements : where b ϵ A , n is the number of occurrences of character b, and {P1, P2…, Pn} represent all offsets of b in string. – measures the frequency of character b … intuitively ! – The 2nd term measures its distribution. • WHY ? 21
  • 22. Copyright 2011 Trend Micro Inc. Math Modeling • Let us consider the constrained optimization problem : where (C is a constant), and Xi ≥ 0, i=1,2,…,m • It is equivalent to the problem: where and Xi ≥ 0, i=1,2,…,m Classification 5/12/2014 22 ]
  • 23. Copyright 2011 Trend Micro Inc. Math Modeling • Its solution is Xi = , i =1, 2 , …, m • It means the even distribution of character b in the string: – Let Xi = Pi+1 - Pi , i = 1, 2 , …, m, and m=n-1; – For even distribution, we have Pi+1 - Pi = C/(n-1) for i = 1, 2 , …, n-1. – Meaning : If character b appears n times in a constant range C, F(b) achieves the maximum value when evenly distributed! 23
  • 24. Copyright 2011 Trend Micro Inc. Math Modeling • With this score function F(b), we select K characters {b1, b2, …,bK} from A with K top scores. • For each selected character bk , at each occurrence in string, we generate a fingerprint from its neighborhood with a hash function H1: • We obtain a set of fingerprints {FP1, FP2, …, FPn}. • Let us sort them in an ascending order, and pick up first N fingerprints. The number N may be pre-selected depending on the string size. 24
  • 25. Copyright 2011 Trend Micro Inc. Math Modeling • We get K*N anchor points ( to generate K*N fingerprints). • We are done with modeling the anchor points: – It should be very easy to provide an algorithm based on the model. • Let us name the Math Model ( of anchor points) as MODEL 1. • With MODEL 1, we developed an algorithm to generate fingerprints from a given string: – DataDNA 1.0. • With DataDNA 1.0, we solve DLP Data Inspection Problem: 25 S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.
  • 26. Copyright 2011 Trend Micro Inc. Math Modeling Again • Not long, we started to face a few challenges: 1. If we make more than 60% change to a document D, we find the new document d may share 0 fingerprints with D; 2. Our customers challenged us with a question: • If we copy & paste a small text into a very large document, does your DLP Data Inspection technology work? 3. Due to product architecture change, we replaced new EvalSim with: 26 NOTE: This is because that the original EvalSim has to compare two strings byte-to-byte for common sub-strings. This new formula is based on number of common fingerprints. • We have an issue : the anchor points selected by DataDNA 1.0 are not evenly distributed over the string. So the EvalSim() as calculated above is not as accurate as expected . We need to fix it!
  • 27. Copyright 2011 Trend Micro Inc. Math Modeling Again • We had to propose new model to select anchor points. – We use rolling hash H to describe anchor points this time. 27 NOTE 1: Many applications do the similar trick for identifying anchor points: • Data de-duplication ( cut points) • SSDEEP NOTE 2: We can use • Karp-Rabin rolling hash OR • Adler-32 .
  • 28. Copyright 2011 Trend Micro Inc. Math Modeling Again • After identifying anchor points, we can generate fingerprints from right neighborhoods (of anchor points) with another hash function h: – This h can be a regular hash function, however, it is better use 2nd rolling hash for performance. 28
  • 29. Copyright 2011 Trend Micro Inc. Math Modeling Again • This is MODEL 2 for describing anchor points. It can solve the 3 issues that we raised. • WHY? – Statistically, H(x)=0 mod p provides us with an anchor point per p consecutive characters in average. – This is close to our expectation: • Even distribution of anchor points. 29
  • 30. Copyright 2011 Trend Micro Inc. Math Modeling Again • With MODEL 2, we developed an algorithm to generate fingerprints from a given string. – DataDNA 2.0 • With DataDNA 2.0, we solve DLP Data Inspection Problem with better solution and simple EvalSim function: where 30 S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.
  • 31. Copyright 2011 Trend Micro Inc. Summary • We proposed a process for math modeling of real world problems. • We practiced the process with DLP Data Inspection Problem . – Proposed by a DLP startup many years ago. • The problem was reduced to string fingerprinting problem : 31 • MODEL 1 was introduced to describe anchor points in order for generating fingerprints. • MODEL 2 was introduced to describe evenly distributed anchor points in order for generating fingerprints.
  • 32. Copyright 2011 Trend Micro Inc. Summary • The problem of DLP Data Inspection has been studied as the problem of Near Duplicate Document Detection. • Many applications: – Data leak prevention – Document classification and clustering – Anti-plagiarism – eDiscovery – Web search engine: index optimization. – More…. 32
  • 33. Copyright 2011 Trend Micro Inc. Q&A • Thank you for your attention. • Do you have questions? 33
  • 34. Copyright 2011 Trend Micro Inc. References 1. US patent 8359472, Document fingerprinting with asymmetric selection of anchor points, Jan 2013 2. US Patent 8266150, Scalable document signature search engine, Sep 2012 3. US patent 7860853, Document matching engine using asymmetric signature generation, Dec 28, 2010 4. US patent 7516130, Matching engine with signature generation, April, 2009 5. My Information: – Email : liwei_ren@trendmicro.com – Linkedin: http://www.linkedin.com/in/drliweiren – Academic Space: https://pittsburgh.academia.edu/LiweiRen 34