Data leakage detection

 Data leakage is the unauthorized transmission of
sensitive data or information from within an
organization to an external destination or recipient.
 Sensitive data of companies and organization
includes
 intellectual property,
 financial information ,
 patient information,
 personal credit card data ,
and other information depending upon the business and the
industry.

 In the course of doing business, sometimes data must
be handed over to trusted third parties for some
enhancement or operations.
 Sometimes these trusted third parties may act as
points of data leakage.
 Example:
a) A hospital may give patient records to researcher who
will devise new treatments.
b) A company may have partnership with other
companies that require sharing of customer data.
c) An enterprise may outsource its data processing , so
data must be given to various other companies.

Development chains
Supply chains Outsourcing
Business hubs
Demand chains

 Owner of data is termed as the distributor
and the third parties are called as the
agents .
 In case of data leakage, the distributor must
assess the likelihood that the leaked data
came from one or more agents, as opposed
to having been independently gathered by
other means.

Watermarking
Overview:
A unique code is embedded in each distributed
copy. If that copy is later discovered in the hands of an
unauthorized party, the leaker can be identified.
Mechanism:
The main idea is to generate a watermark [W(x; y)]
using a secret key chosen by the sender such that W(x;
y) is indistinguishable from random noise for any
entity that does not know the key (i.e., the recipients).

 The sender adds the watermark W(x; y) to the
information object I(x; y) and thus forms a transformed
object TI(x; y) before sharing it with the recipient(s).
 It is then hard for any recipient to guess the
watermark W(x; y) (and subtract it from the
transformed object TI(x; y));
 The sender on the other hand can easily extract and
verify a watermark (because it knows the key).

 It involves some modification of data that is making
the data less sensitive by altering attributes of the data.
 The second problem is that these watermarks can be
sometimes destroyed if the recipient is malicious.

Thus we need a data leakage detection technique which fulfils
the following objective and abides by the given constraint.
CONSTRAINT
To satisfy agent requests by providing them with the number
of objects they request or with all available objects that satisfy their
conditions.
Avoid perturbation of original data before handing it to agents
OBJECTIVE
To be able to detect an agent who leaks any portion of his
data.

 Entities and Agents:
 A distributor owns a set T = {t1, . . . , tm} of valuable data
objects.
 The distributor wants to share some of the objects with a set of
agents U1, U2, ...,Un, but does not wish the objects be leaked
to other third parties.
 The distributor distributes a set of records S to any agents
based on their request such as sample or explicit request.
Sample request Ri= SAMPLE (T, mi): Any subset of mi
records from T can be given to Ui .
Explicit request Ri= EXPLICIT (T; condition): Agent Ui
receives all T objects that satisfy condition

 Fake Objects:
Fake objects are objects generated by the distributor that
are not in set S. The objects are designed to look like real
objects, and are distributed to agents together with the S
objects, in order to increase the chances of detecting agents
that leak data.
 Data Allocation Problem:
The data allocation problem:
“How can the distributor intelligently give data to
agents in order to improve the chances of detecting a guilty
agent?”
There are four instances of this problem, depending on
the type of data requests made by agents and whether “fake
objects” are allowed.

 Sample data requests:
• The distributor has the freedom to select the data
items to provide the agents with
• General Idea:
– Provide agents with as much disjoint sets of data as
possible
• Problem: There are cases where the distributed data
must overlap E.g., |Ri|+…+|Rn|>|T|

 Explicit data requests:
 The distributor must provide agents with the data
they request
 General Idea:
 Add fake data to the distributed ones to minimize
overlap of distributed data
 Problem: Agents can collude and identify fake data

 Evaluation of Sample Data Request:
1: Initialize Min_overlap ← 1, the minimum out of the
maximum relative overlaps that the allocations of
different objects to Ui.
2: for k €{k |tk € Ri} do
Initialize max_rel_ov ← 0, the maximum relative
Overlap between and any set that the allocation of tk
to Ui

3: for all j = 1,..., n : j = i and tk € R do
Calculate absolute overlap as abs_ov ← | Ri∩ Rj| + 1
Calculate relative overlap as
rel_ov ← abs_ov / min ( mi, mj )
4: Find maximum relative as
max_rel_ov ← MAX (max_rel_ov,rel_ov)
If max_rel_ov ≤ min_overlap then
min_overlap ← max_rel_ov
ret_k ← k
Return ret_k
 For Example:
 T={1,2,3} U={a,b,c} Ri={T,2} i={a,b,c}

 Evaluation of Explicit Data Request:
1: Calculate total fake records as sum of fake records allowed.
2: While total fake objects > 0
3: Select agent that will yield the greatest improvement in the
sum objective i.e.
i=arg_max((1|Ri|)-(1|Ri|+1))sigmaj Ri∩ Rj
4: Create fake record
5: Add this fake record to the agent and also to fake record set.
6: Decrement fake record from total fake record set.

 Future work includes the investigation of agent guilt
models that capture leakage scenarios that are not yet
considered.
 The extension of data allocation strategies so that they
can handle agent requests in an online fashion .

 The presented strategies assume that there is a fixed
set of agents with requests known in advance.
 The distributor may have a limit on the number of fake
objects.

 It helps in detecting whether the distributor’s sensitive
data has been leaked by the trustworthy or authorized
agents.
 It helps to identify the agents who leaked the data.
 Reduces cybercrime.

 Though the leakers are identified using the traditional
technique of watermarking, certain data cannot admit
watermarks.
 In spite of these difficulties, it is possible to assess the
likelihood that an agent is responsible for a leak.
 We observed that distributing data judiciously can make a
significant difference in identifying guilty agents using the
different data allocation strategies.

Data leakage detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data leakage detection

Similar to Data leakage detection (20)

More from Vikrant Arya

More from Vikrant Arya (6)

Recently uploaded

Recently uploaded (20)

Data leakage detection