2. surances that there is no threat and when those assurances are kept. Hence the point is
that, keeping all the privacy and handicapping counter-terrorism efforts does not gain
any goal. Privacy in a threatened atmosphere seems absurd.
Therefore, terrorism cannot be ignored and effective counter-terrorism meas-
ures are a must in order to achieve goals of both security as well as privacy. There are
a host of counter-measures that are around us all the time. From in-uniform security
personnel to under-cover cops to intelligence agencies and from traditional technolo-
gies like CCTV networks, X-ray machines to complex face recognition, sensor net-
works etc, there has been a lot of efforts put and initiatives taken to counter terrorism.
While, there are certainly a number of measures taken, but the need is to maintain
objectivity and evaluate each measure by its effectiveness. Most of the measures are
different forms of ―Security Theatre‖, the concept described by Bruce Schneier[1].
Security theatre means securing against a very specific attack like securing against a
second 9/11, securing super bowl, historical monuments, subways, metro against
terrorist attacks. Bruce Schneier says that such a strategy to secure against terrorism-
by trying to secure against each possible attack - is grossly ineffective. The main rea-
sons for the ineffectiveness are that -
1. The number of possible attacks is limitless, by securing against a set of at-
tacks we are only forcing the terrorists to do slight modification in the plan and follow
some other attack - By securing Airports we are only getting the subways blown.
2. There is no dearth of terrifying ideas, but we do not see them in reality
very often. It is because terrorism is hard to carry out. Terrorism is very rare. When
the number of attacks is few, each attack would be a new attack- not a copy of pre-
vious ones. It would be a new target and a new tactic. Hence security theatre can't
work since it is based on old tactics taken by terrorists.
It is required that each measure taken is effective since otherwise there is not only loss
of resources and privacy due to that measure but also loss of security as a possible
better alternative measure goes neglected.
One certain counter-terrorism initiative is use of Information Technology in
the form of data mining, which we are particularly interested in. The governments of
terrorism–affected nations have shown the primary interest in this field. Numerous
programs are believed to exist under US government, which specifically indulge in
pattern-based Data Mining over huge databases. These databases contain huge
amount of data from disparate sources, including detailed data on the US citizens as
well. Subjecting such data to data mining has caused various privacy issues. Articles
on this topic by Bruce Schneier and some papers published in this field have been
surveyed and summarized here to present - a view of privacy invasive nature of work
being done in the field of counter-terrorism data mining, ways to introduce privacy
preserving technologies in this field and arguments on whether data mining can be an
effective tool for National Security at all.
The structure of this paper is as follows: Section 2 describes the privacy in-
vasive nature of these data mining measures; Section 3 is about how to make the tra-
deoff between Security and Privacy in the context of counter-terrorism; Section 4
presents the arguments on why data mining would never work for the purpose of na-
tional security; Section 5 describes a framework which would ensure that the data
3. mining practices do not lead to privacy invasions; Section 6 provides ways of doing
privacy preserving data mining where the privacy preserving nature is inbuilt in the
tool; finally we conclude in Section 7.
2 Privacy Invasive Terrorism Informatics
Past instances of terror strikes – 9/11, Madrid and London bombings- have shown that
terrorists integrate into the society to seek invisibility [3]. This has led governments to
look for terrorists blended in their own society in addition to looking for them in for-
eign lands. Data Mining is one of the strategies adopted in this regard. Vast databases
have been created which records every day information about an individual like –
educational, health, financial, commincations. These records are then subjected to
data mining algorithms to find patterns. The assumption is that terrorist activity leaves
behind a trail in the every day activities and there are patterns which could identify it.
Two types of data mining are being used aggressively:
1. Subject-Based – Used to gather information about individuals already sus-
pected of wrong-doing. This type of data mining has been used since a long
time and forms major source of the investigations.
2. Pattern-Based – A model is built which is considered to characterize the ac-
tivities related to terrorism and is used to match against the sea of every day
data. Any hit is considered as a possible terrorist plan or potentially culpable
individuals. The aim of such program is to find terrorists hidden in the socie-
ty. This type of data mining for national security purposes started after 9/11.
While in subject-based data mining, there is an initial suspect around whom the data
mining revolves, there is no such center of suspicion in pattern-based data mining and
is based on the predictive powers of data linkages [3]. This has caused concerns as
people who have done nothing to warrant a suspicion are suddenly being watched day
in and day out. Almost all of the privacy concerns regarding data mining for national
security purposes have been regarding pattern-based type of data mining.
Although the goal of the program is the security of its citizens, the means are
privacy invasive since the sensitive data of the citizens are scrutinized. Process of
extracting information about individuals used to be expensive and time-consuming.
This ensured that privacy violations are not practically feasible. This effect was
termed as ―practical obscurity‖ by U.S Supreme Court [2]. In twenty first century,
though, practical obscurity has been eroded by the developments in technology.
3 Trading Security with Privacy
Whenever security and privacy are face to face, security measure automatically wins
over civil liberties as the security threat is always more apparent and there is a loss of
understanding of the concept of privacy. Usually, no reasoning is done about whether
the measure is even effective enough. This is a wrong tradeoff for civil liberties as
well as a loss for security as there might be better alternatives which do not get the
attention and the resources.
4. [4] Specifically talks about the tradeoff that exists between security and pri-
vacy. It puts forward the rational way to balance the security with liberty. It says that
the tradeoff between security and privacy is not set in linear equation and it is possible
that alternatives occur with better security promises as well as lesser civil infringes.
Also, protecting privacy does not necessarily require the proposed measure to be
scrapped completely but certain measures ensuring accountability might be enough.
But, the courts are not ready to go even that far as gravity of security threat automati-
cally wins over the loss of privacy.
In order to rationally trade security with privacy [4] puts forward the follow-
ing methodology and applies it in case of terrorism as threat and data mining as secu-
rity measure:
First assess the gravity of security threat
-About terrorism the author says that threat of terrorism is over-
hyped as number of people dying due to terrorism is miniscule; panic and fear cause
the threat to be overstated. But, I would contest this perspective, as I have done earlier
in the paper. The consequences of rare terrorist strikes are long-lasting and very akin
to consequences of privacy violations. In my view, the threat of terrorism can not be
taken lightly and should be given enough weight.
Secondly, Assess the effectiveness of proposed security measure
against the given security threat
-About Data Mining as a security measure against terrorism, the
author says that it is effective in commercial settings where appetite for false positives
is much higher and automatically has serious concerns in governmental purposes due
to the harms of false positives. Also, the author says that there is no evidence where
Data Mining has proved its efficiency and worthiness.
Based on above two factors decide whether the loss of civil liberties is
justified.
In the case of data mining for counter terrorism, as mentioned above the author feels
the threat of terrorism is overhyped and says that the lack of any example proving the
efficiency of data mining for such purposes and the highly covert nature of such tech-
nologies make it hard to gauge the possible worthiness. The verdict of the author is
fully captured in these lines-
" Given the significant potential privacy issues and other constitutional concerns,
combined with speculative and unproven security benefits as well as many other
alternative means of promoting security, should data min-ing still be on the table as a
viable policy option? Of course, one could argue that data mining at least should be
investigated and studied. There is nothing wrong with doing so, but the cost must be
considered in light of alternative security measures that might already be effective and
lack as many potential problems. "
5. In my view, the threat of terrorism would always qualify to consider the possible se-
curity measures and I would like to give it enough weight to consider even privacy
violating measure. I feel this is the problem with the method, as it is qualitative in
nature. I can‘t quantitatively assess a security threat like terrorism and see whether it
qualifies for certain amount of privacy violation (which too can‘t be quantified).
Through this method as well it comes to the whims of the judge to say whether the
particular security threat is grave enough for a list of privacy violations. But, the secu-
rity advocates and privacy advocates would already have sides chosen.
But evaluating the security measure though, certainly seems a logical re-
quirement to perform the tradeoff between security and privacy. Effectiveness of a
security measure is much more quantifiable and apparent. It makes sense to match the
effectiveness of a security measure with the privacy violations. Though, certain de-
gree of ambiguity remains. It may seem naïve to argue but suppose a particular securi-
ty measure saves one life per year in return of particular privacy violations. How
would you decide whether the trade-off is balanced? How would you balance the
certain number of lives saved with any amount of privacy violation?
Thus, the only step I would really stand by while performing the trade-off is
comparing the possible security measures against each other. It is vital to choose the
most effective security measure or the most effective to privacy invasive measure
available, if that ratio is measurable.
4 Why Data Mining won’t work for National Security
Bruce Schneier in [1] from 2001 till today has maintained that data mining would
never work for national security purposes. The main reasons pointed out are:
1. The attacks are very rare.
2. No well defined profile to search for.
3. High cost of false positives.
The author says that Data mining works when there is a reasonable number of attacks
per year and there is a well defined profile to search for. In case of terrorism, though
there is a pattern common to many terrorist attacks, the pattern is shared by many
many other events as well. And since the number of actual attacks are too few than
those other events - the number of false positives per every true positive is massively
large. Further, the author says that the cost of the false positives is financially and in
terms of civil liberties very high.
Hence, in Bruce Schneier‘s view the only way to fight terrorism is through on the
ground intelligence work and investigation.
[1] Performs qualitative assessment of data mining and puts forward the current road-
blocks for data mining to prove efficient for national security:
1. Data Quality
6. Duplicate Records, lack of data standards, timeliness of updates and human
error are some factors that make data mining inaccurate. The reports describing the
various governmental data mining programs have frequently stated the evidence of
such data inaccuracies. Further, the high stakes of such errors for individuals make it
even harder.
2. Data Matching
There is no single huge database and data mining requires integration across
many different databases. This linking different databases together is a difficult and
sometimes infeasible task as databases might have different formats, the data about
the same individual might be in different forms, the data itself might be in unstruc-
tured format etc. Government often does not have control over the disparate sources
of data and hence rectifying this issue is even harder.
3. Data Mining Tools
It is hard to comment directly on governmental data mining efficiency as
there are no examples of its success and otherwise it is carried out in classified man-
ner. But, inferring from the efficiency of data mining in commercial sector, there is
the problem of inaccuracies mainly in the form of huge false positives. Compared to
private sector there are lot many factors that should further diminish the performance
of governmental data mining as
The target for government is far lesser in number than the target for pri-
vate sector.
The terrorist can blend in
Hard to get the pattern to search for as there have not been many terrorist
attacks and those that have occurred are very different from each other. [1]
puts it properly in - ―With a relatively small number of attempts every
year and only one or two major terrorist incidents every few years—
each one distinct in terms of planning and execution—there are no
meaningful patterns that show what behavior indicates planning or
preparation for terrorism.‖
Data Mining efforts are reactive i.e they respond to the previous examples
of terrorist incidents but national security requires proactive efforts as the
terrorists can always come up with a very new plot.
In private space, the targets do not care a lot but in counter terrorism the
terrorists will make all efforts to avoid getting caught.
Paul Rosenzweig, Deputy Assistant Secretary for Policy at DHS : ―[t]he
only certainty [in data mining] is that there will be false positives.‖
5 Framework to Prevent Privacy Invasion
The Fourth Amendment is the restriction imposed on government from obtaining
personal information about individuals against ―general searches‖[1]. Thus Fourth
7. Amendment while allows for specific searches which are encountered in Subject-
based data mining, the general searches of pattern-based data mining are blocked by
the fourth amendment. But, the boundary between specific searches and general
searches are dissolved to just distinguish between reasonable searches and unreasona-
ble searches. Fourth Amendment applies to searches performed by US government
for national security and intelligence purposes.
But Fourth Amendment does not apply on data collected by third parties i.e.
is the private parties. And since most of the data used for data mining purposes are
collected from these private third parties, there is almost no refrain due to the Fourth
Amendment in use of this data.
Apart from Fourth Amendment, the Privacy Act of 1974 tries to regulate the
government‘s collection and usage of private data. This act requires agencies to [1]:
Store no more information than required by the executive order.
Maintain data quality.
Ensure security of the stored data.
But there are various exceptions in this act which let the government get away with
their motives.
To evaluate the efficacy of its data mining programs and the privacy violations due to
them, US government established TAPAC – Technology and privacy Advisory com-
mittee. TAPAC in its recommendations to the government recommended a frame-
work for carrying out the data mining activities. This framework has been generally
accepted and is advocated in [1] and [6]:
Legal Authorization – Requires the agency head to write an authoriza-
tion letter stating the purpose of project, how the information will be used,
establish acceptable false positive rates and the ways to deal with them.
Access Control – Ensure that only the authorized users gets access to the
data and that they do not misuse the data.
Anonymization and Selective Revelation – Reveal the minimum amount
of private information. Further detailed data is shown only if need be
which is also selectively revealed.
Audit – Keep a record of what information was watched by which ana-
lyst. This would allow investigation into data breaches and misappropria-
tion of data.
Address False Positives – Instead of directly taking actions on the results
of data mining, perform an intermediate step where analysts investigate
the result. If a false positive is found, use the result to improve the data
mining program.
Accountability Measures – Internal and external reviews of the program
should be held. The government should validate the models being used in
these programs and the results.
8. 6 Privacy Preserving Data Mining
[5] has performed a survey of the data mining techniques used in the very much re-
lated field of – Fraud Detection. The survey yields that all kinds of learning algo-
rithms are in extensive use in this field:
Supervised Approaches - Using labeled examples of fraudulent and au-
thentic transactions, a mathematical model is created to distinguish be-
tween the two. Supervised learning algorithms that have been used for
such purposes include - Neural Networks, SVM, Bayesian Networks,
Naive Bayes, Association Rule Mining, Genetic Programming. Popular
supervised algorithms like Neural Networks, Bayesian Networks and de-
cision trees have been combined together to create hybrid approaches to
improve results.
Supervised + Unsupervised Hybrids - Some studies show that super-
vised algorithms outperform the unsupervised algorithms on telecommu-
nications data while the best results are achieved when both are used in
conjunction.
Unsupervised Approaches - These techniques use unlabelled examples
to find patterns and structures inherent in the data. Link analysis and graph
mining are considered to be hot research topics in security areas like coun-
ter-terrorism and law enforcement. Unsupervised approaches like cluster
analysis, outlier detection, spike detection, unsupervised neural networks
have been applied for fraud detection.
Due to the privacy invasive nature of these techniques, many efforts have been
made to develop privacy preserving mining techniques. Data mining is a combination
of tools and the data and not just any one of them. Thus various techniques are possi-
ble which work on either the data or the tool [6].
[7] performs classification of privacy preserving data mining techniques into
three classes:
1. Heuristic Based - In heuristic based techniques the data is modified in a way
such that it leads to least loss in utility. For e.g. Data mining algorithms like
association rule mining can be made privacy preserving by ensuring that sen-
sitive rules do not receive the required support or confidence which can be
done by hiding the item sets from which these rules are derived.
2. Cryptographic - Cryptography based techniques are applied where data
mining is done on distributed data. The privacy concern in such scenario is
that each data holder does not want to expose its raw data to others while are
interested in the end computation product. Data mining algorithms are hence
required to perform secure multiparty communication. There have been vari-
ous techniques proposed which convert normal computation into SMCs and
also various SMC methods have been proposed which can support certain
9. data mining algorithms. One particular SMC algorithm for decision tree
learning through ID3 has been proposed by [8]. We look at this algorithm in
detail later in this section.
3. Reconstruction Based - Reconstruction Based techniques perturb the data
but is still possible to infer the distribution of data. Hence, though the data is
perturbed at more granular levels, the higher level view is still maintained.
[6] Have performed another classification of techniques which ensure that certain
sensitive rules cannot be inferred while the non-sensitive rules can be:
1. Limiting Access - Provide a sample view of database so that inferences
drawn do not imply strict support.
2. Fuzz the data - Alter the data or put aggregate values in place of individ-
ual values.
3. Eliminate unnecessary groupings - Keep the data as random as possible.
Do not append meanings to data meant for some other purposes, which
could then be mined.
4. Augment the data - Add dummy data.
5. Audit - Not feasible when the data is publically available but for within
organizational purposes it can induce accountability.
6. Attack the Algorithm
- The logic behind how the algorithm finds rules can be attacked so
as to ensure that dummy rules get created and ensure that sensitive rules
are not found.
- Performance of Algorithm can be attacked to ensure that the algo-
rithm is infeasible to be applied on the given dataset.
In the rest of this section [8] has been summarized to describe the proposed SMC
technique.
[8] proposes privacy preserving decision tree learning for a scenario where
two parties hold parts of the database and wish to not reveal the contents of their da-
tabases while are interested in the decision tree learnt on the union of their databases.
No third party has been assumed. This is a case of SMC where number of participat-
ing parties is two. The proposed technique ensures that each participating party can
learn no more than what can be learnt using its own database (its input) and the result-
ing decision tree (output). A semi-honest adversary has been considered so the tech-
nique preserves privacy in the face of any passive attack. This means that the adver-
sary shall try to break the privacy of a participating party while adhering to the proto-
cols of the proposed technique.
Decision trees are machine learning tools for tasks of classification. A deci-
sion tree is a tree consisting of nodes where each internal node is a rule defined on
one of the attributes of data. Each leaf node is one of the possible classes. Decision
tree is learnt for a given database using some decision tree learning algorithm. Once
the tree is learnt, any test instance is traversed on the tree starting at root and the leaf
10. node at which the traversal ends is the predicted class for the test instance. ID3 is a
specific supervised learning algorithm to learn decision tree on a given database. ID3
attempts to create shortest tree possible by trying to finish the classification using
least number of nodes/attributes. This is done by ordering the attributes in decreasing
order based on their information gain over the training data. The attribute with maxi-
mum information gain classifies completely maximum number of the existing unclas-
sified transactions. Hence ID3 recursively calculates the information gain by each
attribute over the unclassified transaction in the training set and picks the one with
maximum gain and puts it into the tree, till no unclassified transaction is left. The
information gain for an attribute depends on the entropy of the attribute. The entropy
of an attribute is given by:
where Hc(T|A) is the entropy of attribute A over set of training transactions T when
the set of possible classes is C, |T| is the number of transactions, |T(aj)| is the number
of transactions having value for attribute A = aj, m is the number of possible attribute
values for attribute A, Hc(T(aj)) is the entropy to classify the transactions having
attribute value for A = aj.
ID3 thus calculates the entropy for each attribute and selects the one with minimum
entropy and puts it into the tree. ID3- delta is an extension of ID3 where the entropy
for each attribute is approximated and attributes having entropy within delta range of
each other can come in either order-
The problem being solved is a two-party communication which is often de-
noted by: (x, y) |→ ( f 1(x, y), f 2(x, y)) where x is the input from first party, y is the
input from second party , first party wishes to receive f 1(x,y), second party wishes to
receive f 2(x,y). The particular case in the problem at hand can thus be denoted as
(D1, D2) |→ (ID3(D1 ∪ D2), ID3(D1 ∪ D2)) where D1 is the database possessed by
first party, D2 is the database possessed by second party and both parties are interest-
ed in common output – ID3(D1 ∪ D2).
The aim of SMC is to provide a private protocol to carry out above computa-
tion (in two party cases). A protocol is private if the view of each party can be simu-
lated using just its input and protocol‘s output which means that the party does not
learn anything new from protocol execution. The proposed technique is a private pro-
tocol for calculating ID3-delta. Hence the view of first party can be simulated given
D1 and ID3-delta(D1 ∪ D2) only and similarly for second party.
Since, the problem being solved is a case of SMC, the existing solutions for
SMC does solve the problem. Yao in [9] proposed a protocol for computing any prob-
abilistic polynomial-time functionality f(x,y) where x and y are the inputs of the two
parties respectively. The protocol works by first party computing f(x,.) and sending it
to second party in encrypted format. The encryption is such that it allows for partial
decryption by second party to give f(x,y). The keys used by second party are received
from first party corresponding to y. This can be done without revealing y by carrying
out |y| instances of 1-out-of-2 oblivious transfer protocol [8]. A 1-out-of-2 oblivious
11. transfer protocol is: ((x 0, x 1), σ ) |→ (λ, x σ ) i.e. the first party inputs a pair (x0,x1)
and second party inputs a bit 0 or 1. The protocol outputs the x0 or x1 depending on
the input bit to the second party while first party learns nothing. While, this generic
solution applies to the problem of privately computing ID3-delta as well, its complex-
ity is proportional to the input size i.e. is the size of the database and has huge com-
munication overhead. Hence, it scales badly for data mining purposes where the size
of databases is too huge.
Due to inefficiency of generic protocols, research has been focused on devel-
oping efficient solutions to specific problems. In this direction, [8] proposes solution
for two-party distributed private computation of ID3-delta. The proposed algorithm
tries to provide an efficient protocol by cutting on the communication overhead by
making each party indulge in mostly independent computations.
The assumptions across the proposed protocol are:
The databases D1 and D2 possessed by the two parties have same struc-
ture.
Attribute names are public
Possible attribute values are public for each attribute
The total size of |D1 U D2| is public.
As seen before, the main task of ID3 is finding the attribute with minimum entropy
which is performed recursively until all training transactions are classified. In order to
cut the complexity of performing this task, minimum entropy is written in following
form:
Since, |T| is the number of transactions, it is constant across all attributes
hence can be ignored. Now, to compute entropy of any attribute A, two quantities
are required: |T(aj)| and |T(aj,ci)| where |T(aj)| is the number of transactions having
attribute value for attribute A = aj and |T(aj,ci)| is the number of transactions having
attribute value for attribute A = aj and class value = ci. Now, |T(aj)| = |T1(aj)| +
|T2(aj)| and similarly |T(aj,ci)| = |T1(aj,ci)| + |T2(aj,ci)| where T1 signifies the transac-
tions in D1 and T2 signifies the transactions in D2. Therefore, a non-private me-
thod of finding minimum entropy attribute would be for first party to compute
|T1(aj)| and |T1(aj,ci)| for each attribute and send them to second party which could
then calculate |T(aj)| and |T(aj,ci)| and hence the entropy of the attribute. The com-
12. munication complexity is reduced in this case to logarithmic in terms of number of
transactions.
In order to turn this into a private protocol, the basis is the knowledge that
privately computing ID3-delta is equivalent to privately finding the attribute with
the minimum entropy which further means privately computing the entropy of each
attribute – Hc(T|A). This quantity has been written above as a sum of expressions
of the form (v1 + v2) ln (v1+v2), (The log can be changed to ln since we have to
compare this quantity for each attribute), where v1 = |T1(aj,ci)| or |T1(aj)| and v2 =
|T2(aj)| or |T2(aj,ci)|.
The task of privately finding the minimum entropy attribute is done by com-
puting random shares of Hc(T|A) for each attribute A and distributing between par-
ties such that sum of shares = Hc(T|A), and the task of privately computing
Hc(T|A) is done by privately computing the expression (v1 + v2) ln (v1+v2). The
protocol for privately computing the expression (v1 + v2) ln (v1+v2) takes input v1
and v2 from the two parties, privately compute (v1 + v2) ln (v1+v2), output shares
of an approximation of (v1 + v2) ln (v1+v2) to the two parties such that the sum of
the shares = approximation of (v1 + v2) ln (v1+v2).
Now, Hc(T|A) is sum of expressions like (v1 + v2) ln (v1+v2) and by the
protocol of privately computing the expression (v1 + v2) ln (v1+v2) each party has
shares such that their sum is approximation of (v1 + v2) ln (v1+v2) therefore, each
party can independently sum its own shares for all (v1 + v2) ln (v1+v2) expres-
sions for Hc(T|A) to get its share of Hc(T|A). Hence by following the protocol for
calculating all (v1 + v2) ln (v1+v2) expressions for all attributes, each party has
their shares of an approximation of Hc(T|A) for each attribute A. Now, only part
remaining is given the shares find the minimum entropy attribute i.e. find the
attribute for which the sum of corresponding shares possessed by each party is
minimum. This is done using Yao‘s protocol. It takes as input the shares for all
attributes from each party and outputs the attribute for which the sum of its shares
is minimum.
Hence, the task of finding the attribute with minimum entropy is per-
formed by invoking two separate private sub-protocols:
1. Privately calculating (v1 + v2) ln (v1+v2) and distributing its shares.
2. Private Yao’s protocol for finding minimum entropy (Hc(T|A)) attribute
given shares of Hc(T|A) for all A.
This composition of two private sub-protocols result in a private protocol as
the first protocol yields shares which are uniformly distributed in a finite field
[8]. Hence the resulting protocol is private.
Now, protocol for privately computing (v1 + v2) ln (v1+v2) remains un-
described. This protocol as said a few times already, takes as input v1 and v2 from
the two parties and outputs shares of an approximation of (v1 + v2) ln (v1+v2).
This protocol is carried out in 2 steps:
1. Distribute shares of ln(v1 + v2)
13. Let v1+v2 =x, therefore the task of this step is to create shares of lnx. For a
given x, we start by finding n which gives the 2n
closest to x. Therefore, x =
2n
(1+E) where -1/2<= E =<1/2. Taking ln on both sides gives:
Ln(x) = ln(2n
) + ln(1+E)= n ln2 + E – E2
/2+E3
/3-E4
/4 ….
Now, Yao‗s protocol is used to compute 2n
E and 2n
n ln2. Then, shares are calculated
for the above mentioned taylor‘s series approximation using oblivious polynomial
evaluation. The sum of the shares obtained in this step (u1, u2) are the shares for ln x.
2. Given v1 , v2 and shares of ln(v1+v2) find shares of (v1+v2)ln(v1+v2) us-
ing private multiplication protocol. The private multiplication protocol is also
based on oblivious polynomial evaluation. Each party invokes multiplication pro-
tocol twice to receive shares of u1 . v2 and u2 . v1. Party 1‗s share w1 is then the
sum of these two shares and u1 . v1 and party 2‘s share w2 is sum of these two
shares and u2 . v2. We get:
w1 + w2= u1v1+u1v2 + u 2v1 + u 2v2 = (u 1 + u 2)(v1 + v2) ≈ x ln x
These shares are then used in the protocol for finding the attribute with minimum
entropy.
7 Conclusion
While there has not been any evidence of its success, data mining for national se-
curity has serious privacy-invasive implications. The faith in Data Mining ranges
from one end where Bruce Schneier has always condemned it, - to U.S government‘s
numerous programs using data mining for national security purposes.
In this survey paper the privacy implications of national security led data mining
has been put forward, some reasons for no success have been explored, ways to find a
balance between privacy and security in this field through formal frameworks and
data mining techniques which inherently preserve privacy, are explored.
8 Acknowledgement
I would like to deeply thank Dr. Shishir Nagaraja for letting me perform this Indepen-
dent Study under his guidance.
9 References
1. Bruce Schneier : Cryptogram - http://www.schneier.com/essays-terrorism.html
2. Fred H. Cate : Government Data Mining: The Need for a Legal Framework
3. Ira S. Rubinstein et al : Data Mining and Internet Profiling: Emerging Regulato-
ry and Technological Approaches
4. Daniel J. Solove : Data Mining and the Security-Liberty Debate
14. 5. Clifton Phua et al : A Comprehensive Survey of Data Mining-based Fraud De-
tection Research
6. Chris Clifton et al : Security and Privacy Implications of Data Mining
7. Vassilios S. Verykios et al : State-of-the-art in Privacy Preserving Data Mining
8. Yehuda Lindell et al: Privacy Preserving Data Mining
9. A. C. Yao, How to generate and exchange secrets, Proceedings of the 27th Sym-
posium on Foundations of Computer Science (FOCS), IEEE, 1986, pp. 162–167.