Information Security in Big Data : Privacy and Data Mining

Wafaa Anani (MCDBA, MCSD)
Electrical & Computer Engineering – Software Engineering, UWO
wanani@uwo.ca

 Introduction
 Data Mining Roles
 Data Provider
 Data Collector
 Data Miner
 Decision Maker
 Game Theory
 None Technical Solution
 Future Research Area
 Conclusion
 References

 Big Data
 Is a term that describes the large volume of data – both
structured and unstructured.
 Is a term used for data set so large or complex that it is
difficult to process using traditional database and software
techniques.
 Data Mining
 Data mining is the process of discovering interesting
patterns and knowledge from large amount of data.
 Data Mining has been successfully applied to many
domains, such as business intelligence, web search,
scientific discovery, digital library, etc.

Data Mining is also refers to “Knowledge Discovery from Data” (KDD)
To obtain useful knowledge from data as the following steps :
 Step 1 : Data Preprocessing (Data selection, cleaning, and integration)
 Step 2 : Data Transformation (transform data into form appropriate for the mining task)
 Step 3 : Data Mining (extract data patterns)
 Step 4 : Pattern Evaluation and Presentation (present the knowledge in an easy to
understand)

 Data Mining technologies bring serious threat to the security of individual’s
sensitive information.
 Reduce the privacy risk brought by Data Mining operations.
 We need to modify the data in such a way so as to perform Data Mining
algorithms effectively without compromising the security of sensitive information
contained in the data.

 Individual’s privacy maybe violated due to the unauthorized access to personal
data. Thus there is a conflict between data mining and privacy security.
 Privacy Preserving Data Mining (PPDM)
 To deal with the privacy issues in data mining.
 Objective of PPDM is to safeguard sensitive information from unsolicited or
unsanctioned disclosure, and mean while, preserve the utility of the data.
 Consideration of PPDM is:
 1. Sensitive raw data (IDs, Phone number.. Etc.) Should not be used in Data Mining.
 2. Sensitive mining results whose disclosure will result in privacy violation should be
excluded.

Data Database
Data Provider Data Collector Data Minor
Extracted Info.
Information Transmitter
Decision Maker
The user who
owns some
data that are
desired by the
data mining
task
The user who
collects data from
data provider and
then publish it to
the data miner
The user who
performs data
mining tasks on
the data.
The user who makes
decisions based on
the data mining
results in order to
achieve certain goals

 Privacy Concerns of each Role
 Approaches to Privacy Protection Data Provider
Data Collector
Data Miner
Decision Maker

The user who owns some data that are desired by the
data mining task

 If the Data Provider reveals his data to the Data Collector, his privacy might be
compromised due to the unexpected data breach.
 The privacy concern of the Data Provider is weather he can take control over what
kind of and how much information other people can obtain from his data.
 Data Provider should be able to make his sensitive data, inaccessible to the data
collector, However, the Data Provider has to provide some data, and get enough
compensation for the possible loss in privacy

 Limit The Access
Security tools developed for internet environment to protect data:
 Anti-tracking Extensions (Do Not Track Me, Ghostery, etc.)
 Advertisement and script blockers (AdBlock Plus, NoScript, FlashBlock, etc.)
 Encryption Tools (MailCloack, TorChat, etc.)
 Trade Privacy
 Data Provider needs to make a trade-off between the loss of privacy and the benefit brought by participating in data
mining.
 Data Provider needs to know how to negotiate with the data collector, so that he will get enough compensation for any
possible loss in privacy
 Data Provider may be willing to provide his sensitive data to Data Collector who promises that his sensitive information
will not be revealed.
 Provide False Data
 Using “Sockpuppets” to hide one’s true activities
 Using fake Identity to create phony information
 Using security tools to mask one’s Identity

The user who collects data from data provider and
then publish it to the data miner

Data Database
Data Provider Data Collector Data Minor
Extracted Info.
Decision Maker

 The original data collected from Data Providers usually contains a sensitive
information about individuals. If the Data Collector doesn’t take sufficient
precautions before releasing the data to public or data miners, those sensitive
information maybe disclosed.
 It is necessary for the Data Collector to modify the original data before releasing it
to others, so that sensitive information about the Data Provider can not be found.
 The modifications to the data should retained the sufficient utility of the data
after the modifications.

1. Basic Of PPDP
2. Privacy-Preserving publishing of social media
3. Attack Model
4. Privacy-Preserving Publishing of trajectory data

BASIC OF PPDPThe data modification process adopted by the Data Collector, with the goal of preserving
privacy, and utility simultaneously, is usually called Privacy-Preserving Data Publishing
(PPDP)
 Basic Of PPDP
 The original data is assumed to be private table consisting of multiple records, each record
contains : Identifier (ID), Quasi-Identifier (QID), Sensitive Attribute (SA), Non-sensitive
Attribute (NSA).
 The table should be anonymized before published to others, IDs should be removed, QID should
modified.
 K-Anonymity are the most privacy model used, among other privacy models.

BASIC OF PPDP Anonymization operations:
 Generalization : Replace some values with a parent value
 Suppression : Replace some values with a special value e.g. ‘*’
 Anatomization : De-associate the relationship between the QID and sensitive attribute
 Permutation: De-associate the relationship between the QID and the numerical Sensitive
attribute)
 Perturbation: Replace the original data value with synthetic data value, so the computation
would be still the same if it was to be done on the original data
 The Anonymization operation will reduce the utility of the data, there are various
metrics for measuring the information loss.
 A fundamental problem of PPDP is how to make a trade-off between privacy and utility

PRIVACY-PRESERVING PUBLISHING OF
SOCIAL MEDIA Social network usually modeled as a graph, where the vertex represents an entity and the
edge represent the relationship between two entities.
 PPDP in the context of social network mainly deals with anonymizing graph data.
 It is more challenging than anonymizing relation data table
 There are three challenges in social network:
 Modeling adversary’s background knowledge about network is much harder
 Measuring the information loss in anonymizing social network data is harder than relations data.
 Devising anonymization method for social network data is much harder than for relational data.

ATTACK MODEL Given the anonymized network data, adversaries usually rely on background knowledge to de-
anonymize individuals and learn relationships between de-anonymized individuals
 Attack Model is to find the social relationship between the de-anonymized individuals.
 Type of back ground knowledge:
 Attribute of vertices, vertex degrees, Link relationship, Neighborhoods, embedded subgraphs
and graph metrics
 A proposed algorithm called ‘Seed-and-Grow’ to identify uses from an anonymized social graph.
The algorithm identifies a seed sub-graph which is either planted by an attacker or divulged by
collusion of small group of users, then grows the seed larger based on the existing knowledge of t
user’s social relations. e.g. (Structural attack, Mutual friend attack, Friendship attack, degree
attack.)

ATTACK MODEL Privacy Model
 In order to protect the privacy of relationship from the mutual friend attack, a variant of k-
anonymity introduces k-NMF anonymity.
 If the Network satisfies k-NMF anonymity then each edge e, here will be at least k - 1 other
edges with the same number of mutual friends as e. It can be guaranteed that the probability of
an edge being identified is not greater than 1/k

ATTACK MODEL Data Utility
 In the context of network data anonymization, the implication of data utility is : whether and to
what extent properties of the graph are preserved.
 Most Existing K-anonymization algorithms for network data publishing perform edge insertion
and/or deletion operation, to reduce the utility loss.

PRIVACY-PRESERVING PUBLISHING OF
TRAJECTORY DATA Location Based Services (LBS) : by utilizing the location information of individuals.
 Locate a restaurant, or monitor congestion levels of traffic
 Use of private location information may raise a privacy issues in LBS, for publishing
trajectory data of individuals.
 Redefine the k-anonymity for trajectories and proposed (k, ẟ)-anonymity

The user who performs data mining tasks on the
data.

Data Database
Data Provider Data Collector Data Miner
Extracted Info.
Decision Maker

 Personal Information can be directly observed in the data and data breach happens.
 If the Data Miner is able to find out information underlying the data. (Sometimes the
data mining may reveal sensitive information bout the data owners)
 Data Miner also face the Privacy-Utility trade-off problem.
 The main concern of the Data Miner is HOW to prevent sensitive information from
appearing in the mining result
 To perform a privacy-preserving data mining, the Data Miner usually need to modify
the data he got from the Data Collector

 Based on the distribution of data, PPDM approaches can be classified:
 Approaches for Centralized Data Mining
 Approaches for Distributed Data Mining
 Horizontally partitioned data
 Vertically partitioned data

 With distributed data mining, Secure Multi-party Computation (SMC) widely
used
 The goal of SMC to make sure that each participant can get the correct data
mining result without revealing his data to others.
P1, P2, P3, ……….. , Pm  Participants
X1, X2, X3, ………. , Xm  Data

 Privacy-Preserving Association Rule Mining
 Privacy-Preserving Classification
 Privacy-Preserving Clustering

PRIVACY-PRESERVING ASSOCIATION
RULE MINING Privacy-Preserving Association Rule Mining
 Finding interesting associations and correlation relationships among large set of data
items (e.g. Basket Analysis)
 Some of the rule considered to be sensitive
 Generate a sanitized data set (Rule Hiding)
 Heuristic distortion approaches
 Heuristic blocking approaches
 Probabilistic distortion approaches
 Reconstruction-based approaches
 Hybrid partial hiding (HPH)
 Inverse frequent set mining (IFM)

PRIVACY-PRESERVING
CLASSIFICATION Privacy-Preserving Classification
 Classification : is a form of data analysis that extract models describing important data
classes
 Data Classification seen as two-steps:
 Step 1: Learning step, classification algorithm is employed to build a classifier (Classification
model).
 Step 2: the classifier is used for classification
 Classification models :
 Decision Tree
 Naïve Bayesian Classification
 Support Vector Machine

 Privacy-Preserving Clustering
 Clustering the data to group them.

 Data Miner can modify the original data via randomization, blocking, or
reconstruction. The modification often has negative affect on the utility of the
data.
 Data Miner needs to make a balance between privacy and utility. The implication
of privacy and utility vary with the characteristic of data and purpose of the
mining task.

The user who makes decisions based on the data
mining results in order to achieve certain goals.

 The privacy concerns of the Decision Maker are:
 How to prevent unwanted disclosure of sensitive mining result
 How to evaluate the credibility of the received mining result.

 1ST Issue:
 Legal Measures
 making a contract with the data miner to forbid the miner from disclosing the mining result to a
third party.
 2nd Issue:
 The Decision Maker can utilize methodologies from Data Provenance, credibility
analysis of web information, or other related research fields

DATA PROVENANCE Data Provenance :
 The information that helps determine the derivation history of the data, starting from the original source
 Provenance, which describe Where the data come from, and How the data evolved over the time, can
help people to evaluate the credibility of the data.
 Provenance contains two kinds of information:
 Ancestral data from which current data evolved.
 Transformations applied to ancestral data that helped to produce the current data.
 However, in most cases provenance of the data mining results is not available
 The major approach to present the provenance information is adding annotations to data.

WEB INFORMATION CREDIBILITY
 Web Information Credibility
 Users can differentiate false information from the truth based on :
 Authority : the real author of false information is usually not clear
 Accuracy: false information does not contain accurate data
 Objectivity: false information is often prejudicial
 Currency: for false information, the data about its source, time, and place of its origin is
incomplete, out of date or missing
 Coverage : false information usually contains no effective links to other information online

 Game theory provides a formal approach to model situations where a group of
agents have to choose optimum actions considering the mutual effects of other
agents' decisions.
 The essential elements of a game are: players, actions, payoffs, and information.
 Players have actions that they can perform at designated times in the game. As a
result of the performed actions, players receive payoffs.

 PRIVATE DATA COLLECTION AND PUBLICATION
 In this data collection game, the level of privacy protection has significant influence on
each player's action and payoff.
 PRIVACY PRESERVING DISTRIBUTED DATA MINING
 SMC-Bases privacy preserving distributed Data Mining
 Recommender System
 Linear Progression as a non-cooperative game
 DATA ANONYMIZATION

 Game Model :
 Define the elements of the game, namely the players, the actions and the payoffs
 Determine the type of the game: static or dynamic, complete information or incomplete
information
 Solve the game to find equilibriums
 Analyze the equilibriums to obtain some implications for practice

 The Data Collector wants Data Providers to participate in the data mining
activity, i.e. hand over their private data, but the Data Providers may choose to
opt-out because of the privacy concerns. In order to get useful data mining results,
the Data Collector needs to design mechanisms to encourage Data Providers to
opt-in.
 Mechanisms for Truthful Data Sharing
 A mechanism requires agents to report their preferences over the outcomes.
 Privacy Auctions

 Law and regulations
 USA – Privacy Act 1974
 European commission – General Data Protection Regulation 2012
 Industry conventions.
 Agreement between organization to how to collect, analyze, and store personal data,
should help to create Privacy-Safe environment
 Enhance the education to increase the awareness of information security

 Personalized Privacy Preserving
 Developing practical personalized anonymization methods.
 Introducing Personalize Privacy into other type of PPDP/PPDM.
 Data Customization
 A concept was introduced for data mining called “Reverse Data Management “ (RDM)
which it is similar to Inverse data mining. RDM covers a lot of Data problems: Inversion
mapping, provenance, data generation, view update, constraint-based repair, etc.
(We may consider RDM to be a family of data customization methods)
 Provenance for Data Mining
 New techniques and mechanisms that can support Provenance in Data Mining context
should receive more attention.

 Each user role has its own privacy concerns and approaches to Preserve-Privacy
with maintain the data utility.

 Lei Xu, Chunxiao Jiang, Jian Wang, Jain Yuan and Young
Ren, Information security in Big Data: Privacy and Data
Mining, Access, IEEE, 2014

Information Security in Big Data : Privacy and Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Information Security in Big Data : Privacy and Data Mining

Similar to Information Security in Big Data : Privacy and Data Mining (20)

Recently uploaded

Recently uploaded (20)

Information Security in Big Data : Privacy and Data Mining