SlideShare a Scribd company logo
CS 590M Fall 2001: Security
Issues in Data Mining
Chris Clifton
Tuesdays and Thursdays, 9-10:15
Heavilon Hall 123
Course Goals:
Knowledge
At the end of this course, you will:
• Have a basic understanding of the
technology involved in Data Mining
• Know how data mining impacts
information security
• Understand leading-edge research on
data mining and security
Course Goals:
Skills
At the end of this course, you will:
• Be able to understand new technology
through reading the research literature
• Have given conference-style
presentations on difficult research topics
• Have written journal-style critical
reviews of research papers
Course Topics
• Data Mining (as necessary)
– What is it?
– How does it work?
• Research in the use of Data Mining to
improve security
• Research in the security problems posed
by the availability of Data Mining
technology
Process
Initial phase of course: Data Mining
background
• Lectures, handouts, suggested reading
• Length/material to be determined by
what you already know
Expect a quiz at the end of this phase
Process
• Phase 2: Student Presentations
• Two paper presentations per class
– Student presenting will read paper and prepare
presentation materials
You must prepare materials yourself – no fair using
material obtained from the authors
• Any week you do not present, you will do a
journal quality review of one of the papers
being presented that week
You may request a papers to review/present, I will do
final assignment
Evaluation/Grading
Evaluation will be a subjective process, however
it will be based primarily on your
understanding of the material as evidenced in:
• Your presentations
• Your written reviews
• Your contribution to classroom discussions
• Post phase-1 quiz
Policy on Academic Integrity
• Basic idea: You are learning to do Original
Research
– Work you do for the class should be original
(yours)
– Don’t borrow authors slides for presentations, even
if they are available.
Copying images/graphs okay where necessary
• More details on course web site:
http://www.cs.purdue.edu/homes/clifton/cs590m
• When in doubt, ASK!
What is Data Mining?
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock
picking)
Sales data
Sequence
Classify
Inference
Cluster
“70%of
customers who
purchase
comforters later
purchase
curtains”
Select information to bemined Choosemining tool (based on
typeof results wanted)
Evaluateresults
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advanced in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data
Target
Data
Selection
Knowledge
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Knowledge Discovery in
Databases: Process
See also: http://www.crisp-dm.org
Preprocessing
What is Data Mining?
History
• Knowledge Discovery in Databases workshops
started ‘89
– Now a conference under the auspices of ACM
SIGKDD
– IEEE conference series starting 2001
• Key founders / technology contributers:
– Usama Fayyad, JPL (then Microsoft, now has his
own company, Digimine)
– Gregory Piatetsky-Shapiro (then GTE, now his own
data mining consulting company, Knowledge
Stream Partners)
– Rakesh Agrawal (IBM Research)
What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
Clustering
• Find groups of similar data
items
• Statistical techniques require
definition of “distance” (e.g.
between travel profiles),
conceptual techniques use
background concepts and
logical descriptions
Uses:
• Demographic analysis
Technologies:
• Self-Organizing Maps
• Probability Densities
• Conceptual Clustering
“Group people with
similar travel
profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
Top Stories clustering
Classification
• Find ways to separate data
items into pre-defined groups
– We know X and Y belong
together, find other things in
same group
• Requires “training data”:
Data items where group is
known
Uses:
• Profiling
Technologies:
• Generate decision trees
(results are human
understandable)
• Neural Nets
“Route documents to
most likely interested
parties”
– English or non-
english?
– Domestic or Foreign?
Groups
Training Data
tool produces
classifier
Association Rules
• Identify dependencies in
the data:
– X makes Y likely
• Indicate significance of
each dependency
• Bayesian methods
Uses:
• Targeted marketing
Technologies:
• AIS, SETM, Hugin,
TETRAD II
“Find groups of items
commonly purchased
together”
– People who purchase fish
are extraordinarily likely
to purchase wine
– People who purchase
Turkey are
extraordinarily likely to
purchase cranberries
Date/Time/Register Fish Turkey Cranberries Wine …
12/6 13:15 2 N Y Y Y …
12/6 13:16 3 Y N N Y …
Sequential Associations
• Find event sequences that are
unusually likely
• Requires “training” event list,
known “interesting” events
• Must be robust in the face of
additional “noise” events
Uses:
• Failure analysis and
prediction
Technologies:
• Dynamic programming
(Dynamic time warping)
• “Custom” algorithms
“Find common sequences
of warnings/faults
within 10 minute
periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on
any switchTime SwitchEvent
21:10 B Fault21
21:11 A Warn2
21:13 C Warn2
21:20 A Fault17
Deviation Detection
• Find unexpected values,
outliers
• Uses:
• Failure analysis
• Anomaly discovery for
analysis
• Technologies:
• clustering/classification
methods
• Statistical techniques
• visualization
• “Find unusual
occurrences in IBM
stock prices”
Date Close Volume Spread
58/07/02 369.50 314.08 .022561
58/07/03 369.25 313.87 .022561
58/07/04 MarketClosed
58/07/07 370.00 314.50 .022561
Sampledate Event Occurrences
58/07/04 Marketclosed317times
59/01/06 2.5%dividend2times
59/04/04 50%stocksplit7times
73/10/09 nottraded 1time
Large-scale Endeavors
Clustering Classification Association Sequence Deviation
SAS Decision
Trees
SPSS √ √
Oracle
(Darwin)
√ ANN
IBM Time
Series
Decision
Trees
√ √ √
DBMiner
(Simon Fraser)
√ √
Products
Research
War Stories:
Warehouse Product Allocation
The second project, identified as "Warehouse Product Allocation," was also initiated in
late 1995 by RS Components' IS and Operations Departments. In addition to their
warehouse in Corby, the company was in the process of opening another 500,000-
square-foot site in the Midlands region of the U.K. To efficiently ship product from
these two locations, it was essential that RS Components know in advance what
products should be allocated to which warehouse. For this project, the team used IBM
Intelligent Miner and additional optimization logic to split RS Components' product
sets between these two sites so that the number of partial orders and split shipments
would be minimized.
Parker says that the Warehouse Product Allocation project has directly contributed to a
significant savings in the number of parcels shipped, and therefore in shipping costs. In
addition, he says that the Opportunity Selling project not only increased the level of
service, but also made it easier to provide new subsidiaries with the value-added
knowledge that enables them to quickly ramp-up sales.
"By using the data mining tools and some additional optimization logic, IBM helped us
produce a solution which heavily outperformed the best solution that we could have
arrived at by conventional techniques," said Parker. "The IBM group tracked historical
order data and conclusively demonstrated that data mining produced increased revenue
that will give us a return on investment 10 times greater than the amount we spent on
the first project."
http://direct.boulder.ibm.com/dss/customer/rscomp.html
War Stories:
Inventory Forecasting
American Entertainment Company
Forecasting demand for inventory is a central problem for any
distributor. Ship too much and the distributor incurs the cost of
restocking unsold products; ship too little and sales opportunities
are lost.
IBM Data Mining Solutions assisted this customer by providing
an inventory forecasting model, using segmentation and predictive
modeling. This new model has proven to be considerably more
accurate than any prior forecasting model.
More war stories (many humorous) starting with slide 21 of:
http://robotics.stanford.edu/~ronnyk/chasm.pdf
Data Mining as a Threat to
Security
• Data mining gives us “facts” that are not obvious to human
analysts of the data
• Enables inspection and analysis of huge amounts of data
• Possible threats:
– Predict information about classified work from correlation with
unclassified work (e.g. budgets, staffing)
– Detect “hidden” information based on “conspicuous” lack of
information
– Mining “Open Source” data to determine predictive events (e.g.,
Pizza deliveries to the Pentagon)
• It isn’t the data we want to protect, but correlations among data
items
• Published in Chris Clifton and Don Marks, “Security and Privacy
Implications of Data Mining”, Proceedings of the 1996 ACM
SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery
Background – Inference
Problem
• MLS database – “high” and “low” data
– Problem if we can infer “high” data from “low” data
– Progress has been made (Morgenstern, Marks, ...)
• Problem: What if the inference isn’t “strict”?
– “Default inference” problems – Birds fly, an Ostrich is a bird,
so Ostriches fly – not true, so we can’t infer birds fly (and we
don’t prevent such an inference)
– But “birds fly” is useful, even if not strictly true
– Only limited work in detecting/preventing “imprecise”
inferences (Rath, Jones, Hale, Shenoi)
• Data mining specializes in finding imprecise inferences
Data mining – Inference from
Large Data
• Data mining gives us probabilistic “inferences”:
– 25% of group X is Y, but only 2% of population is Y.
• Key to data mining: Don’t need to pre-specify X and
Y.
– Define total population
– Define parameters that can be used to create group X
– Define parameters that can be used to create group Y
– Note the combinatorial explosion in the number of possible
groups: if three parameters used to create group X, possible
n3 groups
• Data mining tool determines groups X and Y where
“inference” is unusually likely
• Existing inference prevention based on guaranteed
truth of inference, but is this good enough?
Motivating Example:
Mortgage Application
• Idea: Mortgage company buys market research data to develop
profile of people likely to default
– Marketing data available
– Mortgage companies have history of current client defaults
• Problem: If 20% of profile defaults, it may make business sense
to reject all – but is it fair to the 80% that wouldn’t?
• Information Provider doesn’t want this done (potential public
backlash, e.g. Lotus)
Name Golfs Skis Mail-order Car ... Default
Dennis Y N $25 BMW N
Chris N Y $815 Ford Y
Denise N Y $790 Ford N
...
Eric N Y $830 Ford ?
Goal – Technical Solution
We want to protect the information
provider.
• Prevent others from finding any meaningful
correlations
– Must still provide access to individual data
elements (e.g. phone book)
• Prevent specific correlations (or classes of
correlations)
– Preserve ability to mine in desired fashion (e.g.
targeted marketing, inventory prediction)
What Can We Do?
• Prevent useful results from mining
– Algorithms only find “facts” with sufficient confidence and
support
– Limit data access to ensure low confidence and support
– Extra data (“cover stories”) to give “false” results with high
confidence and support
• Exploit weaknesses in mining algorithms
– Performance “blowups” under certain conditions
– Alter data to prevent exact matches
• Example: Extra digit at end of telephone number
• Remove information providing unwanted correlations
– Strip identifiers
– Group identifiers (e.g. census blocks, not addresses)
• “You mine the data, I’ll send the mailings”
What We Have Learned So Far:
Qualitative Results
• Avoid unnecessary groupings of data
– Ranges of instances can give information
• Department encodes center, division
• Employee number encodes hire date
– Knowing the meaning of a grouping is not necessary; the
existence of a meaningful grouping allows us to mine
– Moral: Assign “id numbers” randomly (still serve to identify)
• Providing only samples of data can lower confidence
in mining results
– Key: Provable limits for validity of mining results given a
sample
Data Mining to Handle
Security Problems
• Data mining tools can be used to examine audit data
and flag abnormal behavior
• Some work in Intrusion detection
– e.g., Neural networks to detect abnormal patterns
• SRI work on IDES
• Harris Corporation work
• Tools are being examined as a means to determine
abnormal patterns and also to determine the type of
problem
– Classification techniques
• Can draw heavily on Fraud detection
– Credit cards, calling cards, etc.
– Work by SRA Corporation
Data Mining to Improve
Security
• Intrusion Detection
– Relies on “training data”
– We’ll go into detail on this area (lots of new work)
• User profiling (what is normal behavior for a
user)
– Lots of work in the telecommunications industry
(caller fraud)
– Work is happening in computer security community
Various work in “command sequence” profiles

More Related Content

What's hot

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
Kathirvel Ayyaswamy
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
bigdatagurus_meetup
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
Maria de la Iglesia
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
millerca2
 
data mining for security application
data mining for security applicationdata mining for security application
data mining for security applicationbharatsvnit
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
Motaz Saad
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
Neo4j
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Hoang Nguyen
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
Regional Science Academy
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical Universitybutest
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
hktripathy
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
01. Introduction to Data Mining and BI
01. Introduction to Data Mining and BI01. Introduction to Data Mining and BI
01. Introduction to Data Mining and BI
Achmad Solichin
 

What's hot (20)

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
Big data beyond the hype may 2014
Big data beyond the hype may 2014Big data beyond the hype may 2014
Big data beyond the hype may 2014
 
Data mining
Data miningData mining
Data mining
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
 
data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Using Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale AnalyticsUsing Graphs to Enable National-Scale Analytics
Using Graphs to Enable National-Scale Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction
IntroductionIntroduction
Introduction
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
01. Introduction to Data Mining and BI
01. Introduction to Data Mining and BI01. Introduction to Data Mining and BI
01. Introduction to Data Mining and BI
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 

Viewers also liked

Fundamentals of data security policy in i.t. management it-toolkits
Fundamentals of data security policy in i.t. management   it-toolkitsFundamentals of data security policy in i.t. management   it-toolkits
Fundamentals of data security policy in i.t. management it-toolkits
IT-Toolkits.org
 
Personal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisPersonal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off Analysis
Shannon Szabo-Pickering
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Mining
openminted_eu
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Chris Shillum
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protection
meritnorthwest
 
A business driven approach to security policy management a technical perspec...
A business driven approach to security policy management  a technical perspec...A business driven approach to security policy management  a technical perspec...
A business driven approach to security policy management a technical perspec...
AlgoSec
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issues
Krish_ver2
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt
Sabreen Irfana
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 

Viewers also liked (11)

Fundamentals of data security policy in i.t. management it-toolkits
Fundamentals of data security policy in i.t. management   it-toolkitsFundamentals of data security policy in i.t. management   it-toolkits
Fundamentals of data security policy in i.t. management it-toolkits
 
Personal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisPersonal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off Analysis
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Mining
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protection
 
A business driven approach to security policy management a technical perspec...
A business driven approach to security policy management  a technical perspec...A business driven approach to security policy management  a technical perspec...
A business driven approach to security policy management a technical perspec...
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issues
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Similar to Lecture1

lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
bayhehua
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
KannanThangavelu2
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
ITz_1
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
National Information Standards Organization (NISO)
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
hktripathy
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Edward Curry
 
Security Analytics Beyond Cyber
Security Analytics Beyond CyberSecurity Analytics Beyond Cyber
Security Analytics Beyond Cyber
Phil Huggins FBCS CITP
 
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
44CON
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
Jamie Bisset
 
Data Mining
Data MiningData Mining
Data Miningshrapb
 
Robotics: Current Topics
Robotics: Current TopicsRobotics: Current Topics
Robotics: Current Topics
Sabbir Ahmmed
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
Sri Ambati
 

Similar to Lecture1 (20)

lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Data mining
Data miningData mining
Data mining
 
Security Analytics Beyond Cyber
Security Analytics Beyond CyberSecurity Analytics Beyond Cyber
Security Analytics Beyond Cyber
 
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
 
Data Mining
Data MiningData Mining
Data Mining
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Data Mining
Data MiningData Mining
Data Mining
 
Lecture1
Lecture1Lecture1
Lecture1
 
Robotics: Current Topics
Robotics: Current TopicsRobotics: Current Topics
Robotics: Current Topics
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
 

More from Manish Kumar

Security metrics 2
Security metrics 2Security metrics 2
Security metrics 2
Manish Kumar
 
Securitymetrics
SecuritymetricsSecuritymetrics
Securitymetrics
Manish Kumar
 
Wireless intelligent networking
Wireless intelligent networkingWireless intelligent networking
Wireless intelligent networking
Manish Kumar
 
Data Security in Local Area Network Using Distributed Firewall
Data Security in Local Area Network Using Distributed FirewallData Security in Local Area Network Using Distributed Firewall
Data Security in Local Area Network Using Distributed Firewall
Manish Kumar
 
Final iris recognition
Final iris recognitionFinal iris recognition
Final iris recognition
Manish Kumar
 
Iris Biometric for Person Identification
Iris Biometric for Person IdentificationIris Biometric for Person Identification
Iris Biometric for Person Identification
Manish Kumar
 

More from Manish Kumar (6)

Security metrics 2
Security metrics 2Security metrics 2
Security metrics 2
 
Securitymetrics
SecuritymetricsSecuritymetrics
Securitymetrics
 
Wireless intelligent networking
Wireless intelligent networkingWireless intelligent networking
Wireless intelligent networking
 
Data Security in Local Area Network Using Distributed Firewall
Data Security in Local Area Network Using Distributed FirewallData Security in Local Area Network Using Distributed Firewall
Data Security in Local Area Network Using Distributed Firewall
 
Final iris recognition
Final iris recognitionFinal iris recognition
Final iris recognition
 
Iris Biometric for Person Identification
Iris Biometric for Person IdentificationIris Biometric for Person Identification
Iris Biometric for Person Identification
 

Recently uploaded

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 

Recently uploaded (20)

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 

Lecture1

  • 1. CS 590M Fall 2001: Security Issues in Data Mining Chris Clifton Tuesdays and Thursdays, 9-10:15 Heavilon Hall 123
  • 2. Course Goals: Knowledge At the end of this course, you will: • Have a basic understanding of the technology involved in Data Mining • Know how data mining impacts information security • Understand leading-edge research on data mining and security
  • 3. Course Goals: Skills At the end of this course, you will: • Be able to understand new technology through reading the research literature • Have given conference-style presentations on difficult research topics • Have written journal-style critical reviews of research papers
  • 4. Course Topics • Data Mining (as necessary) – What is it? – How does it work? • Research in the use of Data Mining to improve security • Research in the security problems posed by the availability of Data Mining technology
  • 5. Process Initial phase of course: Data Mining background • Lectures, handouts, suggested reading • Length/material to be determined by what you already know Expect a quiz at the end of this phase
  • 6. Process • Phase 2: Student Presentations • Two paper presentations per class – Student presenting will read paper and prepare presentation materials You must prepare materials yourself – no fair using material obtained from the authors • Any week you do not present, you will do a journal quality review of one of the papers being presented that week You may request a papers to review/present, I will do final assignment
  • 7. Evaluation/Grading Evaluation will be a subjective process, however it will be based primarily on your understanding of the material as evidenced in: • Your presentations • Your written reviews • Your contribution to classroom discussions • Post phase-1 quiz
  • 8. Policy on Academic Integrity • Basic idea: You are learning to do Original Research – Work you do for the class should be original (yours) – Don’t borrow authors slides for presentations, even if they are available. Copying images/graphs okay where necessary • More details on course web site: http://www.cs.purdue.edu/homes/clifton/cs590m • When in doubt, ASK!
  • 9. What is Data Mining? Searching through large amounts of data for correlations, sequences, and trends. Current “driving applications” in sales (targeted marketing, inventory) and finance (stock picking) Sales data Sequence Classify Inference Cluster “70%of customers who purchase comforters later purchase curtains” Select information to bemined Choosemining tool (based on typeof results wanted) Evaluateresults
  • 10. adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advanced in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Data Target Data Selection Knowledge Knowledge Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Knowledge Discovery in Databases: Process See also: http://www.crisp-dm.org Preprocessing
  • 11. What is Data Mining? History • Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series starting 2001 • Key founders / technology contributers: – Usama Fayyad, JPL (then Microsoft, now has his own company, Digimine) – Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners) – Rakesh Agrawal (IBM Research)
  • 12. What Can Data Mining Do? • Cluster • Classify – Categorical, Regression • Summarize – Summary statistics, Summary rules • Link Analysis / Model Dependencies – Association rules • Sequence analysis – Time-series analysis, Sequential associations • Detect Deviations
  • 13. Clustering • Find groups of similar data items • Statistical techniques require definition of “distance” (e.g. between travel profiles), conceptual techniques use background concepts and logical descriptions Uses: • Demographic analysis Technologies: • Self-Organizing Maps • Probability Densities • Conceptual Clustering “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob Clusters Top Stories clustering
  • 14. Classification • Find ways to separate data items into pre-defined groups – We know X and Y belong together, find other things in same group • Requires “training data”: Data items where group is known Uses: • Profiling Technologies: • Generate decision trees (results are human understandable) • Neural Nets “Route documents to most likely interested parties” – English or non- english? – Domestic or Foreign? Groups Training Data tool produces classifier
  • 15. Association Rules • Identify dependencies in the data: – X makes Y likely • Indicate significance of each dependency • Bayesian methods Uses: • Targeted marketing Technologies: • AIS, SETM, Hugin, TETRAD II “Find groups of items commonly purchased together” – People who purchase fish are extraordinarily likely to purchase wine – People who purchase Turkey are extraordinarily likely to purchase cranberries Date/Time/Register Fish Turkey Cranberries Wine … 12/6 13:15 2 N Y Y Y … 12/6 13:16 3 Y N N Y …
  • 16. Sequential Associations • Find event sequences that are unusually likely • Requires “training” event list, known “interesting” events • Must be robust in the face of additional “noise” events Uses: • Failure analysis and prediction Technologies: • Dynamic programming (Dynamic time warping) • “Custom” algorithms “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switchTime SwitchEvent 21:10 B Fault21 21:11 A Warn2 21:13 C Warn2 21:20 A Fault17
  • 17. Deviation Detection • Find unexpected values, outliers • Uses: • Failure analysis • Anomaly discovery for analysis • Technologies: • clustering/classification methods • Statistical techniques • visualization • “Find unusual occurrences in IBM stock prices” Date Close Volume Spread 58/07/02 369.50 314.08 .022561 58/07/03 369.25 313.87 .022561 58/07/04 MarketClosed 58/07/07 370.00 314.50 .022561 Sampledate Event Occurrences 58/07/04 Marketclosed317times 59/01/06 2.5%dividend2times 59/04/04 50%stocksplit7times 73/10/09 nottraded 1time
  • 18. Large-scale Endeavors Clustering Classification Association Sequence Deviation SAS Decision Trees SPSS √ √ Oracle (Darwin) √ ANN IBM Time Series Decision Trees √ √ √ DBMiner (Simon Fraser) √ √ Products Research
  • 19. War Stories: Warehouse Product Allocation The second project, identified as "Warehouse Product Allocation," was also initiated in late 1995 by RS Components' IS and Operations Departments. In addition to their warehouse in Corby, the company was in the process of opening another 500,000- square-foot site in the Midlands region of the U.K. To efficiently ship product from these two locations, it was essential that RS Components know in advance what products should be allocated to which warehouse. For this project, the team used IBM Intelligent Miner and additional optimization logic to split RS Components' product sets between these two sites so that the number of partial orders and split shipments would be minimized. Parker says that the Warehouse Product Allocation project has directly contributed to a significant savings in the number of parcels shipped, and therefore in shipping costs. In addition, he says that the Opportunity Selling project not only increased the level of service, but also made it easier to provide new subsidiaries with the value-added knowledge that enables them to quickly ramp-up sales. "By using the data mining tools and some additional optimization logic, IBM helped us produce a solution which heavily outperformed the best solution that we could have arrived at by conventional techniques," said Parker. "The IBM group tracked historical order data and conclusively demonstrated that data mining produced increased revenue that will give us a return on investment 10 times greater than the amount we spent on the first project." http://direct.boulder.ibm.com/dss/customer/rscomp.html
  • 20. War Stories: Inventory Forecasting American Entertainment Company Forecasting demand for inventory is a central problem for any distributor. Ship too much and the distributor incurs the cost of restocking unsold products; ship too little and sales opportunities are lost. IBM Data Mining Solutions assisted this customer by providing an inventory forecasting model, using segmentation and predictive modeling. This new model has proven to be considerably more accurate than any prior forecasting model. More war stories (many humorous) starting with slide 21 of: http://robotics.stanford.edu/~ronnyk/chasm.pdf
  • 21. Data Mining as a Threat to Security • Data mining gives us “facts” that are not obvious to human analysts of the data • Enables inspection and analysis of huge amounts of data • Possible threats: – Predict information about classified work from correlation with unclassified work (e.g. budgets, staffing) – Detect “hidden” information based on “conspicuous” lack of information – Mining “Open Source” data to determine predictive events (e.g., Pizza deliveries to the Pentagon) • It isn’t the data we want to protect, but correlations among data items • Published in Chris Clifton and Don Marks, “Security and Privacy Implications of Data Mining”, Proceedings of the 1996 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
  • 22. Background – Inference Problem • MLS database – “high” and “low” data – Problem if we can infer “high” data from “low” data – Progress has been made (Morgenstern, Marks, ...) • Problem: What if the inference isn’t “strict”? – “Default inference” problems – Birds fly, an Ostrich is a bird, so Ostriches fly – not true, so we can’t infer birds fly (and we don’t prevent such an inference) – But “birds fly” is useful, even if not strictly true – Only limited work in detecting/preventing “imprecise” inferences (Rath, Jones, Hale, Shenoi) • Data mining specializes in finding imprecise inferences
  • 23. Data mining – Inference from Large Data • Data mining gives us probabilistic “inferences”: – 25% of group X is Y, but only 2% of population is Y. • Key to data mining: Don’t need to pre-specify X and Y. – Define total population – Define parameters that can be used to create group X – Define parameters that can be used to create group Y – Note the combinatorial explosion in the number of possible groups: if three parameters used to create group X, possible n3 groups • Data mining tool determines groups X and Y where “inference” is unusually likely • Existing inference prevention based on guaranteed truth of inference, but is this good enough?
  • 24. Motivating Example: Mortgage Application • Idea: Mortgage company buys market research data to develop profile of people likely to default – Marketing data available – Mortgage companies have history of current client defaults • Problem: If 20% of profile defaults, it may make business sense to reject all – but is it fair to the 80% that wouldn’t? • Information Provider doesn’t want this done (potential public backlash, e.g. Lotus) Name Golfs Skis Mail-order Car ... Default Dennis Y N $25 BMW N Chris N Y $815 Ford Y Denise N Y $790 Ford N ... Eric N Y $830 Ford ?
  • 25. Goal – Technical Solution We want to protect the information provider. • Prevent others from finding any meaningful correlations – Must still provide access to individual data elements (e.g. phone book) • Prevent specific correlations (or classes of correlations) – Preserve ability to mine in desired fashion (e.g. targeted marketing, inventory prediction)
  • 26. What Can We Do? • Prevent useful results from mining – Algorithms only find “facts” with sufficient confidence and support – Limit data access to ensure low confidence and support – Extra data (“cover stories”) to give “false” results with high confidence and support • Exploit weaknesses in mining algorithms – Performance “blowups” under certain conditions – Alter data to prevent exact matches • Example: Extra digit at end of telephone number • Remove information providing unwanted correlations – Strip identifiers – Group identifiers (e.g. census blocks, not addresses) • “You mine the data, I’ll send the mailings”
  • 27. What We Have Learned So Far: Qualitative Results • Avoid unnecessary groupings of data – Ranges of instances can give information • Department encodes center, division • Employee number encodes hire date – Knowing the meaning of a grouping is not necessary; the existence of a meaningful grouping allows us to mine – Moral: Assign “id numbers” randomly (still serve to identify) • Providing only samples of data can lower confidence in mining results – Key: Provable limits for validity of mining results given a sample
  • 28. Data Mining to Handle Security Problems • Data mining tools can be used to examine audit data and flag abnormal behavior • Some work in Intrusion detection – e.g., Neural networks to detect abnormal patterns • SRI work on IDES • Harris Corporation work • Tools are being examined as a means to determine abnormal patterns and also to determine the type of problem – Classification techniques • Can draw heavily on Fraud detection – Credit cards, calling cards, etc. – Work by SRA Corporation
  • 29. Data Mining to Improve Security • Intrusion Detection – Relies on “training data” – We’ll go into detail on this area (lots of new work) • User profiling (what is normal behavior for a user) – Lots of work in the telecommunications industry (caller fraud) – Work is happening in computer security community Various work in “command sequence” profiles

Editor's Notes

  1. Mine for: Selection Aggregation Abstraction Visualization Transformation/Conversion Statistical Analysis “Cleaning”
  2. Problem is that we may not know what may be learned from mining Can’t “Classify everything”; as some is open source or may have large benefits to being accessible This is the opposite of statistical queries – we are concerned about preventing generalities from specifics, rather then specifics from generalities – but conceptually similar. Not the same as induction – data mining finds “rules” that are generally true (high confidence and support), but not necessarily exact.