This document summarizes a course on security issues in data mining. The course will cover data mining techniques and how they can impact security. Students will learn about leading research at the intersection of data mining and security. They will give presentations on research papers and write reviews. Course topics include data mining techniques, how data mining can improve security, and security problems posed by data mining. The process will start with a data mining background phase followed by student presentations and reviews. Students will be evaluated based on their presentations, reviews, contributions to discussions, and a post-background quiz.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
Introduction to Excel 2007 Data Mining Plug-In using SQL Server 2008. The presentation starts with definitions and statistical theory (without equations). Then, the audience interactively participates in four demos showing the power and possibilities of the Microsoft Data Mining Algorithms.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
Introduction to Excel 2007 Data Mining Plug-In using SQL Server 2008. The presentation starts with definitions and statistical theory (without equations). Then, the audience interactively participates in four demos showing the power and possibilities of the Microsoft Data Mining Algorithms.
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
Según Hal Varian (experto en microeconomía y economía de la información y, desde el año 2002, Chief Economist de Google) “En los próximos años, el trabajo más atractivo será el de los estadísticos: La capacidad de recoger datos, comprenderlos, procesarlos, extraer su valor, visualizarlos, comunicarlos serán todas habilidades importantes en las próximas décadas. Ahora disponemos de datos gratuitos y omnipresentes. Lo que aún falta es la capacidad de comprender estos datos“.
Fundamentals of data security policy in i.t. management it-toolkitsIT-Toolkits.org
We all know that I.T. stands for “information technology” and that’s no accident. In fact, it’s a reflection of the primary mission of every I.T. organization – to provide the means and methods for creating, storing, transmitting, printing and retrieving business related information. By design, this operational mission is driven by the need to “protect”, which also includes preventing unauthorized access, uncontrolled modification and unwarranted destruction. The priorities are self evident – data integrity is vital, and vital needs must be met with purpose and committment. The tricky part is to balance vital interests with the associated costs and operational overhead. This is the higher purpose of data security and the goal of related policy development.
A PowerPoint presentation examining the advantages and disadvantages of personal information collection. Featured issues include genetic testing and data mining.
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
Según Hal Varian (experto en microeconomía y economía de la información y, desde el año 2002, Chief Economist de Google) “En los próximos años, el trabajo más atractivo será el de los estadísticos: La capacidad de recoger datos, comprenderlos, procesarlos, extraer su valor, visualizarlos, comunicarlos serán todas habilidades importantes en las próximas décadas. Ahora disponemos de datos gratuitos y omnipresentes. Lo que aún falta es la capacidad de comprender estos datos“.
Fundamentals of data security policy in i.t. management it-toolkitsIT-Toolkits.org
We all know that I.T. stands for “information technology” and that’s no accident. In fact, it’s a reflection of the primary mission of every I.T. organization – to provide the means and methods for creating, storing, transmitting, printing and retrieving business related information. By design, this operational mission is driven by the need to “protect”, which also includes preventing unauthorized access, uncontrolled modification and unwarranted destruction. The priorities are self evident – data integrity is vital, and vital needs must be met with purpose and committment. The tricky part is to balance vital interests with the associated costs and operational overhead. This is the higher purpose of data security and the goal of related policy development.
A PowerPoint presentation examining the advantages and disadvantages of personal information collection. Featured issues include genetic testing and data mining.
Merit Event - Understanding and Managing Data Protectionmeritnorthwest
From the 24th of October 2002, the Data Protection Act 1998, which applies to local government, NHS Trusts, Schools, Universities and all UK organisations who process personal information, comes into full force. The Data Protection Act 1998 gives people more rights to have their personal information handled fairly, to object to certain types of processing and to have access to any information held about them.
Who should attend:
These briefings have been designed for those who are responsible for the implementation of the Data Protection Act 1998. The practical as well as the theory will be dealt with and attendees will have the opportunity to discuss Data Protection business issues with experts and other delegates.
Briefing Content:
Morning session - Introduction
a) The Data Protection Act and its Principles
b) Responsibilities
c) Policies and Notification
d) Dealing with sub-contractors
e) Subject Access
f) Manual Records
g) Human Resource
Afternoon Session - Auditing
a) Do you need to Audit?
b) How to Audit
c) Do you know what data you process?
d) Reviewing Responsibilities
e) Procedures and Processes
f) Putting Things Right
g) Demonstrating Compliance
About the eBusiness Club
This training day is being organised as part of the eBusiness Club activities managed on behalf of the Chamber on Merseyside by MERIT (NW) Ltd and supported by leading public and private sector partners. The Merseyside eBusiness club will assist members to achieve the best possible results from their ICT and eBusiness systems. At the same time they will learn about innovations in the market place and hear directly from the leading voices in the industry
Full details about the eBusiness Club can be found online at www.merit.org.uk/ebusinessclub or alternatively by contacting Ian Bulmer, eBusiness Club Co-ordinator, MERIT (NW) Ltd, One Old Hall Street, Liverpool. L3 9HG. Tel: 0151 285 1400 email: ebusinessclub@merit.org.uk
A business driven approach to security policy management a technical perspec...AlgoSec
In this era of digital transformation, globalization, and relentless cyber-attacks, security can no longer remain a technology issue that simply focuses on defending networks and data. It must become a strategic, business driver that transforms the next generation datacenter to both protect and power the agile enterprise. Security teams are therefore now looking to implement intelligent automation that injects business context into their security management.
Join Joe DiPietro, SE Director at AlgoSec for a technical webinar, where he will discuss a business-driven approach to security policy management – from automatically discovering application connectivity requirements, through ongoing change management and proactive risk analysis, to secure decommissioning – that will help make your organizations more agile, more secure and more compliant.
During the webinar, Joe will explain how to:
• Get holistic visibly of security risk and compliance across the enterprise network
• How to reduce risk and avoid application outages
• Tie cyber threats to business processes
• Enhance and automate business processes with business context, including impact analysis and risk approval
• Accelerate and ensure secure business transformation to the cloud
Applications of Data Mining Issues in Data Mining
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
This presentation was provided by Jake Zarnegar of Silverchair, during the NFAIS Forethought event "Artificial Intelligence #2 – Processes for Media Analysis and Extraction" The webinar was held on May 20, 2020.
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry
Data management efforts such as Master Data Management and Data Curation are a popular approach for high quality enterprise data. However, Data Curation can be heavily centralised and labour intensive, where the cost and effort can become prohibitively high. The concentration of data management and stewardship onto a few highly skilled individuals, like developers and data experts, can be a significant bottleneck. This talk explores how to effectively involving a wider community of users within big data management activities. The bottom-up approach of involving crowds in the creation and management of data has been demonstrated by projects like Freebase, Wikipedia, and DBpedia. The talk discusses how crowdsourcing data management techniques can be applied within an enterprise context.
Topics covered include:
- Data Quality And Data Curation
- Crowdsourcing
- Case Studies on Crowdsourced Data Curation
- Setting up a Crowdsourced Data Curation Process
- Linked Open Data Example
- Future Research Challenges
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins44CON
44CON 2014 - Security Analytics Beyond Cyber, Phil Huggins
A quick summary of the current state of big data technology and data science approaches used in cyber / network defender security analytics including summary use cases, a walk through of a reference architecture and breakdown of the required skills. Focus is on the knowledge needed to run a proof of concept and establish a programme for early benefits. Will then also include a view on the future of extending the platforms and capabilities of security analytics to cover performance metrics and data-driven security management approaches.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Recycled Concrete Aggregate in Construction Part III
Lecture1
1. CS 590M Fall 2001: Security
Issues in Data Mining
Chris Clifton
Tuesdays and Thursdays, 9-10:15
Heavilon Hall 123
2. Course Goals:
Knowledge
At the end of this course, you will:
• Have a basic understanding of the
technology involved in Data Mining
• Know how data mining impacts
information security
• Understand leading-edge research on
data mining and security
3. Course Goals:
Skills
At the end of this course, you will:
• Be able to understand new technology
through reading the research literature
• Have given conference-style
presentations on difficult research topics
• Have written journal-style critical
reviews of research papers
4. Course Topics
• Data Mining (as necessary)
– What is it?
– How does it work?
• Research in the use of Data Mining to
improve security
• Research in the security problems posed
by the availability of Data Mining
technology
5. Process
Initial phase of course: Data Mining
background
• Lectures, handouts, suggested reading
• Length/material to be determined by
what you already know
Expect a quiz at the end of this phase
6. Process
• Phase 2: Student Presentations
• Two paper presentations per class
– Student presenting will read paper and prepare
presentation materials
You must prepare materials yourself – no fair using
material obtained from the authors
• Any week you do not present, you will do a
journal quality review of one of the papers
being presented that week
You may request a papers to review/present, I will do
final assignment
7. Evaluation/Grading
Evaluation will be a subjective process, however
it will be based primarily on your
understanding of the material as evidenced in:
• Your presentations
• Your written reviews
• Your contribution to classroom discussions
• Post phase-1 quiz
8. Policy on Academic Integrity
• Basic idea: You are learning to do Original
Research
– Work you do for the class should be original
(yours)
– Don’t borrow authors slides for presentations, even
if they are available.
Copying images/graphs okay where necessary
• More details on course web site:
http://www.cs.purdue.edu/homes/clifton/cs590m
• When in doubt, ASK!
9. What is Data Mining?
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock
picking)
Sales data
Sequence
Classify
Inference
Cluster
“70%of
customers who
purchase
comforters later
purchase
curtains”
Select information to bemined Choosemining tool (based on
typeof results wanted)
Evaluateresults
10. adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advanced in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data
Target
Data
Selection
Knowledge
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Knowledge Discovery in
Databases: Process
See also: http://www.crisp-dm.org
Preprocessing
11. What is Data Mining?
History
• Knowledge Discovery in Databases workshops
started ‘89
– Now a conference under the auspices of ACM
SIGKDD
– IEEE conference series starting 2001
• Key founders / technology contributers:
– Usama Fayyad, JPL (then Microsoft, now has his
own company, Digimine)
– Gregory Piatetsky-Shapiro (then GTE, now his own
data mining consulting company, Knowledge
Stream Partners)
– Rakesh Agrawal (IBM Research)
12. What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
13. Clustering
• Find groups of similar data
items
• Statistical techniques require
definition of “distance” (e.g.
between travel profiles),
conceptual techniques use
background concepts and
logical descriptions
Uses:
• Demographic analysis
Technologies:
• Self-Organizing Maps
• Probability Densities
• Conceptual Clustering
“Group people with
similar travel
profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
Top Stories clustering
14. Classification
• Find ways to separate data
items into pre-defined groups
– We know X and Y belong
together, find other things in
same group
• Requires “training data”:
Data items where group is
known
Uses:
• Profiling
Technologies:
• Generate decision trees
(results are human
understandable)
• Neural Nets
“Route documents to
most likely interested
parties”
– English or non-
english?
– Domestic or Foreign?
Groups
Training Data
tool produces
classifier
15. Association Rules
• Identify dependencies in
the data:
– X makes Y likely
• Indicate significance of
each dependency
• Bayesian methods
Uses:
• Targeted marketing
Technologies:
• AIS, SETM, Hugin,
TETRAD II
“Find groups of items
commonly purchased
together”
– People who purchase fish
are extraordinarily likely
to purchase wine
– People who purchase
Turkey are
extraordinarily likely to
purchase cranberries
Date/Time/Register Fish Turkey Cranberries Wine …
12/6 13:15 2 N Y Y Y …
12/6 13:16 3 Y N N Y …
16. Sequential Associations
• Find event sequences that are
unusually likely
• Requires “training” event list,
known “interesting” events
• Must be robust in the face of
additional “noise” events
Uses:
• Failure analysis and
prediction
Technologies:
• Dynamic programming
(Dynamic time warping)
• “Custom” algorithms
“Find common sequences
of warnings/faults
within 10 minute
periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on
any switchTime SwitchEvent
21:10 B Fault21
21:11 A Warn2
21:13 C Warn2
21:20 A Fault17
18. Large-scale Endeavors
Clustering Classification Association Sequence Deviation
SAS Decision
Trees
SPSS √ √
Oracle
(Darwin)
√ ANN
IBM Time
Series
Decision
Trees
√ √ √
DBMiner
(Simon Fraser)
√ √
Products
Research
19. War Stories:
Warehouse Product Allocation
The second project, identified as "Warehouse Product Allocation," was also initiated in
late 1995 by RS Components' IS and Operations Departments. In addition to their
warehouse in Corby, the company was in the process of opening another 500,000-
square-foot site in the Midlands region of the U.K. To efficiently ship product from
these two locations, it was essential that RS Components know in advance what
products should be allocated to which warehouse. For this project, the team used IBM
Intelligent Miner and additional optimization logic to split RS Components' product
sets between these two sites so that the number of partial orders and split shipments
would be minimized.
Parker says that the Warehouse Product Allocation project has directly contributed to a
significant savings in the number of parcels shipped, and therefore in shipping costs. In
addition, he says that the Opportunity Selling project not only increased the level of
service, but also made it easier to provide new subsidiaries with the value-added
knowledge that enables them to quickly ramp-up sales.
"By using the data mining tools and some additional optimization logic, IBM helped us
produce a solution which heavily outperformed the best solution that we could have
arrived at by conventional techniques," said Parker. "The IBM group tracked historical
order data and conclusively demonstrated that data mining produced increased revenue
that will give us a return on investment 10 times greater than the amount we spent on
the first project."
http://direct.boulder.ibm.com/dss/customer/rscomp.html
20. War Stories:
Inventory Forecasting
American Entertainment Company
Forecasting demand for inventory is a central problem for any
distributor. Ship too much and the distributor incurs the cost of
restocking unsold products; ship too little and sales opportunities
are lost.
IBM Data Mining Solutions assisted this customer by providing
an inventory forecasting model, using segmentation and predictive
modeling. This new model has proven to be considerably more
accurate than any prior forecasting model.
More war stories (many humorous) starting with slide 21 of:
http://robotics.stanford.edu/~ronnyk/chasm.pdf
21. Data Mining as a Threat to
Security
• Data mining gives us “facts” that are not obvious to human
analysts of the data
• Enables inspection and analysis of huge amounts of data
• Possible threats:
– Predict information about classified work from correlation with
unclassified work (e.g. budgets, staffing)
– Detect “hidden” information based on “conspicuous” lack of
information
– Mining “Open Source” data to determine predictive events (e.g.,
Pizza deliveries to the Pentagon)
• It isn’t the data we want to protect, but correlations among data
items
• Published in Chris Clifton and Don Marks, “Security and Privacy
Implications of Data Mining”, Proceedings of the 1996 ACM
SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery
22. Background – Inference
Problem
• MLS database – “high” and “low” data
– Problem if we can infer “high” data from “low” data
– Progress has been made (Morgenstern, Marks, ...)
• Problem: What if the inference isn’t “strict”?
– “Default inference” problems – Birds fly, an Ostrich is a bird,
so Ostriches fly – not true, so we can’t infer birds fly (and we
don’t prevent such an inference)
– But “birds fly” is useful, even if not strictly true
– Only limited work in detecting/preventing “imprecise”
inferences (Rath, Jones, Hale, Shenoi)
• Data mining specializes in finding imprecise inferences
23. Data mining – Inference from
Large Data
• Data mining gives us probabilistic “inferences”:
– 25% of group X is Y, but only 2% of population is Y.
• Key to data mining: Don’t need to pre-specify X and
Y.
– Define total population
– Define parameters that can be used to create group X
– Define parameters that can be used to create group Y
– Note the combinatorial explosion in the number of possible
groups: if three parameters used to create group X, possible
n3 groups
• Data mining tool determines groups X and Y where
“inference” is unusually likely
• Existing inference prevention based on guaranteed
truth of inference, but is this good enough?
24. Motivating Example:
Mortgage Application
• Idea: Mortgage company buys market research data to develop
profile of people likely to default
– Marketing data available
– Mortgage companies have history of current client defaults
• Problem: If 20% of profile defaults, it may make business sense
to reject all – but is it fair to the 80% that wouldn’t?
• Information Provider doesn’t want this done (potential public
backlash, e.g. Lotus)
Name Golfs Skis Mail-order Car ... Default
Dennis Y N $25 BMW N
Chris N Y $815 Ford Y
Denise N Y $790 Ford N
...
Eric N Y $830 Ford ?
25. Goal – Technical Solution
We want to protect the information
provider.
• Prevent others from finding any meaningful
correlations
– Must still provide access to individual data
elements (e.g. phone book)
• Prevent specific correlations (or classes of
correlations)
– Preserve ability to mine in desired fashion (e.g.
targeted marketing, inventory prediction)
26. What Can We Do?
• Prevent useful results from mining
– Algorithms only find “facts” with sufficient confidence and
support
– Limit data access to ensure low confidence and support
– Extra data (“cover stories”) to give “false” results with high
confidence and support
• Exploit weaknesses in mining algorithms
– Performance “blowups” under certain conditions
– Alter data to prevent exact matches
• Example: Extra digit at end of telephone number
• Remove information providing unwanted correlations
– Strip identifiers
– Group identifiers (e.g. census blocks, not addresses)
• “You mine the data, I’ll send the mailings”
27. What We Have Learned So Far:
Qualitative Results
• Avoid unnecessary groupings of data
– Ranges of instances can give information
• Department encodes center, division
• Employee number encodes hire date
– Knowing the meaning of a grouping is not necessary; the
existence of a meaningful grouping allows us to mine
– Moral: Assign “id numbers” randomly (still serve to identify)
• Providing only samples of data can lower confidence
in mining results
– Key: Provable limits for validity of mining results given a
sample
28. Data Mining to Handle
Security Problems
• Data mining tools can be used to examine audit data
and flag abnormal behavior
• Some work in Intrusion detection
– e.g., Neural networks to detect abnormal patterns
• SRI work on IDES
• Harris Corporation work
• Tools are being examined as a means to determine
abnormal patterns and also to determine the type of
problem
– Classification techniques
• Can draw heavily on Fraud detection
– Credit cards, calling cards, etc.
– Work by SRA Corporation
29. Data Mining to Improve
Security
• Intrusion Detection
– Relies on “training data”
– We’ll go into detail on this area (lots of new work)
• User profiling (what is normal behavior for a
user)
– Lots of work in the telecommunications industry
(caller fraud)
– Work is happening in computer security community
Various work in “command sequence” profiles
Problem is that we may not know what may be learned from mining
Can’t “Classify everything”; as some is open source or may have large benefits to being accessible
This is the opposite of statistical queries – we are concerned about preventing generalities from specifics, rather then specifics from generalities – but conceptually similar.
Not the same as induction – data mining finds “rules” that are generally true (high confidence and support), but not necessarily exact.