PPT Presentation - CODATA, The Committee on Data for Science and ...
Upcoming SlideShare
Loading in...5
×
 

PPT Presentation - CODATA, The Committee on Data for Science and ...

on

  • 695 views

 

Statistics

Views

Total Views
695
Views on SlideShare
695
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • If you have any question during my presentation, please feel free to ask.
  • This quotation is one of the most commonly used definitions of DMKD. What does each of these terms mean? Non-trivial process of identification. Knowledge discovery is a process with many component steps, including data preparation, data mining, pattern analysis, and others. The process is non-trivial in that it is typically complex and may take many iterations to refine and tune the data to the problem at hand. Patterns in data . The goal of knowledge discovery is to identify patterns in data. The patterns to be identified can be qualified as follows Valid . The patterns must be correct in representing the data themselves. Novel . If the patterns are not novel, no knowledge has been discovered. Potentially useful . The goal of knowledge discovery is to obtain knowledge so that it can be used. Understandable . If the data patterns cannot be understood, they are difficult to use. Knowledge cannot be used unless it is understood. The term “data mining” is overloaded. Sometimes it is used synonymously with knowledge discovery, while other times it is used to refer to the machine-learning phase of knowledge discovery.
  • This slide is borrowed from Dr Jiawei Han’s Data Mining textbook. The pyramid shows a more detailed view of each steps of DM. the data sources may be Paper, Files, Information Providers, Database Systems, OLTP. Data exploration uses Statistical Analysis, Querying and Reporting. The top is the knowledge for end user. To summarize, we are drowning in data, but starving for knowledge. So what is the difference between data, information, and knowledge? Where does wisdom fit into all of this?
  • Now let’s see how a caveman did data mining. In this cartoon, a caveman gathers information about the world around him, infers knowledge from this information, and reasons using this knowledge to obtain wisdom . (This process is conceptually similar to many modern rule-based systems, where rules such as “wolves eat rabbits” and “rabbits eat grass” are chained together to conduct reasoning.) Data. Data are a collection of observations or measurements that have no meaning in and of themselves (e.g., “10”). Information. Information is data that has been given context and is meaningful to a particular audience (e.g., “Orange Line trains are running 10 minutes late”). Knowledge. Knowledge is information with a broader, general understanding of the subject matter (e.g., “The single track in the Rosslyn tunnel is shared by both directions of the Orange and Blue lines and frequently causes delays”). Wisdom. Wisdom is knowledge embedded with the reflection or application of that knowledge (e.g., “Orange line train passengers should expect regular delays and plan accordingly”). This organization of terms is often referred to as the “data–information– knowledge–wisdom hierarchy” (DIKW)
  • The traditional approach to mining data from distributed sources has been to gather all data into a central data warehouse, then run data mining algorithms against the warehouse. There are a number of applications that are infeasible under such a constraint, leading to a need for distributed data mining. Issues include performance, connectivity, heterogeneity of sources, and privacy of sources.
  • Simply creating a data mining algorithm that extracts data directly from the local sources doesn’t solve these problems – all the issues that prevented creating a data warehouse still exist, they are just merged in with the data mining tool.
  • Local data mining distributes control over data mining! This solves many of the problems we have highlighted. In particular, we are addressing the privacy issue – this is a significant impediment to many uses of data mining, particularly in areas of general public interest. As an example, the Centers for Disease Control may want to use data mining to identify trends and patterns in disease outbreaks, such as understanding and predicting the progression of a flu epidemic. Insurance companies have considerable data that would be useful for such a task, but privacy considerations prevent them from releasing the data. An alternative is to have each of the statistics provide statistics on their data that cannot be traced to individual patients, but can be used to identify the trends and patterns of interested to the CDC.
  • Compared with other data mining areas, medical data mining has some unique characteristics. Since medical files are related to human subjects, privacy concern is taken more seriously than other data mining tasks. We take two approaches to protect privacy: one approach is to vertically partition the medical data and mine these partitioned data at multiple sites; the other approach is to horizontally split data across multiple sites. In the vertical partition approach, each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method. In the horizontal partition approach, data are distributed among several sites. Each site computes its own data and a central trusted party is responsible to integrate these results. Figure in the left tells us that the classification result using the whole dataset is better than using sub-datasets and the majority-vote result, except for number 2. The majority-vote result is slightly higher than the average of 9 sub-datasets for malignant class (3.32% vs. 3.59%) and slight lower than the average of 9 sub-datasets for benign class (2.84% vs. 2.6%). To summarize, using vertical data separation techniques, we can protect data privacy while sacrifice some classification accuracy. Figure in the right tells us that the classification result using the combined dataset is better than using individual dataset and the majority-vote result. The majority-vote result is better than the average of three individual dataset for both classes (heart-disease: 15.99% vs. 23.35%; normal: 16.63% vs. 18.28%). To summarize, using horizontal data separation techniques, we can both protect data privacy and achieve higher classification accuracy.

PPT Presentation - CODATA, The Committee on Data for Science and ... PPT Presentation - CODATA, The Committee on Data for Science and ... Presentation Transcript

  • College of Information Science & Technology University of Nebraska’s
  • Privacy-preserving Data Mining of Medical Data using Data Separation Based Techniques Gang Kou, Yi Peng, Yong Shi, and Zhengxin Chen
      • Data Mining and Knowledge Discovery
    “ Knowledge Discovery is the non-trivial process of identifying valid , novel , potentially useful , and ultimately understandable patterns in data . ” — U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 View slide
  • We are drowning in data, but starving for knowledge! Adopted from Jiawei Han, UIUC Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP Knowledge View slide
  • A Caveman's Wisdom Source: The Futurist, December 1982, Tom Chalkley Information is data that has been given context and is meaningful to a particular audience Knowledge is information with a broader, general understanding of the subject matter Wisdom is knowledge embedded with the reflection or application of that knowledge
      • Introduction
    Data mining is concerned with the extraction of useful knowledge from various types of data. Medical data mining has been a popular data mining topic of late. Compared with other data mining areas, medical data mining has some unique characteristics. Since medical files are related to human subjects, privacy concern is taken more seriously than other data mining tasks. This paper applied data separation-based techniques to preserve privacy in classification of medical data. We implement these two approaches using medical datasets from UCI KDD archive and report the experimental results.
      • Medical Datasets in this research
    Wisconsin prognostic breast cancer dataset has 699 records and 9 variables. These records belong to either benign or malignant class. We create nine different sub-datasets by removing one variable at a time. The heart-disease dataset has 797 records and 13 variables. These records belong to either heart-disease or normal class. This data was collected from Cleveland Clinic Foundation, Hungarian Institute of Cardiology, University Hospital of Zurich, and Long Beach V.A. Medical Center.
  • Distributed Data Mining: The “Standard” Method Local Data Local Data Local Data Warehouse Data Mining Combined valid results The Data Warehouse Approach
  • Private Distributed Mining: What is it? Local Data Local Data Local Data Data Mining Combined valid results What Won’t Work
  • Private Distributed Mining: What is it? Local Data Mining Local Data Local Data Local Data Data Mining Combined valid results What Will Work Data Mining Combiner Local Data Mining Local Data Mining
      • Privacy-preserving Data Mining
    • – Data obfuscation
    • Nobody sees the real data so it hide the protected information
    • Approaches– Randomly modify data– Swap values between records– Controlled modification of data to hide secrets
    • – Summarization
    • Only the needed facts are exposed
    • Approaches– Overall collection statistics– Limited query functionality
    • – Data separation
    • Data remains with trusted parties
    • Approaches– Vertical separation– Horizontal separation
      • Privacy-preserving Classification (Vertical) Process
    Input : The Medical dataset M , each of the medical records has m attributes Output : Average classification accuracies for benign and malignant of the dataset in 10-fold cross-validation; scores for all records; decision trees ensemble. Step 1 Generate m subsets with one different attribute removed from M at each time. Step 2 Training each subset with See5 with adaptive boosting and 10-fold cross validation to get m decision trees. Step 3 Ensemble the final decision function D, via majority vote of the m decision trees from step 2. Step 4 Classify M by the final decision function. END
      • Privacy-preserving Classification (Horizontal) Process
    Input : The Medical dataset from r different sources, M1, M2, Mr, each of the medical records has m attributes Output : Average classification accuracies for Normal and Heart-disease of the dataset in 10-fold cross-validation; scores for all records; decision trees ensemble. Step 1 Training r datasets with See5 with adaptive boosting and 10-fold cross validation to get r decision trees . Step 2 Ensemble the final decision function D, via majority vote of the r decision trees from step 1. Step 3 Classify all r datasets by the final decision function. END
      • Privacy-preserving Data Mining
    Vertical partition - each site uses a portion of the attributes to compute its results and the distributed results are assembled at a central trusted party using majority-vote ensemble method. Horizontal partition - data are distributed among several sites. Each site computes its own data and a central party is responsible to integrate these results.
      • Conclusion
    Privacy-preserving is an important issue in medical data mining. This paper investigates data separation techniques in medical data classification. The experiments demonstrate that data separation techniques can not only protect data privacy, but also increase classification accuracy sometimes (e.g., horizontally partitioned data).
  • Comment & Question 谢谢大家! Thank You! University of Nebraska’s