Tutorial on High Performance Data Mining
Upcoming SlideShare
Loading in...5
×
 

Tutorial on High Performance Data Mining

on

  • 1,393 views

 

Statistics

Views

Total Views
1,393
Views on SlideShare
1,390
Embed Views
3

Actions

Likes
1
Downloads
75
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • MineSet from Silicon Graphics
  • Pp 234-235

Tutorial on High Performance Data Mining Tutorial on High Performance Data Mining Presentation Transcript

  • Data Mining Dr. Mohsen Kahani Email: [email_address] http:// www.um.ac.ir/~kahani /
  • Overview
    • Introduction
    • Data Mining Functions and Models
    • Data Mining Methodologies
    • Data Mining Case Studies
    • Final Remarks
  • Motivation: “Necessity is the Mother of Invention”
    • Data explosion problem:
      • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
    • We are drowning in data, but starving for knowledge!
    View slide
  • Data pyramid Data Information Knowledge Wisdom Data + context Information + rules Knowledge + experience View slide
  • Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
  • Transformed Data Target Data RawData Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge __ ____ __ ____ __ ____ Patterns and Rules
  • Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • Definition of Data Mining
    • “… The non-trivial process of identifying valid , novel , potentially useful , and ultimately understandable patterns in data…”
    • Fayyad, Piatetsky-Shapiro, Smyth [1996]
  • The Evolution of Data Analysis Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data A ccess (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing & Decis ion Support (1990s) "What were unit sales in New England last March? Drill down to Boston." On - line analytic processing (OLAP), multidimensional databases, data warehouses SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR Retrospective, dynamic data d elivery at multiple levels Data Mining (Emerging Today) "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous s tartups Prospective, proactive information delivery
  • Need for Data Mining
    • Data accumulate and double every 9 months
    • There is a big gap from stored data to knowledge; and the transition won’t occur automatically.
    • Manual data analysis is not new but a bottleneck
    • Fast developing Computer Science and Engineering generates new demands
    • Seeking knowledge from massive data
      • Any personal experience?
  • When is DM useful
    • Data rich world
    • Large data (dimensionality and size)
      • Image data (size)
      • Gene chip data (dimensionality)
    • Little knowledge about data (exploratory data analysis)
      • What if we have some knowledge?
  • Challenges
    • Increasing data dimensionality and data size
    • Various data forms
    • New data types
      • Streaming data, multimedia data
    • Efficient search and access to data/knowledge
    • Intelligent update and integration
  • Data Mining Survey
    • Industry Pioneers
    • 23% Manufacturing
    • 19% Financial Serv.
    • 17% Tele/Data communication
    • 13% Media
    • 12% Retail/Wholesaler
    • Objectives
    • 21.4% Understanding Customer Segments and Preferences,
    • 19,5% Identifying Profitable Customers and Acquiring New ones,
    • 14,1% Increasing Revenue From Customers.
    • World Data Mining Survey, 6 August, 2002.
  • Results of Data Mining Include:
    • Forecasting what may happen in the future
    • Classifying people or things into groups by recognizing patterns
    • Clustering people or things into groups based on their attributes
    • Associating what events are likely to occur together
    • Sequencing what events are likely to lead to later events
  • Data Mining versus OLAP
    • OLAP - On-line Analytical Processing
      • Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
  • Data Mining Versus Statistical Analysis
    • Data Analysis
      • Tests for statistical correctness of models
        • Are statistical assumptions of models correct?
          • Eg Is the R-Square good?
      • Hypothesis testing
        • Is the relationship significant?
          • Use a t-test to validate significance
      • Tends to rely on sampling
      • Techniques are not optimised for large amounts of data
      • Requires strong statistical skills
    • Data Mining
      • Originally developed to act as expert systems to solve problems
      • Less interested in the mechanics of the technique
      • If it makes sense then let’s use it
      • Does not require assumptions to be made about data
      • Can find patterns in very large amounts of data
      • Requires understanding of data and business problem
  • Data Mining Taxonomy
    • Predictive Method
    • - … predict the value of a particular attribute…
    • Descriptive Method
    • - … foundation of human-interpretable patterns that describe the data…
  • Data Mining Tasks...
    • Classification [ Predictive ]
    • Clustering [ Descriptive ]
    • Association Rule Discovery [ Descriptive ]
    • Sequential Pattern Discovery [ Descriptive ]
    • Deviation Detection [ Predictive ]
  • Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • Classification: Linear Regression
    • Linear Regression
      • w 0 + w 1 x + w 2 y >= 0
    • Regression computes w i from data to minimize squared error to ‘fit’ the data
    • Not flexible enough
  • Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
  • Decision Trees
    • -a way of representing a series of rules that lead to a class or value;
    • -basic components of a decision tree: decision node, branches and leaves;
    • Income>40,000
    • Job>5 High Debt
    • Low Risk High Risk High Risk Low Risk
    No Yes Yes No Yes No
  • Decision Trees (cont.)
    • handle very well non-numeric data;
    • work best when the predictor variables are categorical;
  • Example Decision Tree categorical categorical continuous class Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes The splitting attribute at a node is determined based on the Gini index.
  • Classification: Neural Networks
    • efficiently model large and complex problems;
    • may be used in classification problems or for regressions;
    • Starts with input layer => hidden layer => output layer
    1 2 3 4 5 6 Inputs Output Hidden Layer
  • Neural Networks (cont.)
    • can be easily implemented to run on massively parallel computers;
    • can not be easily interpret;
    • require an extensive amount of training time;
    • require a lot of data preparation (involve very careful data cleansing, selection, preparation, and pre-processing);
    • require sufficiently large data set and high signal-to noise ratio.
  • Kohonen Network
    • Description
    • unsupervised
    • seeks to describe dataset in terms of natural clusters of cases
  • Classification Example categorical categorical continuous class Training Set Learn Classifier Test Set Model
  • Classification Application
    • Direct Marketing
    • Fraud Detection
    • Customer Attrition/Churn
    • Sky Survey Cataloging
  • Data Mining Tasks: Clustering
    • Goal is to identify categories
    • Natural grouping of customers by processing all the available data about them.
    • Other applications
      • market segmentation, discovering affinity groups, and defect analysis
  • Data Mining Tasks: Association Rule Discovery
    • Given a set of records each of which contain some number of items from a given collection;
      • Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
    Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
  • Association Rule Discovery Application
    • Marketing and Sales Promotion
    • Supermarket Shelf Management
    • Inventory Management
  • Deviation Detection & Pattern Discovery
    • Deviation Detection:
    • … discovering most significant changes in data from previously measured or normative values…
    • V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
    • Sequential Pattern Discovery:
    • … process of looking for patterns and rules that predict strong sequential dependencies among different events…
    • V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
  • Sequential Patterns
    • Identify frequently occurring sequences from given records
    • 40 percent of female customers buy a gray skirt six months after buying a red jacket
  • Data Mining Methodology: SAS
    • Sample
      • Extract a portion of the dataset for data mining
    • Explore
    • Modify
      • create, select and transform variables with the intention of building a model
    • Model
      • Specify a relationship of variables that reliably predicts a desired goal
    • Assess
      • Evaluate the practical value of the findings and the model resulting from the data mining effort
  • Data Mining Methodology: CRISP-DM
    • Data understanding
    • Data preparation
    • Modeling
    • Evaluation
    • Deployment
  • CRISP-DM Phases
  • Phases and Tasks Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Deployment
  • Major Application Areas for Data Mining Solutions
    • Fraud/Non-Compliance Anomaly detection
      • Isolate the factors that lead to fraud, waste and abuse
      • Target auditing and investigative efforts more effectively
    • Credit/Risk Scoring
    • Intrusion detection
    • Parts failure prediction
    • Recruiting/Attracting customers
    • Maximizing profitability (cross selling, identifying profitable customers)
    • Service Delivery and Customer Retention
      • Build profiles of customers likely to use which services
    • Web Mining
    • Health Care
  • Case Study: Search Engines
    • Early search engines used mainly keywords on a page – were subject to manipulation
    • Google success is due to its algorithm which uses mainly links to the page
    • Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google
  • Case Study: Direct Marketing and CRM
    • Most major direct marketing companies are using modeling and data mining
    • Most financial companies are using customer modeling
    • Modeling is easier than changing customer behaviour
    • Some successes
      • Verizon Wireless reduced churn rate from 2% to 1.5%
  • Biology: Molecular Diagnostics
    • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML)
      • 72 samples, about 7,000 genes
    ALL AML
      • Results: 33 correct (97% accuracy),
      • 1 error (sample suspected mislabelled)
      • Outcome predictions?
  • Case Study: Security and Fraud Detection
    • Credit Card Fraud Detection
    • Money laundering
      • FAIS (US Treasury)
    • Securities Fraud
      • NASDAQ Sonar system
    • Phone fraud
      • AT&T, Bell Atlantic, British Telecom/MCI
    • Bio-terrorism detection at Salt Lake Olympics 2002
  • 3D example by MineSet
  • Data Mining and Privacy
    • Data Mining looks for patterns, not people!
    • Technical solutions can limit privacy invasion
      • Replacing sensitive personal data with anon. ID
      • Give randomized outputs
      • Multi-party computation – distributed data
  • The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations
  • Final Remarks
    • Data Mining can be utilized for any field that needs to find patterns or relationships in their data.
  • Questions?
  • Special Data Types
    • Spatial Data
    • Streamed Data
    • Multimedia data
  • Spatial Mining Spatial Data and Structures Images Spatial Mining Algorithms
  • Definitions
    • Spatial data is about instances located in a physical space
    • Spatial data has location or geo-referenced features
    • Some of these features are:
      • Address, latitude/longitude (explicit)
      • Location-based partitions in databases (implicit)
  • Applications and Problems
    • Geographic information systems (GIS) store information related to geographic locations on Earth
      • Weather, community infrastructure needs, disaster management, and hazardous waste
    • Homeland security issues such as prediction of unexpected events and planning of evacuation
    • Remote sensing and image classification
    • Biomedical applications include medical imaging and illness diagnosis
  • Use of Spatial Data
    • Map overlay – merging disparate data
      • Different views of the same area: (Level 1) streets, power lines, phone lines, sewer lines, (Level 2) actual elevations, building locations, and rivers
    • Spatial selection – find all houses near WSU
    • Spatial join – nearest for points, intersection for areas
    • Other basic spatial operations
      • Region/range query for objects intersecting a region
      • Nearest neighbor query for objects closest to a given place
      • Distance scan asking for objects within a certain radius
  • Spatial Data Structures
    • Minimum bounding rectangles (MBR)
    • Different tree structures
      • Quad tree
      • R-Tree
      • kd-Tree
    • Image databases
  • MBR
    • Representing a spatial object by the smallest rectangle [(x1,y1), (x2,y2)] or rectangles
    (x1,y1) (x2,y2)
  • R-Tree
    • Indexing MBRs in a tree
      • An R-tree of order m has at most m entries in one node
      • An example (order of 3)
    R8 R7 R6 R3 R2 R1 R5 R4 R8 R1 R2 R3 R6 R5 R4 R7
  • Common Tasks dealing with Spatial Data
    • Data focusing
      • Spatial queries
      • Identifying interesting parts in spatial data
      • Progress refinement can be applied in a tree structure
    • Feature extraction
      • Extracting important/relevant features for an application
    • Classification or others
      • Using training data to create classifiers
      • Many mining algorithms can be used
        • Classification, clustering, associations
  • Spatial Mining Tasks
    • Spatial classification
    • Spatial clustering
    • Spatial association rules
  • Spatial Classification
    • Use spatial information at different (coarse/fine) levels (different indexing trees) for data focusing
    • Determine relevant spatial or non-spatial features
    • Perform normal supervised learning algorithms
      • e.g., Decision trees,
  • Spatial Clustering
    • Use tree structures to index spatial data
    • DBSCAN: R-tree
    • CLIQUE: Grid or Quad tree
    • Clustering with spatial constraints (obstacles  need to adjust notion of distance)
  • Spatial Association Rules
    • Spatial objects are of major interest, not transactions
    • A  B
      • A, B can be either spatial or non-spatial (3 combinations)
      • What is the fourth combination?
    • Association rules can be found w.r.t. the 3 types
  • Summary
    • Spatial data can contain both spatial and non-spatial features.
    • When spatial information becomes dominant interest, spatial data mining should be applied.
    • Spatial data structures can facilitate spatial mining.
    • Standard data mining algorithms can be modified for spatial data mining, with a substantial part of preprocessing to take into account of spatial information.
  • Characteristics of Data Streams
    • Data Streams
      • Data streams — continuous, ordered, changing, fast, huge amount
      • Traditional DBMS — data stored in finite, persistent data sets
    • Characteristics
      • Huge volumes of continuous data, possibly infinite
      • Fast changing and requires fast, real-time response
      • Data stream captures nicely our data processing needs of today
      • Random access is expensive — single linear scan algorithm ( can only have one look )
      • Store only the summary of the data seen thus far
      • Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing
  • Stream Data Applications
    • Telecommunication calling records
    • Business: credit card transaction flows
    • Network monitoring and traffic engineering
    • Financial market: stock exchange
    • Engineering & industrial processes: power supply & manufacturing
    • Sensor, monitoring & surveillance: video streams
    • Security monitoring
    • Web logs and Web page click streams
    • Massive data sets (even saved but random access is too expensive)
  • Architecture: Stream Query Processing Scratch Space (Main memory and/or Disk) User/Application Stream Query Processor Results Multiple streams SDMS (Stream Data Management System) Continuous Query
  • Challenges of Stream Data Processing
    • Multiple, continuous, rapid, time-varying, ordered streams
    • Main memory computations
    • Queries are often continuous
      • Evaluated continuously as stream data arrives
      • Answer updated over time
    • Queries are often complex
      • Beyond element-at-a-time processing
      • Beyond stream-at-a-time processing
      • Beyond relational queries (scientific, data mining, OLAP)
    • Multi-level/multi-dimensional processing and data mining
      • Most stream data are at pretty low-level or multi-dimensional in nature
  • Processing Stream Queries
    • Query types
      • One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)
      • Predefined query vs. ad-hoc query (issued on-line)
    • Unbounded memory requirements
      • For real-time response, main memory algorithm should be used
      • Memory requirement is unbounded if one will join future tuples
    • Approximate query answering
      • With bounded memory, it is not always possible to produce exact answers
      • High-quality approximate answers are desired
      • Data reduction and synopsis construction methods
        • Sketches, random sampling, histograms, wavelets, etc.
  • Stream Data Mining vs. Stream Querying
    • Stream mining — A more challenging task
      • It shares most of the difficulties with stream querying
      • Patterns are hidden and more general than querying
      • It may require exploratory analysis
        • Not necessarily continuous queries
    • Stream data mining tasks
      • Multi-dimensional on-line analysis of streams
      • Mining outliers and unusual patterns in stream data
      • Clustering data streams
      • Classification of stream data
  • Stream Data Mining Tasks
    • Multi-dimensional (on-line) analysis of streams
    • Clustering data streams
    • Classification of data streams
    • Mining frequent patterns in data streams
    • Mining sequential patterns in data streams
    • Mining partial periodicity in data streams
    • Mining notable gradients in data streams
    • Mining outliers and unusual patterns in data streams
    • …… , more?
  • Challenges for Mining Dynamics in Data Streams
    • Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing
    • Analysis requirements
      • Multi-dimensional trends and unusual patterns
      • Capturing important changes at multi-dimensions/levels
      • Fast, real-time detection and response
      • Comparing with data cube: Similarity and differences
    • Stream (data) cube or stream OLAP: Is this feasible?
      • Can we implement it efficiently?
  • Multi-Dimensional Stream Analysis: Examples
    • Analysis of Web click streams
      • Raw data at low levels: seconds, web page addresses, user IP addresses, …
      • Analysts want: changes, trends, unusual patterns, at reasonable levels of details
      • E.g., A verage clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours .”
    • Analysis of power consumption streams
      • Raw data: power consumption flow for every household, every minute
      • Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago
  • A Key Step — Stream Data Reduction
    • Challenges of OLAPing stream data
      • Raw data cannot be stored
      • Simple aggregates are not powerful enough
      • History shape and patterns at different levels are desirable: multi-dimensional regression analysis
    • Proposal
      • A scalable multi-dimensional stream “data cube” that can aggregate regression model of stream data efficiently without accessing the raw data
    • Stream data compression
      • Compress the stream data to support memory- and time-efficient multi-dimensional regression analysis
  • Data Warehouse
  • Data Warehouse Architecture
  • Data Warehouse Options