• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Additional theme: Visual Data Mining
 

Additional theme: Visual Data Mining

on

  • 3,357 views

 

Statistics

Views

Total Views
3,357
Views on SlideShare
3,355
Embed Views
2

Actions

Likes
1
Downloads
56
Comments
0

2 Embeds 2

http://www.slideshare.net 1
http://health.medicbd.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • From Han and Kamber “Data Mining” p. 463
  • Stat from http://www.galaxy.gmu.edu/stats/syllabi/inft979/VisualDataMining.pdf
  • http://books.elsevier.com/companions/1558606890/pictures/Chapter_01/fig1-6b.gif
  • “two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
  • Picture: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
  • http://www.cs.umd.edu/hcil/treemap-history/treemap-nba.gif
  • Picture of MineSet decision tree visualizer
  • How the interactive Clementine knowledge discovery process works See your solution discovery process clearly The interactive stream approach to data mining is the key to Clementine's power. Using icons that represent steps in the data mining process, you mine your data by building a stream - a visual map of the process your data flows through. Start by simply dragging a source icon from the object palette onto the Clementine desktop to access your data flow. Then, explore your data visually with graphs. Apply several types of algorithms to build your model by simply placing the appropriate icons onto the desktop to form a stream. Discover knowledge interactively Data mining with Clementine is a "discovery-driven" process. Work toward a solution by applying your business expertise to select the next step in your stream, based on the discoveries made in the previous step. You can continually adapt or extend initial streams as you work through the solution to your business problem. Easily build and test models All of Clementine's advanced techniques work together to quickly give you the best answer to your business problems. You can build and test numerous models to immediately see which model produces the best result. Or you can even combine models by using the results of one model as input into another model. These "meta-models" consider the initial model's decisions and can improve results substantially. Understand variations in your business with visualized data Powerful data visualization techniques help you understand key relationships in your data and guide the way to the best results. Spot characteristics and patterns at a glance with Clementine's interactive graphs. Then "query by mouse" to explore these patterns by selecting subsets of data or deriving new variables on the fly from discoveries made within the graph. How Clementine scales to the size of the challenge The Clementine approach to scaling is unique in the way it aims to scale the complete data mining process to the size of large, challenging datasets. Clementine executes common operations used throughout the data mining process in the database through SQL queries. This process leverages the power of the database for faster processing, enabling you to get better results with large datasets.
  • Figure from http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/Kdd-99.final.pdf

Additional theme: Visual Data Mining Additional theme: Visual Data Mining Presentation Transcript

  • Data Mining: Concepts and Techniques — Chapter 11 — — Applications and Trends in Data Mining — Additional Theme: Visual Data Mining
    • Jiawei Han and Micheline Kamber
    • Department of Computer Science
    • University of Illinois at Urbana-Champaign
    • www.cs.uiuc.edu/~hanj
    • ©2006 Jiawei Han and Micheline Kamber. All rights reserved.
  •  
  • Visual Data Mining: An Overview
    • What is Visual Data Mining?
    • Survey of techniques
      • Data Visualization
      • Visualizing Data Mining Results
      • Visual Data Mining
  • What Is Visual Data Mining?
    • Visual data mining “discovers implicit and useful knowledge from large data sets using data and/or knowledge visualization techniques”
    • Data visualization + Data mining techniques
  • Why Visual Data Mining?
    • Advantages of human visual system
      • Highly parallel processor
      • Sophisticated reasoning engine
      • Large knowledge base
    • Can be used to comprehend data distributions, patterns, clusters, and outliers
    + – User Interaction + – Flexibility – + Evaluation – + Actionable Visualization Data Mining Algorithms
  • Why Not Only Visual Data Mining?
    • Disadvantages of human visual system
      • Needs training
      • Not automated
      • Intrinsic bias
      • Limit of about 10 6 or 10 7 observations (Wegman 1995)
    • Power of integration with analytical methods
  • Scope of Visual Data Mining
    • Visualization : Use of computer graphics to create visual images which aid in the understanding of complex, often massive representations of data
    • Visual Data Mining : The process of discovering implicit but useful knowledge from large data sets using visualization techniques
    Computer Graphics High Performance Computing Pattern Recognition Human Computer Interfaces Multimedia Systems
  • Purpose of Visualization
    • Gain insight into an information space by mapping data onto graphical primitives
    • Provide qualitative overview of large data sets
    • Search for patterns, trends, structure, irregularities, relationships among data
    • Help find interesting regions and suitable parameters for further quantitative analysis
    • Provide a visual proof of computer representations derived
  • Visual Data Mining & Data Visualization
    • Integration of visualization and data mining
      • data visualization
      • data mining result visualization
      • data mining process visualization
      • interactive visual data mining
    • Data visualization
      • Data in a database or data warehouse can be viewed
        • at different levels of abstraction
        • as different combinations of attributes or dimensions
      • Data can be presented in various visual forms
  • Abilities of Humans and Computers
  • Visual Mining vs. Scientific Vis. & Graphics
    • Scientific Visualization
      • Often visualize physical model, low dimensionality
    • Graphics
      • More concerned with how to render (draw) rather than what to render
  • Data Visualization
    • View data in database or data warehouse
    • User may control
      • Different levels of details
      • Subset of attributes
    • Drawn using boxplots, histograms, polylines, etc.
  • Historical Overview of Exploratory Data Visualization Techniques (cf. [WB 95])
    • Pioneering works of Tufte [Tuf 83, Tuf 90] and Bertin [Ber 81] focus on
      • Visualization of data with inherent 2D-/3D-semantics
      • General rules for layout, color composition, attribute mapping, etc.
    • Development of visualization techniques for different types of data with an underlying physical model
      • Geographic data, CAD data, flow data, image data, voxel data, etc.
    • Development of visualization techniques for arbitrary multidimensional data (w.o. an underlying physical model)
      • Applicable to databases and other information resources
  • Dimensions of Exploratory Data Visualization
  • Classification of Data Visualization Techniques
    • Geometric Techniques:
      • Scatterplots, Landscapes, Projection Pursuit, Prosection Views, Hyperslice, ParallelCoordinates ...
    • Icon-based Techniques:
      • Chernoff Faces, Stick Figures , Shape-Coding, Color Icons, TileBars,...
    • Pixel-oriented Techniques:
      • Recursive Pattern Technique, Circle Segments Technique, Spiral- & Axes-Techniques ,...
    • Hierarchical Techniques:
      • Dimensional Stacking, Worlds-within-Worlds, Treemap , Cone Trees, InfoCube,...
    • Graph-Based Techniques:
      • Basic Graphs (Straight-Line, Polyline, Curved-Line,...)
      • Specific Graphs (e.g., DAG, Symmetric, Cluster,...)
      • Systems (e.g., Tom Sawyer, Hy+, SeeNet , Narcissus,...)
    • Hybrid Techniques: arbitrary combinations from above
  • Distortion & Dynamic/Interaction Techniques
    • Distortion Techniques
      • Simple Distortion (e.g. Perspective Wall , Bifocal Lenses, TableLens , Graphical Fisheye Views ,...)
      • Complex Distortion (e.g. Hyperbolic Repr. Hyperbox ,...)
    • Dynamic/Interaction Techniques
      • Data-to-Visualization Mapping (e.g. Auto Visual, S Plus, XGobi, IVEE ,...)
      • Projections: (e.g. GrandTour, S Plus, XGobi ,...)
      • Filtering (Selection, Querying) (e.g. MagicLens, Filter/Flow Queries, InfoCrystal ,...)
      • Linking & Brushing (e.g. Xmdv-Tool, XGobi , DataDesk,...)
      • Zooming (e.g. PAD++, IVEE , DataSpace,...)
      • Detail on Demand (e.g. IVEE , TableLens, MagicLens, VisDB ,...)
  • Visual Survey
    • Data visualization techniques
      • Scatterplot Matrices, Landscapes, Parallel Coordinates
      • Icon-based, Dimensional Stacking, Treemaps
  • Direct Visualization Ribbons with Twists Based on Vorticity
  • Geometric Techniques
    • Basic Idea
      • Visualization of geometric transformations and projections of the data
    • Methods
      • Landscapes [Wis 95]
      • Projection Pursuit Techniques [Hub 85] (a techniques for finding meaningful projections of multidimensional data)
      • Scatterplot-Matrices [And 72, Cle 93]
      • Prosection Views [FB 94, STDS 95]
      • Hyperslice [WL 93]
      • Parallel Coordinates [Ins 85, ID 90]
  • Scatterplot-Matrices [Cleveland 93] matrix of scatterplots (x-y-diagrams) of the k-dimensional data [total of (k2/2-k) scatterplots] Used by ermission of M. Ward, Worcester Polytechnic Institute
  • Landscapes [Wis 95]
    • Visualization of the data as perspective landscape
    • The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data
    news articles visualized as a landscape Used by permission of B. Wright, Visible Decisions Inc.
  • Parallel Coordinates [Ins 85, ID 90]
    • n equidistant axes which are parallel to one of the screen axes and correspond to the attributes
    • the axes are scaled to the [minimum, maximum] ― range of the corresponding attribute
    • every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute
  • Parallel Coordinates
  • Icon-Based Techniques
    • Basic Idea
      • Visualization of the data values as features of icons
    • Overview
      • Chernoff-Faces [Che 73, Tuf 83]
      • Stick Figures [Pic 70, PG 88]
      • Shape Coding [Bed 90]
      • Color Icons [Lev 91, KK 94]
      • TileBars [Hea 95] (use of small icons representing the relevance feature vectors in document retrieval)
  • Stick Figures census data showing age, income, sex, education, etc. used by permission of G. Grinstein, University of Massachusettes at Lowell
  • Hierarchical Techniques
    • Basic Idea :  Visualization of the data using a hierarchical partitioning into subspaces.
    • Overview
      • Dimensional Stacking [LWW 90]
      • Worlds-within-Worlds [FB 90a/b]
      • Treemap [Shn 92, Joh 93]
      • Cone Trees [RMC 91]
      • InfoCube [RG 93]
  • Dimensional Stacking [LWW 90]
    • partitioning of the n-dimensional attribute space in 2-dimensional subspaces which are ‘stacked’ into each other
    • partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels
    • adequate especially for data with ordinal attributes of low cardinality
  • Dimensional Stacking
        • Used by permission of M. Ward, Worcester Polytechnic Institute
    Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
  • Dimensional Stacking
    • Disadvantages:
      • Difficult to display more than nine dimensions
      • Important to map dimensions appropriately
      • May be difficult to understand visualizations at first
  • Treemap [JS 91, Shn 92, Joh 93]
        • Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values
        • The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)
    MSR Netscan image:
  •  
  • Treemap of a File System (Schneiderman)
  • Treemaps
    • The attributes used for the partitioning and their ordering are user-defined (the most important attributes should be used first)
    • The color of the regions may correspond to an additional attribute
    • Suitable to get an overview over large amounts of hierarchical data (e.g., file system) and for data with multiple ordinal attributes (e.g., census data)
  • Data Mining Result Visualization
    • Presentation of the results or knowledge obtained from data mining in visual forms
    • Examples
      • Scatter plots and boxplots (obtained from descriptive data mining)
      • Decision trees
      • Association rules
      • Clusters
      • Outliers
      • Generalized rules
      • Text mining
  • Boxplots from Statsoft: Multiple Variable Combinations
  • Visualization of Data Mining Results in SAS Enterprise Miner: Scatter Plots
  • Visualization of Association Rules in SGI/MineSet 3.0
  • Visualization of Decision Tree in SGI/MineSet 3.0
  • Vizualization of Decision Trees
  • Visualization of Cluster Grouping IBM Intelligent Miner
  • Association Rules (MineSet)
    • LHS and RHS items are mapped to x-, y-axis
    • Confidence, support correspond to height of the bar or disc, respectively
    • Interestingness is mapped to Color
  • MineSet: Association Rules
  • Association Ball Graph (DBMiner)
    • Items are visualized as balls
    • Arrows indicate rule implication
    • Size represents support
  • Classification ( SAS EM [SAS 01])
    • Color corresponds to relative frequency of a class in a node
    • Branch line thickness is proportional to the square root of the objects
    Tree Viewer
  • Cluster Analysis (H-BLOB: Hierarchical BLOB) [SBG 00] Cluster Form ellipsoids Form blobs (implicit surfaces)
  • H-BLOB
  • Text Mining ( ThemeRiver [WCF+ 00])
    • Visualization of thematic Changes in documents
    • Vertical distance indicates collective strength of the themes
  • Data Mining Process Visualization
    • Presentation of the various processes of data mining in visual forms so that users can see the flow of data cleaning, integration, preprocessing, mining
      • Data extraction process
      • Where the data is extracted
      • How the data is cleaned, integrated, preprocessed, and mined
      • Method selected for data mining
      • Where the results are stored
      • How they may be viewed
  • Visualization of Data Mining Processes by Clementine Understand variations with visualized data See your solution discovery process clearly
  • Interactive Visual Data Mining
    • Using visualization tools in the data mining process to help users make smart data mining decisions
    • Example
      • Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of columns)
      • Use the display to which sector should first be selected for classification and where a good split point for this sector may be
  • Visual data mining
    • Projection Pursuits
    • (Class) Tours [Dhillon et al. ’98]
    • Visual Classification [Ankerst et al. KDD ’99]
  • Projection Pursuits
    • Exploratory projection pursuit:
      • Goal: reduce dimensionality
      • Define “interestingness” index to each possible projection of a data set
      • Maximize this index, project linearly
      • Not always possible/useful
  • Class Tours
    • “ Visualizing Class Structure of Multidimensional Data” by Dhillon et al. 1998
    • Problem: Visualize multidimensional data categorized into classes
    • Solution: Project data into 2D while preserving distances between class means
  • Class-Preserving Projection: Preserves distances between projected means
  • Tours
    • Tours are animated and interpolated sequences of 2D projections [Asimov 1985]
    • Class tours: sequences of class-preserving 2-dimensional projections
    • Captures “inter-class structure of complex, multi-dimensional data”
  • Interactive Visual Mining by Perception-Based Classification (PBC)
  • Visual Classification
    • “ Visual Classification: An Interactive Approach to Decision Tree Construction” by Ankerst et al. KDD 99
    • Exploit expert’s domain knowledge and human visual processing
  • Visual Classification
  • Visual Classification Results
    • Comparable classification accuracy
    • Can produce more understandable decision trees
    • Expert domain knowledge can be exploited
  • Audio Data Mining
    • Uses audio signals to indicate the patterns of data or the features of data mining results
      • An interesting alternative to visual mining
      • An inverse task of mining audio (such as music) databases which is to find patterns from audio data
      • Visual data mining may disclose interesting patterns using graphical displays, but requires users to concentrate on watching patterns
    • Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual
  • Summary
    • Many visualization methods available
    • How to evaluate and compare methods?
    • Need for:
      • Integrated visualization/exploration systems
      • Studies of interaction techniques for mining
      • Practical case studies
  • Acknowledgments
    • Many slides and images from Mihael Ankerst, Boeing, Daniel A. Keim, AT&T, Tutorial at PKDD'2001
    • Some pictures from Information Visualization in Data Mining and Knowledge Discovery , edited by Usama Fayyad, Georges Grinstein and Andreas Wierse
    • A good set of slides were prepared by Andrew Wu (Spring 2004)
  •