Your SlideShare is downloading. ×
0
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining

955

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
955
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://books.elsevier.com/companions/1558606890/pictures/Chapter_01/fig1-6b.gif
  • “two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
  • Picture: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
  • Transcript

    • 1. Main Concepts of Data Mining Introduction to Data Preprocessing
    • 2. Learning Objectives <ul><li>Study some examples of data mining systems </li></ul><ul><li>Understand why to preprocess the data. </li></ul><ul><li>Understand how to understand the data (descriptive data summarization) </li></ul>
    • 3. Acknowledgements <ul><li>Some of these slides are adapted from Jiawei Han and Micheline Kamber </li></ul>
    • 4. Learning Objectives <ul><li>Study some examples of data mining systems </li></ul><ul><li>Understand why to preprocess the data. </li></ul><ul><li>Understand how to understand the data (descriptive data summarization) </li></ul>
    • 5. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
    • 6. Data Mining: Classification Schemes <ul><li>General functionality </li></ul><ul><ul><li>Descriptive data mining </li></ul></ul><ul><ul><li>Predictive data mining </li></ul></ul><ul><li>Different views, different classifications </li></ul><ul><ul><li>Kinds of databases to be mined </li></ul></ul><ul><ul><li>Kinds of knowledge to be discovered </li></ul></ul><ul><ul><li>Kinds of techniques utilized </li></ul></ul><ul><ul><li>Kinds of applications adapted </li></ul></ul>
    • 7. Major Issues in Data Mining (1) <ul><li>Mining methodology and user interaction </li></ul><ul><ul><li>Mining different kinds of knowledge in databases </li></ul></ul><ul><ul><li>Interactive mining of knowledge at multiple levels of abstraction </li></ul></ul><ul><ul><li>Incorporation of background knowledge </li></ul></ul><ul><ul><li>Data mining query languages and ad-hoc data mining </li></ul></ul><ul><ul><li>Expression and visualization of data mining results </li></ul></ul><ul><ul><li>Handling noise and incomplete data </li></ul></ul><ul><ul><li>Pattern evaluation: the interestingness problem </li></ul></ul><ul><li>Performance and scalability </li></ul><ul><ul><li>Efficiency and scalability of data mining algorithms </li></ul></ul><ul><ul><li>Parallel, distributed and incremental mining methods </li></ul></ul>
    • 8. Major Issues in Data Mining (2) <ul><li>Issues relating to the diversity of data types </li></ul><ul><ul><li>Handling relational and complex types of data </li></ul></ul><ul><ul><li>Mining information from heterogeneous databases and global information systems (WWW) </li></ul></ul><ul><li>Issues related to applications and social impacts </li></ul><ul><ul><li>Application of discovered knowledge </li></ul></ul><ul><ul><ul><li>Domain-specific data mining tools </li></ul></ul></ul><ul><ul><ul><li>Intelligent query answering </li></ul></ul></ul><ul><ul><ul><li>Process control and decision making </li></ul></ul></ul><ul><ul><li>Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem </li></ul></ul><ul><ul><li>Protection of data security, integrity, and privacy </li></ul></ul>
    • 9. Main Concepts in Data Mining <ul><li>Data mining: discovering interesting patterns from large amounts of data </li></ul><ul><li>A natural evolution of database technology, in great demand, with wide applications </li></ul><ul><li>A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation </li></ul><ul><li>Mining can be performed in a variety of information repositories </li></ul><ul><li>Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. </li></ul><ul><li>Classification of data mining systems </li></ul><ul><li>Major issues in data mining </li></ul>
    • 10. Case-Based Reasoning <ul><li>Case-based reasoning (CBR) </li></ul><ul><ul><li>Problem-solving method from artificial intelligence (AI) that proposes to reuse previously solved and memorized problem situations, called cases </li></ul></ul><ul><ul><li>Instance-based method from machine learning </li></ul></ul><ul><ul><li>Can be used for classification/prediction tasks </li></ul></ul>
    • 11. Case-Based Reasoning New Case Target case Interpretation Retrieve Reuse Revise Retain Retrieved Case Solved Case Solution Tested Case USER INTERFACE PROBLEM SOLUTION CASE BASE Previous Cases
    • 12. Fifth Workshop on Case-Based Reasoning in the Health Sciences Isabelle Bichindaritz University of Washington, Tacoma, Washington, USA [email_address] Stefania Montani University of Piemonte Orientale, Italy stefania.montani@unipmn.it
    • 13. Workshop Stats <ul><li>Papers accepted: 10 papers </li></ul><ul><li>Attendees: 19 participants </li></ul><ul><li>Good news !!! </li></ul>
    • 14. Workshop Goals <ul><li>Provide a forum for identifying important contributions and opportunities for research on the application of CBR to the Health Sciences </li></ul><ul><li>Promote the systematic study of how to apply CBR to the Health Sciences </li></ul><ul><li>Showcase applications of CBR in the Health Sciences </li></ul>
    • 15. A CBR Solution for Missing Medical Data Olga Vorobieva and Rainer Schmidt Institute for Medical Informatics and Biometry University of Rostock, Germany Alexander Rumiantzev Pavlov State Medical University, St.Petersburg, Russia
    • 16. Summary <ul><li>Application domain dialysis medicine effects of fitness on dialysis </li></ul><ul><li>System context ISOR, a CBR system that explains the exceptional cases – those for which fitness does not improve renal function </li></ul><ul><li>Task / problem addressed restoration of missing data </li></ul><ul><li>Research hypothesis case-based reasoning can be applied to restore missing data in a dataset/case base </li></ul><ul><li>Main contribution synergy between CBR and statistics (statistical modeling). </li></ul>
    • 17.  
    • 18. A Case-Based Reasoning Approach to Dose Planning in Radiotherapy Xueyan Song 1 , Sanja Petrovic 1 , and Santhanam Sundar 2 1 A utomated S cheduling, Optimis a tion and P lanning Group School of Computer Science University of Nottingham, UK 2 Dept. of Oncology, City Hospital Campus, Nottingham University Hospitals NHS Trust, Nottingham, UK
    • 19. Summary <ul><li>Application domain dose planning in radiotherapy for prostate cancer </li></ul><ul><li>System context trade-off between the benefit in terms of cancer control and the risk in terms of harmful side effects to neighboring tissues </li></ul><ul><li>Task / problem addressed planning problem – designing a radiotherapy dose planning </li></ul><ul><li>Research hypothesis case-based reasoning can be applied to propose dose plans </li></ul><ul><li>Main contribution fuzzy representation of attribute values and similarity measure fusion of similar cases by Dempster-Shafer theory . </li></ul>
    • 20.  
    • 21. On-Line Domain Knowledge Management for Case-Based Medical Recommendation Amélie Cordier 1 ,Béatrice Fuchs 1 ,Jean Lieber 2 , and Alain Mille 1 1 LIRIS CNRS, UMR 5202, Université Lyon 1, INSA Lyon, Université Lyon 2, ECL 43, bd du 11 Novembre 1918, Villeurbanne Cedex, France, {Amelie.Cordier, Beatrice.Fuchs, Alain.Mille}@liris.cnrs.fr 2 LORIA (UMR 7503 CNRS–INRIA–Nancy Universities), BP 239, 54506 Vandoeuvre-lès-Nancy, France [email_address]
    • 22. Summary <ul><li>Application domain breast cancer treatment </li></ul><ul><li>System context Kasimir is a knowledge management and decision-support system in oncology focusing on case-based protocol treatment recommendations </li></ul><ul><li>Task / problem addressed planning problem – recommending a treatment plan based on a protocol </li></ul><ul><li>Research hypotheses conservative adaptation is recommended for adapting a protocol to a new case through case-based reasoning new domain knowledge can be acquired by analysis of failures </li></ul><ul><li>Main contribution improvement of adaptation method for learning from failures of the case-based reasoning. </li></ul>
    • 23.  
    • 24. Concepts for Novelty Detection and Handling based on Case-Based Reasoning Petra Perner Institute of Computer Vision and applied Computer Sciences, IBaI
    • 25. Summary <ul><li>Application domain Hep-2 cell image interpretation </li></ul><ul><li>System context case-based image interpretation </li></ul><ul><li>Task / problem addressed classification problem – improve recognition of over 30 different nuclear and cytoplasmic patterns when patterns change over time or new patterns emerge </li></ul><ul><li>Research hypothesis case-based reasoning can be applied to the problem of novelty detection and also of concept drift </li></ul><ul><li>Main contribution novel application for CBR: detecting novelty, detecting concept drift. </li></ul>
    • 26.  
    • 27. Similarity of Medical Cases in Health Care Using Cosine Similarity and Ontology Shahina Begum, Mobyen Uddin Ahmed, Peter Funk, Ning Xiong, Bo von Schéele Mälardalen University, Department of Computer Science and Electronics PO Box 883 SE-721 23, Västerås, Sweden {firstname.lastname}@mdh.se
    • 28. Summary <ul><li>Application domain any medical domain </li></ul><ul><li>System context electronic medical records </li></ul><ul><li>Task / problem addressed retrieval task – finding similar cases represented with structured and semi-structured data </li></ul><ul><li>Research hypothesis a hybrid similarity measure based on combining the cosine similarity measure, an ontology, and the nearest neighbor method permit to successfully retrieve similar cases </li></ul><ul><li>Main contribution synergy between case-based reasoning and information retrieval. </li></ul>
    • 29.  
    • 30. Towards Case-Based Reasoning for Diabetes Management Cindy Marling 1 , Jay Shubrook 2 and Frank Schwartz 2 1 School of Electrical Engineering and Computer Science Russ College of Engineering and Technology Ohio University, Athens, Ohio 45701, USA [email_address] 2 Appalachian Rural Health Institute, Diabetes and Endocrine Center College of Osteopathic Medicine Ohio University, Athens, Ohio 45701, USA shubrook@ohio.edu, schwartf@ohio.edu
    • 31. Summary <ul><li>Application domain type I diabetes management </li></ul><ul><li>System context real-time monitoring of glucose level through insulin pump </li></ul><ul><li>Task / problem addressed treatment planning – adjusting insulin dosage </li></ul><ul><li>Research hypothesis case-based reasoning can adjust insulin dosage in real time cases required for the future CBR system can be acquired through an online Web-based interface </li></ul><ul><li>Main contribution planning the development of a case-based reasoning system for automatic type I diabetes monitoring. </li></ul>
    • 32. Hypothetico-Deductive Case-Based Reasoning David McSherry School of Computing and Information Engineering, University of Ulster, Northern Ireland
    • 33. Summary <ul><li>Application domain contact lenses classification </li></ul><ul><li>System context conversational CBR </li></ul><ul><li>Task / problem addressed classification problem – recommending type of contact lenses </li></ul><ul><li>Research hypothesis a hypothetico-deductive CBR approach to test selection can minimize the number of tests required to confirm a hypothesis proposed by the system or user </li></ul><ul><li>Main contribution synergy between case-based reasoning and hypothetico-deductive reasoning explanations in CBR. </li></ul>
    • 34.  
    • 35. Other Papers Summaries <ul><li>Case-based Reasoning for managing non-compliance with clinical guidelines, Stefania Montani, University of Piemonte Orientale, Alessandria, Italy A CBR system able to </li></ul><ul><ul><li>Retrieve similar past episodes (cases) of non-compliance to guidelines, to be suggested to the physician </li></ul></ul><ul><ul><li>Learn more general indications from ground non-compliance cases, adoptable for a formal GL revision by an experts committee </li></ul></ul><ul><li>CBR for Temporal Abstractions Configuration in Haemodyalisis, Leonardi Giorgio, Bottrighi Alessio, Portinale Luigi, Montani Stefania, University of Piemonte Orientale, Alessandria, Italy A CBR system able to c hoose the appropriate parameters for the configuration of temporal abstractions in medical domain of haemodyalisis </li></ul>
    • 36. Other Papers Summaries <ul><li>Prototypical Cases for Knowledge Maintenance in Biomedical CBR, Isabelle Bichindaritz, University of Washington, Tacoma, WA, USA Prototypical cases have served various purposes in biomedical CBR systems, among which to organize and structure the memory, to guide the retrieval as well as the reuse of cases, and to serve as bootstrapping a CBR system memory when real cases are not available in sufficient quantity and/or quality. Knowledge maintenance is yet another role that these prototypical cases can play in biomedical CBR systems </li></ul>
    • 37. Discussion <ul><li>Trends and issues </li></ul><ul><ul><li>Integration of CBR with electronic patient records and/or in clinical practice (Begum et al., Marling et al.) </li></ul></ul><ul><ul><li>Importance of prototypical cases (Bichindaritz) </li></ul></ul><ul><ul><li>Incompleteness / non-reliability of cases or CBR system knowledge (Vorobieva et al., Cordier et al., Bichindaritz) </li></ul></ul><ul><ul><li>Novel domains of applications for CBR (Perner, Leonardi et al., Montani) </li></ul></ul><ul><ul><li>Need for synergy with other AI methods (Song et al., McSherry) </li></ul></ul>
    • 38. Discussion <ul><li>Pearls of wisdom </li></ul><ul><ul><li>Remember Occam’s razor – introducing complexity in CBR should be carefully justified </li></ul></ul><ul><ul><li>Knowledge in medical cases / domain knowledge is often questionable – finding methods for dealing with this reality is essential for the development of CBR in biomedical domains </li></ul></ul><ul><ul><li>CBR can be promoted as the methodology of choice for evidence gathering in evidence-based medicine </li></ul></ul>
    • 39. Future Plans <ul><li>A second special issue on CBR in the Health Sciences, based on papers from this Fifth Workshop on CBR in the Health Sciences is going to be published in Computational Intelligence. </li></ul><ul><li>The Web-site (version 1.beta) and mailing list for our research group are now live : http://www.cbr-health.org http://www.cbr-biomed.org </li></ul>
    • 40.  
    • 41.  
    • 42. Learning Objectives <ul><li>Study some examples of data mining systems </li></ul><ul><li>Understand why to preprocess the data. </li></ul><ul><li>Understand how to understand the data (descriptive data summarization) </li></ul>
    • 43. Why Data Preprocessing? <ul><li>Data mining aims at discovering relationships and other forms of knowledge from data in the real world. </li></ul><ul><li>Data map entities in the application domain to symbolic representation through a measurement function. </li></ul><ul><li>Data in the real world is dirty </li></ul><ul><ul><li>incomplete : missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data </li></ul></ul><ul><ul><li>noisy : containing errors, such as measurement errors, or outliers </li></ul></ul><ul><ul><li>inconsistent : containing discrepancies in codes or names </li></ul></ul><ul><ul><li>distorted : sampling distortion </li></ul></ul><ul><li>No quality data, no quality mining results! ( GIGO ) </li></ul><ul><ul><li>Quality decisions must be based on quality data </li></ul></ul><ul><ul><li>Data warehouse needs consistent integration of quality data </li></ul></ul>
    • 44. Multi-Dimensional Measure of Data Quality <ul><li>Data quality is multidimensional: </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Preciseness (=reliability) </li></ul></ul><ul><ul><li>Completeness </li></ul></ul><ul><ul><li>Consistency </li></ul></ul><ul><ul><li>Timeliness </li></ul></ul><ul><ul><li>Believability (=validity) </li></ul></ul><ul><ul><li>Value added </li></ul></ul><ul><ul><li>Interpretability </li></ul></ul><ul><ul><li>Accessibility </li></ul></ul><ul><li>Broad categories: </li></ul><ul><ul><li>intrinsic, contextual, representational, and accessibility. </li></ul></ul>
    • 45. Major Tasks in Data Preprocessing <ul><li>Data cleaning </li></ul><ul><ul><li>Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors </li></ul></ul><ul><li>Data integration </li></ul><ul><ul><li>Integration of multiple databases, data cubes, or files </li></ul></ul><ul><li>Data transformation </li></ul><ul><ul><li>Normalization and aggregation </li></ul></ul><ul><li>Data reduction </li></ul><ul><ul><li>Obtains reduced representation in volume but produces the same or similar analytical results </li></ul></ul><ul><li>Data discretization </li></ul><ul><ul><li>Part of data reduction but with particular importance, especially for numerical data </li></ul></ul>
    • 46. Forms of data preprocessing
    • 47. Learning Objectives <ul><li>Study some examples of data mining systems </li></ul><ul><li>Understand why to preprocess the data. </li></ul><ul><li>Understand how to understand the data (descriptive data summarization) </li></ul>
    • 48. Mining Data Descriptive Characteristics <ul><li>Motivation </li></ul><ul><ul><li>To better understand the data: central tendency, variation and spread </li></ul></ul><ul><li>Data dispersion characteristics </li></ul><ul><ul><li>median, max, min, quantiles, outliers, variance, etc. </li></ul></ul><ul><li>Numerical dimensions correspond to sorted intervals </li></ul><ul><ul><li>Data dispersion: analyzed with multiple granularities of precision </li></ul></ul><ul><ul><li>Boxplot or quantile analysis on sorted intervals </li></ul></ul><ul><li>Dispersion analysis on computed measures </li></ul><ul><ul><li>Folding measures into numerical dimensions </li></ul></ul><ul><ul><li>Boxplot or quantile analysis on the transformed cube </li></ul></ul>
    • 49. Measuring the Central Tendency <ul><li>Mean (algebraic measure) (sample vs. population): </li></ul><ul><ul><li>Weighted arithmetic mean: </li></ul></ul><ul><ul><li>Trimmed mean: chopping extreme values </li></ul></ul><ul><li>Median : A holistic measure </li></ul><ul><ul><li>Middle value if odd number of values, or average of the middle two values otherwise </li></ul></ul><ul><ul><li>Estimated by interpolation (for grouped data ): </li></ul></ul><ul><li>Mode </li></ul><ul><ul><li>Value that occurs most frequently in the data </li></ul></ul><ul><ul><li>Unimodal, bimodal, trimodal </li></ul></ul><ul><ul><li>Empirical formula: </li></ul></ul>
    • 50. Symmetric vs. Skewed Data <ul><li>Median, mean and mode of symmetric, positively and negatively skewed data </li></ul>positively skewed negatively skewed symmetric
    • 51. Measuring the Dispersion of Data <ul><li>Quartiles, outliers and boxplots </li></ul><ul><ul><li>Quartiles : Q 1 (25 th percentile), Q 3 (75 th percentile) </li></ul></ul><ul><ul><li>Inter-quartile range : IQR = Q 3 – Q 1 </li></ul></ul><ul><ul><li>Five number summary : min, Q 1 , M, Q 3 , max </li></ul></ul><ul><ul><li>Boxplot : ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually </li></ul></ul><ul><ul><li>Outlier : usually, a value higher/lower than 1.5 x IQR </li></ul></ul><ul><li>Variance and standard deviation ( sample: s, population: σ ) </li></ul><ul><ul><li>Variance : (algebraic, scalable computation) </li></ul></ul><ul><ul><li>Standard deviation s (or σ ) is the square root of variance s 2 ( or σ 2) </li></ul></ul>
    • 52. Boxplot Analysis <ul><li>Five-number summary of a distribution: </li></ul><ul><ul><li>Minimum, Q1, M, Q3, Maximum </li></ul></ul><ul><li>Boxplot </li></ul><ul><ul><li>Data is represented with a box </li></ul></ul><ul><ul><li>The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR </li></ul></ul><ul><ul><li>The median is marked by a line within the box </li></ul></ul><ul><ul><li>Whiskers: two lines outside the box extend to Minimum and Maximum </li></ul></ul>
    • 53. Visualization of Data Dispersion: 3-D Boxplots
    • 54. Properties of Normal Distribution Curve <ul><li>The normal (distribution) curve </li></ul><ul><ul><li>From μ – σ to μ + σ : contains about 68% of the measurements ( μ : mean, σ : standard deviation) </li></ul></ul><ul><ul><li>From μ –2 σ to μ +2 σ : contains about 95% of it </li></ul></ul><ul><ul><li>From μ –3 σ to μ +3 σ : contains about 99.7% of it </li></ul></ul>
    • 55. Graphic Displays of Basic Statistical Descriptions <ul><li>Boxplot: graphic display of five-number summary </li></ul><ul><li>Histogram: x-axis are values, y-axis repres. frequencies </li></ul><ul><li>Quantile plot: each value x i is paired with f i indicating that approximately 100 f i % of data are  x i </li></ul><ul><li>Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another </li></ul><ul><li>Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane </li></ul><ul><li>Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence </li></ul>
    • 56. Histogram Analysis <ul><li>Graph displays of basic statistical class descriptions </li></ul><ul><ul><li>Frequency histograms </li></ul></ul><ul><ul><ul><li>A univariate graphical method </li></ul></ul></ul><ul><ul><ul><li>Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data </li></ul></ul></ul>
    • 57. Histograms Often Tells More than Boxplots <ul><li>The two histograms shown in the left may have the same boxplot representation </li></ul><ul><ul><li>The same values for: min, Q1, median, Q3, max </li></ul></ul><ul><li>But they have rather different data distributions </li></ul>
    • 58. Quantile Plot <ul><li>Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) </li></ul><ul><li>Plots quantile information </li></ul><ul><ul><li>For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i </li></ul></ul>
    • 59. Quantile-Quantile (Q-Q) Plot <ul><li>Graphs the quantiles of one univariate distribution against the corresponding quantiles of another </li></ul><ul><li>Allows the user to view whether there is a shift in going from one distribution to another </li></ul>
    • 60. Scatter plot <ul><li>Provides a first look at bivariate data to see clusters of points, outliers, etc </li></ul><ul><li>Each pair of values is treated as a pair of coordinates and plotted as points in the plane </li></ul>
    • 61. Loess Curve <ul><li>Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence </li></ul><ul><li>Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression </li></ul>
    • 62. Positively and Negatively Correlated Data <ul><li>The left half fragment is positively correlated </li></ul><ul><li>The right half is negative correlated </li></ul>
    • 63. Not Correlated Data
    • 64. Data Visualization and Its Methods <ul><li>Why data visualization? </li></ul><ul><ul><li>Gain insight into an information space by mapping data onto graphical primitives </li></ul></ul><ul><ul><li>Provide qualitative overview of large data sets </li></ul></ul><ul><ul><li>Search for patterns, trends, structure, irregularities, relationships among data </li></ul></ul><ul><ul><li>Help find interesting regions and suitable parameters for further quantitative analysis </li></ul></ul><ul><ul><li>Provide a visual proof of computer representations derived </li></ul></ul><ul><li>Typical visualization methods: </li></ul><ul><ul><li>Geometric techniques </li></ul></ul><ul><ul><li>Icon-based techniques </li></ul></ul><ul><ul><li>Hierarchical techniques </li></ul></ul>
    • 65. Direct Data Visualization Ribbons with Twists Based on Vorticity
    • 66. Geometric Techniques <ul><li>Visualization of geometric transformations and projections of the data </li></ul><ul><li>Methods </li></ul><ul><ul><li>Landscapes </li></ul></ul><ul><ul><li>Projection pursuit technique </li></ul></ul><ul><ul><ul><li>Finding meaningful projections of multidimensional data </li></ul></ul></ul><ul><ul><li>Scatterplot matrices </li></ul></ul><ul><ul><li>Prosection views </li></ul></ul><ul><ul><li>Hyperslice </li></ul></ul><ul><ul><li>Parallel coordinates </li></ul></ul>
    • 67. Scatterplot Matrices <ul><li>Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots] </li></ul>Used by ermission of M. Ward, Worcester Polytechnic Institute
    • 68. Landscapes <ul><li>Visualization of the data as perspective landscape </li></ul><ul><li>The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data </li></ul>news articles visualized as a landscape Used by permission of B. Wright, Visible Decisions Inc.
    • 69. Parallel Coordinates <ul><li>n equidistant axes which are parallel to one of the screen axes and correspond to the attributes </li></ul><ul><li>The axes are scaled to the [minimum, maximum]: range of the corresponding attribute </li></ul><ul><li>Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute </li></ul>
    • 70. Parallel Coordinates of a Data Set
    • 71. Icon-based Techniques <ul><li>Visualization of the data values as features of icons </li></ul><ul><li>Methods: </li></ul><ul><ul><li>Chernoff Faces </li></ul></ul><ul><ul><li>Stick Figures </li></ul></ul><ul><ul><li>Shape Coding: </li></ul></ul><ul><ul><li>Color Icons: </li></ul></ul><ul><ul><li>TileBars: The use of small icons representing the relevance feature vectors in document retrieval </li></ul></ul>
    • 72. Chernoff Faces <ul><li>A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. </li></ul><ul><li>The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson) </li></ul><ul><li>REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 </li></ul><ul><li>Weisstein, Eric W. &quot;Chernoff Face.&quot; From MathWorld --A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html </li></ul>
    • 73. Stick Figures census data showing age, income, gender, education, etc. used by permission of G. Grinstein, University of Massachusettes at Lowell
    • 74. Hierarchical Techniques <ul><li>Visualization of the data using a hierarchical partitioning into subspaces. </li></ul><ul><li>Methods </li></ul><ul><ul><li>Dimensional Stacking </li></ul></ul><ul><ul><li>Worlds-within-Worlds </li></ul></ul><ul><ul><li>Treemap </li></ul></ul><ul><ul><li>Cone Trees </li></ul></ul><ul><ul><li>InfoCube </li></ul></ul>
    • 75. Dimensional Stacking <ul><li>Partitioning of the n-dimensional attribute space in 2-D subspaces which are ‘stacked’ into each other </li></ul><ul><li>Partitioning of the attribute value ranges into classes the important attributes should be used on the outer levels </li></ul><ul><li>Adequate for data with ordinal attributes of low cardinality </li></ul><ul><li>But, difficult to display more than nine dimensions </li></ul><ul><li>Important to map dimensions appropriately </li></ul>
    • 76. Dimensional Stacking <ul><ul><ul><li>Used by permission of M. Ward, Worcester Polytechnic Institute </li></ul></ul></ul>Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
    • 77. Tree-Map <ul><li>Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values </li></ul><ul><li>The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) </li></ul>MSR Netscan Image
    • 78. Tree-Map of a File System (Schneiderman)

    ×