SlideShare a Scribd company logo
Exploring Data A preliminary exploration of the data to better understand its characteristics Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize pattern  People can recognize patterns not captured by data analysis tools  Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey
Contd… In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP)
Iris Sample Data Set  Many of the exploratory data techniques are illustrated with the Iris Plant data set. Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes  Sepal width and length  Petal width and length
Summary Statistics Summary statistics  are numbers that summarize properties of the data Summarized properties include frequency, location and spread  Examples: 	location - mean                   	spread - standard deviation
Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set  For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of a an attribute is the most frequent attribute value    The notions of frequency and mode are typically used with categorical data
Percentiles For continuous data, the notion of a percentile is more useful.  Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value     of x such that p% of the observed values of x are less than    .  For instance, the 50th percentile is the value      such that 50% of all values of x are less than
Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points.   However, the mean is very sensitive to outliers.    Thus, the median or a trimmed mean is also commonly used
Measures of Spread: Range and Variance Range is the difference between the max and min The variance or standard deviation is the most common measure of the spread of a set of points
Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Humans have a well developed ability to analyze large amounts of information that is presented visually Can detect general patterns and trends Can detect outliers and unusual patterns
Visualization techniques-Histogram Histogram  Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin.  The height of each bar indicates the number of objects Shape of histogram depends on the number of bins
Contd…. Ex: petal width 10 bins
Visualization Techniques: Box Plots Invented by J. Tukey Another way of displaying the distribution of data  Following figure shows the basic part of a box plot outlier 75th percentile 50th percentile 25th percentile 10th percentile 10th percentile
Visualization Techniques: Scatter Plots Scatter plots  Attributes values determine the position Two-dimensional scatter plots most common, but can have three-dimensional scatter plots Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects  It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
Contd…
Visualization Techniques: Contour Plots Contour plots  Useful when a continuous attribute is measured on a spatial grid They partition the plane into regions of similar values The contour lines that form the boundaries of these regions connect points with equal values	 The most common example is contour maps of elevation
Contour Plot Example: SST Dec, 1998
Visualization Techniques: Matrix Plots Matrix plots  Can plot the data matrix This can be useful when objects are sorted according to class Typically, the attributes are normalized to prevent one attribute from dominating the plot	 Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects Examples of matrix plots are p
Visualization of the Iris Data Matrix
Visualization Techniques: Parallel Coordinates Parallel Coordinates  Used to plot the attribute values of high-dimensional data Instead of using perpendicular axes, use a set of parallel axes  The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line	 Thus, each object is represented as a line
Other Visualization Techniques Star Plots  Similar approach to parallel coordinates, but axes radiate from a central point The line connecting the values of an object is a polygon
Contd… Chernoff Faces Approach created by Herman Chernoff This approach associates each attribute with a characteristic of a face The values of each attribute determine the appearance of the corresponding facial characteristic	 Each object becomes a separate face Relies on human’s ability to distinguish faces
OLAP On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, while OLAP uses a multidimensional array representation.  Such representations of data previously existed in statistics and other fields There are a number of data analysis and data exploration operations that are easier with such a data representation.
OLAP Operations: Data Cube The key operation of a OLAP is the formation of a data cube A data cube is a multidimensional representation of data, together with all possible aggregates. By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions. For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
OLAP Operations: Slicing and Dicing Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions.  Dicing involves selecting a subset of cells by specifying a range of attribute values.  This is equivalent to defining a subarray from the complete array.  In practice, both operations can also be accompanied by aggregation over some dimensions.
OLAP Operations: Roll-up and Drill-down Attribute values often have a hierarchical structure. Each date is associated with a year, month, and week. A location is associated with a continent, country, state (province, etc.), and city.  Products can be divided into various categories, such as clothing, electronics, and furniture.
Contd… Note that these categories often nest and form a tree or lattice A year contains months which contains day A country contains a state which contains a city
Contd… This hierarchical structure gives rise to the roll-up and drill-down operations. For sales data, we can aggregate (roll up) the sales across all the dates in a month.  Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. Likewise, we can drill down or roll up on the location or product ID attributes.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot

Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Adbms 11 object structure and type constructor
Adbms 11 object structure and type constructorAdbms 11 object structure and type constructor
Adbms 11 object structure and type constructor
Vaibhav Khanna
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialBilkent University
 
Data Visualization.pptx
Data Visualization.pptxData Visualization.pptx
Data Visualization.pptx
Ultimate Multimedia Consult
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
Peter Reimann
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
ramya marichamy
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Classification of data
Classification of dataClassification of data
Classification of data
Dr. C.V. Suresh Babu
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Gramener
 

What's hot (20)

Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Adbms 11 object structure and type constructor
Adbms 11 object structure and type constructorAdbms 11 object structure and type constructor
Adbms 11 object structure and type constructor
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Data Visualization.pptx
Data Visualization.pptxData Visualization.pptx
Data Visualization.pptx
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
 
Data preparation
Data preparationData preparation
Data preparation
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Ontology engineering
Ontology engineering Ontology engineering
Ontology engineering
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 

Viewers also liked

Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
2015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_012015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_01
Tal Galili
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
Yanchang Zhao
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Pier Luca Lanzi
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
Michael Küsters
 

Viewers also liked (7)

Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
2015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_012015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_01
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and Preparation
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
 

Similar to Exploring Data

Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
OllieShoresna
 
17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
KowsalyaG17
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Himakshi7
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
deboshreechatterjee2
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
scet315
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
gina458018
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE
Smitha Sumod
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
DrMAlagupriyasafiq
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
DrMAlagupriyasafiq
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analytics
harshrnotaria
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg Girls High
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
haramaya university
 
Visualizations in Exploratory Data Analysis
Visualizations in Exploratory Data AnalysisVisualizations in Exploratory Data Analysis
Visualizations in Exploratory Data Analysis
OluwatobiAdefami
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
Sanju Rusara Seneviratne
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptx
JANNU VINAY
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
Rahul Borate
 
Bba 2001
Bba 2001Bba 2001
Bba 2001
Monie Joey
 
Survey data & sampling
Survey data & samplingSurvey data & sampling
Survey data & sampling
Syed Iqrar Hussain
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptx
PraneethBhai1
 

Similar to Exploring Data (20)

Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
 
17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analytics
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
 
Visualizations in Exploratory Data Analysis
Visualizations in Exploratory Data AnalysisVisualizations in Exploratory Data Analysis
Visualizations in Exploratory Data Analysis
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptx
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Bba 2001
Bba 2001Bba 2001
Bba 2001
 
Survey data & sampling
Survey data & samplingSurvey data & sampling
Survey data & sampling
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptx
 

More from Datamining Tools

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
Datamining Tools
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
Datamining Tools
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technology
Datamining Tools
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
Datamining Tools
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
Datamining Tools
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
Datamining Tools
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysis
Datamining Tools
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitions
Datamining Tools
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
Datamining Tools
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
Datamining Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
Datamining Tools
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
Datamining Tools
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
Datamining Tools
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
Datamining Tools
 
AI: Learning in AI 2
AI: Learning in AI  2AI: Learning in AI  2
AI: Learning in AI 2
Datamining Tools
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
Datamining Tools
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
Datamining Tools
 

More from Datamining Tools (20)

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technology
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysis
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitions
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI  2AI: Learning in AI  2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Exploring Data

  • 1. Exploring Data A preliminary exploration of the data to better understand its characteristics Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize pattern People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey
  • 2. Contd… In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP)
  • 3. Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes Sepal width and length Petal width and length
  • 4. Summary Statistics Summary statistics are numbers that summarize properties of the data Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation
  • 5. Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of a an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data
  • 6. Percentiles For continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than . For instance, the 50th percentile is the value such that 50% of all values of x are less than
  • 7. Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used
  • 8. Measures of Spread: Range and Variance Range is the difference between the max and min The variance or standard deviation is the most common measure of the spread of a set of points
  • 9. Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Humans have a well developed ability to analyze large amounts of information that is presented visually Can detect general patterns and trends Can detect outliers and unusual patterns
  • 10. Visualization techniques-Histogram Histogram Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of bins
  • 11. Contd…. Ex: petal width 10 bins
  • 12. Visualization Techniques: Box Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plot outlier 75th percentile 50th percentile 25th percentile 10th percentile 10th percentile
  • 13. Visualization Techniques: Scatter Plots Scatter plots Attributes values determine the position Two-dimensional scatter plots most common, but can have three-dimensional scatter plots Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
  • 15. Visualization Techniques: Contour Plots Contour plots Useful when a continuous attribute is measured on a spatial grid They partition the plane into regions of similar values The contour lines that form the boundaries of these regions connect points with equal values The most common example is contour maps of elevation
  • 16. Contour Plot Example: SST Dec, 1998
  • 17. Visualization Techniques: Matrix Plots Matrix plots Can plot the data matrix This can be useful when objects are sorted according to class Typically, the attributes are normalized to prevent one attribute from dominating the plot Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects Examples of matrix plots are p
  • 18. Visualization of the Iris Data Matrix
  • 19. Visualization Techniques: Parallel Coordinates Parallel Coordinates Used to plot the attribute values of high-dimensional data Instead of using perpendicular axes, use a set of parallel axes The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line Thus, each object is represented as a line
  • 20. Other Visualization Techniques Star Plots Similar approach to parallel coordinates, but axes radiate from a central point The line connecting the values of an object is a polygon
  • 21. Contd… Chernoff Faces Approach created by Herman Chernoff This approach associates each attribute with a characteristic of a face The values of each attribute determine the appearance of the corresponding facial characteristic Each object becomes a separate face Relies on human’s ability to distinguish faces
  • 22. OLAP On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, while OLAP uses a multidimensional array representation. Such representations of data previously existed in statistics and other fields There are a number of data analysis and data exploration operations that are easier with such a data representation.
  • 23. OLAP Operations: Data Cube The key operation of a OLAP is the formation of a data cube A data cube is a multidimensional representation of data, together with all possible aggregates. By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions. For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
  • 24. OLAP Operations: Slicing and Dicing Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. Dicing involves selecting a subset of cells by specifying a range of attribute values. This is equivalent to defining a subarray from the complete array. In practice, both operations can also be accompanied by aggregation over some dimensions.
  • 25. OLAP Operations: Roll-up and Drill-down Attribute values often have a hierarchical structure. Each date is associated with a year, month, and week. A location is associated with a continent, country, state (province, etc.), and city. Products can be divided into various categories, such as clothing, electronics, and furniture.
  • 26. Contd… Note that these categories often nest and form a tree or lattice A year contains months which contains day A country contains a state which contains a city
  • 27. Contd… This hierarchical structure gives rise to the roll-up and drill-down operations. For sales data, we can aggregate (roll up) the sales across all the dates in a month. Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. Likewise, we can drill down or roll up on the location or product ID attributes.
  • 28. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net