The stem-and-leaf display shows that the estimated percentage of households with only wireless phone service ranges from 5.6% to 20.0%, with most values centered around 16%. There is one unusual low value of 8.0% and the distribution appears roughly symmetrical.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
This document discusses data and attributes in data mining. It defines data as a collection of objects and their properties or attributes. Attributes can be nominal, ordinal, interval or ratio. The document describes different types of attributes and data sets, as well as important characteristics like dimensionality and sparsity. It also covers data quality issues, preprocessing techniques like aggregation, sampling and feature selection, and measures of similarity and dissimilarity between data objects.
This technical report explores using set-valued attributes for decision tree induction algorithms. Conventional algorithms use single-valued attributes, but the authors argue set-valued attributes can improve accuracy and speed. They describe modifying decision tree algorithms for splitting, pruning, and classification when attributes can have set values. Experiments show the proposed approach works well with only simple pre-pruning needed to limit excessive instance replication across tree branches. The set-valued approach is intended to better handle noise and variability in data values.
This document describes a lesson on measures of variation. The lesson introduces concepts like standard deviation and variance as measures of risk. Students will analyze stock return data for two stocks (A and B) and calculate summary statistics. They will discover that investing half in each stock reduces risk compared to investing fully in one stock, as the standard deviation is lower for a mixed portfolio. The lesson aims to show students that variation measures provide important information beyond just averages.
Random forests are an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes of the individual trees. It improves upon decision trees by reducing variance. The algorithm works by:
1) Randomly sampling cases and variables to grow each tree.
2) Splitting nodes using the gini index or information gain on the randomly selected variables.
3) Growing each tree fully without pruning.
4) Aggregating the predictions of all trees using a majority vote. This reduces variance compared to a single decision tree.
This document provides an overview of cluster analysis techniques. It begins with definitions of cluster analysis and discusses how it differs from discriminant analysis. The chapter then covers: [END SUMMARY]
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses Classification and Regression Trees (CART), a data mining technique for classification and regression. CART builds decision trees by recursively splitting data into purer child nodes based on a split criterion, with the goal of minimizing heterogeneity. It describes the 8 step CART generation process: 1) testing all possible splits of variables, 2) evaluating splits using reduction in impurity, 3) selecting the best split, 4) repeating for all variables, 5) selecting the split with most reduction in impurity, 6) assigning classes, 7) repeating on child nodes, and 8) pruning trees to avoid overfitting.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
This document discusses data and attributes in data mining. It defines data as a collection of objects and their properties or attributes. Attributes can be nominal, ordinal, interval or ratio. The document describes different types of attributes and data sets, as well as important characteristics like dimensionality and sparsity. It also covers data quality issues, preprocessing techniques like aggregation, sampling and feature selection, and measures of similarity and dissimilarity between data objects.
This technical report explores using set-valued attributes for decision tree induction algorithms. Conventional algorithms use single-valued attributes, but the authors argue set-valued attributes can improve accuracy and speed. They describe modifying decision tree algorithms for splitting, pruning, and classification when attributes can have set values. Experiments show the proposed approach works well with only simple pre-pruning needed to limit excessive instance replication across tree branches. The set-valued approach is intended to better handle noise and variability in data values.
This document describes a lesson on measures of variation. The lesson introduces concepts like standard deviation and variance as measures of risk. Students will analyze stock return data for two stocks (A and B) and calculate summary statistics. They will discover that investing half in each stock reduces risk compared to investing fully in one stock, as the standard deviation is lower for a mixed portfolio. The lesson aims to show students that variation measures provide important information beyond just averages.
Random forests are an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes of the individual trees. It improves upon decision trees by reducing variance. The algorithm works by:
1) Randomly sampling cases and variables to grow each tree.
2) Splitting nodes using the gini index or information gain on the randomly selected variables.
3) Growing each tree fully without pruning.
4) Aggregating the predictions of all trees using a majority vote. This reduces variance compared to a single decision tree.
This document provides an overview of cluster analysis techniques. It begins with definitions of cluster analysis and discusses how it differs from discriminant analysis. The chapter then covers: [END SUMMARY]
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses Classification and Regression Trees (CART), a data mining technique for classification and regression. CART builds decision trees by recursively splitting data into purer child nodes based on a split criterion, with the goal of minimizing heterogeneity. It describes the 8 step CART generation process: 1) testing all possible splits of variables, 2) evaluating splits using reduction in impurity, 3) selecting the best split, 4) repeating for all variables, 5) selecting the split with most reduction in impurity, 6) assigning classes, 7) repeating on child nodes, and 8) pruning trees to avoid overfitting.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
CART: Not only Classification and Regression TreesMarc Garcia
Decision trees are very simple methods compared to Support Vector Machines, or Deep Learning. But they have some interesting properties that make them unique. For classification, for regression, or to extract probabilities, decision trees are easy to set up, and debug. And they are excellent to get a better understanding of your data.
This talk will cover Decision Trees, from theory, to their implementation in Python.
The talk will have a very practical approach, using examples and real cases to illustrate how to use decision trees, what we can expect from using them, and what kind of problems we will need to address.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Data preprocessing for decision trees
* Understanding your data better with decision tree visualization
* Debugging decision trees using common sense and prior domain knowledge
* Avoiding overfitting, without cross-validation
* Python implementation
* Performance
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
This document provides an overview of key concepts for describing numerical data, including measures of central tendency (such as the mean, median, mode, weighted mean, and geometric mean) and measures of dispersion (such as the range, mean deviation, variance and standard deviation). It defines each measure and provides examples to demonstrate how to calculate and interpret the measures. The learning objectives cover explaining the concept of central tendency, identifying and computing various measures of central tendency and dispersion, and applying the measures to analyze datasets.
This presentation educates you about Classification and
Regression trees (CART), CART decision tree methodology, Classification Trees, Regression Trees, Differences in CART, When to use CART?, Advantages of CART, Limitations of CART and What is a CART in Machine Learning?.
For more topics stay tuned with Learnbay.
1. Discretization involves dividing the range of continuous attributes into intervals to reduce data size. Concept hierarchy formation recursively groups low-level concepts like numeric values into higher-level concepts like age groups.
2. Common techniques for discretization and concept hierarchy generation include binning, histogram analysis, clustering analysis, and entropy-based discretization. These techniques can be applied recursively to generate hierarchies.
3. Discretization and concept hierarchies reduce data size, provide more meaningful interpretations, and make data mining and analysis easier.
This presentation guide you through Linear Discriminant
Analysis, LDA: Overview, Assumptions of LDA and Prepare the data for LDA.
For more topics stay tuned with Learnbay.
Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc.(The odoridis and Koutroubas, 1999):
Data reduction.
Hypothesis generation.
Hypothesis testing.
Prediction based on groups.
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
This document discusses various data analysis techniques including cluster analysis, multidimensional scaling, perceptual mapping, and discriminant analysis. It provides details on cluster analysis methods and processes. Cluster analysis involves grouping similar observations into clusters so that observations within a cluster are more similar to each other than observations in other clusters. The document discusses different clustering algorithms and applications. It also provides an example of using cluster analysis to segment customers of an auto insurance company based on preferences.
This document discusses various methods for evaluating and improving the accuracy of classification models, including:
- Confusion matrices and measures like accuracy, sensitivity, and precision to evaluate classifier performance.
- Ensemble methods like bagging and boosting that combine multiple models to improve accuracy. Bagging averages predictions from models trained on bootstrap samples, while boosting gives higher weight to instances harder to classify.
- Model selection techniques like statistical tests and ROC curves to compare models and determine the best performing one. ROC curves show the tradeoff between true and false positives for threshold-based classifiers.
Cluster analysis of classification is often called the 'non-supervised technique'.
It is a multivariate technique used to determine group membership for cases or variables.
The document contains a frequency distribution table that summarizes data from a questionnaire regarding a university computer service system. The table includes frequencies and percentages for respondent demographics like gender (72% male, 28% female), age group (50% between 21-23 years old, 34% 24-26, 16% over 26), GPA range, and academic program of study. A second section of the questionnaire assessed satisfaction levels of system quality on a 5-point scale.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like evaluating decision trees using training and testing accuracy. The document concludes with strengths and weaknesses of decision tree algorithms.
This document describes using decision trees and linear regression for a statistical learning project on housing data. It discusses building decision trees and regression trees on latitude, longitude and other variables to predict housing prices. Linear regression performs poorly with an R-squared of 0.24, while regression trees more accurately identify areas with above-median home values. Further optimizing the regression tree with additional variables like income and population improves the model fit and predictions.
This document provides an overview and agenda for a presentation on multivariate analysis. It introduces the presenter, Dr. Nisha Arora, and lists her qualifications and areas of expertise, which include statistics, data analysis, machine learning, and online teaching. The presentation agenda covers topics like cluster analysis using SPSS, including different clustering algorithms, applications of cluster analysis, and how to interpret and validate clustering outputs and solutions.
Cluster analysis is a descriptive technique that groups similar objects into clusters. It finds natural groupings within data according to characteristics in the data. Cluster analysis is used for taxonomy development, data simplification, and relationship identification. Some applications of cluster analysis include market segmentation in marketing, grouping users on social networks, and reducing markers on maps. It requires representative data and assumes groups will be sufficiently sized and not distorted by outliers.
This chapter discusses descriptive statistics including organizing and graphing qualitative and quantitative data, measures of central tendency, and measures of dispersion. It covers frequency distributions, histograms, polygons, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), skewness, and cumulative frequency distributions. The objectives are to describe and interpret graphical displays of data, compute various statistical measures, and identify shapes of distributions.
This document discusses frequency distributions and graphical presentations of data. It defines frequency distributions as the pattern of frequencies of a variable's values or grouped values. There are four main types of frequency distributions: ungrouped, grouped, relative, and cumulative. The document also describes three common graphical presentations: pie charts to show relative frequencies of categorical variables, bar charts to display frequency distributions of categorical variables, and histograms to illustrate quantitative variable distributions. The purpose of graphical presentations is to visually compare and relate data.
This document provides an introduction to quantitative methods and statistics. It defines statistics as the science of collecting, organizing, presenting, analyzing and interpreting data to assist in decision making. It outlines descriptive and inferential statistics, and describes variables, levels of measurement, characteristics of statistical data, uses of statistics, and limitations of statistics. It also discusses topics such as frequency distributions, measures of central tendency including the mean, median and mode, and measures of dispersion.
This chapter discusses how to organize and present both qualitative and quantitative data using frequency tables, bar charts, pie charts, histograms, frequency polygons, and cumulative frequency distributions. It provides examples of how to construct frequency tables by determining the number of classes, class width, and class limits. It also explains how to convert frequency distributions to relative frequency distributions and how to represent the distributions graphically.
This document provides a tutorial on statistics and data displays. It includes definitions and examples of key statistical concepts like measures of central tendency, measures of spread, and types of data and variables. It also describes commonly used data displays like line plots, bar graphs, circle graphs, and stem-and-leaf plots. For each display, it explains what they consist of, when they work best, and includes examples. Questions to consider when analyzing each display are also provided. The tutorial is intended to refresh teachers' knowledge but can also be used by students.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
CART: Not only Classification and Regression TreesMarc Garcia
Decision trees are very simple methods compared to Support Vector Machines, or Deep Learning. But they have some interesting properties that make them unique. For classification, for regression, or to extract probabilities, decision trees are easy to set up, and debug. And they are excellent to get a better understanding of your data.
This talk will cover Decision Trees, from theory, to their implementation in Python.
The talk will have a very practical approach, using examples and real cases to illustrate how to use decision trees, what we can expect from using them, and what kind of problems we will need to address.
The main topics covered will include:
* What are decision trees?
* How decision trees are trained?
* Data preprocessing for decision trees
* Understanding your data better with decision tree visualization
* Debugging decision trees using common sense and prior domain knowledge
* Avoiding overfitting, without cross-validation
* Python implementation
* Performance
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
This document provides an overview of key concepts for describing numerical data, including measures of central tendency (such as the mean, median, mode, weighted mean, and geometric mean) and measures of dispersion (such as the range, mean deviation, variance and standard deviation). It defines each measure and provides examples to demonstrate how to calculate and interpret the measures. The learning objectives cover explaining the concept of central tendency, identifying and computing various measures of central tendency and dispersion, and applying the measures to analyze datasets.
This presentation educates you about Classification and
Regression trees (CART), CART decision tree methodology, Classification Trees, Regression Trees, Differences in CART, When to use CART?, Advantages of CART, Limitations of CART and What is a CART in Machine Learning?.
For more topics stay tuned with Learnbay.
1. Discretization involves dividing the range of continuous attributes into intervals to reduce data size. Concept hierarchy formation recursively groups low-level concepts like numeric values into higher-level concepts like age groups.
2. Common techniques for discretization and concept hierarchy generation include binning, histogram analysis, clustering analysis, and entropy-based discretization. These techniques can be applied recursively to generate hierarchies.
3. Discretization and concept hierarchies reduce data size, provide more meaningful interpretations, and make data mining and analysis easier.
This presentation guide you through Linear Discriminant
Analysis, LDA: Overview, Assumptions of LDA and Prepare the data for LDA.
For more topics stay tuned with Learnbay.
Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc.(The odoridis and Koutroubas, 1999):
Data reduction.
Hypothesis generation.
Hypothesis testing.
Prediction based on groups.
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
This document discusses various data analysis techniques including cluster analysis, multidimensional scaling, perceptual mapping, and discriminant analysis. It provides details on cluster analysis methods and processes. Cluster analysis involves grouping similar observations into clusters so that observations within a cluster are more similar to each other than observations in other clusters. The document discusses different clustering algorithms and applications. It also provides an example of using cluster analysis to segment customers of an auto insurance company based on preferences.
This document discusses various methods for evaluating and improving the accuracy of classification models, including:
- Confusion matrices and measures like accuracy, sensitivity, and precision to evaluate classifier performance.
- Ensemble methods like bagging and boosting that combine multiple models to improve accuracy. Bagging averages predictions from models trained on bootstrap samples, while boosting gives higher weight to instances harder to classify.
- Model selection techniques like statistical tests and ROC curves to compare models and determine the best performing one. ROC curves show the tradeoff between true and false positives for threshold-based classifiers.
Cluster analysis of classification is often called the 'non-supervised technique'.
It is a multivariate technique used to determine group membership for cases or variables.
The document contains a frequency distribution table that summarizes data from a questionnaire regarding a university computer service system. The table includes frequencies and percentages for respondent demographics like gender (72% male, 28% female), age group (50% between 21-23 years old, 34% 24-26, 16% over 26), GPA range, and academic program of study. A second section of the questionnaire assessed satisfaction levels of system quality on a 5-point scale.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like evaluating decision trees using training and testing accuracy. The document concludes with strengths and weaknesses of decision tree algorithms.
This document describes using decision trees and linear regression for a statistical learning project on housing data. It discusses building decision trees and regression trees on latitude, longitude and other variables to predict housing prices. Linear regression performs poorly with an R-squared of 0.24, while regression trees more accurately identify areas with above-median home values. Further optimizing the regression tree with additional variables like income and population improves the model fit and predictions.
This document provides an overview and agenda for a presentation on multivariate analysis. It introduces the presenter, Dr. Nisha Arora, and lists her qualifications and areas of expertise, which include statistics, data analysis, machine learning, and online teaching. The presentation agenda covers topics like cluster analysis using SPSS, including different clustering algorithms, applications of cluster analysis, and how to interpret and validate clustering outputs and solutions.
Cluster analysis is a descriptive technique that groups similar objects into clusters. It finds natural groupings within data according to characteristics in the data. Cluster analysis is used for taxonomy development, data simplification, and relationship identification. Some applications of cluster analysis include market segmentation in marketing, grouping users on social networks, and reducing markers on maps. It requires representative data and assumes groups will be sufficiently sized and not distorted by outliers.
This chapter discusses descriptive statistics including organizing and graphing qualitative and quantitative data, measures of central tendency, and measures of dispersion. It covers frequency distributions, histograms, polygons, measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), skewness, and cumulative frequency distributions. The objectives are to describe and interpret graphical displays of data, compute various statistical measures, and identify shapes of distributions.
This document discusses frequency distributions and graphical presentations of data. It defines frequency distributions as the pattern of frequencies of a variable's values or grouped values. There are four main types of frequency distributions: ungrouped, grouped, relative, and cumulative. The document also describes three common graphical presentations: pie charts to show relative frequencies of categorical variables, bar charts to display frequency distributions of categorical variables, and histograms to illustrate quantitative variable distributions. The purpose of graphical presentations is to visually compare and relate data.
This document provides an introduction to quantitative methods and statistics. It defines statistics as the science of collecting, organizing, presenting, analyzing and interpreting data to assist in decision making. It outlines descriptive and inferential statistics, and describes variables, levels of measurement, characteristics of statistical data, uses of statistics, and limitations of statistics. It also discusses topics such as frequency distributions, measures of central tendency including the mean, median and mode, and measures of dispersion.
This chapter discusses how to organize and present both qualitative and quantitative data using frequency tables, bar charts, pie charts, histograms, frequency polygons, and cumulative frequency distributions. It provides examples of how to construct frequency tables by determining the number of classes, class width, and class limits. It also explains how to convert frequency distributions to relative frequency distributions and how to represent the distributions graphically.
This document provides a tutorial on statistics and data displays. It includes definitions and examples of key statistical concepts like measures of central tendency, measures of spread, and types of data and variables. It also describes commonly used data displays like line plots, bar graphs, circle graphs, and stem-and-leaf plots. For each display, it explains what they consist of, when they work best, and includes examples. Questions to consider when analyzing each display are also provided. The tutorial is intended to refresh teachers' knowledge but can also be used by students.
The document discusses the meaning and objectives of descriptive statistics. It defines descriptive statistics as a branch of statistics that deals with describing and summarizing collected data through organization, classification, and presentation. The key aspects covered include:
- Organizing data through classification, tabulation, and graphical/diagrammatic presentation. This includes frequency distributions, histograms, polygons, etc.
- Measures of central tendency and variability that summarize data distributions, such as mean, median, and standard deviation.
- Descriptive statistics involves organizing and summarizing raw data to define characteristics of populations. This enables researchers to describe phenomena based on sample data.
This document provides an introduction to statistics, including what statistics is, who uses it, and different types of variables and data presentation. Statistics is defined as collecting, organizing, analyzing, and interpreting numerical data to assist with decision making. Descriptive statistics organizes and summarizes data, while inferential statistics makes estimates or predictions about populations based on samples. Variables can be qualitative or quantitative, and quantitative variables can be discrete or continuous. Data can be presented through frequency tables, graphs like histograms and polygons, and cumulative frequency distributions.
The document discusses the importance and key concepts of statistics. It introduces three main reasons to study statistics: to be an informed consumer of information, to understand and make decisions, and to evaluate decisions that affect one's life. It then defines important statistical terms like population, sample, variable, and data types. It also provides examples of different data sets and ways to organize data, such as through frequency distributions, bar charts, and dot plots.
The document discusses the importance and key concepts of statistics. It introduces three main reasons to study statistics: to be an informed consumer of information, to understand and make decisions, and to evaluate decisions that affect one's life. It then defines important statistical terms like population, sample, variable, and data types. It also provides examples of different data sets and ways to organize data, such as through frequency distributions, bar charts, and dot plots.
This document provides an overview of key concepts in statistics including:
- Descriptive statistics such as frequency distributions which organize and summarize data
- Inferential statistics which make estimates or predictions about populations based on samples
- Types of variables including quantitative, qualitative, discrete and continuous
- Levels of measurement including nominal, ordinal, interval and ratio
- Common measures of central tendency (mean, median, mode) and dispersion (range, standard deviation)
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning, Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.
Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization
Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation
Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot
Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal number of values.
Mode
Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.
Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.
Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a normal distribution.
Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an int ...
This document discusses different types of data and how to classify them. It defines attributes data as focusing on specific non-numerical characteristics of a population, while variables data measures characteristics on a continuous scale. Attributes data is counted as discrete events, while variables data derives a numeric estimate. The document also discusses distributional models and their relationship to different types of charts that can be used for attributes data analysis.
This document provides an introduction to statistics. It defines key statistical concepts such as descriptive statistics, inferential statistics, populations, samples, variables, and different types of data. It also discusses methods for organizing and summarizing data, including frequency distributions, histograms, frequency polygons, ogives, time series graphs and pie charts. The goal of statistics is to collect, organize, analyze and draw conclusions from data.
Frequency distribution, types of frequency distribution.
Ungrouped frequency distribution
Grouped frequency distribution
Cumulative frequency distribution
Relative frequency distribution
Relative cumulative frequency distribution
Graphical representation of frequency distribution
I. Representation of Grouped data
1.Line graphs
2.Bar diagrams
a) Simple bar diagram
b)Multiple/Grouped bar diagram
c)Sub-divided bar diagram.
d) % bar diagram
3. Pie charts
4.Pictogram
II. Graphical representation of ungrouped data
1, Histogram
2.Frequency polygon
3.Cumulative change diagram
4. Proportional change diagram
5. Ratio diagram
This document discusses various methods for analyzing quantitative data, including coding data, creating a codebook, entering data into a grid format for analysis, checking data for accuracy, and using computers and statistical software to analyze data. It covers descriptive statistics for one and two variables, such as frequency distributions, measures of central tendency and variation, scatterplots, cross-tabulations, and measures of association between two variables.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting, model evaluation methods, and comparing model performance.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting and model evaluation, discussing metrics, methods of evaluation like cross validation, and comparing models.
This document discusses various methods for presenting data numerically and graphically, including frequency distributions, charts, and graphs. It describes steps for constructing frequency distributions and tables, and types of charts like histograms, frequency polygons, ogives, pie charts, bar charts, and time series graphs. The purpose is to summarize large data sets in a concise and understandable way.
Case Study: Prediction on Iris Dataset Using KNN AlgorithmIRJET Journal
This document summarizes a case study that uses the K-Nearest Neighbors (KNN) machine learning algorithm to classify iris flowers into species using measurements from the well-known Iris dataset. The case study loads the Iris dataset, splits it into training and test sets, trains a KNN model with k=3 neighbors to classify the iris flowers based on their sepal length, sepal width, petal length, and petal width measurements, and evaluates the model's accuracy on the test set. The KNN model achieved an accuracy of 95.5% on this classification task using the Iris dataset.
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
Brand Guideline of Bashundhara A4 Paper - 2024khabri85
It outlines the basic identity elements such as symbol, logotype, colors, and typefaces. It provides examples of applying the identity to materials like letterhead, business cards, reports, folders, and websites.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
How to Setup Default Value for a Field in Odoo 17Celine George
In Odoo, we can set a default value for a field during the creation of a record for a model. We have many methods in odoo for setting a default value to the field.
4. Suppose that a PE coach records the
height of each student in his class.
This is an example of
univariate data
Univariate – consist of observations on a
single variable made on individuals in a
sample or population
5. Suppose that the PE coach records the
height and weight of each student in his
class.
This is an example of
bivariate data
Bivariate - data that consist of pairs of
numbers from two variables for each
individual in a sample or population
6. Suppose that the PE coach records the
height, weight, number of sit-ups, and
number of push-ups for each student in
his class.
This is an example of
multivariate data
Multivariate - data that consist of
observations on two or more variables
8. Categorical variables
• Qualitative
• Consist of categorical responses
1. Car model Which of
They are all
these
2. Birth year categorical
variables are
3. Type of cell phone variables!
NOT
4. Your zip code categorical
5. Which club you have joined variables?
9. Numerical variables
• quantitative It makes sense to perform math
There operations on these values.
are two types of
numerical variables -
• observations or measurements take on
discrete and continuous
numerical values
1. GPAs Which of these
Does it makes sense
variables are
2. Height of students to find an average
NOT numerical?
3. Codes to combination locks
code to combination
4. Number of text messages per day locks?
5. Weight of textbooks
10. Two types of variables
categorical numerical
discrete continuous
11. Discrete (numerical)
• Isolated points along a number line
• usually counts of items
• Example: number of textbooks purchased
12. Continuous (numerical)
• Variable that can be any value in a
given interval
• usually measurements of something
• Examples: GPAs or height or weight
13. Are the following variables categorical
or numerical (discrete or continuous)?
1. the color of cars in the teacher’s lot
Categorical
2. the number of calculators owned by
students at your college Discrete numerical
3. the zip code of an individual
Categorical
Is money a measurement or a count?
4. the amount of time it takes students to
drive to school Continuous numerical
5. the appraised value of homes in your city
Discrete numerical
14. Graphical Display Variable Type Data Type Purpose
Display data
Bar Chart
Use the following table to
Univariate Categorical
distribution
Comparative Bar
Chart determine2an appropriate 2 or more
Univariate for or
more groups
Categorical
Compare
groups
Dotplot graphical display a data set. data of
Univariate
What types
Numerical
Display
graphs can be
distribution
Numerical used with
Comparative Univariate for 2 or Compare 2 or more
dotplot more groups groups
Stem-and-leaf categorical
Display data
Univariate Numerical
display data?
distribution
Comparative stem- Univariate for 2 Compare 2 or more
and-leaf groups In section
Numerical
2.3, we will
groups
Histogram Univariate see how the various
Numerical
Display data
distribution
graphical displays for
Investigate
Scatterplot Bivariate univariate,relationship between
Numerical numerical
data compare.
2 variables
Univariate, collected Investigate trend
Time series plot Numerical
over time over time
16. Bar Chart
When to Use: Univariate, Categorical data
To comply with new standards from the U. S. Department of
This ischart is afrequency distribution.
A bar called a graphical bottom of the
Transportation, helmets should reach thedisplay for
motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 –
categorical data.
Overall Results” (National Highway Traffic Safety Administration,
Augustfrequency distribution is by observing 1700
A 2005) summarized data collected a table that
displays the possible categories along
motorcyclists nationwide at selected roadway locations.
The frequency for a particular
Each time a motorcyclist passed by,frequencies or whether
with the associated the observer that noted
the category is thehelmet (N), a noncompliant helmet (NC),
rider was wearing no number of times
or a compliant helmet (C). frequencies. set.
relative in the data
category appears
Helmet Use Frequency
The data are summarized in this
N 731
table:
NC 153
This should equal the total number of
C 816
observations. 1700
17. Bar Chart
To compile with new standards from the U. S. Department of
Transportation, helmets should reach the bottom of the
motorcyclist’s ears. The report “Motorcycle Helmet Use in 2005 –
Overall Results” (National Highway Traffic Safety Administration,
August 2005) summarized data collected by observing 1700
motorcyclists nationwide at selected roadway locations.
Each time a motorcyclist passed by, the observer noted whether
the rider was wearing no helmet (N), a noncompliant helmet (NC),
or a compliant helmet (C).
The data are summarized in this Relative
Helmet Use
Helmet Use Frequency
table: N 731
0.430
This should equal 1 NC 153
0.090
816
(allowing for rounding). C 0.480
1700
1.000
18. Bar Chart
How to construct
1. Draw a horizontal line; write the categories or
All bars should have the same width so
labels below the line at regularly spaced
that both the height and the area of
intervals
the bar are proportional to the
frequency or relative frequency of the
2. Draw a vertical line; label the scale using
corresponding categories.
frequency or relative frequency
3. Place a rectangular bar above each category
label with a height determined by its frequency
or relative frequency
19. Bar Chart
What to Look For
Frequently or infrequently occurring
categories
Here is the
completed bar chart
for the motorcycle
helmet data.
Describe this graph.
20. Comparative Bar Charts
When to Use Univariate, Categorical data for
Bar charts can two or more groups
also be used to provide a visual
You use relative frequency rather
comparison of two or more groups.
than frequency on the vertical axis
How to constructyou can make meaningful
so that
comparisons even if the sample
• Constructed by using the same horizontal and
sizes are not the same.
vertical axes for the bar charts of two or
more groups
• Usually color-coded to indicate which bars
Why?
correspond to each group
• Should use relative frequencies on the
vertical axis
21. Each year the Princeton Review conducts a survey of
students applying to college and of parents of college
applicants. In 2009, 12,715 high school students
responded to the question “Ideally how far from home
would you like the college you attend to be?”
Also, 3007 parents of students applying to college
responded to the question “how far from home would
you like the college yourshould you do first?Data
What child attends to be?”
are displayed in the frequency table below.
Frequency
Ideal Distance Students Parents Create a
Less than 250 miles 4450 1594 comparative
250 to 500 miles 3942 902 bar chart
500 to 1000 miles 2416 331
with these
data.
More than 1000 miles 1907 180
22. Relative Frequency
Ideal Distance Students Parents
Less than 250 miles .35 .53
250 to 500 miles .31 .30
500 to 1000 miles .19 .11
More than 1000 miles .15 .06
Found by dividing the frequency by the total
number of students
Found by dividing the frequency by the total
number of parents
What does this
graph show about
the ideal distance
college should be
from home?
24. Dotplot
When to Use Univariate, Numerical data
How to construct
1. Draw a horizontal line and mark it with an
appropriate numerical scale
2. Locate each value in the data set along the
scale and represent it by a dot. If there are
two are more observations with the same
value, stack the dots vertically
25. Dotplot
What to Look For
• A representative or typical value (center)
in An outlier is an unusually large or small
the data set
• The extent to which the data values
data value.
spread out
• The nature offor deciding when an(shape)
A precise rule the distribution observation
is an outlier is given we look for with
What in Chapter 3.
along the numberunivariate, numerical data
line
• The presence of unusual values (gaps and
sets are similar for
outliers) dotplots, stem-and-leaf
displays, and histograms.
26. The first three observations are
Professor Norm gave a 10-question quiz last
plotted – note that you stack the
week in his introductory statistics class. The
points if values are repeated.
number of correct answers for each student is
recorded below.First draw a horizontal line with an
appropriate scale.
6 8 6 5 4 7 9 4 5
8 5 This 6 the completed dotplot.
4 is 7 7 3 8 7
6 7 6 6 6 5 5 9
Write a sentence
or two describing
this distribution. 2 4 6 8 10
Number of correct answers
Number of correct answers
27. What to Look For
What to Look For
The representative or typical value (center) in the data set
• • The representative or typical value (center) in the data set
• • The extent to which the data values spreadone that has a
A symmetrical distributionspread out
data values is out
The extent to which thedata values spread out
extent to which
vertical Norm curve, (shape) alongthe left line
If we draw a
•Professor line of gave a 10-question the number line is
• The nature of the distribution (shape) along the number half
The nature of symmetry where quiz last
smoothing out this
• • The presence of unusual values
The presence of unusual values
week in hiswe will see that of the right half.The
dotplot, mirror image statistics class.
a introductory
number of ONLY oneanswers for each student is
there is correct peak.
recorded below.
Distributions with a single
peak are said to be
2 4 6 8 10
unimodal.
Number of correct answers
TheDistributions with two
center for the distribution of the number of
peaks are bimodal, and
correct answers is about 6. There is not a lot of
with more than two peaks
variability in the observations. The distribution
are multimodal.
is approximately symmetrical with no unusual
observations.
28. Comparative Dotplots
When to Use Univariate, numerical data with
observations from 2 or more groups
How to construct
• Constructed using the same numerical scale
for two or more dotplots
• Be sure to include group labels for the
dotplots in the display
What to Look For
Comment on the same four attributes, but
comparing the dotplots displayed.
29. Distributions where the right tail is longer
In anotherthatcomparative be positively the data
Notice introductorysidedotplotclass, skewed
Create a the left statistics with of the
than the left is said to (or lower tail)
sets from the two statistics classes,
Professor Skew also gaveto 10-question quiz. The
(or skewed a the right).
distribution is longer than the right side (or
number of correct answers for andSkew’s class
Is the distribution for Prof. Skew. to is
Professors’ Norm each said
upper tail). This distribution is studentbe
recorded direction of skewness is always inleft).
The below.
negatively skewed (or skewed to the the
symmetric? Why or why not?
direction of the longer tail.
The center8 the distribution for the number
6 of 10 8 8 7 9 8 10
of correct 7
answers on 9Prof. Skew’s class is 8
Prof. Skew
8 8 7 7 3 7
larger than the center of Prof. Norm’s class.
8 7 6 6 6 5 5 9 8
There is also more variability in Prof. Skew’s
distribution. Prof. Skew’s distribution
appears to have an unusual observation where
one student few had 2 answers correct while
Write a only
Prof. Norm
there were no unusual observations in Prof.
sentences
Norm’s class. The distribution for Prof. Skew
comparing these
is negatively skewed while Prof. Norm’s
distributions.
distribution is more symmetrical.
2 4 6 8 10
Number of correct answers
30. Stem-and-Leaf Displays
When to Use Univariate, Numerical data
How to construct
Stem-and-leafor more of the leading digits for
• Select one displays are an effective way to
summarize univariate numerical data when the
the stem
• List the data set stem values in a vertical
possible is not too large.
column
• Record the leaf for each observationlist
Each observation is split intosure to
Be two parts:
beside theconsists of theevery stem from
Stem – corresponding stem digit(s)
first value
• Indicate the units forthe finaland leavesthe
Leaf - consists of stems digit(s) to
the smallest
someplace in the display
largest value
31. Stem-and-Leaf Displays
What to Look For
• A representative or typical value (center)
in the data set
• The extent to which the data values
spread out
• The presence of unusual values (gaps and
outliers)
• The extent of symmetry in the data
distribution
• The number and location of peaks
32. The completed stem-and-leafleaf will is shown
So the display be the last
below. two digits.
TheLet 5.6% be represented (AARP Bulletin, Junethe
article “Going Wireless” as 05.6% so that all
2009) reported thedigits in front of the decimal. If we
numbers have two estimated percentage of due to
However, it is somewhat difficult tothe leaf is 5.6
With 05.6%, read
households with only wireless phone service (no behind –
use the 2-digits, we would have will be written to 20
the 2-digit stems. from 05
and it stems
landline) for the 50 U.S. states andstems! the second
the stem the District of
that’s way too many 0. For
Columbia. Data use the first digit (tens) as our stems.
So let’s just for the 19 Eastern but theare written
number, states first digit
5.7 also is given
A common practice is to drop all
here.
in theThis makes the (with a
behind the stem 0 display
leaf.
5.6 5.7 20.0 16.8 16.5 13.4commato read, and
easier between). 8.0
10.8 9.3 11.6
11.4 16.3 14.0 10.8 7.8 DOES NOT change the
20.6 10.8 5.1 11.6
What is the leaf for 20.0%
overall distribution of
A 5stem-and-leaf display is anshould that leafway
and What is the variablebe
where appropriate
0 5.6, 9 8 79.3, 8.0, 7.8, 5.1
5 5.7, 5
5.7 the data set.
written?
1 6.8, 3 0 13.4, 4 0 0 1summarize theseinterest?
6 6 6.5, 1 6 0.8, 1.6, 1.4, 6.3, 4.0, 0.8, 0.8, 1.6 data.
to of
2 0.0, 0.6
00
0.0
(A dotplot would also be Wireless percent
a reasonable choice.)
33. The article “Going Wireless” (AARP Bulletin, June
2009) reported the estimated percentage of
households with only wireless phone service (no
landline) for the 50 U.S. states and the District of
Columbia. Data for the 19 Eastern states are given
here. While it is not
necessary to write
The center of the distribution
0 559875
5789
for the the leaves in order
estimated percentage
1 6 6 3 01 1 3 4 6 6 6
0001 1164001
of households with only wireless
2 00 from smallest to
phone service is approximately
Stem: tens
11%. There doesby doing so,
largest, not appear to
Leaf: ones be much the centerThisthe
variability. of
Write a few display distribution is more
appears to be a
sentences describing unimodal, symmetric
easily seen.
this distribution. distribution with no outliers.
34. Comparative Stem-and-Leaf Displays
When to Use Univariate, numerical data with
observations from 2 or more group
How to construct
• List the leaves for one data set to the right
of the stems
• List the leaves for the second data set to the
left of the stems
• Be sure to include group labels to identify
which group is on the left and which is on the
right
35. The article “Going Wireless” (AARP Bulletin, June
2009) reported the estimated percentage of
households with only wireless phone service (no
landline) for the 50 U.S. states and the District of
Columbia. Data for the 13 Western states are given
Western States Eastern States
here. 998 0 555789
8766110 1 00011134666
11.7 18.9 9.0 16.7 8.0 22.1 9.2 10.8
521 2 00
21.1 17.7 25.5 16.3 11.4
Stem: tens
Leaf: ones
The center of the distribution ofcomparative stem-
Create a the estimated
and-leaf display comparing the
Write a few of households with only wireless phone service
percentage
for the Western states is a little larger than the center
sentences distributions of the Eastern
comparing these states. Both distributions are
for the Eastern
and Western states.
distribution. with approximately the same amount of
symmetrical
variability.
36. Histograms
When to Use Univariate numerical data
Dotplots and stem-and-leaf displays are not
How to construct Constructed data
Discrete differently for
effective ways to summarize numerical
• Draw a horizontal scale and mark it with the possible
data when the discrete contains a large
data set versus continuous
values for the variable
• Draw a vertical scale and data it datafrequency or
number of mark values. data almost
Discrete numerical
with
relative frequencyalways result from counting. In
Histograms are value, draw a rectangle centered a
such cases, each observation is
• Above each possible displays that don’t work
at well for small a height corresponding to its
that value with data sets but do work well
whole number
frequency or relative frequency
for larger numerical data sets.
What to look for
Center or typical value; spread; general shape
and location and number of peaks; and gaps or
outliers
37. Queen honey bees mate shortly after they become adults.
During a mating flight, the queen usually takes multiple
partners, collecting sperm that she will store and use
throughout the rest of her life.
A paper, “The Curious Promiscuity of Queen Honey Bees”
(Annals of Zoology [2001]: 255-265), provided the
following data on the number of partners for 30 queen
bees.
12 2 4 6 6 7 8 7 8 11
8 3 5 6 7 10 1 9 7 6
9 7 5 4 7 4 6 7 8 10
Here is a dotplot
of these data.
2 4 6 8 10 12
Number of Partners
38. The bars should be centered over the
discrete data values and have heights
Queen honey bees continued
corresponding to the frequency of each
data value.
6
Frequency
4
2
0 2 4 6 8 10 12
In practice, histograms for discrete data ONLY show the
Number of partners
The distributionnumber built the histogram on of queen
rectangular bars. We of partners, partners top of the
The variable, for the number of is discrete. To
honey bees to show create a histogram: with aover the
dotplot is approximatelybars are centered center
that the symmetric
at 7 partners already have athat heights of the bars are
discrete data values and horizontal axis – of
we and a somewhat large amount
variability. There doesn’t appear to befrequency
we need to frequency of each data any outliers.
the add a vertical axis for value.
39. Here are two histograms showing the of
What do you notice about the shapes
“queen bee these two One uses frequency
data set”. histograms?
on the vertical axis, while the other uses
relative frequency
40. Histograms with equal width intervals
When to Use Univariate numerical data
How to construct Continuous data
• Mark the boundaries of the class intervals on the
horizontal axis
• Use either frequency or relative frequency on the
vertical axis
• Draw a rectangle for each class interval directly above
that interval. The height of each rectangle is the
frequency or relative frequency of the corresponding
interval
What to look for
Center or typical value; spread; general shape and
location and number of peaks; and gaps or outliers
41. The top dotplot shows all the data
Consider the following data on carry-on luggage
values in each interval stacked in
weight for 25 airline passengers.
This interval includes 10the the interval. barsbut not
With25.0 17.9 the middle 30.0 rectangular to cover
continuous data, of 18.0 values 28.2 27.8
10.1 27.6 and all 28.7 up
an interval 20.9 data values (notwill 20.8 28.5
15. of 33.8 intervals just one value).
including 31.4 The next 27.6 21.9 19.9 include 15 and
28.0
Looking 24.9up todotplot, it 22.7easy 20,see that we
all22.4 at this but not including to and so on.
values 26.4 22.0 34.5 is 25.3
could use intervals with a width of 5.
Here is a is a continuous numerical data set.
This dotplot of this data set.
42. From the dotplot, it is easy to see how the
continuous histogram is created.
43. Comparative Histograms
The article “Early Television Exposure and
The biggest difference between the two histograms
Subsequent Attention Problems in Children”
• Mustthe lowApril with a much higher proportion of 3-
is at use two separate histograms with the
(Pediatrics, end, 2004) investigated the television
same horizontal U.S. children. 0-2 TVfrequency on
year-old children axis and relative hours show
viewing habits of falling in the These graphsinterval
the vertical axis 1-year-old children.3-year old
than
the viewing habits of 1-year old and
children.
1-yr-olds 3-yr-olds
44. Histograms with unequal width intervals
When to use
when you have a concentration of data in the
middle with some extreme values
How to construct
construct similar to histograms with
continuous data, but with density on the
vertical axis
relative frequency for interval
density
width of interval
45. When people are asked for the values such as age or weight,
they sometimes relative frequency on the verticalThe
When using shade the truth in their responses. axis,
article “Self-Report of Academic Performance” (Social
the proportional area principle is violated.
Methods and Research [November 1981]: 165-185) focused
on SAT scores and grade point average (GPA). For each
student inthe relativethe difference between reported to
Notice the sample, frequency for the interval 0.4 GPA
and< actual GPA was than the relative frequency for the
2.0 is smaller determined. Positive differences
resulted from individuals reporting GPAs the bar is MUCH
interval -0.1 to < 0, but the area of larger than the
Class Relative Frequency
correct value.
Interval
larger.
-2.0 to < -0.4 0.023
-0.4 to < -0.2 0.055
-0.2 to < 0.1 0.097
-0.1 to < 0 0.210
0 to < 0.1 0.189
0.1 to < 0.2 0.139
0.2 to < 0.4 0.116
0.4 to < 2.0 0.171
46. GPAs continued
Class Relative Width Density
To fix this problem, we Interval Frequency
need to find the -2.0 to < -0.4 0.023 1.6 0.014
density of each -0.4 to < -0.2 0.055 0.2 0.275
interval. -0.2 to < 0.1 0.097 0.1 0.970
-0.1 to < 0 0.210 0.1 2.100
0 to < 0.1 0.189 0.1 1.890
relative frequency for interval
density 0.1 to 0.2 0.139 0.1 1.390
width of interval
0.2 to < 0.4 0.116 0.2 0.580
0.4 to 2.0 0.171 1.6 0.107
This is a correct
histogram with unequal
widths.
48. Scatterplots
When to Use Bivariate Numerical data
How to construct
1. Draw horizontal and vertical axes. Label the
horizontal axis and include an appropriate scale for
the x-variable. Label the vertical axis and include
an appropriate scale for the y-variable.
2. For each (x, y) pair in the data set, add a dot in
the appropriate location in the display.
What to look for
Relationship between x and y
49. The accompanying table gives the cost (in
dollars) and an overall quality rating for 10
different brands of men’s athletic shoes
(www.consumerreports.org).
Cost 65 45 45 80 110 110 30 80 110 70
Rating 71 70 62 59 58 57 56 52 51 51
Is there a relationship between x = cost and
y = quality rating?
A scatterplot can help
answer this question
50. Cost 65 45 45 80 110 110 30 80 110 70
Rating 71 70 62 59 58 57 56 52 51 51
Is there a relationship
70 between x = cost and
Next, plotdraw completed
Here is eachand y) pair.
yFirst, the (x, label
= quality rating?
appropriate horizontal
scatterplot.
Rating
60
and vertical axes.
There appears to be a
50 negative relationship
20 40 60 80 100
between cost of athletic
Cost shoes and their quality
rating – does that
surprise you?
51. Time Series Plots
When to Use Bivariate data with time and
another variable
How to construct
1. Draw horizontal and vertical axes. Label the
horizontal axis and include an appropriate scale
for the x-variable. Label the vertical axis and
include an appropriate scale for the y-variable.
2. For each (x, y) pair in the data set, add a dot in
the appropriate location in the display.
3. Connect each dot in order
What to look for
trends or patterns over time
52. The Christmas Price Index is computed each year by
PNC Advisors. It is a humorous look at the cost of
giving all the gifts described in the popular Christmas
song “The Twelve Days of Christmas”
(www.pncchristmaspriceindex.com).
Describe any
trends or
patterns
that you see.
Why is there a downward
trend between 1993 & 1995?
54. Pie (Circle) Chart
When to Use Categorical data
How to construct
• A circle is used to represent the whole data set.
• “Slices” of the pie represent the categories
• The size of a particular category’s slice is
proportional to its frequency or relative
frequency.
• Most effective for summarizing data sets when
there are not too many categories
55. Pie (Circle) Chart
The article “Fred Flintstone, Check Your Policy” (The Washington
Post, October 2, 2005) summarized a survey of 1014 adults
conducted by the Life and Health Insurance Foundation for
Education. Each person surveyed was asked to select which of five
fictional characters had the greatest need for life insurance:
Spider-Man, Batman, Fred Flintstone, Harry Potter, and Marge
Simpson. The data are summarized in the pie chart.
The survey results were quite
different from the assessment
of an insurance expert.
The insurance expert felt that
Batman, a wealthy bachelor, and
Spider-Man did not need life
insurance as much as Fred
Flintstone, a married man with
dependents!
56. Segmented can be difficult to construct by
A pie chart (or Stacked) Bar Charts
When to Use circular Categorical data makes
hand. The shape sometimes
if difficult to compare areas for different
categories, particularly when the relative
How to construct
frequencies are similar.
• Use a rectangular bar rather than a circle
to represent the entire data set.
So, we could use a segmented bar chart.
• The bar is divided into segments, with
different segments representing
different categories.
• The area of the segment is proportional to
the relative frequency for the particular
category.
57. Segmented (or Stacked) Bar Charts
Each year, the Higher Education Research Institute
conducts a survey of college seniors. In 2008,
approximately 23,000 seniors participated in the survey
(“Findings from the 2008 Administration of the College
Senior Survey,” Higher Education Research Institute,
June 2009).
This segmented bar
chart summarizes
student responses to
the question: “During
the past year, how much
time did you spend
studying and doing
homework in a typical
week?”
59. Avoid these Common Mistakes
1. Areas should be proportional to frequency,
relative frequency, or magnitude of the
number being represented.
By replacing naturally drawn to
The eye is the bars of a bar
large areas in graphical displays.
chart with milk buckets,
Sometimes, indistorted. to make
areas are an effort
the graphical displays more
interesting, designers1980 sight
The two buckets for lose
of this important principle.
represent 32 cows, whereas
Consider this graph (1970 Today,
the one bucket for USA
October 3, 2002).cows.
represents 19
60. Avoid these Common Mistakes
1. Areas should be proportional to frequency,
relative frequency, or magnitude of the
number being represented.
Another common distortion
occurs when a third
dimension is added to bar
charts or pie charts. This
distorts the areas and
makes it much more
difficult to interpret.
61. Avoid these Common Mistakes
2. Be cautious of graphs with broken axes (axes
that don’t start at 0).
• The use of broken axes in a scatterplot does not result
in a misleading picture of the relationship of bivariate
data.
• In time series plots, broken axes can sometimes
exaggerate the magnitude of change over time.
• In bar charts and histograms, the vertical axis should
NEVER be broken. This violates the “proportional
area” principle.
62. Avoid these Common Mistakes
2. Be cautious of graphs with broken axes (axes
that don’t start at 0).
This bar chart is similar to
one in an advertisement for
a software product designed
to raise student test scores.
Areas of the bars are not
proportional to the
magnitude of the numbers
represented – the area for
the rectangle 68 is more
than three times the area of
the rectangle representing
55!
63. Avoid these Common Mistakes
3. Notice that the intervals between observations are
Watch out for unequal time spacing in time
irregular,plots. points in the plot are equally spaced
series yet the
along the time axis. This makes it difficult to assess
the rate ofis a correct time series plot.
Here change over time.
If observations
over time are not
made at regular
time intervals,
special care must
be taken in
constructing the
time series plot.
64. Avoid these Common Mistakes
4. Be careful how you interpret patterns in
Does an increase in the number of Methodist
scatterplots.
ministers CAUSE the increase in imported rum?
Consider the following scatterplot showing the relationship between
the number of Methodist ministers in New England and the amount
of Cuban rum imported into Boston from 1860 to 1940
(Education.com). 35000
r = .999973 30000
A strong pattern in a
Number of Barrels
of Imported Rum 25000
scatterplot means that 20000
the two variables tend to
vary together in a 15000
predictable way, BUT it 10000
does not mean that there
is a cause-and-effect 5000
0 50 100 150 200 250 300
relationship. Number of Methodist Ministers
65. Avoid these Common Mistakes
5. Make sure that a graphical display creates
the right first impression.
Consider the following graph
from USA Today (June 25,
2001). Although this graph
does not violate the
proportional area principle,
the way the “bar” for the
none category is displayed
makes this graph difficult to
read. A quick glance at this
graph may leave the reader
with an incorrect impression.