SlideShare a Scribd company logo
Lecture #1: Exploratory Data Analysis
CS 109A, STAT 121A, AC 209A
Pavlos Protopapas Kevin Rader
Margo Levine Rahul Dave
1
Lecture Outline
What are Data?
Data Exploration
Descriptive Statistics
Visualizations
An Example
2
What are Data?
The Data Science Process
Recall the data science process.
• Ask questions
• Data Collection
• Data Exploration
• Data Modeling
• Data Analysis
• Visualization and Presentation of Results
Today we will begin introducing the data collection and data
exploration steps.
3
The Data Science Process
4
What are data?
“A datum is a single measurement of something on a scale that is
understandable to both the recorder and the reader. Data are
multiple such measurements.”
Claim: everything is (can be) data!
5
Where do data come from?
• Internal sources: already collected by or is part of the overall
data collection of you organization.
For example: business-centric data that is available in the
organization data base to record day to day operations;
scientific or experimental data
• Existing External Sources: available in ready to read format
from an outside source for free or for a fee.
For example: public government databases, stock market
data, Yelp reviews
• External Sources Requiring Collection Efforts: available
from external source but acquisition requires special
processing.
For example: data appearing only in print form, or data on
websites
6
Where do data come from?
How to get data generated, published or hosted online:
• API (Application Programming Interface): using a
prebuilt set of functions developed by a company to access
their services. Often pay to use. For example: Google Map
API, Facebook API, Twitter API
• RSS (Rich Site Summary): summarizes frequently updated
online content in standard format. Free to read if the site has
one. For example: news-related sites, blogs
• Web scraping: using software, scripts or by-hand extracting
data from what is displayed on a page or what is contained in
the HTML file.
7
Web Scraping
• Why do it? Older government or smaller news sites might not
have APIs for accessing data, or publish RSS feeds or have
databases for download. You dont want to pay to use the API
or the database.
• How do you do it? See HW1
• Should you do it?
• You just want to explore: Are you violating their terms of
service? Privacy concerns for website and their clients?
• You want to publish your analysis or product: Do they have an
API or fee that you are bypassing? Are they willing to share
this data? Are you violating their terms of service? Are there
privacy concerns?
8
Types of Data
What kind of values are in your data (data types)? Simple or
atomic:
• Numeric: integers, floats
• Boolean: binary or true false values
• Strings: sequence of symbols
9
Types of Data
What kind of values are in your data (data types)? Compound,
composed of a bunch of atomic types:
• Date and time: compound value with a specific structure
• Lists: a list is a sequence of values
• Dictionaries: A dictionary is a collection of key-value pairs, a
pair of values x : y where x is usually a string called key
representing the “name” of the value, and y is a value of any
type.
Example: Student record
• First: Kevin
• Last: Rader
• Classes: [CS109A, STAT121A, AC209A, STAT139]
10
How is your data represented and stored (data format)?
• Tabular Data: a dataset that is a two-dimensional table,
where each row typically represents a single data record, and
each column represents one type of measurement (csv, tsp,
xlsx etc.).
• Structured Data: each data record is presented in a form of
a, possibly complex and multi-tiered, dictionary (json, xml
etc.)
• Semistructured Data: not all records are represented by the
same set of keys or some data records are not represented
using the key-value pair structure.
11
How is your data represented and stored (data format)?
• Textual Data
• Temporal Data
• Geolocation Data
12
Tabular Data
In tabular data, we expect each record or observation to represent
a set of measurements of a single object or event. We’ve seen this
already in Lecture 0:
Each type of measurement is called a variable or an attribute of
the data (e.g. seq id, status and duration are variables or
attributes). The number of attributes is called the dimension.
We expect each table to contain a set of records or observations
of the same kind of object or event (e.g. our table above contains
observations of rides/checkouts).
13
Types of Data
We’ll see later that its important to distinguish between classes of
variables or attributes based on the type of values they can take on.
• Quantitative variable: is numerical and can be either:
• discrete - a finite number of values are possible in any
bounded interval.
For example: “Number of siblings” is a discrete variable
• continuous - an infinite number of values are possible in any
bounded interval
For example: “Height” is a continuous variable
• Categorical variable: no inherent order among the values For
example: “What kind of pet you have” is a categorical variable
14
Common Issues
Common issues with data:
• Missing values: how do we fill in?
• Wrong values: how can we detect and correct?
• Messy format
• Not usable: the data cannot answer the question posed
15
Handling Messy Data
The following is a table accounting for the number of produce
deliveries over a weekend.
What are the variables in this dataset?
What object or event are we measuring?
Friday Saturday Sunday
Morning 15 158 10
Afternoon 2 90 20
Evening 55 12 45
What’s the issue? How do we fix it?
16
Handling Messy Data
Were measuring individual deliveries; the variables are Time, Day,
Number of Produce.
Friday Saturday Sunday
Morning 15 158 10
Afternoon 2 90 20
Evening 55 12 45
Problem: each column header represents a single value rather than
a variable. Row headers are “hiding” the Day variable. The values
of the variable,“Number of Produce”, is not recorded in a single
column.
17
Handling Messy Data
We need to reorganize the information to make explicit the event
were observing and the variables associated to this event.
ID Time Day Number
1 Morning Friday 15
2 Morning Saturday 158
3 Morning Sunday 10
4 Afternoon Friday 2
5 Afternoon Saturday 9
6 Afternoon Sunday 20
7 Evening Friday 55
8 Evening Saturday 12
9 Evening Sunday 45
18
More Messiness
What object or event are we measuring?
What are the variables in this dataset?
How do we fix?
19
More Messiness
Were measuring individual deliveries; the variables are Time, Day,
Number of Produce:
20
Common causes of messiness are:
• Column headers are values, not variable names
• Variables are stored in both rows and columns
• Multiple variables are stored in one column
• Multiple types of experimental units stored in same table
In general, we want each file to correspond to a dataset, each
column to represent a single variable and each row to represent a
single observation.
We want to tabularize the data. This makes Python happy.
21
Data Exploration
Basics of Sampling
Population versus sample:
• A population is the entire set of objects or events under
study. Population can be hypothetical “all students” or all
students in this class.
• A sample is a “representative” subset of the objects or events
under study. Needed because its impossible or intractable to
obtain or compute with population data.
Biases in samples:
• Selection bias: some subjects or records are more likely to be
selected
• Volunteer/nonresponse bias: subjects or records who are
not easily available are not represented
Examples? 22
Sample mean
The mean of a set of n observations of a variable is denoted ¯x and
is defined as:
¯x =
x1 + x2 + ... + xn
n
=
1
n
n
i=1
xi
The mean describes what a “typical” sample value looks like, or
where is the “center” of the distribution of the data.
Key theme: there is always uncertainty involved when calculating a
sample mean to estimate a population mean.
23
Sample median
The median of a set of n number of observations in a sample,
ordered by value, of a variable is is defined by
Median =
x(n+1)/2 if n is odd
xn/2+x(n+1)/2
2 if n is even
Example (already in order):
Ages: 17, 19, 21, 22, 23, 23, 23, 38
Median = (22+23)/2 = 22.5
The median also describes what a typical observation looks like, or
where is the center of the distribution of the sample of
observations.
24
Mean vs. Median
The mean is sensitive to extreme values (outliers)
25
Mean vs. Median
The mean is sensitive to extreme values (outliers).
The above distribution is called right-skewed since the mean is
greater than the median.
26
Measures of Centrality: computation
How hard (in terms of algorithmic complexity) is it to calculate
• the mean
• the median
27
Measures of Centrality: Computation Time
How hard (in terms of algorithmic complexity) is it to calculate
• the mean: at most O(n)
• the median: at least O(n log n)
Note: Practicality of implementation should be considered!
28
Regarding Categorical Variables...
For categorical variables, neither mean or median make sense.
Why?
The mode might be a better way to find the most “representative”
value.
29
Measures of Spread: Range
The spread of a sample of observations measures how well the
mean or median describes the sample.
One way to measure spread of a sample of observations is via the
range.
Range = Maximum Value - Minimum Value
30
Measures of Spread: Variance
The (sample) variance, denoted s2, measures how much on
average the sample values deviate from the mean:
s2
=
1
n − 1
n
i=1
|xi − ¯x|2
Note: the term |xi − ¯x| measures the amount by which each xi
deviates from the mean ¯x. Squaring these deviations means that
s2 is sensitive to extreme values (outliers).
Note: s2 doesnt have the same units as the xi :(
What does a variance of 1,008 mean? Or 0.0001?
31
Measures of Spread: Standard Deviation
The (sample) standard deviation, denoted s, is the square root of
the variance
s =
√
s2 =
1
n − 1
n
i=1
|xi − ¯x|2
Note: s does have the same units as the xi . Phew!
32
Anscombe’s Data
The following four data sets comprise the Anscombes Quartet; all
four sets of data have identical simple summary statistics.
33
Anscombe’s Data 2
Summary statistics clearly don’t tell the story of how they differ.
But a picture can be worth a thousand words:
34
More Visualization Motivation
If I tell you that the average score for Homework 0 is: 7.64/15.
What does that suggest?
35
More Visualization Motivation
If I tell you that the average score for Homework 0 is: 7.64/15.
What does that suggest?
And what does the graph suggest?
35
More Visualization Motivation
Visualizations help us to analyze and explore the data. They help
to:
• Identify hidden patterns and trends
• Formulate/test hypotheses
• Communicate any modeling results
• Present information and ideas succinctly
• Provide evidence and support
• Influence and persuade
• Determine the next step in analysis/modeling
36
Principles of Visualizations
Some basic data visualization guidelines from Edward Tufte:
1. Maximize data to ink ratio: show the data
Good Better
37
Principles of Visualizations
Some basic data visualization guidelines from Edward Tufte:
1. Maximize data to ink ratio: show the data
2. Dont lie with scale: minimize size of effect in graph
size of effect in data (Lie Factor)
Bad Better
38
Principles of Visualizations
Some basic data visualization guidelines from Edward Tufte:
1. Maximize data to ink ratio: show the data
2. Dont lie with scale: minimize size of effect in graph
size of effect in data (Lie Factor)
3. Minimize chart-junk: show data variation, not design variation
Bad Better
39
Principles of Visualizations
Some basic data visualization guidelines from Edward Tufte:
1. Maximize data to ink ratio: show the data
2. Dont lie with scale: minimize size of effect in graph
size of effect in data (Lie Factor)
3. Minimize chart-junk: show data variation, not design variation
4. Clear, detailed and thorough labeling
More Tufte details can be found here:
http://www2.cs.uregina.ca/ rbm/cs100/notes/spreadsheets/tufte paper.html
40
Types of Visualizations
What do you want your visualization to show about your data?
• Distribution: how a variable or variables in the dataset
distribute over a range of possible values.
• Relationship: how the values of multiple variables in the
dataset relate
• Composition: how the dataset breaks down into subgroups
• Comparison: how trends in multiple variable or datasets
compare
41
Histograms to visualize distribution
A histogram is a way to visualize how 1-dimensional data is
distributed across certain values.
Note: Trends in histograms are sensitive to number of bins.
42
Scatter plots to visualize relationships
A scatter plot is a way to visualize how multi-dimensional data
are distributed across certain values.
A scatter plot is also a way to visualize the relationship between
two different attributes of multi-dimensional data.
43
Pie chart for a categorical variable
A pie chart is a way to visualize the static composition (aka,
distribution) of a variable (or single group).
44
Stacked area graph to show trend over time
A stacked area graph is a way to visualize the composition of a
group as it changes over time (or some other quantitative
variable). This shows the relationship of a categorical variable
(AgeGroup) to a quantitative variable (year).
45
Multiple histograms
Plotting multiple histograms and/or distribution curves on the
same axes is a way to visualize how different variables compare (or
how a variable differs over specific groups)
46
Boxplots
A boxplot is a simplified visualization to compare a quantitative
variable across groups. It highlights the range, quartiles, median
and any outliers present in a data set.
47
[Not] Anything is possible!
Often your dataset seem too complex to visualize:
• Data is too high dimensional (how do you plot 100 variables
on the same set of axes?)
• Some variables are categorical (how do you plot values like
Cat or No?)
48
More dimensions not always better
When the data is high dimensional, a scatter plot of all data
attributes can be impossible or unhelpful.
49
Reducing complexity
Relationships may be easier to spot by producing multiple plots of
lower dimensionality.
50
Adding a dimension
For 3D data, color coding a categorical attribute can be effective.
The above visualizes a set of Iris measurements. The variables are:
petal length, sepal length, Iris type (setosa, versicolor, virginica).
51
3D can work
For 3D data, a quantitative attribute can be encoded by size in a
bubble chart.
The above visualizes a set of consumer products. The variables
are: revenue, consumer rating, product type and product cost.
52
An Example
An example
Use some simple visualizations to explore the following dataset:
53
An example
A bar graph showing resistance of each bacteria to each drug:
What do you notice?
54
An example
Bar graph showing resistance of each bacteria to each drug
(grouped by Group Number):
Now what do you notice?
55
An example
Scatter plot of Drug #1 vs Drug #3 resistance:
Key: the process of data exploration is iterative (visualize for
trends, re-visualize to confirm)!
56
Quantifying Relationships
We can see that birth weight is positively correlated with femur
length.
Can we quantify exactly how they are correlated? Can we predict
birthweight based on femur length (or vice versa) through a
statistical model?
57
Prediction
We can see that types of iris seem to be distinguished by petal and
sepal lengths.
Can we predict the type of iris given petal and sepal lengths
through some sort of statistical model?
58

More Related Content

What's hot

SPSS statistics - how to use SPSS
SPSS statistics - how to use SPSSSPSS statistics - how to use SPSS
SPSS statistics - how to use SPSS
csula its training
 
Spss
SpssSpss
SPSS an intro...
SPSS an intro...SPSS an intro...
SPSS an intro...
Jithin Zcs
 
Evaluation Spss
Evaluation SpssEvaluation Spss
Evaluation Spss
jackng
 
Introduction To Spss - Opening Data File and Descriptive Analysis
Introduction To Spss - Opening Data File and Descriptive AnalysisIntroduction To Spss - Opening Data File and Descriptive Analysis
Introduction To Spss - Opening Data File and Descriptive Analysis
Dr Ali Yusob Md Zain
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
Andrea Wiggins
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
rkalidasan
 
Spss as a research tool
Spss  as a research tool Spss  as a research tool
Spss as a research tool
Integral university, Lucknow
 
Spss and software Application
Spss and software ApplicationSpss and software Application
Spss and software Application
Ashok Pandey
 
Uses of SPSS and Excel to analyze data
Uses of SPSS and Excel   to analyze dataUses of SPSS and Excel   to analyze data
Uses of SPSS and Excel to analyze data
Kudrat-E- Khoda(Prince)
 
introduction to spss
introduction to spssintroduction to spss
introduction to spss
Omid Minooee
 
Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packages
Km Ashif
 
Application of spss usha (1)
Application of spss usha (1)Application of spss usha (1)
Application of spss usha (1)
Rajat Kumar Pandeya
 
SPSS
SPSSSPSS
Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013
Tin Myo Han
 
SPSS
SPSSSPSS
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
softwares in public health
softwares in public healthsoftwares in public health
softwares in public health
Pragyan Parija
 
What Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data AnalysisWhat Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data Analysis
SPSSResearch
 

What's hot (19)

SPSS statistics - how to use SPSS
SPSS statistics - how to use SPSSSPSS statistics - how to use SPSS
SPSS statistics - how to use SPSS
 
Spss
SpssSpss
Spss
 
SPSS an intro...
SPSS an intro...SPSS an intro...
SPSS an intro...
 
Evaluation Spss
Evaluation SpssEvaluation Spss
Evaluation Spss
 
Introduction To Spss - Opening Data File and Descriptive Analysis
Introduction To Spss - Opening Data File and Descriptive AnalysisIntroduction To Spss - Opening Data File and Descriptive Analysis
Introduction To Spss - Opening Data File and Descriptive Analysis
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Spss as a research tool
Spss  as a research tool Spss  as a research tool
Spss as a research tool
 
Spss and software Application
Spss and software ApplicationSpss and software Application
Spss and software Application
 
Uses of SPSS and Excel to analyze data
Uses of SPSS and Excel   to analyze dataUses of SPSS and Excel   to analyze data
Uses of SPSS and Excel to analyze data
 
introduction to spss
introduction to spssintroduction to spss
introduction to spss
 
Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packages
 
Application of spss usha (1)
Application of spss usha (1)Application of spss usha (1)
Application of spss usha (1)
 
SPSS
SPSSSPSS
SPSS
 
Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013
 
SPSS
SPSSSPSS
SPSS
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
softwares in public health
softwares in public healthsoftwares in public health
softwares in public health
 
What Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data AnalysisWhat Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data Analysis
 

Similar to EDA

Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
AlbertoLugoGonzalez
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
John Sihotang, Dr, MM, Ir
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
XICSStudents
 
Lec 3.pptx
Lec 3.pptxLec 3.pptx
Lec 3.pptx
AliAkbar99386
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
PratikshaSurve4
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
Nirmalavenkatachalam
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Henock Beyene
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
BabasID2
 
WEEK-1-IS-20022023-094301am.pdf
WEEK-1-IS-20022023-094301am.pdfWEEK-1-IS-20022023-094301am.pdf
WEEK-1-IS-20022023-094301am.pdf
MdDahri
 
Data Science 1.pdf
Data Science 1.pdfData Science 1.pdf
Data Science 1.pdf
ArchanaArya17
 
Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
lee_anderson40
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Data analysis
Data analysisData analysis
Data analysis
Titus Mutambu Mweta
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
AnkurTiwari813070
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
Ikbal Ahmed
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Data Analysis
Data AnalysisData Analysis
Data_Preparation_Modeling_Evaluation.ppt
Data_Preparation_Modeling_Evaluation.pptData_Preparation_Modeling_Evaluation.ppt
Data_Preparation_Modeling_Evaluation.ppt
AronMozart1
 

Similar to EDA (20)

Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
classIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptxclassIX_DS_Teacher_Presentation.pptx
classIX_DS_Teacher_Presentation.pptx
 
Lec 3.pptx
Lec 3.pptxLec 3.pptx
Lec 3.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
WEEK-1-IS-20022023-094301am.pdf
WEEK-1-IS-20022023-094301am.pdfWEEK-1-IS-20022023-094301am.pdf
WEEK-1-IS-20022023-094301am.pdf
 
Data Science 1.pdf
Data Science 1.pdfData Science 1.pdf
Data Science 1.pdf
 
Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Data analysis
Data analysisData analysis
Data analysis
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data_Preparation_Modeling_Evaluation.ppt
Data_Preparation_Modeling_Evaluation.pptData_Preparation_Modeling_Evaluation.ppt
Data_Preparation_Modeling_Evaluation.ppt
 

More from VuTran231

4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol
VuTran231
 
2 project-oriented learning examined
2 project-oriented learning examined2 project-oriented learning examined
2 project-oriented learning examined
VuTran231
 
2 blended learning directions
2 blended learning directions2 blended learning directions
2 blended learning directions
VuTran231
 
4 jigsaw review
4 jigsaw review4 jigsaw review
4 jigsaw review
VuTran231
 
2 learning alone and together
2  learning alone and together2  learning alone and together
2 learning alone and together
VuTran231
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11
VuTran231
 
Cs229 notes10
Cs229 notes10Cs229 notes10
Cs229 notes10
VuTran231
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
VuTran231
 
Cs229 notes8
Cs229 notes8Cs229 notes8
Cs229 notes8
VuTran231
 
Cs229 notes7b
Cs229 notes7bCs229 notes7b
Cs229 notes7b
VuTran231
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7a
VuTran231
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
VuTran231
 
Cs229 notes-deep learning
Cs229 notes-deep learningCs229 notes-deep learning
Cs229 notes-deep learning
VuTran231
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4
VuTran231
 

More from VuTran231 (14)

4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol4 work readiness and reflection on wr and pol
4 work readiness and reflection on wr and pol
 
2 project-oriented learning examined
2 project-oriented learning examined2 project-oriented learning examined
2 project-oriented learning examined
 
2 blended learning directions
2 blended learning directions2 blended learning directions
2 blended learning directions
 
4 jigsaw review
4 jigsaw review4 jigsaw review
4 jigsaw review
 
2 learning alone and together
2  learning alone and together2  learning alone and together
2 learning alone and together
 
Cs229 notes11
Cs229 notes11Cs229 notes11
Cs229 notes11
 
Cs229 notes10
Cs229 notes10Cs229 notes10
Cs229 notes10
 
Cs229 notes9
Cs229 notes9Cs229 notes9
Cs229 notes9
 
Cs229 notes8
Cs229 notes8Cs229 notes8
Cs229 notes8
 
Cs229 notes7b
Cs229 notes7bCs229 notes7b
Cs229 notes7b
 
Cs229 notes7a
Cs229 notes7aCs229 notes7a
Cs229 notes7a
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Cs229 notes-deep learning
Cs229 notes-deep learningCs229 notes-deep learning
Cs229 notes-deep learning
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4
 

Recently uploaded

TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
Aditya Rajan Patra
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 

Recently uploaded (20)

TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
Recycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part IIRecycled Concrete Aggregate in Construction Part II
Recycled Concrete Aggregate in Construction Part II
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 

EDA

  • 1. Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave 1
  • 2. Lecture Outline What are Data? Data Exploration Descriptive Statistics Visualizations An Example 2
  • 4. The Data Science Process Recall the data science process. • Ask questions • Data Collection • Data Exploration • Data Modeling • Data Analysis • Visualization and Presentation of Results Today we will begin introducing the data collection and data exploration steps. 3
  • 5. The Data Science Process 4
  • 6. What are data? “A datum is a single measurement of something on a scale that is understandable to both the recorder and the reader. Data are multiple such measurements.” Claim: everything is (can be) data! 5
  • 7. Where do data come from? • Internal sources: already collected by or is part of the overall data collection of you organization. For example: business-centric data that is available in the organization data base to record day to day operations; scientific or experimental data • Existing External Sources: available in ready to read format from an outside source for free or for a fee. For example: public government databases, stock market data, Yelp reviews • External Sources Requiring Collection Efforts: available from external source but acquisition requires special processing. For example: data appearing only in print form, or data on websites 6
  • 8. Where do data come from? How to get data generated, published or hosted online: • API (Application Programming Interface): using a prebuilt set of functions developed by a company to access their services. Often pay to use. For example: Google Map API, Facebook API, Twitter API • RSS (Rich Site Summary): summarizes frequently updated online content in standard format. Free to read if the site has one. For example: news-related sites, blogs • Web scraping: using software, scripts or by-hand extracting data from what is displayed on a page or what is contained in the HTML file. 7
  • 9. Web Scraping • Why do it? Older government or smaller news sites might not have APIs for accessing data, or publish RSS feeds or have databases for download. You dont want to pay to use the API or the database. • How do you do it? See HW1 • Should you do it? • You just want to explore: Are you violating their terms of service? Privacy concerns for website and their clients? • You want to publish your analysis or product: Do they have an API or fee that you are bypassing? Are they willing to share this data? Are you violating their terms of service? Are there privacy concerns? 8
  • 10. Types of Data What kind of values are in your data (data types)? Simple or atomic: • Numeric: integers, floats • Boolean: binary or true false values • Strings: sequence of symbols 9
  • 11. Types of Data What kind of values are in your data (data types)? Compound, composed of a bunch of atomic types: • Date and time: compound value with a specific structure • Lists: a list is a sequence of values • Dictionaries: A dictionary is a collection of key-value pairs, a pair of values x : y where x is usually a string called key representing the “name” of the value, and y is a value of any type. Example: Student record • First: Kevin • Last: Rader • Classes: [CS109A, STAT121A, AC209A, STAT139] 10
  • 12. How is your data represented and stored (data format)? • Tabular Data: a dataset that is a two-dimensional table, where each row typically represents a single data record, and each column represents one type of measurement (csv, tsp, xlsx etc.). • Structured Data: each data record is presented in a form of a, possibly complex and multi-tiered, dictionary (json, xml etc.) • Semistructured Data: not all records are represented by the same set of keys or some data records are not represented using the key-value pair structure. 11
  • 13. How is your data represented and stored (data format)? • Textual Data • Temporal Data • Geolocation Data 12
  • 14. Tabular Data In tabular data, we expect each record or observation to represent a set of measurements of a single object or event. We’ve seen this already in Lecture 0: Each type of measurement is called a variable or an attribute of the data (e.g. seq id, status and duration are variables or attributes). The number of attributes is called the dimension. We expect each table to contain a set of records or observations of the same kind of object or event (e.g. our table above contains observations of rides/checkouts). 13
  • 15. Types of Data We’ll see later that its important to distinguish between classes of variables or attributes based on the type of values they can take on. • Quantitative variable: is numerical and can be either: • discrete - a finite number of values are possible in any bounded interval. For example: “Number of siblings” is a discrete variable • continuous - an infinite number of values are possible in any bounded interval For example: “Height” is a continuous variable • Categorical variable: no inherent order among the values For example: “What kind of pet you have” is a categorical variable 14
  • 16. Common Issues Common issues with data: • Missing values: how do we fill in? • Wrong values: how can we detect and correct? • Messy format • Not usable: the data cannot answer the question posed 15
  • 17. Handling Messy Data The following is a table accounting for the number of produce deliveries over a weekend. What are the variables in this dataset? What object or event are we measuring? Friday Saturday Sunday Morning 15 158 10 Afternoon 2 90 20 Evening 55 12 45 What’s the issue? How do we fix it? 16
  • 18. Handling Messy Data Were measuring individual deliveries; the variables are Time, Day, Number of Produce. Friday Saturday Sunday Morning 15 158 10 Afternoon 2 90 20 Evening 55 12 45 Problem: each column header represents a single value rather than a variable. Row headers are “hiding” the Day variable. The values of the variable,“Number of Produce”, is not recorded in a single column. 17
  • 19. Handling Messy Data We need to reorganize the information to make explicit the event were observing and the variables associated to this event. ID Time Day Number 1 Morning Friday 15 2 Morning Saturday 158 3 Morning Sunday 10 4 Afternoon Friday 2 5 Afternoon Saturday 9 6 Afternoon Sunday 20 7 Evening Friday 55 8 Evening Saturday 12 9 Evening Sunday 45 18
  • 20. More Messiness What object or event are we measuring? What are the variables in this dataset? How do we fix? 19
  • 21. More Messiness Were measuring individual deliveries; the variables are Time, Day, Number of Produce: 20
  • 22. Common causes of messiness are: • Column headers are values, not variable names • Variables are stored in both rows and columns • Multiple variables are stored in one column • Multiple types of experimental units stored in same table In general, we want each file to correspond to a dataset, each column to represent a single variable and each row to represent a single observation. We want to tabularize the data. This makes Python happy. 21
  • 24. Basics of Sampling Population versus sample: • A population is the entire set of objects or events under study. Population can be hypothetical “all students” or all students in this class. • A sample is a “representative” subset of the objects or events under study. Needed because its impossible or intractable to obtain or compute with population data. Biases in samples: • Selection bias: some subjects or records are more likely to be selected • Volunteer/nonresponse bias: subjects or records who are not easily available are not represented Examples? 22
  • 25. Sample mean The mean of a set of n observations of a variable is denoted ¯x and is defined as: ¯x = x1 + x2 + ... + xn n = 1 n n i=1 xi The mean describes what a “typical” sample value looks like, or where is the “center” of the distribution of the data. Key theme: there is always uncertainty involved when calculating a sample mean to estimate a population mean. 23
  • 26. Sample median The median of a set of n number of observations in a sample, ordered by value, of a variable is is defined by Median = x(n+1)/2 if n is odd xn/2+x(n+1)/2 2 if n is even Example (already in order): Ages: 17, 19, 21, 22, 23, 23, 23, 38 Median = (22+23)/2 = 22.5 The median also describes what a typical observation looks like, or where is the center of the distribution of the sample of observations. 24
  • 27. Mean vs. Median The mean is sensitive to extreme values (outliers) 25
  • 28. Mean vs. Median The mean is sensitive to extreme values (outliers). The above distribution is called right-skewed since the mean is greater than the median. 26
  • 29. Measures of Centrality: computation How hard (in terms of algorithmic complexity) is it to calculate • the mean • the median 27
  • 30. Measures of Centrality: Computation Time How hard (in terms of algorithmic complexity) is it to calculate • the mean: at most O(n) • the median: at least O(n log n) Note: Practicality of implementation should be considered! 28
  • 31. Regarding Categorical Variables... For categorical variables, neither mean or median make sense. Why? The mode might be a better way to find the most “representative” value. 29
  • 32. Measures of Spread: Range The spread of a sample of observations measures how well the mean or median describes the sample. One way to measure spread of a sample of observations is via the range. Range = Maximum Value - Minimum Value 30
  • 33. Measures of Spread: Variance The (sample) variance, denoted s2, measures how much on average the sample values deviate from the mean: s2 = 1 n − 1 n i=1 |xi − ¯x|2 Note: the term |xi − ¯x| measures the amount by which each xi deviates from the mean ¯x. Squaring these deviations means that s2 is sensitive to extreme values (outliers). Note: s2 doesnt have the same units as the xi :( What does a variance of 1,008 mean? Or 0.0001? 31
  • 34. Measures of Spread: Standard Deviation The (sample) standard deviation, denoted s, is the square root of the variance s = √ s2 = 1 n − 1 n i=1 |xi − ¯x|2 Note: s does have the same units as the xi . Phew! 32
  • 35. Anscombe’s Data The following four data sets comprise the Anscombes Quartet; all four sets of data have identical simple summary statistics. 33
  • 36. Anscombe’s Data 2 Summary statistics clearly don’t tell the story of how they differ. But a picture can be worth a thousand words: 34
  • 37. More Visualization Motivation If I tell you that the average score for Homework 0 is: 7.64/15. What does that suggest? 35
  • 38. More Visualization Motivation If I tell you that the average score for Homework 0 is: 7.64/15. What does that suggest? And what does the graph suggest? 35
  • 39. More Visualization Motivation Visualizations help us to analyze and explore the data. They help to: • Identify hidden patterns and trends • Formulate/test hypotheses • Communicate any modeling results • Present information and ideas succinctly • Provide evidence and support • Influence and persuade • Determine the next step in analysis/modeling 36
  • 40. Principles of Visualizations Some basic data visualization guidelines from Edward Tufte: 1. Maximize data to ink ratio: show the data Good Better 37
  • 41. Principles of Visualizations Some basic data visualization guidelines from Edward Tufte: 1. Maximize data to ink ratio: show the data 2. Dont lie with scale: minimize size of effect in graph size of effect in data (Lie Factor) Bad Better 38
  • 42. Principles of Visualizations Some basic data visualization guidelines from Edward Tufte: 1. Maximize data to ink ratio: show the data 2. Dont lie with scale: minimize size of effect in graph size of effect in data (Lie Factor) 3. Minimize chart-junk: show data variation, not design variation Bad Better 39
  • 43. Principles of Visualizations Some basic data visualization guidelines from Edward Tufte: 1. Maximize data to ink ratio: show the data 2. Dont lie with scale: minimize size of effect in graph size of effect in data (Lie Factor) 3. Minimize chart-junk: show data variation, not design variation 4. Clear, detailed and thorough labeling More Tufte details can be found here: http://www2.cs.uregina.ca/ rbm/cs100/notes/spreadsheets/tufte paper.html 40
  • 44. Types of Visualizations What do you want your visualization to show about your data? • Distribution: how a variable or variables in the dataset distribute over a range of possible values. • Relationship: how the values of multiple variables in the dataset relate • Composition: how the dataset breaks down into subgroups • Comparison: how trends in multiple variable or datasets compare 41
  • 45. Histograms to visualize distribution A histogram is a way to visualize how 1-dimensional data is distributed across certain values. Note: Trends in histograms are sensitive to number of bins. 42
  • 46. Scatter plots to visualize relationships A scatter plot is a way to visualize how multi-dimensional data are distributed across certain values. A scatter plot is also a way to visualize the relationship between two different attributes of multi-dimensional data. 43
  • 47. Pie chart for a categorical variable A pie chart is a way to visualize the static composition (aka, distribution) of a variable (or single group). 44
  • 48. Stacked area graph to show trend over time A stacked area graph is a way to visualize the composition of a group as it changes over time (or some other quantitative variable). This shows the relationship of a categorical variable (AgeGroup) to a quantitative variable (year). 45
  • 49. Multiple histograms Plotting multiple histograms and/or distribution curves on the same axes is a way to visualize how different variables compare (or how a variable differs over specific groups) 46
  • 50. Boxplots A boxplot is a simplified visualization to compare a quantitative variable across groups. It highlights the range, quartiles, median and any outliers present in a data set. 47
  • 51. [Not] Anything is possible! Often your dataset seem too complex to visualize: • Data is too high dimensional (how do you plot 100 variables on the same set of axes?) • Some variables are categorical (how do you plot values like Cat or No?) 48
  • 52. More dimensions not always better When the data is high dimensional, a scatter plot of all data attributes can be impossible or unhelpful. 49
  • 53. Reducing complexity Relationships may be easier to spot by producing multiple plots of lower dimensionality. 50
  • 54. Adding a dimension For 3D data, color coding a categorical attribute can be effective. The above visualizes a set of Iris measurements. The variables are: petal length, sepal length, Iris type (setosa, versicolor, virginica). 51
  • 55. 3D can work For 3D data, a quantitative attribute can be encoded by size in a bubble chart. The above visualizes a set of consumer products. The variables are: revenue, consumer rating, product type and product cost. 52
  • 57. An example Use some simple visualizations to explore the following dataset: 53
  • 58. An example A bar graph showing resistance of each bacteria to each drug: What do you notice? 54
  • 59. An example Bar graph showing resistance of each bacteria to each drug (grouped by Group Number): Now what do you notice? 55
  • 60. An example Scatter plot of Drug #1 vs Drug #3 resistance: Key: the process of data exploration is iterative (visualize for trends, re-visualize to confirm)! 56
  • 61. Quantifying Relationships We can see that birth weight is positively correlated with femur length. Can we quantify exactly how they are correlated? Can we predict birthweight based on femur length (or vice versa) through a statistical model? 57
  • 62. Prediction We can see that types of iris seem to be distinguished by petal and sepal lengths. Can we predict the type of iris given petal and sepal lengths through some sort of statistical model? 58