The document discusses basic descriptive quantitative data analysis techniques such as tables, graphs, and summary statistics. It covers topics like frequency distributions, contingency tables, bar graphs, pie charts, and measures of central tendency and variation. The objectives are to learn how to perform these analyses in Excel and how they are useful for understanding complex quantitative data and communicating findings to others. Employers value these types of quantitative and data visualization skills.
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Quick reminder ordinal or scaled or nominal porportionalKen Plummer
This is learning module for a decision point within a decision model for statistics as part of a teaching methodology called Decision-Based Learning developed at Brigham Young University in Provo, Utah, United States
Statistics is a mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. In general, its investigations and analyses fall into two broad categories called descriptive and inferential statistics.
Quick reminder ordinal or scaled or nominal porportionalKen Plummer
This is learning module for a decision point within a decision model for statistics as part of a teaching methodology called Decision-Based Learning developed at Brigham Young University in Provo, Utah, United States
Quick reminder diff-rel-ind-gd of fit (spanish in four slides) (2)Ken Plummer
This is learning module for a decision point within a decision model for statistics as part of a teaching methodology called Decision-Based Learning developed at Brigham Young University in Provo, Utah, United States
1. You are given only three quarterly seasonal indices and quarter.docxjackiewalcutt
1. You are given only three quarterly seasonal indices and quarterly seasonally adjusted data for the entire year. What is the raw data value for Q4? Raw data is not adjusted for seasonality.
Quarter Seasonal Index Seasonally Adjusted Data
Q1 .80 295
Q2 .85 299
Q3 1.15 270
Q4 --- 271
(Points : 3)
325
225
252
271
Question 2. 2. One model of exponential smoothing will provide almost the same forecast as a liner trend method. What are linear trend intercept and slope counterparts for exponential smoothing? (Points : 3)
Alpha and Delta
Delta and Gamma
Alpha and Gamma
Std Dev and Mean
Question 3. 3. Why is the residual mean value important to a forecaster? (Points : 3)
Large mean values indicate nonautoregressiveness.
Small mean values indicate the total amount of error is small.
Large absolute mean values indicate estimate bias. Large mean values indicate the standard error of the model is small.
Question 4. 4. When performing correlation analysis what is the null hypothesis? What measure in Minitab is used to test it and to be 95% confident in the significance of correlation coefficient. (Points : 3)
Ho: r = .05 p < .5
Ho: r = 1 p =.05
Ho: r ≠ 0 p≤.05
Ho: r = 0 p≤.05
Question 5. 5. In decomposition what does the cycle factor (CF) of .80 represent for a monthly forecast estimate of a Y variable? (Points : 3)
The estimated value is 80% of the average monthly seasonal estimate.
The estimate is .80 of the forecasted Y trend value.
The estimated value is .80 of the historical average CMA values.
The estimated value has 20% more variation than the average historical Y data values.
Question 6. 6. A Burger King franchise owner notes that the sales per store has fallen below the stated national Burger King outlet average of $1,258,000. He asserts a change has occurred that reduced the fast food eating habits of Americans. What is his hypothesis (H1) and what type of test for significance must be applied? (Points : 3)
H1: u ≥ $1.258,000 A one-tailed t-test to the left.
H1: u = $1.258,000 A two-tailed t-test.
H1: u < $1.258,000 A one-tailed t-test to the left.
H1: p < $1.258,000 A one-tailed test to the right.
Question 7. 7.
The CEO of Home Depot wants to see if city size has any relationship to the current profit margins of the company stores. What data type will he likely use to determine this?
(Points : 3)
Time series data of profits by store.
Recent 10 year sample of profits by stores.
Recent cross section of store profits by city.
Trend of a random sample of store profits over time.
Question 8. 8. Sometimes forecasters get lazy or forgetful and do not check the significance of XY data correlations ...
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Objectives
Learn about basic descriptive quantitative analysis
How to perform these tasks in Excel
Starting point for 502B
Excel knowledge and quantitative skills are highly desired by
Employers
EC stream
2
3. Introduction
3
Without data, it is anyone’s opinion
Why use tables, graphs, summary stats?
“At their best, tables, graphs, and statistics are instruments
for reasoning about complex quantitative information.”
Why learn how to design them appropriately?
“At their worst, tables, graphs and summary statistics are
instruments of evil used for deceiving a naive viewer.”
Does your mindset match my dataset!
http://www.ted.com/talks/hans_rosling_at_state.html
7. Frequency Distribution
Page 7
A convenient way of summarizing a lot of tabular data
What is a Frequency Distribution?
A frequency distribution is a list or a table …
containing class groupings (categories or ranges within
which the data fall) ...
and the corresponding frequencies with which data fall
within each class or category
For nominal/ordinal data
9. Page 9
Table 1
Univariate Frequencies of Percentage of Sales
Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales
Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent
(%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
10. Contingency/Pivot/Cross Table
10
May also want to produce a table with more
categories
Cross table or Contingency table or Pivot table
Suitable if you have two nominal/ordinal variables
Simple extension to a univariate table
Considers relationship between two variables
Row variable (Dependent)
Column variable (Independent)
11. Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 11
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153
Source: 1999 World Bank World Business Environment Survey (WBES)
* Excludes missing observations
12. Features of a Table
12
Title that accurately summarizes the data
Simple, indicates major variables, and time frame (if applicable)
Source: data set or origin of table
Explanatory footnotes
Easy to read & separated from text
Properly formatted for style (see APA Rules)
Necessary to advance analysis
See Module 7 for APA Table Checklist
Reproduced from APA manual
14. Bar Graph
Page 14
Often used to describe categorical data
Ordinal/Nominal
Draws attention to the frequency of each category
15. Page 15
Table 1
Univariate Frequencies of Percentage of Sales
Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales
Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent
(%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
16. Bar Graph
Page 16
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
18. Pie Graph
Page 18
Emphasizes the proportion of each category
Something that may be good for our tax evasion data
Circle represents the total
Segments the shares of the total
Segment size is proportional to frequency
19. Pie Graph
19
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
20. Page 2020
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
21. Page 2121
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
23. Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 23
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153
24. Bar Graph
Page 24
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
25. Page 2525
Segmented Bar Chart
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
26. Pie Graph
Page 26
Figure 2
Percentage of sales reported to tax authority by region
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
29. Time Series Graph
Page 29
Time series are often used in social sciences
Data collected at various time period: daily, weekly, monthly,
quarterly, annually, etc.
Examples include GDP, Unemployment, University Tuition
Plot series of interest over time
Let’s look at a graph of the unemployment rate by gender and
age
31. InstructorPage 31
Histogram
Used for continuous data
Frequency Distribution for continuous data
Summary graph showing count of the data pints falling in
various ranges
Rough approximate of the distribution of the data
A histogram is a way to summarize data
The distribution condenses the raw data into a more useful
form...
and allows for a quick visual interpretation of the data
35. Principles of Graphical Excellence
35
Well-designed presentation of interesting data
Substance & design
Simplicity of design, complexity of data
Proportion and Balance
Clear, precise, efficient
Know what you are trying to show (have a story)
make sure you graph shows it
Well formatted, professional
Choose format that reflects your data and the story
Informative and legible axis
Fully labelled & legible
Gets across main point(s) in the shortest time with the least ink in the
smallest space
Adds information not otherwise available to the reader
But supplemented with text describing the figure
Tells the truth about the data
Limits complexity and confusion
Avoid Chart Junk
36. 36
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
0
20
40
60
80
100
120 West
North
Northeast
Southwest
Mexico
Europe
Japan
East
South
International
Examples of Chartjunk
37. 37
Examples of Chartjunk
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Gridlines!
Vibration
Pointless
Fake 3-D Effects
Filled “Floor” Clip Art
In or out?
Filled
“Walls”
Borders and
Fills Galore
Unintentional
Heavy or Double Lines
Filled Labels
Serif Font with
Thin & Thick Lines
38. Displaying Data: “Mistakes”
Page 38
Graphs are also instruments of evil used for deceiving
a naive viewer.
Non-zero origin
Omitting data that refutes your “evidence”
Limiting scope of data
39. What is Wrong with this Graph?
39
Provincial Personal Income Taxes
Single Individual with $45,000 in
income claiming basic personal tax
credits
45. Describing Data Numerically
45
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
46. Mode
46
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode or several modes
What are the modes for the displayed data?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
47. Mode
47
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical
data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
48. Mode
48
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5 & 9
49. Mode
49
Caution: Mode may not be representative of the data
{0.1, 0.1, 5000, 4900, 4500, 5200,…}
50. Median
50
In an ordered list, the median is the “middle” number
(50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51. Mean
51
The “balancing point” (centre of gravity) of the data
E.g. The data “balances” at 5
1 2 3 4 5 6 7 8 9
-2
-1 +3
52. Arithmetic Mean
52
The arithmetic mean (mean) is the most
common measure of central tendency
Calculated by summing the value observations
and dividing by the number of observations
For a sample of size n:
# of observationsn
xxx
n
x
x n21
n
1i
i
+++
==
∑= Observed
values
53. Arithmetic Mean
53
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
What is the mean for these examples?
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
54. Arithmetic Mean
54
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
54321
==
++++ 4
5
20
5
104321
==
++++
55. Measures of Central Tendency
55
Central Tendency
Mean Median Mode
n
x
x
n
1i
i∑=
=
Overview
Midpoint of
ranked values
Most frequently
observed valueArithmetic
average
50% 50%
56. The “Shape of a Distribution”
56
Use information on mean, median, and mode to
“visualize” the data
A data distribution is said to be symmetric if its shape
is the same on both sides of the median
Symmetry implies that median=arithmetic mean
If a distribution is uni-modal and symmetric then
Median=mean=mode
57. The “Shape of a Distribution”
57
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Symmetric:
Median=Mean
Sym
m
etric:
Median=M
ean
UNIMODAL
Symmetric & Unimodel: Median=Mean=Mode
58. The “Shape of a Distribution”
58
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Sym
m
etric:
Median=M
ean Symmetric:
Median=Mean
BIMODAL BIMODAL
Symmetric & Bimodel: Median=Mean≠Mode
59. The “Shape of a Distribution”
59
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
#ofObs.
Values
MEDIAN50% 50%
Symmetric:
Median=Mean
Symmetric:
Median=Mean
MODE?
Symmetric & no mode: Median=Mean (Uniform
60. The “Shape of a Distribution”
60
An asymmetric distribution is said to be skewed
1. Negatively if Mean<Median<Mode
2. Positively if Mean>Median>Mode
Hence, by comparing our measures of cental tendancy,
we can start to visualize the shape and characteristics
of the data
61. The “Shape of a Distribution”
61
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8
MODE=2
MEDIAN=3
50% 50%
MEAN=3.2
MODE < MEDIAN < MEAN = POSITIVELY SKEWED
DISTRIBUTION
62. Example: Positively skewed variable
62
The Distribution of
After-Tax Income
shows the distribution
of income across all
Canadian households
63. Example: Positively skewed variable
63
The mode income is the
most common income and
was in the range from
$15,000 to $19,999.
The median income is the
level of income that
separates the population into
two groups of equal size and
was $39,700.
The mean income is the
average income and was
$48,400.
64. Example: Positively skewed variable
64
A distribution in which the
mean exceeds the median
and the median exceeds
the mode is positively
skewed, which means it
has a long tail of high
values.
The distribution of income
in Canada is positively
skewed.
Most likely to report
median rather than mean
since long tail distorts
average
65. Example: Positively skewed variable
65
Volunteer hours
Charitable contributions
# of Cigarette packs smoked (excluding 0)
Collective bargaining agreement duration (in years)
# of beers consumed on a Saturday night
Duration of low income (in years)
Number of children
66. The “Shape of a Distribution”
66
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7
MODE=6
MEDIAN=5
50% 50%
MEAN=4.7
Mean< MEDIAN < Mode = NEGATIVELY SKEWED
DISTRIBUTION
68. Describing Data Numerically
68
Simple Arithmetic
Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the
Distribution
69. Same center,
different variation
Measures of Dispersion/Variability
69
Variation
Variance Standard
Deviation
Range
Measures of variation
give information on the
spread or variability of
the data values.
70. Range
70
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example:
71. Range
71
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
72. The Range
72
• Problem
• Ignores all but two data points
• These values may be “outliers” (i.e. not
representative)
73. Disadvantages of the Range
73
Ignores the way in which data are distributed
Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
74. The Variance
74
• A single summary measure of dispersion would be
more helpful
• Takes account of all data Values
75. The Variance
1. Variance
2. Standard Deviation
∑=
−
−
=
N
i
i Xx
n
s
1
22
)(
1
1
75
siancedeviationdards == vartan
77. Comparing Standard Deviations
77
Mean = 15.5
s = 3.33811 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.570
Data C
78. Describing Data Numerically
78
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
79. The Sample Covariance
79
The covariance measures the strength of the linear
relationship between two variables
The sample covariance:
Only concerned with the strength of the
relationship
No causal effect is implied
1n
)y)(yx(x
sy),(xCov
n
1i
ii
xy
−
−−
==
∑=
80. Interpreting Covariance
80
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
81. Coefficient of Correlation
81
Measures the relative strength of the linear relationship
between two variable
Sample correlation coefficient:
YX ss
y),(xCov
r =
82. Features of
Correlation Coefficient, r
82
Unit free
Ranges between –1 and 1
The closer to –1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
86. Fun with Graphs
86
Does your mindset match my dataset!
http://www.ted.com/talks/hans_rosling_at_state.html
87. Looking ahead
SRs to client (cc) and Turnitin on Wednesday by
noon
No class next week
Work on 598 critiques
598 Critiques due in class & Turnitin Nov. 30
Comments on your SRs will be ready Nov. 30
Final SRs (if required) due Dec. 8 @11:55PM PST
Note carefully the requirements
Moodle site will be inaccessible sometime in December
Final Grades reported via usource once approved by
the Director
87
Editor's Notes
Graph makes the frequencies pop more
Or that which could have been a bar chart can be made into a line by connect the midpoints
Remember our cross table?
Can we present this graphically?
Note legend is on right as no room on left hand side
Or we can display this as a stacked bar where the proportion of each region in each category is displayed.
Called a segmented bar chart
Mancession Video 4 minutes
Unemployment Rates sheet
ExcelTutorial5_timeseriesgraph
The main defences of the lying graph is that at least it was approximately corret, we were just trying to show the general direction of change or magnitidue.
So yes, taxes are low in BC but not as low as show in the original graph
Non zerio origins are a great way to lie
Very popular in government
Remember this time series graph. Look at what happens if we change the scale on the Y axis
Boy, that really changes your impression of the data and the underlying trend. The drop from 1992 to 1997 was 7%. Does this graph under or overstate a 7% change over this period?
Dr. Kendall used his diagram to demonstrate that we are drinking too much when really there are more people drinking due to population growth
9
No mode
If the mean=median and there is no mode, your distribution looks something like this
Not as frequently occuring in economic data so I actually do not have many examples
What does the standard deviation tell us? It tells us how far from the mean the data points tend to be . A bigger number tells us that the observations are further away from the mean than if there is a small standard deviation. Tells us HOW representative of the data the mean is.
Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction. The mean for this example was about 15.5
For the first distribution we have 15.5+3.338= 18.838 and 15.5-3.338=12.162
Assuming this is how much restaurant patrons spend, what this means is that most of the patrons probably spend between $12.16 and $18.84.
In the second example, we have 15.5+0.926=16.43 and 15.5-0.926=14.57 which as you can see shows less spread in the data.
In the third example we have 15.5+4.57=20.07 and 15.5-4.57=10.93 which is the most spread.
Excel 4 minutes
Food Expenditures 2
ExcelTutorial9_Dispersion.mp4
Measures of Relationships Between Variables
More often than not, we are interested in describing relationship between variables
On Oct. 28 we learned about scatter plots as a graphical way to describe a relationship between two variables.
We also learned about cross tabs aka contingency tables for nominal/ordinal variables
Let’s look a little more closely at measure of relationships for ratio level data