Exploratory data analysis (EDA) involves analyzing datasets to discover patterns, trends, and relationships. EDA techniques include graphical methods like histograms, box plots, and scatter plots as well as calculating summary statistics. The goal of EDA is to better understand the data structure and relationships between variables through visual and numerical techniques without beginning with a specific hypothesis. EDA is used to generate hypotheses for further confirmatory analysis and to identify outliers, anomalies, and other unusual data characteristics. Lattice graphics and other plotting functions in R can be useful tools for EDA to visualize univariate and bivariate relationships in data.
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
How to validate a model?
What is a best model ?
Types of data
Types of errors
The problem of over fitting
The problem of under fitting
Bias Variance Tradeoff
Cross validation
K-Fold Cross validation
Boot strap Cross validation
Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
There are 100,000 applicants for loans. Who is likely to default? How to effectively offer a loan
There are 100,000 consumers who is likely to buy my product? How to effectively market my product?
There are more than 1,000,000,000 transactions in a day. How to identify the fraud transaction?
There are 1,000,000 claims every year. How to identify the fake claims
Missing data handling is typically done in an ad-hoc way. Without understanding the repurcussions of a missing data handling technique, approaches that only let you get to the "next step" in your analytics pipeline leads to terrible outputs, conclusions that aren't robust and biased estimates. Handling missing data in data sets requires a structured approach. In this workshop, we will cover the key tenets of handling missing data in a structured way
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
How to validate a model?
What is a best model ?
Types of data
Types of errors
The problem of over fitting
The problem of under fitting
Bias Variance Tradeoff
Cross validation
K-Fold Cross validation
Boot strap Cross validation
Just finished a basic course on data science (highly recommend it if you wish to explore what data science is all about). Here are my takeaways from the course.
Data Mining is newly technology and it's very useful for Data analytics for business analysis purpose and decision making data. This PPT described Data Mining in very easy way.
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDATAVERSITY
Data analysis can be divided into descriptive, prescriptive and predictive analytics. Descriptive analytics aims to help uncover valuable insight from the data being analyzed. Prescriptive analytics suggests conclusions or actions that may be taken based on the analysis. Predictive analytics focuses on the application of statistical models to help forecast the behavior of people and markets.
This webinar will compare and contrast these different data analysis activities and cover:
- Statistical Analysis – forming a hypothesis, identifying appropriate sources and proving / disproving the hypothesis
- Descriptive Data Analytics – finding patterns
- Predictive Analytics – creating models of behavior
- Prescriptive Analytics – acting on insight
- How the analytic environment differs for each
Introduction to Statistics - Basic concepts
- How to be a good doctor - A step in Health promotion
- By Ibrahim A. Abdelhaleem - Zagazig Medical Research Society (ZMRS)
Data Mining is newly technology and it's very useful for Data analytics for business analysis purpose and decision making data. This PPT described Data Mining in very easy way.
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDATAVERSITY
Data analysis can be divided into descriptive, prescriptive and predictive analytics. Descriptive analytics aims to help uncover valuable insight from the data being analyzed. Prescriptive analytics suggests conclusions or actions that may be taken based on the analysis. Predictive analytics focuses on the application of statistical models to help forecast the behavior of people and markets.
This webinar will compare and contrast these different data analysis activities and cover:
- Statistical Analysis – forming a hypothesis, identifying appropriate sources and proving / disproving the hypothesis
- Descriptive Data Analytics – finding patterns
- Predictive Analytics – creating models of behavior
- Prescriptive Analytics – acting on insight
- How the analytic environment differs for each
Introduction to Statistics - Basic concepts
- How to be a good doctor - A step in Health promotion
- By Ibrahim A. Abdelhaleem - Zagazig Medical Research Society (ZMRS)
Data reduction: breaking down large sets of data into more-manageable groups or segments that provide better insight.
- Data sampling
- Data cleaning
- Data transformation
- Data segmentation
- Dimension reduction
Research methodology - Analysis of DataThe Stockker
Processing & Analysis of Data, Data editing, Benefits of data editing, Data coding, Classification of data, CLASSIFICATION ACCORDING THE ATTRIBUTES, CLASSIFICATION ON THE BASIS OF INTERVAL, TABULATION of data, Types of tables, Graphing of data, Bar chart, Pie chart, Line graph, histogram, Polygon / ogive, Analysis of Data, Descriptive Analysis, Uni-Variate Analysis, Bivariate Analysis, Multi-Variate Analysis, Causal Analysis, Inferential Analysis, PARAMETRIC TESTS, Non parametric Test,
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).
2
4. AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)
4
5. AIM OF THE EDA
• The goal of EDA is to open-mindedly explore data.
• Tukey: EDA is detective work… Unless detective finds the clues, judge
or jury has nothing to consider.
• Here, judge or jury is a confirmatory data analysis
• Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
• With EDA, we can examine data and try to understand the meaning of
variables. What are the abbreviations stand for.
5
6. Exploratory vs Confirmatory Data Analysis
EDA CDA
• No hypothesis at first
• Generate hypothesis
• Uses graphical methods (mostly)
• Start with hypothesis
• Test the null hypothesis
• Uses statistical models
6
7. STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones.
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
• Try to identify confounding variables, interaction relations and multicollinearity, if any.
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables).
• Decide on the hypothesis based on your research questions
7
8. AFTER EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
• Get conclusions and present your results nicely.
8
9. Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First,
each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics,
while graphical methods obviously summarize the data in a diagrammatic
or pictorial way.
• Univariate methods look at one variable (data column) at a time, while
multivariate methods look at two or more variables at a time to explore
relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more
variables.
• It is almost always a good idea to perform univariate EDA on each of the
components of a multivariate EDA before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
9
10. EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1. Climate mildness
2. Housing cost
3. Health care and environment
4. Crime
5. Transportation supply
6. Educational opportunities and effort
7. Arts and culture facilities
8. Recreational opportunities
9. Personal economic outlook
+ latitude and longitude of each city
Questions:
1. How is climate related to location?
2. Are there clusters in the data (excluding
location)?
3. Are nearby cities similar?
4. Any relation bw economic outlook and
crime?
5. What else???
10
11. EXAMPLE 2
• In a breast cancer research, main questions of interest might be
• Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
• Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)
11
12. EXAMPLE 3
• In a project, investigating the well-being of teenagers after an
economic hardship, main questions can be
• Is there a positive ( and significant) effect of economic problems on
distress?
• Which other factors can be most related to the distress of teenagers?
e.g. age, gender,…?
12
13. EXAMPLE 4*
New cancer cases in the U.S. based on a cancer registry
• The rows in the registry are called observations they correspond to
individuals
• The columns are variables or data fields they correspond to attributes
of the individuals
https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf 13
14. Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
14
15. Data Types and Measurement Scales
• Variables may be one of several types, and have a defined set of
valid values.
• Two main classes of variables are:
Continuous Variables: (Quantitative, numeric).
Continuous data can be rounded or binned to create categorical data.
Categorical Variables: (Discrete, qualitative).
Some categorical variables (e.g. counts) are sometimes treated as
continuous.
15
16. Categorical Data
• Unordered categorical data (nominal)
2 possible values (binary or dichotomous)
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.
• Ordered categorical data (ordinal)
Ratings or preferences
Cancer stage
Quality of life scales,
National Cancer Institute's NCI Common Toxicity Criteria
(severity grades 1-5)
Number of copies of a recessive gene (0, 1 or 2)
16
17. EDA Part 2: Summarizing Data With Tables
and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.
• Familiarizing yourself with the data.
• Find possible errors and anomalies.
• Examine the distribution of values for each variable.
17
18. Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as
ordered categorical.
Plots specific to Continuous variables.
The goal for both categorical and continuous data is data reduction
while preserving/extracting key information about the process under
investigation.
18
20. Frequency Table
• Frequency Table: Categories with counts
• Relative Frequency Table: Percentage in each category
20
21. Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:
21
22. Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:
22
23. We can then create a frequency table for this new categorical age
variable.
23
24. Continuous data - plots
A histogram is a bar chart constructed using the frequencies or relative
frequencies of a grouped (or binned") continuous variable
It discards some information (the exact values), retaining only the
frequencies in each bin"
24
32. Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package
32
34. • The boxplot function can also take a formula as an argument mpg cyl
mpg conditional on cyl"
> boxplot(mpg ~ cyl,
+ data = mtcars,
+ main = "Miles per Gallon by Number of Cylinders",
+ xlab = "Cylinders",
+ ylab = "Miles per Gallon")
34
35. > # Expand the formula
> boxplot(mpg ~ cyl + am,
+ data = mtcars,
+ main = "MPG by Number of Cylinders & Transmissions”)
35
37. Bar Chart
Use the table function to create a two-way frequency table, and
plotting options to group bars
> counts <- table(mtcars$cyl, mtcars$am)
> colnames(counts) <- c("Auto", "Manual")
> barplot(counts,
+ main = "Number of Cars by Transmission and Cylinders",
+ xlab = "Transmission",
+ beside = TRUE,
+ legend = rownames(counts))
37
39. > # create a vector for conditional color coding
> colorcode <- ifelse(mtcars$am == 0, "red", "blue")
> plot(mtcars$mpg,
+ mtcars$hp,
+ xlab = "Miles per Gallon",
+ ylab = "Horsepower",
+ col = colorcode)
39
40. Lattice graphics*
• lattice is an add-on package that implements Trellis graphics (originally developed
for S and S-PLUS) in R. It is a powerful and elegant high-level data visualization
system, with an emphasis on multivariate data.
• To fix ideas, we start with a few simple examples. We use the Chem97 dataset
from the mlmRev package.
> library(mlmRev)
> data(Chem97, package = "mlmRev")
> head(Chem97)
lea school student score gender age gcsescore gcsecnt
1 1 1 1 4 F 3 6.625 0.3393157
2 1 1 2 10 F -3 7.625 1.3393157
3 1 1 3 10 F -4 7.250 0.9643157
4 1 1 4 10 F -2 7.500 1.2143157
5 1 1 5 8 F -1 6.444 0.1583157
6 1 1 6 10 F 4 7.750 1.4643157
40
*All notes related to lattice graphics: https://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf
41. Variables in CHEM97 Data
• A data frame with 31022 observations on the following 8 variables.
• lea: Local Education Authority - a factor
• school: School identifier - a factor
• student: Student identifier - a factor
• score: Point score on A-level Chemistry in 1997
• gender: Student's gender
• age: Age in month, centred at 222 months or 18.5 years
• gcsescore: Average GCSE score of individual.
• gcsecnt: Average GCSE score of individual, centered at mean.
41
42. Lattice graphics
• The dataset records information on students appearing in the 1997 A-
level chemistry examination in Britain.
• We are only interested in the following variables:
• score: point score in the A-level exam, with six possible values (0, 2, 4, 6, 8).
• gcsescore: average score in GCSE exams. This is a continuous score that may
be used as a predictor of the A-level score.
• gender: gender of the student.
• Using lattice, we can draw a histogram of all the gcsescore values
using
> library(lattice)
> histogram(~ gcsescore, data = Chem97)
42
43. Lattice graphics
histogram(~ gcsescore, data = Chem97)
This plot shows a reasonably symmetric unimodal distribution, but is
otherwise uninteresting. A more interesting display would be one
where the distribution of gcsescore is compared across different
subgroups, say those defined by the A-level exam score.
43
45. Lattice graphics
• More effective comparison is enabled by direct superposition. This is
hard to do with conventional histograms, but easier using kernel
density estimates. In the following example, we use the same
subgroups as before in the different panels, but additionally subdivide
the gcsescore values by gender within each panel.
45
47. Lattice graphics
• Several standard statistical graphics are intended to visualize the
distribution of a continuous random variable. We have already seen
histograms and density plots, which are both estimates of the probability
density function. Another useful display is the normal Q-Q plot, which is
related to the distribution function F(x) = P(X ≤ x). Normal Q-Q plots can be
produced by the lattice function qqmath().
• Normal Q-Q plots plot empirical quantiles of the data against quantiles of
the normal distribution (or some other theoretical distribution). They can
be regarded as an estimate of the distribution function F, with the
probability axis transformed by the normal quantile function. They are
designed to detect departures from normality; for a good fit, the points lie
approximate along a straight line. In the plot above, the systematic
convexity suggests that the distributions are left-skewed, and the change in
slopes suggests changing variance.
47
49. Lattice graphics
The type argument adds a common reference grid to each panel that makes
it easier to see the upward shift in gcsescore across panels. The aspect
argument automatically computes an aspect ratio. Two-sample Q-Q plots
compare quantiles of two samples (rather than one sample and a theoretical
distribution). They can be produced by the lattice function qq(), with a
formula that has two primary variables. In the formula y ~ x, y needs to be a
factor with two levels, and the samples compared are the subsets of x for the
two levels of y. For example, we can compare the distributions of gcsescore
for males and females, conditioning on A-level score
> qq(gender ~ gcsescore | factor(score), Chem97,
+ f.value = ppoints(100), type = c("p", "g"), aspect = 1)
49
50. 50
The plot suggests that females do
better than males in the GCSE
exam for a given A-level score (in
other words, males tend to
improve more from the GCSE exam
to the A-level exam), and also have
smaller variance (except in the first
panel).
51. Lattice graphics
• A well-known graphical design that allows comparison between an
arbitrary number of samples is the comparative box-and-whisker plot.
• Box-and-whisker plots can be produced by the lattice function
bwplot().
> bwplot(factor(score) ~ gcsescore | gender, Chem97)
51
52. 52
The decreasing lengths of the boxes
and whiskers suggest decreasing
variance, and the large number of
outliers on one side indicate heavier
left tails (characteristic of a left-
skewed distribution).