This document provides an introduction and overview of graphics and plotting in R. It discusses high level and low level plotting functions, interacting with graphics, and modifying plots. It also covers plotting different variable types including dichotomous, categorical, ordinal, and continuous variables. Examples are provided for various plot types including histograms, bar plots, dot plots, boxplots, and more.
statistical computation using R- an intro..Kamarudheen KV
This presentation deals with some basics of R language. It is very useful for benners in R. It describes the basics in a very easy manner, so those who are not familiar with R it would be very helpful.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
statistical computation using R- an intro..Kamarudheen KV
This presentation deals with some basics of R language. It is very useful for benners in R. It describes the basics in a very easy manner, so those who are not familiar with R it would be very helpful.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
Most data is allocated to a period or to some point in time. We can gain a lot of insight by analyzing what happened when. The better the quality and accuracy of our data, the better our predictions can become.
Unfortunately the data we have to deal with is often aggregated for example on a monthly basis, but not all months are the same, they may have 28 days, 31 days, have four or five weekends,…. It’s made fit to our calendar that was made fit to deal with the earth surrounding the sun, not to please Data Scientists.
Dealing with periodical data can be a challenge. This talk will show to how you can deal with it with Pandas.
I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
Vibrant Technologies is headquarted in Mumbai,India.We are the best r programming training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best r programming classes in Mumbai according to our students and corporates
This is the slides of the UCLA School of Engineering Matlab workshop on Matlab graphics.
Learning Matlab graphics by examples:
- In 2 hours, you will be able to create publication-quality plots.
- Starts from the basic 2D line plots to more advanced 3D plots.
- You will also learn some advanced topics like fine-tuning the appearance of your figure and the concept of handles.
- You will be able to create amazing animations: we use 2D wave equation and Lorentz attractor as examples.
Abstract: This PDSG workshop introduces the basics of Python libraries used in machine learning. Libraries covered are Numpy, Pandas and MathlibPlot.
Level: Fundamental
Requirements: One should have some knowledge of programming and some statistics.
A short list of the most useful R commands
reference: http://www.personality-project.org/r/r.commands.html
R programı ile ilgilenen veya yeni öğrenmeye başlayan herkes için hazırlanmıştır.
Provide an introduction to graphics in Stata. Topics include graphing principles, descriptive graphs, and post-estimation graphs. This is an introductory workshop appropriate for those with little experience with graphics in Stata. Intended for those with basic Stata skills.
All workshop materials including slides, do files, and example data sets can be downloaded from http://projects.iq.harvard.edu/rtc/event/graphing-stata
download for better quality - Learn about the sequence and traverse functions
through the work of Runar Bjarnason and Paul Chiusano, authors of Functional Programming in Scala https://www.manning.com/books/functional-programming-in-scala
I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
Vibrant Technologies is headquarted in Mumbai,India.We are the best r programming training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best r programming classes in Mumbai according to our students and corporates
This is the slides of the UCLA School of Engineering Matlab workshop on Matlab graphics.
Learning Matlab graphics by examples:
- In 2 hours, you will be able to create publication-quality plots.
- Starts from the basic 2D line plots to more advanced 3D plots.
- You will also learn some advanced topics like fine-tuning the appearance of your figure and the concept of handles.
- You will be able to create amazing animations: we use 2D wave equation and Lorentz attractor as examples.
Abstract: This PDSG workshop introduces the basics of Python libraries used in machine learning. Libraries covered are Numpy, Pandas and MathlibPlot.
Level: Fundamental
Requirements: One should have some knowledge of programming and some statistics.
A short list of the most useful R commands
reference: http://www.personality-project.org/r/r.commands.html
R programı ile ilgilenen veya yeni öğrenmeye başlayan herkes için hazırlanmıştır.
Provide an introduction to graphics in Stata. Topics include graphing principles, descriptive graphs, and post-estimation graphs. This is an introductory workshop appropriate for those with little experience with graphics in Stata. Intended for those with basic Stata skills.
All workshop materials including slides, do files, and example data sets can be downloaded from http://projects.iq.harvard.edu/rtc/event/graphing-stata
download for better quality - Learn about the sequence and traverse functions
through the work of Runar Bjarnason and Paul Chiusano, authors of Functional Programming in Scala https://www.manning.com/books/functional-programming-in-scala
Chart and graphs in R programming language CHANDAN KUMAR
This slide contains basics of charts and graphs in R programming language. I also focused on practical knowledge so I tried to give maximum example to understand the concepts.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. Introduction to Data Analysis and Graphics in R
Introduction to Data Analysis and Graphics in R
Hellen Gakuruh
2017-04-03
Slide 5: Graphics in R
Outline
What we will cover:
• Introduction
• High level plotting functions
• Low level plotting functions
• Interacting with graphics
• Modifying a graph
n
• Plotting dichotomous and categorical variables
• Plotting ordinal variables
• Plotting continuous variables
Introduction
• R is renown for it’s plotting facilities; not only does it have all the well
known graphs, it also offers an opportunity to build an entirely new type
of graph
• There three well known graphics in R; “base graphics”, “grid graphics
(often implemented with package Lattice)” and “ggplot2”
• On start-up, R initiates a graphical device; calls X11() IN UNIX,
windows() in Windows and quartz() in mac
• Plotting functions fall under three types of commands; High-level, Low-
level, and Interactive
1
2. • Plots can be customized with “graphical parameters”
High level plotting functions
• They are designed to generate a complete plot with axes, labels and titles
unless they are suppressed (with graphical parameters)
• They start a new plot
• Core R’s plotting function is plot()
• plot() can produce a variety of different plots depending on type/class of
first argument (hence, plot() is completely reliant on class(object))
Expected output of “plot()”
• If only “x” is given only;
– if it is a time series object (class = ts), a line plot is produced; other
wise if it’s numeric a scatter plot of it’s index against it (x) is generated
– if class(x) = "factor", a bar plot is produced
– it’s an error when class(x) == "character" as plot needs a finite
object to set a plotting window
• If two variables are given and they are both numeric, output is a scatter
plot
Expected output of “plot()”
• If a factor and a numeric vector are given, box plots are produced
• If both vectors are factors, stacked bar plot is produced
• If objected parsed is not a vector but a matrix, data frame or list, plot()
will make plots per elements type
• We produce a few of these as example using plain plot(obj) (without
changing/giving other arguments)
Time series object
n
ts <- ts(rnorm(12, 50), start = 1, end = 12, frequency = 1)
class(ts)
[1] "ts"
n
plot(ts)
2
6. Factor and numeric vector
n
set.seed(5)
num3 <- rnorm(100, 88)
class(num3)
[1] "numeric"
n
plot(fac, num3)
6
7. Two factor vectors
n
fac2 <- factor(sample(c("F", "M"), 100, T, c(0.8, 0.2)))
class(fac2)
[1] "factor"
n
plot(fac, fac2)
7
8. Summary
• In all these plots, axis, labels (except title) and in some, color is give, this
makes them communicative
• However, they might not be aesthetically up to requirements, this can be
changed by passing other arguments including suppression of axis
Other arguments to “plot”
• Type of plot produced by plot() depends on first (and “y”) argument,
but how it is generated depends on values parsed to other argument
• Plot type can also be changed with argument “type”, though do this when
sure it makes sense
• “xlim” and “ylim” define x and y limits (min and max axis values), this
can be changed especially if need a bit more padding
8
9. Other argument to “plot” function cont.
• For customized axis like logs, argument “axes” can be suppressed
• To annotate plot with additional graphical parameters, add them as argu-
ment to high and low level plots or make a call to par(). . . more on this
later (read ?par)
Other High-level plots
• hist() for histograms (univariate continuous distributions)
• boxplot() for box-and-whiskers plot (for univariate numerical variables
alone or categorised by a categorical variable)
• barplot() for bar plots (for categorical distribution)
• pie() for pie chart (for categorical distribution)
Low level plotting functions
• These functions add more information to an existing plot
• Used to customize plots
• Some of the most frequently used functions are; point(), lines(), text(),
title(), abline(), polygon(), legend(), and axis()
• We use some of these when plotting some of the example distributions
Interacting with graphics
• Interaction means extracting or adding information to a plot using a mouse
(rather than inputting data to plot)
• Two function for interaction in R are locator() and identify()
• locator(n, type): one can select “n” number of points using left mouse
button and if type is not specified, a list with two components x and y is
outputted otherwise plotting over selected points given “type” is done
• locator() is particularly handy in locating position for legends, and labels
e.g. text(locator(1), "Outlier", adj=0)
Interacting with graphics cont.
• identify(x, y, labels) is used to highlight any of the points defined
by x and y (using left mouse button)
• These can be used to identify certain points and possibly label
Demonstration on interacting with graphics
9
10. Graphical paramenters “par()”
• Almost every aspect of a plot can be customized by graphical parameters
• Graphical parameters come in “name=value” pair with all having a default
value
• Accessing current default parameters call par() for complete list
• For a specific list call par detailing parameter of interest par("parameter")
e.g. par("mfrow")
• Changing any parameters can be done globally (not recommended) or
individually
Plotting dichotomous and categorical variables
• Plotting of any distribution depends on whether it’s univariate (one vari-
able), bi-variate (two variables) or multi-variate
• Plots for univariate categorical variables (dichotomous included) are:
– Pie charts (for few values e.g. 2)
– Bar plots, and
– Cleveland’s dot plots
Plotting dichotomous and categorical variables conti.
• Bi-variate plots
– Stacked/besides bar plots
– Four-fold display
• Multi-variate plots
– Mosaic
– Four-fold plots
Pie chart
• Suitable when their few categories
• Useful for showing “%’s”
• Highly discouraged due to angular perception, in addition it uses a lot of
ink
10
11. Pie chart example
set.seed(5)
response <- sample(c("Yes", "No"), 300, T, c(0.68, 0.32))
tab_response <- table(response)
pie(tab_response, col = c("#99CCFF", "#6699CC"))
labs <- paste0("(", round(as.vector(prop.table(tab_response)*100)), "%)")
text(x = c(0.78, -0.50), y = c(0.80, -1), labels = c(labs[1], labs[2]))
Bar plot
• Consist of a sequence of rectangular bars with heights given by values
given
• Ideally, bars should be ordered by frequency rather than bar-label
• Not recommended due to high-ink-ration (an alternative is Cleveland’s dot
plot)
11
12. Bar plot cont.
barplot(sort(tab_response, decreasing = TRUE), las = 1, col = c("#6699CC", "#99CCFF"))
title("Bar chart", xlab = "Response", ylab = "Frequency")
Cleveland’s dot plot
• An alternative to bar chart (uses less data:ink ratio)
• As an example, generate a “Cleveland’s dot plot” of the following data set
and it should be:
– titled “Total student’s trained by quarters (2016)”
– have an x axis titled “Total student’s trained”
– a sub-title “Data Mania Inc” (grey in color and slant), and
– Y axis titled “Quarters”, balled according to (ordered) months given
(March, Jun, Sep and Dec)
– have blue colored points
12
13. Cleveland’s dot plot
• Example data: Hypothetical random number of students trained by quarter
totals for year 2016
set.seed(5)
months <- sample(month.abb[c(3, 6, 9, 12)], size = 300, replace = TRUE)
tab_months <- table(months)[c("Mar", "Jun", "Sep", "Dec")]
tab_months
months
Mar Jun Sep Dec
81 78 60 81
Cleveland’s dot plot
13
14. n
dotchart(as.numeric(tab_months), xlab = "Total student's Trained", ylab = "Quarters", bg = 4
title("Total students trained by quarters (2016)", sub = "Data Mania Inc.,", font.sub = 3, c
axis(2, at = 1:4, labels = names(tab_months), las = 2)
Bi-variate Stacked/Besides bar plots and Dot plot
• Following earlier example, generate stacked/besides bar plot and bi-variate
Cleveland’s dot plot
• Adding second variable; Gender composition of students trained
Bivariate stacked/besides bar plots and dot plot cont.
set.seed(5)
gender <- sample(c("Female", "Male"), 300, TRUE, c(0.7, 0.3))
monthgen_tab <- table(gender, months)[, c("Dec", "Sep", "Jun", "Mar")]
monthgen_tab
months
gender Dec Sep Jun Mar
Female 0 49 78 81
Male 81 11 0 0
14
15. Bivariate stacked/besides bar plots and dot plot cont.
barplot(monthgen_tab, col = c("#6699CC", "#99CCFF"), beside = TRUE)
legend("topright", legend = c("Female", "Male"), pch = 22 , pt.bg = c("#6699CC", "#99CCFF"),
title("Student's trained by gender and month (2016)", xlab = "Month", ylab = "Number trained
15
16. Bivariate Cleveland’s dot plot
dotchart(as.matrix(monthgen_tab)[, c("Mar", "Jun", "Sep", "Dec")], bg = 4, xlab = "Total num
title("Total student's trained by gender and month", sub = "Data Mania Inc.", font.sub = 3,
title(ylab = "Gender and month", line = 2.5)
Four-fold plots
• Used to display association (or lack of)
• Designed for two binary variables (2 x 2 tables), this can be categorized
by a third categorical variable with K levels (2 x 2 x k tables)
• Association established if diagonal opposite cells in one direction tend to
differ in size from those in the other direction
• Color used to show this direction
16
17. Four-fold plots cont.
• Rings around circle are confidence rings and if adjacent quadrants rings
overlap then it corresponds to ( H_0: ) No association
• Example data: R’s “Titanic” data (but only for passengers)
# Convert Titanic data
titanic_passengers <- colSums(Titanic[-4,,,])
titanic_passengers
, , Survived = No
Age
Sex Child Adult
Male 35 659
Female 17 106
, , Survived = Yes
Age
Sex Child Adult
Male 29 146
Female 28 296
17
18. Four-fold for Titanic Passengers
n
# Plotting four fold plot
fourfoldplot(titanic_passengers, std = "margins")
• Plot shows association (rings do not overlap and diagonal opposite cells
differ in size) between Titanic’s passenger’s age (child/adult) and gender
(Male/Female) stratified by survival status (No/Yes)
• Four-fold differ from pie chart as it varies radius while holding angle
constant while pie varies angle while holding radius constant
Mosaic plots
• Originally proposed by Hartigan and Kleiner (1981, 1984)
18
19. • Similar to a divided bar plot where it displays counts of a contingency table
directly by tiles whose area is proportional to the observed cell frequency
• Later extended by Friendly (1992, 1994b)
• Extended version generates greater visual impact by using color and shading
to reflect size of residuals from independence (no association)
• Used for exploratory data analysis (establish associations) and model
building (display residuals of log-linear model)
mosaicplot(titanic_passengers, color = TRUE)
• Width of each column of tile in above figure is proportional to observed
frequency of each cell and height of each tile is determined by conditional
probabilities of row (age) in each column (sex).
# Height of tiles
prop.table(apply(titanic_passengers, 1:2, sum), 1)
Age
19
20. Sex Child Adult
Male 0.07364787 0.9263521
Female 0.10067114 0.8993289
Plotting continuous variables
• Display will depend on whether it univariate, bi-variate or multivariate
• Some often used displays for univariate:
– Histograms
– Density plots
– Box-and-whisker plots
– Dot plot
– Stem-and-leave plot
Plotting continuous variables
• Some bi-variate displays
– Scatter plot (both variables are continuous)
– Box-and-whisker plot (one variable is continuous and the other cate-
gorical)
Histogram
• Display distribution of observation in intervals called “bins”
• Each bin is represented by a rectangle whose width is the intervals
• Intervals can be equal through out (equidistant, R’s default) or not
• Heights of each rectangle corresponds to number of observations falling
within an interval (bin)
• Generated with function “hist” or plot(x, type = “h”)
• Hist constructs bins from argument “breaks”
Histogram cont.
• Breaks are breaking points for each interval or bin
• Giving a vector without this argument is okay (R will compute them), but
it’s usually good to change them to show best picture of distribution
• Argument “nclass” (compatible with S) can also be used to get number of
breaks needed
• Histograms are excellent for data with numerous observations
20
22. Code used to plot
op <- par("mfrow")
par(mfrow = c(1, 2))
hist(sepal, col = "#99CCFF", ann = FALSE)
title("Breaks = 10", xlab = "Sepal Length", ylab = "Frequency")
hist(sepal, nclass = 15, col = "#6699cc", ann = FALSE)
title("Breaks = 15", xlab = "Sepal Length", ylab = "Frequency")
par(mfrow = op)
Density Plots
• Fit “smooth” curve by computing kernel density estimates
• Based on probability theory
22
23. dens_sepal <- density(sepal)
plot(dens_sepal, type = "n")
polygon(dens_sepal, col = "#99CCFF")
Box-and-whisker plot (univariate)
• Used to visualize data distribution in terms of quarters
• Shows outliers
• Good comparison displays as multiple variables or groups can be plotted
side-by-side
states <- as.data.frame(state.x77[, c("Illiteracy", "Life Exp", "Murder", "HS Grad")])
23
24. # Layout (1 row by 2 columns)
op <- par("mfrow")
par(mfrow = c(1, 2))
# Visualise distributions
boxplot(states$Illiteracy, col = "#99CCFF")
boxplot(states$'Life Exp', col = "#6699CC")
# Reset original layout
par(mfrow = op)
• Both distributions have no outliers (points beyond whiskers)
• First distribution has most of it’s values at the lower side suggesting a
positive skewness (right tail)
• Second distribution look almost symmetrical as lower and upper quarters
look the same though it’s middle value is more on the lower side
24
25. Dot plots (Uni-variate)
• An alternative to box plot when n (sample size) is small
• They are one dimensional scatter plots
• Called stripchart in R
• Example data: 49.3, 48.1, 51.4, 48.1, 49, 49.3, 49.5, 49.8, 49.9, 50.4, 50.1
and 50.3
stripchart(round(num, 1), pch = 22, bg = col[1])
title("Dot plot for small sample size", xlab = "Observations")
Stem-and-leave plot
• Used to show distribution of observation
• Use actual values rather than points
25
26. • Stem is the whole number and is plotted on the left side while on the right
side (separated by a vertical bar) are the fractions
# Example data (sorted)
sort(round(num, 1))
[1] 48.1 48.1 49.0 49.3 49.3 49.5 49.8 49.9 50.1 50.3 50.4 51.4
# # Stem-and-leave plot
stem(round(num, 1))
The decimal point is at the |
48 | 11
49 | 033589
50 | 134
51 | 4
Scatter plot
• Used to show relationship between two continuous variables
• Relationship is said to exist if points have a visible pattern (positive or
negative)
• No relationship exists if not pattern is visible; points are scattered
plot(states[, 1:2], pch = 21, bg = col[1])
title("Association between Illiteracy and Life Expectancy")
26
27. n
• Scatter plot shows some negative pattern suggesting an association between
“Life Expectancy” and “Illiteracy” (cor = -0.5884779)
Box-and-whisker plot (bi-variate)
• Useful to display numerical variable by strata’s or groups of another
categorical variable
• Can also be used to compare two numerical distributions
27
29. # Add xlab
mtext("Divisions", side = 1, line = 6, font = 2)
# Annotate plot
title("Life expectancy for each US division", ylab = "Life expectancy")
# Reset parameter
par(mar = op)
• Using box plot to make comparison of similar distribution
• Example data: Elgar Anderson’s Iris Data
29
30. # Comparing lengths (Sepal and Petal)
boxplot(iris[, c("Sepal.Length", "Petal.Length")], col = col)
title("Comparing length of Irises of Gaspe Peninsula")
# Comparing width (Sepal and Petal)
boxplot(iris[, c("Sepal.Width", "Petal.Width")], col = col)
title("Comparing width of Irises of Gaspe Peninsula")
• Sepal seems to be higher in terms of length and width than petal
• Will this pattern hold under different species?
30
31. • Pattern still holds, Sepal length is higher than Petal length across all
species
31
32. • Pattern still holds as Sepal width is higher than Petal width across all
species however, it’s interesting to see “setosa” is higher than the others.
# High level functions
boxplot(iris$Sepal.Length~iris$Species, col = col[1], ylim = c(min(iris$Petal.Length) - 0.1,
boxplot(iris$Petal.Length~iris$Species, col = 4, add = TRUE)
# Low level functions
legend("bottomright", c("Sepal", "Petal"), pch = 22, pt.bg = c(col[1], 4), title = "Iris Typ
title("Comparison of Iris Length by species", xlab = "Species", ylab = "Length")
# High level functions
boxplot(iris$Sepal.Width~iris$Species, col = col[1], ylim = c(min(iris$Petal.Width) - 0.1, m
boxplot(iris$Petal.Width~iris$Species, col = 4, add = TRUE)
# Low level functions
legend("bottomright", c("Sepal", "Petal"), pch = 22, pt.bg = c(col[1], 4), title = "Iris Typ
32