A brief introduction to data visualisation using R. It contains both basic and advanced visualisation techniques with sample codes. The datasets being used are mostly available with RStudio.
2. About Me:-
I am Baijayanti Chakraborty, a Post Graduate student from Great Lakes Institute
of Management. I am doing PG in Business Analytics and Business Intelligence.
You can find me on:
1.LinkedIN: https://www.linkedin.com/in/baijayanti-chakraborty/
2.Twitter: twitter.com/baijayantic
3.Mail: baijayantichakraborty96@gmail.com
4.Github: https://github.com/baijayantichakraborty
5.Kaggel: https://www.kaggle.com/baijayanti94
3. Today’s Spot Of Interest
❖ What is visualization and why do we need it !!!!
❖ Basic Visualizations
❖ Advanced Visualizations
4. What is visualization and why do we need it !!!
Data visualization is an art of how to turn numbers into useful knowledge. We all know that when we see images its easy to
understand than when reading a lot of information.
Let’s consider the below example: Over here is a snip from the IRIS dataset which is already present in R. It’s quite difficult
to comprehend anything from the huge lot of data and hence to make it easy for understanding we will be using visualization
techniques.
Th
8. Selecting the right kindof chart !!!
There are four basic presentation types:
1. Comparison
2. Composition
3. Distribution
4. Relationship
To determine which amongst these types is best suited for your
data at hand we should be able to answer the below questions :-
● How many variables do you want to show in a single
chart?
● How many data points will you display for each variable?
● Will you display values over a period of time, or among
items or groups?
11. Histogram
Histogram is basically a plot that breaks the data into bins (or breaks) and shows frequency distribution of these bins.
12. Bar/Line Chart
● Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical
variable.
● Line Charts are commonly preferred when we are to analyse a trend spread over a time period.
14. Boxplots
Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the
data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th
percentile and the maximum.Example for boxplot creation using the below code :
20. Some advanced packages of visualisation in R are :-
● Lattice Graphs :- Lattice package is essentially an improvement upon the R Graphics package and is used to
visualize multivariate data. Some kinds of visualisations with lattice package are :-
1.Kernal Density Plots
22. ● ggplot2 :- this package is one of the most widely used visualisation packages in R. It enables the users to create
sophisticated visualisations with little code using the Grammar of Graphics.
● Plotly is an R package that creates interactive web-based graphs via the open source JavaScript graphing library
plotly.js. It can easily translate the ‘ggplot2’ graphs to web-based versions also.
23. Adavanced Scatter Plots
Besides the basic version of scatterplots we can also create them using the “ggplot2” library.
The below codes give a taste of the same.
25. HeatMaps
Heat Map uses intensity (density) of colors to display relationship between two or three or many variables
in a two dimensional image. It allows us to explore two dimensions as the axis and the third dimension by
intensity of color.
The colour of the bars in the heat map is
dependent on the cyl parameter of the dataset.
The dataset used here is mtcars. It’s an inbuilt
dataset.
26. HeatMaps contd….
Using the library “plotly”, the heatmaps can be made interactive in nature. The below code gives
an insight as to how we can use plotly.
27. Correlogram
Correlogram is used to test the level of correlation among the variable available in the data
set. The cells of the matrix can be shaded or colored to show the correlation value.
28. Correlogram contd...
It is possible to use “ggplot2” aesthetics on the chart, for instance to color each category. We can use a new library “GGally”
and see how different variations are made to the simple correlogram.
29. Correlogram contd….
Change the type of plot used on each part of the correlogram. This is done with the upper and lower argument.
30. Area Chart
Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used
for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
31. 3D Plots
● To create a 3D plot using R can be done
with the help of scatterplot3d package.
● scaterplot3d is very simple to use and it
can be easily extended by adding
supplementary points or regression
planes into an already generated
graphics.
33. Quick Information
For quick references you can easily check the cheatsheet side of Rstudio:
https://rstudio.com/resources/cheatsheets/
References :-
1. https://rstudio.com/resources/cheatsheets/
2. https://www.slant.co/topics/2354/~best-data-visualization-tools-for-massive-datasets
3. https://policyviz.com/product/core-principles-of-data-visualization-cheatsheet/
4. https://eazybi.com/blog/data_visualization_and_chart_types/
5. https://www.r-graph-gallery.com/199-correlation-matrix-with-ggally.html
6. https://towardsdatascience.com/a-guide-to-data-visualisation-in-r-for-beginners-ef6d41a34174?#0689
This is like a million dollar question because before we start with any kind of analysis with data we need to know about the insights from the data.These relations among the various variables in the data needs to be understood and what better could it be than by understanding them with visual effects.An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
For a very proper understanding of datasets we need to know which type of chart should be used when….
1. Used for continuous variables
2.It breaks the data into bins and shows frequency distribution of these bins
3.We can always change the bin size and see the effect it has on visualization.
brewer.pal makes the color palettes from ColorBrewer available as R palettes.
Boxplots are also used to detect the outliers present in the dataset.
Outlier detection and removal is an essential step of successful data exploration.
We can find the median , and also treat the outliers.
By using the ~ sign, we can visualize how the spread (of Sepal Length) is across various categories ( of Species). In the last two graphs we have seen the example of color palettes. A color palette is a group of colors that is used to make the graph more appealing and helping create visual distinctions in the data.
Lattice enables the use of trellis graphs. Trellis graphs exhibit the relationship between variables which are dependent on one or more variables.
The Grammar of Graphics is a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.The popularity of ggplot2 has increased tremendously in recent years since it makes it possible to create graphs that contain both univariate and multivariate data in a very simple manner.
Advanced visualisations include graphs like heatcharts,geographical maps,3D charts etc.which can be easily made by using visualisation tools like tableau etc.
Darker the color, higher the correlation between variables. Positive correlations are displayed in blue and negative correlations in red color. Color intensity is proportional to the correlation value.
GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pairwise plot matrix, a scatterplot plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.