2. Linegraph
• To visualize the value of something over time
• Buka data “Crude Oil Prices.xls”
• Buat grafik berikut ini
• Date vs Price
3. Bar Chart for Categorical Data
• Presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent.
• Buka data “Energy Drink Survey.xls”
• Buat grafik berikut ini
4. Scatterplot
• Displays relationship between two numerical variables
• Buka data “House Sales.xls”
• Buat grafik berikut ini
• Selling Price vs Lot Cost
5. Scatterplot
• Displays relationship between two numerical variables
• Buka data “Boston Housing.xls”
• Buat grafik berikut ini
• Lstat vs Medv
%% Octave / Matlab
clf;
medv = [24 21.6 34.7 …
lstat= [4.98 9.14 4.03 …
x = lstat;
y = medv;
ukuran = 200;
scatter (x, y, ukuran, 0, "filled");
6. Scatterplot with color added
• Contoh dengan menggunakan Octave
clf;
x = [1, 2, 3, 4, 5, 6, 7];
y = [1.9, 1.76, 1.34, 1.67, 1.72, 1.89, 1.91];
warna = [1,1,1,2,2,3,3];
ukuran = 200;
scatter (x, y, ukuran, warna, "filled");
7. Scatterplot with color added
• Displays relationship between two numerical variables
• Buka data “Boston Housing.xls”
• Buat grafik berikut ini
• Lstat vs Nox
clf;
medv = [24 21.6 34.7 …
lstat= [4.98 9.14 4.03 …
nox = [0.538 0.469 0.469 …
med = median(medv);
warna = 1:size(medv)(2);
warna(:)= 1;
iwarna = medv <= med;
warna(iwarna) = 0;
x = lstat;
y = nox;
ukuran = 200;
scatter (x, y, ukuran, warna, "filled");
8. Scatterplot using R
data(iris)
pairs(iris[1:4],main="Iris Data(red=setosa,green=versicolor,blue=virginica)",
pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
9. Scatterplot using Python
# data iris
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
iris = iris[iris['Species']!=2]
#
X = iris['PetalLength'].values.T
Y = iris['PetalWidth'].values.T
warna = iris[['Species']].values.T
warna = warna.astype('uint8')
#Make a scatter plot
import matplotlib.pyplot as plt
plt.scatter(X, Y, c=warna[0,:], s=40, cmap=plt.cm.Spectral);
plt.title("IRIS DATA | Blue - Versicolor, Red - Virginica ")
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()
X = iris['PetalLength'].values.T
Y = iris['PetalWidth'].values.T
warna = iris[['Species']].values.T
warna = warna.astype('uint8')
10. Scatterplot for all attributes using Python
# Scatter plots of all pairs of attributes
# pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns
plt.close()
sns.pairplot(irisdata, hue = 'Species', size = 2, diag_kind = 'kde')
plt.show()
# data iris
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
11. Histogram
• Displays the distribution of the outcome variable
• Display “how many” of each value occur in a data set
• Buka data “House Sales.xls”
• Buat grafik berikut ini
12. Histogram using R
• Displays the distribution of the outcome variable
# coba histogram
HouseSales<-read.csv("House Sales.csv",header=T)
hist(HouseSales$SellingPrice)
13. Histogram
• Displays the distribution of the outcome variable
• Display “how many” of each value occur in a data set
• Buka data “Boston Housing.xls”
• Buat grafik berikut ini
# coba histogram
BostonHousing<-read.csv("Boston Housing.csv",header=T)
hist(BostonHousing$medv)
14. Histogram using Python
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
# Histograms of distribution of input attributes
irisdata.hist()
his = plt.gcf()
his.set_size_inches(12, 6)
plt.show()
15. Boxplot
• Depicting groups of numerical data through their quartiles
• Useful for comparing subgroups
• Buka data “President's Inn Guest Database.xls”
• Buat grafik berikut ini
16. Boxplot
• Depicting groups of numerical data through their quartiles
• Useful for comparing subgroups
• Buka data “Boston Housing.xls”
• Buat grafik berikut ini
17. Boxplot using R
• Depicting groups of numerical data
through their quartiles
• Useful for comparing subgroups
# coba boxplot untuk kolom medv
BostonHousing<-read.csv("Boston Housing.csv",header=T)
boxplot(BostonHousing$medv)
# coba boxplot untuk kolom medv
BostonHousing<-read.csv("Boston Housing.csv",header=T)
boxplot(BostonHousing$medv, BostonHousing$lstat,
main = "Multiple boxplots for comparison",
at = c(1,2), names = c("medv", "lstat") )
18. Boxplot using Python
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
import matplotlib.pyplot as plt
# Box and whisker plots(Give idea about distribution of input attributes)
irisdata.plot(kind = 'box', subplots = True, layout = (2, 2), sharex = False, sharey = False)
plt.show()
19. Heatmap
• Correlation Matrix
• To highlight correlations
• Buka data “Boston Housing.xls”
• Buat grafik berikut ini
Load the Analysis ToolPak in Excel
• Click the File tab, click Options, and then click the Add-Ins category.
• If you're using Excel 2007, click the Microsoft Office Button Office button
image , and then click Excel Options
• In the Manage box, select Excel Add-ins and then click Go.
• If you're using Excel for Mac, in the file menu go to Tools > Excel Add-ins.
• In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.
• If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to
locate it.
• If you are prompted that the Analysis ToolPak is not currently installed on
your computer, click Yes to install it.
• Select Data–Data Analysis–Correlation
• Select the input range..in our case columns F:Q
• Check the box for “Labels in first row”
• Select output…either a new worksheet or a location in the
current sheet
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
crim 1
zn -0,20047 1
indus 0,406583 -0,53383 1
chas -0,05589 -0,0427 0,062938 1
nox 0,420972 -0,5166 0,763651 0,091203 1
rm -0,21925 0,311991 -0,39168 0,091251 -0,30219 1
age 0,352734 -0,56954 0,644779 0,086518 0,73147 -0,24026 1
dis -0,37967 0,664408 -0,70803 -0,09918 -0,76923 0,205246 -0,74788 1
rad 0,625505 -0,31195 0,595129 -0,00737 0,611441 -0,20985 0,456022 -0,49459 1
tax 0,582764 -0,31456 0,72076 -0,03559 0,668023 -0,29205 0,506456 -0,53443 0,910228 1
ptratio 0,289946 -0,39168 0,383248 -0,12152 0,188933 -0,3555 0,261515 -0,23247 0,464741 0,460853 1
b -0,38506 0,17552 -0,35698 0,048788 -0,38005 0,128069 -0,27353 0,291512 -0,44441 -0,44181 -0,17738 1
lstat 0,455621 -0,41299 0,6038 -0,05393 0,590879 -0,61381 0,602339 -0,497 0,488676 0,543993 0,374044 -0,36609 1
medv -0,3883 0,360445 -0,48373 0,17526 -0,42732 0,69536 -0,37695 0,249929 -0,38163 -0,46854 -0,50779 0,333461 -0,73766 1
• Conditional formatting
• Color scales
20. Heatmap using R
bostonhousing<-read.csv("Boston Housing.csv",header=T)
x <- as.matrix(bostonhousing)
xx <- cor(x)
my_palette <- colorRampPalette(c("red", "blue", "yellow"))(n = 256)
your_palette <- cm.colors(256)
rc <- rainbow(nrow(xx), start = 0, end = .3)
cc <- rainbow(ncol(xx), start = 0, end = .3)
hv <- heatmap(xx, col = my_palette, scale = "column",
RowSideColors = rc, ColSideColors = cc, margins = c(5,10),
xlab = "xlabel", ylab = "ylabel",
main = "heatmap"
)
utils::str(hv) # the two re-ordering index vectors
21. Heatmap using Python
# data iris
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
import matplotlib.pyplot as plt
plt.figure(figsize=(7,5))
sns.heatmap(irisdata.corr(),annot=True,cmap='RdYlGn_r')
plt.show()
22. Treemaps
• Gives you a vision of the
size of your data by area.
The more then area is big,
the more the data is
important.
• Buka data “daftar-file.xls”
• Buat grafik berikut ini
23. Treemaps using R
• Gives you a vision of the size of your
data by area. The more then area is
big, the more the data is important.
• Buka data “daftar-file.csv”
# install.packages("treemap");
library(treemap)
dataku<-read.csv("daftar-file.csv",header=T)
treemap(dataku,
index=c("subdir2", "namafile"),
vSize="ukuran",
vColor="ukuran",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))
24. Treemaps using R
• Gives you a vision of the
size of your data by area.
The more then area is big,
the more the data is
important.
library(treemap)
data(GNI2014)
treemap(GNI2014,
index=c("continent", "iso3"),
vSize="population",
vColor="GNI",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))
iso3 country continent population GNI
BMU Bermuda North America 67837 106140
NOR Norway Europe 4676305 103630
25. Treemap using Python
import pandas as pd
import os
#Set working directory and load datax
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataxset
datax = pd.read_csv('daftar-file.csv')
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import squarify
# filter yang ukuran file lebih dari 25K
mydata = datax[datax["ukuran"]>25]
#Utilise matplotlib to scale our goal numbers between the min and max, then assign this scale to our values.
norm = matplotlib.colors.Normalize(vmin=min(mydata.ukuran), vmax=max(mydata.ukuran))
colors = [matplotlib.cm.Blues(norm(value)) for value in mydata.ukuran]
#Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
#Use squarify to plot our datax, label it and add colours. We add an alpha layer to ensure black labels show
through
squarify.plot(label=mydata.namafile,sizes=mydata.ukuran, color = colors, alpha=.6)
plt.title("Ini file saya yang lebih dari 25K",fontsize=23,fontweight="bold")
#Remove our axes and display the plot
plt.axis('off')
plt.show()
26. Descriptive statistics for Iris Dataset
Using Python
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') #
C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
iris.shape
iris.describe()
irisdata.info()
irisdata.head()
irisdata[irisdata['Species']=='Iris-virginica'].describe()
irisdata.groupby('Species').size()