visualisasi data praktik pakai excel, py

Percobaan Visualisasi Data
Menggunakan EXCEL, Octave/Matlab, R, dan Python
atjahyanto@gmail.com

Linegraph
• To visualize the value of something over time
• Buka data “Crude Oil Prices.xls”
• Buat grafik berikut ini
• Date vs Price

Bar Chart for Categorical Data
• Presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent.
• Buka data “Energy Drink Survey.xls”

Scatterplot
• Displays relationship between two numerical variables
• Buka data “House Sales.xls”
• Selling Price vs Lot Cost

Scatterplot
• Buka data “Boston Housing.xls”
• Lstat vs Medv
%% Octave / Matlab
clf;
medv = [24 21.6 34.7 …
lstat= [4.98 9.14 4.03 …
x = lstat;
y = medv;
ukuran = 200;
scatter (x, y, ukuran, 0, "filled");

Scatterplot with color added
• Contoh dengan menggunakan Octave
clf;
x = [1, 2, 3, 4, 5, 6, 7];
y = [1.9, 1.76, 1.34, 1.67, 1.72, 1.89, 1.91];
warna = [1,1,1,2,2,3,3];
ukuran = 200;
scatter (x, y, ukuran, warna, "filled");

Scatterplot with color added
• Lstat vs Nox
clf;
medv = [24 21.6 34.7 …
lstat= [4.98 9.14 4.03 …
nox = [0.538 0.469 0.469 …
med = median(medv);
warna = 1:size(medv)(2);
warna(:)= 1;
iwarna = medv <= med;
warna(iwarna) = 0;
x = lstat;
y = nox;
ukuran = 200;
scatter (x, y, ukuran, warna, "filled");

Scatterplot using R
data(iris)
pairs(iris[1:4],main="Iris Data(red=setosa,green=versicolor,blue=virginica)",
pch=21, bg=c("red","green3","blue")[unclass(iris$Species)])
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50

Scatterplot using Python
# data iris
import pandas as pd
import os
#Set working directory and load data
os.chdir('C:myworkspacepython') # C:myworkspacepython
# load dataset
irisdata = pd.read_csv('iris-UCI-header.csv')
iris = pd.read_csv('iris-UCI-header.csv')
#Create numeric classes for species (0,1,2)
iris.loc[iris['Species']=='Iris-virginica','Species']=0
iris.loc[iris['Species']=='Iris-versicolor','Species']=1
iris.loc[iris['Species']=='Iris-setosa','Species'] = 2
iris = iris[iris['Species']!=2]
#
X = iris['PetalLength'].values.T
Y = iris['PetalWidth'].values.T
warna = iris[['Species']].values.T
warna = warna.astype('uint8')
#Make a scatter plot
import matplotlib.pyplot as plt
plt.scatter(X, Y, c=warna[0,:], s=40, cmap=plt.cm.Spectral);
plt.title("IRIS DATA | Blue - Versicolor, Red - Virginica ")
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()
X = iris['PetalLength'].values.T
Y = iris['PetalWidth'].values.T
warna = iris[['Species']].values.T
warna = warna.astype('uint8')

Scatterplot for all attributes using Python
# Scatter plots of all pairs of attributes
# pip install seaborn
import seaborn as sns
plt.close()
sns.pairplot(irisdata, hue = 'Species', size = 2, diag_kind = 'kde')
plt.show()
# data iris
import pandas as pd
import os
# load dataset

Histogram
• Displays the distribution of the outcome variable
• Display “how many” of each value occur in a data set
• Buka data “House Sales.xls”

Histogram using R
# coba histogram
HouseSales<-read.csv("House Sales.csv",header=T)
hist(HouseSales$SellingPrice)

Histogram
• Display “how many” of each value occur in a data set
# coba histogram
BostonHousing<-read.csv("Boston Housing.csv",header=T)
hist(BostonHousing$medv)

Histogram using Python
import pandas as pd
import os
# load dataset
# Histograms of distribution of input attributes
irisdata.hist()
his = plt.gcf()
his.set_size_inches(12, 6)
plt.show()

Boxplot
• Depicting groups of numerical data through their quartiles
• Useful for comparing subgroups
• Buka data “President's Inn Guest Database.xls”

Boxplot
• Depicting groups of numerical data through their quartiles

Boxplot using R
• Depicting groups of numerical data
through their quartiles
# coba boxplot untuk kolom medv
boxplot(BostonHousing$medv)
# coba boxplot untuk kolom medv
boxplot(BostonHousing$medv, BostonHousing$lstat,
main = "Multiple boxplots for comparison",
at = c(1,2), names = c("medv", "lstat") )

Boxplot using Python
import pandas as pd
import os
# load dataset
# Box and whisker plots(Give idea about distribution of input attributes)
irisdata.plot(kind = 'box', subplots = True, layout = (2, 2), sharex = False, sharey = False)
plt.show()

Heatmap
• Correlation Matrix
• To highlight correlations
Load the Analysis ToolPak in Excel
• Click the File tab, click Options, and then click the Add-Ins category.
• If you're using Excel 2007, click the Microsoft Office Button Office button
image , and then click Excel Options
• In the Manage box, select Excel Add-ins and then click Go.
• If you're using Excel for Mac, in the file menu go to Tools > Excel Add-ins.
• In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.
• If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to
locate it.
• If you are prompted that the Analysis ToolPak is not currently installed on
your computer, click Yes to install it.
• Select Data–Data Analysis–Correlation
• Select the input range..in our case columns F:Q
• Check the box for “Labels in first row”
• Select output…either a new worksheet or a location in the
current sheet
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
crim 1
zn -0,20047 1
indus 0,406583 -0,53383 1
chas -0,05589 -0,0427 0,062938 1
nox 0,420972 -0,5166 0,763651 0,091203 1
rm -0,21925 0,311991 -0,39168 0,091251 -0,30219 1
age 0,352734 -0,56954 0,644779 0,086518 0,73147 -0,24026 1
dis -0,37967 0,664408 -0,70803 -0,09918 -0,76923 0,205246 -0,74788 1
rad 0,625505 -0,31195 0,595129 -0,00737 0,611441 -0,20985 0,456022 -0,49459 1
tax 0,582764 -0,31456 0,72076 -0,03559 0,668023 -0,29205 0,506456 -0,53443 0,910228 1
ptratio 0,289946 -0,39168 0,383248 -0,12152 0,188933 -0,3555 0,261515 -0,23247 0,464741 0,460853 1
b -0,38506 0,17552 -0,35698 0,048788 -0,38005 0,128069 -0,27353 0,291512 -0,44441 -0,44181 -0,17738 1
lstat 0,455621 -0,41299 0,6038 -0,05393 0,590879 -0,61381 0,602339 -0,497 0,488676 0,543993 0,374044 -0,36609 1
medv -0,3883 0,360445 -0,48373 0,17526 -0,42732 0,69536 -0,37695 0,249929 -0,38163 -0,46854 -0,50779 0,333461 -0,73766 1
• Conditional formatting
• Color scales

Heatmap using R
bostonhousing<-read.csv("Boston Housing.csv",header=T)
x <- as.matrix(bostonhousing)
xx <- cor(x)
my_palette <- colorRampPalette(c("red", "blue", "yellow"))(n = 256)
your_palette <- cm.colors(256)
rc <- rainbow(nrow(xx), start = 0, end = .3)
cc <- rainbow(ncol(xx), start = 0, end = .3)
hv <- heatmap(xx, col = my_palette, scale = "column",
RowSideColors = rc, ColSideColors = cc, margins = c(5,10),
xlab = "xlabel", ylab = "ylabel",
main = "heatmap"
)
utils::str(hv) # the two re-ordering index vectors

Heatmap using Python
# data iris
import pandas as pd
import os
# load dataset
plt.figure(figsize=(7,5))
sns.heatmap(irisdata.corr(),annot=True,cmap='RdYlGn_r')
plt.show()

Treemaps
• Gives you a vision of the
size of your data by area.
The more then area is big,
the more the data is
important.
• Buka data “daftar-file.xls”

Treemaps using R
• Gives you a vision of the size of your
data by area. The more then area is
big, the more the data is important.
• Buka data “daftar-file.csv”
# install.packages("treemap");
library(treemap)
dataku<-read.csv("daftar-file.csv",header=T)
treemap(dataku,
index=c("subdir2", "namafile"),
vSize="ukuran",
vColor="ukuran",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))

Treemaps using R
• Gives you a vision of the
size of your data by area.
The more then area is big,
the more the data is
important.
library(treemap)
data(GNI2014)
treemap(GNI2014,
index=c("continent", "iso3"),
vSize="population",
vColor="GNI",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))
iso3 country continent population GNI
BMU Bermuda North America 67837 106140
NOR Norway Europe 4676305 103630

Treemap using Python
import pandas as pd
import os
#Set working directory and load datax
# load dataxset
datax = pd.read_csv('daftar-file.csv')
import matplotlib
import pandas as pd
import squarify
# filter yang ukuran file lebih dari 25K
mydata = datax[datax["ukuran"]>25]
#Utilise matplotlib to scale our goal numbers between the min and max, then assign this scale to our values.
norm = matplotlib.colors.Normalize(vmin=min(mydata.ukuran), vmax=max(mydata.ukuran))
colors = [matplotlib.cm.Blues(norm(value)) for value in mydata.ukuran]
#Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
#Use squarify to plot our datax, label it and add colours. We add an alpha layer to ensure black labels show
through
squarify.plot(label=mydata.namafile,sizes=mydata.ukuran, color = colors, alpha=.6)
plt.title("Ini file saya yang lebih dari 25K",fontsize=23,fontweight="bold")
#Remove our axes and display the plot
plt.axis('off')
plt.show()

Descriptive statistics for Iris Dataset
Using Python
import pandas as pd
import os
os.chdir('C:myworkspacepython') #
C:myworkspacepython
# load dataset
iris.shape
iris.describe()
irisdata.info()
irisdata.head()
irisdata[irisdata['Species']=='Iris-virginica'].describe()
irisdata.groupby('Species').size()

visualisasi data praktik pakai excel, py

Recommended

Recommended

More Related Content

Similar to visualisasi data praktik pakai excel, py

Similar to visualisasi data praktik pakai excel, py (20)

Recently uploaded

Recently uploaded (20)

visualisasi data praktik pakai excel, py