This document discusses exploratory data analysis (EDA) and data visualization techniques in Python. It introduces commonly used Python packages like matplotlib, pandas, and seaborn for EDA and visualization. Specific visualization methods covered include histograms, scatter plots, line plots, bar plots, boxplots, heatmaps, and jointplots. The document also discusses concepts like normal distribution and how to test if a variable is normally distributed. For homework, students are asked to visualize and analyze the winequality-red.csv dataset using various charts.
2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
5. EDA
• EDA refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
5
6. EDA
• The useful python package for
EDA:
• matplotlib
• pandas
• seaborn
• The useful python interactive
visualization tool:
• dash
6
參考: https://dash.plotly.com/basic-callbacks
7. Using pandas
• Firstly, load csv file into data-frame
• Check basic information of data-frame, those are useful methods:
• head()
• tail()
• shape
• info()
• describe(include='all')
7
8. Using pandas
• Visualize from data-frame, those are useful methods:
• corr
• hist
• scatter
• line
• bar
• pie
• boxplot
8
pandas.ipynb
9. Using seaborn
• Seaborn supports rich chart visualization based on matplotlib tool
and is compatible with numpy or pandas data types.
• heatmap
• kdeplot/displot
• cut, cumulative
• jointplot
• pairplot
• lmplot
• barplot
• countplot
• catplot
9
seaborn.ipynb
15. How to test for a normal distribution
• The following variables are close to normally distributed variables:
• Height of a population
• Blood pressure of adult human
• Position of a particle that experiences diffusion
• Measurement errors
• Residuals in regression
• Shoe size of a population
• Amount of time it takes for employees to reach home
• A large number of educational measures
15
16. How to test for a normal distribution
• A normal distribution is a distribution
that is solely dependent on two
parameters of the data set: mean and
the standard deviation of the sample.
• Mean — This is the average value of all the
points in the sample that is computed by
summing the values and then dividing by
the total number of the values in a sample.
• Standard Deviation — This indicates how
much the data set deviates from the mean
of the sample.
16
Ref: https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data
test_for_a_Normal_Distribution.ipynb