SlideShare a Scribd company logo
1 of 84
Download to read offline
REGULATION – 2021
CS3361 – DATA SCIENCE LABORATORY
LAB MANUAL
YEAR / SEMESTER: II / III
Prepared by
P.SANTHIYA
Assistant Professor
Department of Computer Science and Engineering
CS3362 DATA SCIENCE LABORATORY L T P C
0 0 4 2
COURSE OBJECTIVES:
To understand the python libraries for data science.
To understand the basic Statistical and Probability
measures for data science.To learn descriptive analytics
on the benchmark data sets.
To apply correlation and regression analytics on
standard data sets. To present and interpret data
using visualization packages in Python.
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodelsand Pandaspackages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring
variouscommands for doingdescriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data
set forperforming thefollowing:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, StandardDeviation,Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
LIST OF EQUIPMENTS :(30 Students per Batch)
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn,
plotly, bokeh
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
TOTAL: 60 PERIODS
COURSE OUTCOMES:
At the end of this course, the students will be able to:
 Make use of the python libraries for data science.
 Make use of the basic Statistical and Probability measures for data science.
 Perform descriptive analytics on the benchmark data sets.
 Perform correlation and regression analytics on standard data sets.
 Present and interpret data using visualization packages in Python.
1
Ex.No 1
Download, install and explore the features of NumPy, SciPy,
Jupyter,Statsmodels andpackages
Date:
AIM:
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
packages.
Downloading and Installing Anaconda on Linux:
1. Getting Started:
2. Getting through the License Agreement:
2
3. Choose Installation Location:
4. Extracting Files and packages:
3
5. Initializing Anaconda Installation:
6. Finishing up the Installation:
4
7. Working with Anaconda:
>> anaconda-navigator
5
a) Installing Jupyter Notebook using Anaconda:
To install Jupyter using Anaconda, just go through the following instructions:
1. Launch Anaconda Navigator:
2. Click on the Install Jupyter Notebook Button:
6
3. Beginning the Installation:
4. Loading Packages:
7
5. Finished Installation:
6. Launching Jupyter:
8
b) Installing Jupyter Notebook using pip:
To install Jupyter using pip, the following command to update pip:
>> python3 -m pip install --upgrade pip
After updating the pip version, follow the instructions provided below to install Jupyter:
Command to install Jupyter:
>>pip3 install Jupyter
1. Beginning Installation:
9
2. Collecting Files and Data:
3. Downloading Packages:
10
4. Running Installation:
5. Finished Installation:
11
6. Launching Jupyter:
Use the following command to launch Jupyter using command-line:
>>jupyter notebook
12
Explore the following features of python packages:
1. NumPy:
NumPy stands for Numerical Python.NumPy (Numerical Python) is an open-source library for
the Python programming language. It is used for scientific computing and working with arrays. The
source code for NumPy is located at this github repository https://github.com/numpy/numpy.
Features:
1. High-performance N-dimensional array object.
2. It contains tools for integrating code from C/C++ and Fortran.
3. It contains a multidimensional container for generic data.
4. Additional linear algebra, Fourier transform, and random number capabilities.
5. It consists of broadcasting functions.
6. It had data type definition capability to work with varied databases.
2. SciPy:
SciPy stands for Scientific Python. SciPy is a scientific computation library that uses NumPy
underneath. The source code for SciPy is located at this github repository
https://github.com/scipy/scipy
Features:
1. SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems,
algebraic equations, differential equations, statistics and many other classes of problems.
2. It provides more utility functions for optimization, stats and signal processing.
Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy is suitable for
basicoperations such as sorting, indexing and many more because it contains array data, whereas SciPy
consistsof all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra, Fourier transforms,
etc. whereas SciPy library contains full featured version of the linear algebra module as well many other
numerical algorithms.
13
3. Pandas:
Python Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name ofPandas isderived from the word Panel Data, which meansan
Econometrics from Multidimensional data. It is used for data analysis in Python. Pandas is built
on top of the Numpy package, means Numpy is required for operating the Pandas.
Features:
1. Group by data for aggregations and transformations.
2. It has a fast and efficient DataFrame object with the default and customized indexing.
3. Used for reshaping and pivoting of the data sets.
4. It is used for data alignment and integration of the missing data.
5. Provide the functionality of Time Series.
6. Process a variety of data sets in different formats like matrix data, tabular heterogeneous,time
series.
7. Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.
4. Statsmodels:
statsmodels is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. The package is released under the open source Modified BSD (3-clause) license. The
online documentation is hosted at statsmodels.org.
Features:
1. Linear regression models like Ordinary least squares, Generalized least
squares,Weighted least squares, Least squares with autoregressive errors.
2. Bayesian Mixed GLM for Binomial and Poisson
3. GEE: Generalized Estimating Equations for one-way clustered or longitudinal data
4. Nonparametric statistics: Univariate and multivariate kernel density estimators
5. Datasets: Datasets used for examples and in testing
6. Sandbox: statsmodels contains a sandbox folder with code in various
stages ofdevelopment and testing.
RESULT:
Thus, the NumPy, SciPy, Jupyter, Statsmodels packages have been successfully
download,install and explore their features.
14
Ex.No 2
Working with Numpy arrays
Date:
AIM:
To write a Numpy arrays program to demonstrate basic array concepts in Jupyter Notebook.
PROGRAM:
1. Creating Arrays from Python Lists:
In[1]: import numpy as np
In[2]: # integer array:
np.array([1, 4, 2, 5, 3])
Out[2]: array([1, 4, 2, 5, 3])
#NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will
upcast ifpossible (here, integers are upcast to floating point):
In[3]: np.array([3.14, 4, 2, 3])
Out[3]: array([ 3.14, 4. , 2. , 3. ])
#If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:
In[4]: np.array([1, 2, 3, 4], dtype='float32')
Out[4]: array([ 1., 2., 3., 4.], dtype=float32)
In[5]: # nested lists result in multidimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
Out[5]: array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
2. NumPy Array Attributes:
In[1]: import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
15
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and
size (thetotal size of the array):
In[2]: print("x3 ndim: ",
x3.ndim) print("x3 shape:",
x3.shape)print("x3 size: ",
x3.size)
Out[2]:x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
In[3]: print("dtype:", x3.dtype)# data type of the array
Out[3]:dtype: int64
In[4]: print("itemsize:", x3.itemsize,
"bytes")print("nbytes:", x3.nbytes,
"bytes")
Out[4]:itemsize: 8 bytes
Out[4]:nbytes: 480 bytes
3. Array Indexing: Accessing Single Elements:
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
In[6]:
x1[0]
Out[6
]: 5
In[7]:
x1[4]
Out[7]: 7
#To index from the end of the array, you can use negative indices
In[8]:
x1[-1]
Out[8]
: 9
In[9]:
x1[-2]
Out[9]: 7
16
#In a multidimensional array, you access items using a comma-separated tuple of indices
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In[11]: x2[0, 0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
#modify values using any of the above index notation
In[14]: x2[0, 0] = 12
x2
Out[14]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[15]: x1[0] = 3.14159 # this will be truncated!
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
4. Array Slicing: Accessing Subarrays
#One-dimensional subarrays
In[16]: x =
np.arange(10)x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In[17]: x[:5] # first five elements
Out[17]: array([0, 1, 2, 3, 4])
In[18]: x[5:] # elements after index 5
Out[18]: array([5, 6, 7, 8, 9])
In[19]: x[4:7] # middle subarray
Out[19]: array([4, 5, 6])
In[20]: x[::2] # every other element
Out[20]: array([0, 2, 4, 6, 8])
In[21]: x[1::2] # every other element, starting at index 1
Out[21]: array([1, 3, 5, 7, 9])
17
In[22]: x[::-1] # all elements, reversed
Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In[23]: x[5::-2] # reversed every other from index 5
Out[23]: array([5, 3, 1])
5. Multidimensional subarrays:
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
Out[25]: array([[12, 5, 2],
[ 7, 6, 8]])
In[26]: x2[:3, ::2] # all rows, every other column
Out[26]: array([[12, 2],
[ 7, 8],
[ 1, 7]])
#Finally, subarray dimensions can even be reversed together:
In[27]: x2[::-1, ::-1]
Out[27]: array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
6. Accessing array rows and columns:
In[28]: print(x2[:, 0]) # first column of x2
[12 7 1]
In[29]: print(x2[0, :]) # first row of x2
[12 5 2 4]
#In the case of row access, the empty slice can be omitted for a more compact syntax:
In[30]: print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
In[31]: print(x2)
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
#extract a 2×2 subarray from this:
In[32]: x2_sub = x2[:2, :2]
18
print(x2_sub)
[[12 5]
[ 7 6]]
#modify this subarray
In[33]: x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In[34]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
7. Creating copies of arrays:
In[35]: x2_sub_copy = x2[:2,
:2].copy()print(x2_sub_copy)
[[99 5]
[ 7 6]]
#modify this subarray
In[36]: x2_sub_copy[0,
0] = 42
print(x2_sub_copy) [[42
5]
[ 7 6]]
In[37]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Reshaping of Arrays:
In[38]: grid = np.arange(1,
10).reshape((3, 3))print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
19
In[39]: x = np.array([1, 2, 3])
# row vector via
reshape
x.reshape((1, 3))
Out[39]: array([[1, 2, 3]])
In[40]: # row vector via newaxis
x[np.newaxis, :]
Out[40]: array([[1, 2, 3]])
In[41]: # column vector via reshape
x.reshape((3, 1))
Out[41]:
array([[1],[2],
[3]])
In[42]: # column vector via newaxis
x[:, np.newaxis]
Out[42]: array([[1],
[2],
[3]])
8. Array Concatenation and Splitting:
In[43]: x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]: array([1, 2, 3, 3, 2, 1])
#concatenate more than two arrays at once:
In[44]: z = [99, 99, 99]
print(np.concatenate([x, y, z]))[
1 2 3 3 2 1 99 99 99]
#np.concatenate can also be used for two-dimensional arrays:
In[45]: grid = np.array([[1, 2, 3],
[4, 5, 6]])
In[46]: # concatenate along the first
axis
np.concatenate([grid, grid])
Out[46]: array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
20
[4, 5, 6]])
In[47]: # concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[47]: array([[1, 2, 3, 1,
2, 3],
[4, 5, 6, 4, 5, 6]])
In[48]: x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
Out[48]: array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
In[49]: # horizontally stack the
arraysy = np.array([[99],[99]])
np.hstack([grid, y])
Out[49]: array([[ 9, 8,
7, 99],
[ 6, 5, 4, 99]])
Splitting of arrays:
In[50]: x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
In[51]: grid =
np.arange(16).reshape((4, 4))grid
Out[51]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In[52]: upper, lower =
np.vsplit(grid, [2])print(upper)
print(lower)
[[0 1 2 3]
21
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In[53]: left, right = np.hsplit(grid, [2])
print(left)
print(right)
Out[53]: [[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
9. Exploring NumPy’s UFuncs:
In[54]: x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
Out[54]: x = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [ 0. 0.5 1. 1.5]
x // 2 = [0 0 1 1]
In[8]: print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
Out[8]:-x = [ 0 -1
-2 -3]
x ** 2 = [0 1 4 9]
22
x % 2 = [0 1 0 1]
In[9]: -(0.5*x + 1) ** 2
Out[9]: array([-1. , -2.25, -4. , -6.25])
In[10]: np.add(x, 2)
Out[10]: array([2, 3, 4, 5])
10. Absolute value:
In[11]: x = np.array([-2, -1, 0, 1, 2])
abs(x)
Out[11]: array([2, 1, 0, 1, 2])
In[12]: np.absolute(x)
Out[12]: array([2, 1, 0, 1, 2])
In[13]: np.abs(x)
Out[13]: array([2, 1, 0, 1, 2])
In[14]: x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0
+ 1j])np.abs(x)
Out[14]: array([ 5., 5., 2., 1.])
11. Trigonometric functions:
In[15]: theta = np.linspace(0, np.pi, 3)
In[16]: print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
Out[16]:theta = [ 0. 1.57079633 3.14159265]
sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16]
cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00]
tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16]
In[17]: x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
Out[17]:x = [-1, 0, 1]
arcsin(x) = [-1.57079633 0. 1.57079633]
23
arccos(x) = [ 3.14159265 1.57079633 0. ]
arctan(x) = [-0.78539816 0. 0.78539816]
12. Exponents and logarithms:
In[18]: x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
Out[18]:x = [1, 2, 3]
e^x = [ 2.71828183 7.3890561 20.08553692]
2^x = [ 2. 4. 8.]
3^x = [ 3 9 27]
In[19]: x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
Out[19]:x = [1, 2, 4, 10]
ln(x) = [ 0. 0.69314718 1.38629436 2.30258509]
log2(x) = [ 0. 1. 2. 3.32192809]
log10(x) = [ 0. 0.30103 0.60205999 1. ]
In[20]: x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
Out[20]:exp(x) - 1 = [ 0. 0.0010005 0.01005017 0.10517092]
log(1 + x) = [ 0. 0.0009995 0.00995033 0.09531018]
13. Specialized ufuncs:
In[21]: from scipy import special
In[22]: # Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
Out[22]:gamma(x) = [ 1.00000000e+00 2.40000000e+01
3.62880000e+05]ln|gamma(x)| = [ 0. 3.17805383 12.80182748]
24
beta(x, 2) = [ 0.5 0.03333333 0.00909091]
In[23]: # Error function (integral of Gaussian) its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
Out[23]:erf(x) = [ 0. 0.32862676 0.67780119 0.84270079]
erfc(x) = [ 1. 0.67137324 0.32219881 0.15729921]
erfinv(x) = [ 0. 0.27246271 0.73286908 inf]
14. Aggregates:
In[26]: x = np.arange(1, 6)
np.add.reduce(x)
In[27]: np.multiply.reduce(x)
Out[27]: 120
In[28]: np.add.accumulate(x)
Out[28]: array([ 1, 3, 6, 10, 15])
In[29]: np.multiply.accumulate(x)
Out[29]: array([ 1, 2, 6, 24, 120])
15. Outer products:
In[30]: x = np.arange(1, 6)
np.multiply.outer(x, x)
Out[30]: array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
RESULT:
Thus, the Numpy array program was successfully executed and verified.
25
Ex.No 3
Working of Pandas DataFrame
Date:
AIM:
To write a Pandas program using dictionary sample dataframe to perform opertations in its DataFrame.
Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
PROGRAM:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(df)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())
26
Sample Output:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>Index:
10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 name 10 non-null object
1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2)memory
usage: 400.0+ bytes
None
i. To get the first 3 rows of a given DataFrame.
print("First three rows of the data frame:")
print(df.iloc[:3])
Sample Output:
First three rows of the data frame:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
ii. To select the 'name' and 'score' columns from the following DataFrame.
print("Select specific columns:")
print(df[['name', 'score']])
27
Sample Output:
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
iii. To select the specified columns and rows from a given DataFrame. Select 'name' and'score'
columns in rows 1, 3, 5, 6 from the following data frame.
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes
iv. To select the rows where the number of attempts in the examination is greater than 2.
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])
Sample Output:
Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes
v. To select the rows where the score is missing, i.e. is NaN.
print("Rows where score is missing:")
print(df[df['score'].isnull()])
28
Sample Output:
Rows where score is missing: attempts
name qualify score
d 3 James no NaN
h 1 Laura no NaN
vi. To change the score in row 'd' to 11.5.
print("nOriginal data frame:")print(df)
print("nChange the score in row 'd' to 11.5:")df.loc['d',
'score'] = 11.5
print(df)
Sample Output:
Original data frame:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Change the score in row 'd' to 11.5:
Attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no 11.5
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
29
vii. To calculate the sum of the examination score by the students.
print("nSum of the examination attempts by the students:")
print(df['score'].sum())
Sample Output:
Sum of the examination attempts by the students:
108.5
viii. To append a new row 'k' to DataFrame with given values for each column. Now delete thenew
row and return the original data frame.
print("Original rows:")print(df)
print("nAppend a new row:") df.loc['k'] =
[1, 'Suresh', 'yes', 15.5]
print("Print all records after insert a new record:")print(df)
print("nDelete the new row and display the original rows:")df =
df.drop('k')
print(df)
Sample Output:
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Append a new row:
Print all records after insert a new record:attempts
name qualify score
30
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
k 1 Suresh yes 15.5
Delete the new row and display the original rows:attempts
name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
ix. To delete the 'attempts' column from the DataFrame.
print("Original rows:")print(df)
print("nDelete the 'attempts' column from the data frame:")
df.pop('attempts')
print(df)
31
Sample Output:
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Delete the 'attempts' column from the data frame:
name qualify score
a Anastasia yes 12.5
b Dima no 9.0
c Katherine yes 16.5

d James no NaN
e Emily no 9.0
f Michael yes 20.0
g Matthew yes 14.5
h Laura no NaN
i Kevin no 8.0
j Jonas yes 19.0
x. To insert a new column in existing DataFrame.
print("Original rows:")print(df)
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']df['color'] =
color
32
print("nNew DataFrame after inserting the 'color' column")print(df)
Sample Output
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
New DataFrame after inserting the 'color' column
Attempts name qualify score color
a 1 Anastasia yes 12.5 Red
b 3 Dima no 9.0 Blue
c 2 Katherine yes 16.5 Orange
d 3 James no NaN Red
e 2 Emily no 9.0 White
f 3 Michael yes 20.0 White
g 1 Matthew yes 14.5 Blue
h 1 Laura no NaN Green
i 2 Kevin no 8.0 Green
j 1 Jonas yes 19.0 Red
RESULT:
Thus, the working of Pandas Dataframe using Dictionary was executed and verified successfully.
33
Ex.No:4
Descriptive analytics on the Iris data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing descriptive
analytics on the Iris data set.
PROCEDURE:
Download the Iris.csv file from the https://www.kaggle.com/datasets/uciml/iris and use the Pandas library to
load this CSV file,and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
df = pd.read_csv("Music/Iris.csv")# Reading the CSV file
print(df)
print(df.d
types)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm 
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
34
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150 rows x 6 columns]
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object dtype: object
# Printing top 5 rows
print(df.head())
#To shape parameter to get the shape of the dataset.
print(df.shape)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
(150, 6)
#To know the columns and their data types use the info() method.
df.info()
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149
35
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
Species
150 non-null object
dtypes: float64(4),
int64(1),
object(1)
memory usage: 7.2+ KB
print(df.describe())
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
m
e
a
n
75.500000 5.843333 3.054000 3.758667 1.198667
s
t
d
43.445368 0.828066 0.433594 1.764420 0.763161
m
i
n
1.000000 4.300000 2.000000 1.000000 0.100000
2
5
%
38.250000 5.100000 2.800000 1.600000 0.300000
5
0
%
75.500000 5.800000 3.000000 4.350000 1.300000
7
5
%
112.750000 6.400000 3.300000 5.100000 1.800000
m
a
x
150.0000
00
7.900000 4.40000
0
6.90000
0
2.500000
36
# Missing values can occur when no information is provided
print(df.isnull().sum())
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
# To check dataset contains any duplicates
or notdata = df.drop_duplicates(subset
="Species",) print(data)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm 
0 1 5.1 3.5 1.4 0.2
5
0
5
1
7.0 3.
2
4.7 1.4
1
0
0
1
0
1
6.3 3.3 6.0 2.5
Species
0 Iris-
setosa
50 Iris-versicolor
100 Iris-virginica
#To find unique species from the given dataset
print(df.value_counts("Species"))
Species
Iris-setosa50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
37
#matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#seaborn
import seaborn as sns
#plot the variable ‘sepal.width’
plt.scatter(df.index,df['SepalWidthCm'])plt.show()
#visualize the same plot by considering its variety using the sns.scatterplot() function of the
seaborn library.
sns.scatterplot(x=df.index,y=df['SepalWidthCm'],hue=df['Species'])
38
#visualizes data by connecting the data points via line segments.
plt.figure(figsize=(6,6))
plt.title("line plot for petal length")
plt.xlabel('index',fontsize=20)
plt.ylabel('PetalLengthCm',fontsize=20)
plt.plot(df.index,df['PetalLengthCm'],markevery=1,marker='d') for name,group in
df.groupby('Species'):
plt.plot(group.index,group['PetalLengthCm'],label=name,markevery=1,marker='d')
plt.legend()
plt.show()
#Plotting histogram using the matplotlib plt.hist() function :
plt.hist(df["PetalWidthCm"])
39
sns.distplot(df["PetalWidthCm"],kde=False,color='RED',bins=10)
<AxesSubplot:xlabel='PetalWidthCm'>
RESULT:
Thus, the descriptive analysis on the iris data set was successfully executed and practically
verified.
40
Ex.No:5.a
Univariate analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from csv files exploring various commands for doing Univariate analysis using the
UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/datasets/uciml/pima-
indians-diabetes-database and use the Pandaslibrary to load this CSV file, and convert it into the dataframe.
read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
df = pd.read_csv("E:DATA
SCIENCEPima_indian_diabetesdiabetes.CSV")print(df)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI 
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ...
.
.
.
.
.
.
..
.
...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4
DiabetesPedigreeFunction Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
41
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0
[768 rows x 9 columns]
# To know data type print(df.dtypes)
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction
float64Age int64
Outcome int64
dtype: object
#To print fiest 5 rows
print(df.head())
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI 
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
DiabetesPedigreeFunction Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
#To shape parameter to get the shape of the dataset.
print(df.shape)(768, 9)
42
#calculate mean
print("Mean of Preganancies: %f"
%df['Pregnancies'].mean()) print("Mean of BloodPressure:
%f" %df['BloodPressure'].mean())print("Mean of Glucose:
%f" %df['Glucose'].mean()) print("Mean of Age: %f"
%df['Age'].mean())
Sample Output:
Mean of Preganancies:
3.845052 Mean of
BloodPressure: 69.105469
Mean of Glucose:
120.894531 Mean of Age:
33.240885 #calculate
median
print("median of Preganancies: %f"
%df['Pregnancies'].median()) print("median of BloodPressure:
%f" %df['BloodPressure'].median())print("medianf Glucose:
%f" %df['Glucose'].median())
print("median of Age: %f" %df['Age'].median())
Sample Output:
median of Preganancies:
3.000000 median of
BloodPressure: 72.000000
medianf Glucose: 117.000000
median of Age: 29.000000
#calculate standard deviation of 'points'
print("standard deviation for BloodPressure: %f" %
df['BloodPressure'].std())print("standard deviation for Glucose: %f" %
df['Glucose'].std()) print("standard deviation for Pregnancies: %f" %
df['Pregnancies'].std()) Sample Output:
standard deviation for BloodPressure:
19.355807standard deviation for Glucose:
31.972618 standard deviation for
Pregnancies: 3.369578 #To describe the
data
43
df.Glucose.describe()
Sample Output:
count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
#create frequency table
df['Pregnancies'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype:
int64#create frequency table
df['Glucose'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14
44
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype: int64
#skewness and kurtosis
print("Skewness: %f" %
df['Pregnancies'].skew())print("Kurtosis:
%f" % df['Pregnancies'].kurt()) Sample
Output:
Skewness: 0.901674
Kurtosis: 0.159220
#find frequency of each letter grade
pd.crosstab(index=df['Outcome'],
columns='count')Sample Output:
col_0 count
Outcome
0 500
1 268
#create frequency table for
'points'
df['Pregnancies'].value_count
s() Sample Output:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
45
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies, dtype: int64
#find frequency of each letter grade
pd.crosstab(index=df['Pregnancies'],
columns='count')Sample Output:
col_0 count
Pregnancies
0 111
1 135
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
12 9
13 10
14 2
15 1
17 1
import matplotlib.pyplot as plt
df.hist(column='BloodPressure', grid=False,
edgecolor='black')
Sample Output:
array([[<AxesSubplot:title={'center':'BloodPressure'}>]], dtype=object)
46
#to create a density curve
import seaborn as sns
sns.kdeplot(df['BloodPres
sure'])
<AxesSubplot:xlabel='BloodPressure', ylabel='Density'>
#visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn
library.sns.scatterplot(x=df.index,y=df['Age'],hue=df['Outcome'])
47
import numpy as np
preg_proportion =
np.array(df['Pregnancies'].value_counts()) preg_month =
np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_proportion':preg_pr
oportion_perc})
preg.set_index(['month'],inplace=
True)preg.head(10)
Sample Output:
month count_of_preg_prop percentage_proportion
1 135 17
0 111 14
2 103 13
3 75 9
4 68 8
5 57 7
6 50 6
7 45 5
8 38 4
9 28 3
48
import warnings
warnings.filterwarnings("igno
re")
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='gre
en') axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-
Diab.',fontdict={'fontsize':8})axes[0][1].set_xlabel('Month
of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies
Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.
') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for
legend textplt.setp(axes[1][1].get_legend().get_title(), fontsize='6')
# for legend title plt.tight_layout()
49
plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,
ax=axes[2][1])axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
Sample Output:
RESULT:
Thus, the Univariate analysis using the UCI diabetes data set was successfully executed and
practically verified.
50
Ex.No:5.b
Bivariate analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing Bivariate
analysis using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
Linear regression modelling
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:,
np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-50]
diabetes_X_test = diabetes_X[-50:]
# Split the targets into
training/testing sets
diabetes_y_train = diabetes_y[:-50]
diabetes_y_test = diabetes_y[-50:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients print("Coefficients: n", regr.coef_)
51
Sample output:
Coefficients: [945.4992184]
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
Sample output:
Mean squared error: 3471.92
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
Sample output:
Coefficient of determination: 0.41
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue",
linewidth=3)plt.xticks(())
plt.yticks(())plt.show()
52
Logistic regression modelling
#Import Sklearn Packages
import numpy as np
import pandas as pd
from sklearn.linear_model
import LogisticRegressionfrom sklearn.model_selection
import train_test_split
#to create plot_bar,histogram,boxplot etc
import seaborn as sns
import matplotlib.pyplot as plt
#calculate accurancy measure and confusion matrix
from sklearn import metrics
import warnings warnings.filterwarnings("ignore")
#Loading Data
diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv")
diabetes
Preg
nanci
es
Glucose Blood
Pressure
Skin
Thickness
Insulin DiabetesPedigree
Function
Outcome
6 14
8
72 35 0 0.627 1
1 85 66 29 0 0.351 0
8 18
3
64 0 0 0.672 1
1 89 66 23 94 0.167 0
0 13
7
40 35 168 2.288 1
.
.
.
... ... ... ... ... ...
1
0
10
1
76 48 180 0.171 0
2 12
2
70 27 0 0.340 0
5 12
1
72 23 112 0.245 0
53
Pregnancies Glucose BloodPressure SkinThickn
ess
DiabetesPedigreeFunction Outcome
1 126 60 0 0.349 1
1 93 70 31 0.315 0
768 rows × 9 columns
#Train/Test split
X=df.drop("Outcome",axis=1)
Y=df[["Outcome"]]
# target variable
# split data into training and validation datasets
X_train, X_test, y_train, y_t est = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.linear_model
import LogisticRegression
# instantiate the model
model = LogisticRegression()
# fitting the model
model.fit(X_train, y_train) y_pred = model.predict(X_test)y_pred[0:5]
# metrics
print("Accuracy for test set is {}.".format(round(metrics.accuracy_score(y_test, y_pred), 4)))
print("Precision for test set is {}.".format(round(metrics.precision_score(y_test, y_pred), 4)))
print("Recall for test set is {}.".format(round(metrics.recall_score(y_test, y_pred), 4)))
Sample Output:
Accuracy for test set is
0.7917.Precision for test
set is 0.7115.Recall for
test set is 0.5968.
print(metrics.classification_report(y_test, y_pred))
Sample Output:
precision recall f1-score support
0 0.82 0.88 0.85 130
1 0.71 0.60 0.65 62
accuracy 0.79 192
macro avg 0.77 0.74 0.75 192
weighted avg 0.79 0.79 0.79 192
54
#Visualization
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(df.corr(), cmap="GnBu", annot=True, linewidths=0.5, fmt=
'.1f',ax=ax)plt.show()
Result:
Thus, the Bivariate analysis using the UCI diabetes data set was successfully executed and
practically verified.
55
Ex.No:5.c
Multiple Regression analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from Excel and exploring various commands for doing Multiple Regression analysis
using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
#import our Libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats
import diagnostic as diag
from statsmodels.stats.outliers_influence
import variance_inflation_factor
from sklearn.linear_model
import LinearRegression
from sklearn.model_selection
import train_test_split
from sklearn.metrics
import mean_squared_error, r2_score, mean_absolute_error
import warnings warnings.filterwarnings("ignore")
%matplotlib inline
#Loading Data
diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv")
diabetes
Preg
nanci
es
Glucose Blood
Pressure
Skin
Thickness
Insulin DiabetesPedigree
Function
Outcome
6 148 72 35 0 0.627 1
1
85 66 29 0 0.351 0
8
183 64 0 0 0.672 1
1
89 66 23 94 0.167 0
0
137 40 35 168 2.288 1
56
.
.
.
... ... ... ... ... ...
1
0
101 76 48 180 0.171 0
2
122 70 27 0 0.340 0
5
121 72 23 112 0.245 0
768 rows × 9 columns
# calculate the correlation matrix
corr=diabetes.corr()
# display the correlation matrix
display(corr)
# plot
the
Pre
gna
ncie
s
Glucose
BloodPr
essure
SkinT
hic
kness
Insuli
n BMI
Diabetes
Pedigree
Function
Age Outcome
Pregnan
cies
1.00
000
0
0.129
459
0.1412
82
-
0.0816
72
-
0.073
535
0.017
683
-
0.033523
0.54
4
341
0.221
898
Glucose
0.12
945
9
1.000
000
0.1525
90
0.057
328
0.331
357
0.221
071 0.137337
0.26
3
514
0.466
581
BloodPr
essure
0.14
128
2
0.152
590
1.0000
00
0.207
371
0.088
933
0.281
805 0.041265
0.23
9
528
0.065
068
SkinThi
ckness
-
0.08
167
2
0.057
328
0.2073
71
1.000
000
0.436
783
0.392
573 0.183928
-
0.1
13
970
0.074
752
Insulin
-
0.07
353
5
0.331
357
0.0889
33
0.436
783
1.000
000
0.197
859 0.185071
-
0.0
42
163
0.130
548
BMI
0.01
768
3
0.221
071
0.2818
05
0.392
573
0.197
859
1.000
000 0.140647
0.03
6
242
0.292
695
DiabetesP
edigree
Function
-
0.03
352
3
0.137
337
0.0412
65
0.183
928
0.185
071
0.140
647 1.000000
0.03
3
561
0.173
844
Age
0.54
434
1
0.263
514
0.2395
28
-
0.1139
70
-
0.042
163
0.036
242 0.033561
1.00
0
000
0.238
356
57
correlation heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu')
<AxesSubplot:>
#Train/Test split
X=diabetes.drop("Outcome",axis=1)
Y=diabetes[["Outcome"]]
# target variable
# split data into training and validation datasets
# Split X and y into X_
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train,Y_train) LinearRegression()
Outcom
e
0.22
189
8
0.466
581
0.0650
68
0.074
752
0.130
548
0.292
695 0.173844
0.23
8
356
1.000
000
58
# let's grab the coefficient of our model and the intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {:.4}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {:.2}".format(coef[0],coef[1]))
Sample output:
The intercept for our model is -0.879
The Coefficient for Pregnancies is 0.015
The Coefficient for Glucose is 0.0057
The Coefficient for BloodPressure is -0.0021
The Coefficient for SkinThickness is 0.001
The Coefficient for Insulin is -0.00017
The Coefficient for BMI is 0.013
The Coefficient for DiabetesPedigreeFunction is 0.14
The Coefficient for Age is 0.0038
# Get multiple predictions
y_predict = regression_model.predict(X_test)
# Show the first 5
predictionsy_predict[:5]
array([[1.01391226],
[0.21532924],
[0.09157383],
[0.60583158],
[0.15988782]])
# define our intput
X2=sm.add_constant(X)
# create a OLS model
model=sm.OLS(Y, X2)
# fit the data
est = model.fit()
59
# print out a summary
print(est.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Sat, 15 Oct 2022 Prob (F-statistic): 7.36e-55
Time: 19:14:26 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
const -0.8539 0.085 -9.989 0.000 -
1.022
-0.686
Pregnancies 0.0206 0.005 4.014 0.000 0.011 0.031
Glucose 0.0059 0.
00
1
11.
493
0.0
00
0.005 0.007
BloodPressure -0.0023 0.
00
1
-
2.8
7
3
0.0
0
4
-
0.004
-0.001
SkinThickness 0.0002 0.001 0.139 0.890 -0.002 0.002
Insulin -0.0002 0.
00
0
-
1.2
05
0.2
29
-
0.000
0.000
BMI 0.0132 0.
00
2
6.3
44
0.0
00
0.009 0.017
DiabetesPedigreeFunction 0.1472 0.0
45
3.2
68
0
.
0
0
1
0.05
9
0
.
2
3
6
Age 0.0026 0.002 1.6
93
0.0
91
-
0
.
0
0
0
0.00
6
==============================================================================
Omnibus: 41.539 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183
Skew: 0.395 Prob(JB): 1.69e-07
Kurtosis: 2.408 Cond. No..................... 1.10e+03
==============================================================================
60
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that
there arestrong multicollinearity or other numerical problems.
# make some confidence intervals, 95% by default
est.conf_int()
0 1
const -1.021709 -0.686079
Pregnancies 0.010521 0.030663
Glucose 0.004909 0.006932
BloodPressure -0.003925 -0.000739
SkinThickness -0.002029 0.002338
Insulin -0.000475 0.000114
BMI 0.009146 0.017343
DiabetesPedigreeFunction 0.058792 0.235682
Age -0.000419 0.005662
# estimate the p-values
est.pvalues
Sample output:
const 3.707465e-22
Pregnancies 6.561462e-05
Glucose 2.691192e-28
BloodPressure 4.178788e-03
SkinThickness 8.895424e-01
Insulin 2.285711e-01
BMI 3.853484e-10
DiabetesPedigreeFunction
1.131733e-03Age
9.092163e-02
dtype: float64import math
# calculate the mean squared error
model_mse = mean_squared_error(Y_test, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(Y_test, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
61
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
MSE 0.148
MAE 0.322
RMSE 0.384
model_r2 = r2_score(Y_test, y_predict)print("R2: {:.2}".format(model_r2))
R2: 0.32
import pickle
# pickle the model
with open('my_mulitlinear_regression.sav','wb') as f:
pickle.dump(regression_model, f)
# load it back in
with open('my_mulitlinear_regression.sav', 'rb') as pickle_file:
regression_model_2 = pickle.load(pickle_file)
# make a new prediction
regression_model_2.predict([X_test.loc[150]])array([[0.42308994]])
Result:
Thus, the multiple regression analysis using the UCI diabetes data set was successfully
executedand practically verified.
62
Ex.No:6
Apply and explore various plotting functions on UCI data sets
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#To run numerical descriptive stats for the data set
diabetes.describe()
Pregnan
cies
Glucose
Blood
Pressure
Skin
Thick
ness
Insuli
n
BMI
Diabet
es
Pedigr
ee
F
unct
ion
Age
Outcome
cou
nt
768.000000 768.00000
0
768.00000
0
768.00000
0
768.000
000
768.00
0
000
768.0000
00
768.00
0
000
768.000
000
me
an
3.845052 120.89453
1 69.10546
9
20.536458
79.7994
79
31.992
5
78
0.471876
33.240
8
85
0.34895
8
63
50%3.00000
0
117.000
000
72.000000 23.000000
30.5000
00
32.0000
00
0.372500
29.0000
00
0.00000
0
75%6.00000
0
140.250
000
80.000000 32.000000
127.250
000
36.6000
00
0.626250
41.0000
00
1.00000
0
max 17.0000
00
199.000
000
122.00000
0
99.000000
846.000
000
67.1000
00
2.420000
81.0000
00
1.00000
0
sns.kdeplot(diabetes["Pregnancies"], color = "green",shade =
True)plt.show()
plt.figure()
std
3.369578 31.972618
19.35580
7
15.952218
115.244
002
7.8841
6
0
0.331329
11.760
2
32
0.47695
1
min
0.000000 0.000000
0.00000
0
0.000000
0.00000
0
0.0000
0
0
0.078000
21.000
0
00
0.00000
0
25% 1.000000 99.000000
62.00000
0
0.000000
0.00000
0
27.300
0
00
0.243750
24.000
0
00
0.00000
0
64
plt.figure(figsize=(6,6))
sns.kdeplot(diabetes["Glucose"], color = "green",shade = True)
plt.show()
plt.figure()
plt.figure(figsize=(8,8))
sns.kdeplot(diabetes["Age"], diabetes["BloodPressure"],cmap="RdYlBu", shade =
True)plt.show()
plt.figure()
65
plt.figure(figsize=(6,6))
sns.kdeplot(x=diabetes.Age, y=diabetes.Glucose, cmap="PRGn", shade=True,
bw_adjust=1)plt.show()
# calculate the correlation
matrixcorr=diabetes.corr()
# display the correlation
matrixdisplay(corr)
Pregna
ncies
Gluc
ose
BloodPre
ssure
SkinThic
kness
Insuli
n
BMI
Diabetes
Pedigree
Function
Age
Outco
me
Pregnancies
1.00000
0
0.129
459
0.141282 0.081672
0.073
535
0.017
683
0.033523
0.544
341
0.221
898
Glucose
0.12945
9
1.000
000
0.152590 0.057328
0.331
357
0.221
071
0.137337
0.263
514
0.466
581
66
Pregna
ncies
Glu
cose
BloodPre
ssure
SkinThic
kness
Insulin
BMI
DiabetesP
edigree
Function
Age
Outtco
me
BloodPre
ssure
0.14128
2
0.152
590
1.000000 0.207371
0.088
933
0.281
805
0.041265
0.239
528
0.065
068
SkinThic
kness
-
0.08167
2
0.057
328
0.207371 1.000000
0.436
783
0.392
573
0.183928 0.113
9
7
0
0.074
752
Insulin
0.07353
5
0
.
3
3
1
3
5
7
0.088933 0.436783
1.000
000
0.197
859
0.185071
0.042
163
0.130
548
BMI
0.01768
3
0
.
2
2
1
0
7
1
0.281805 0.392573
0.197
859
1.000
000
0.140647
0.036
242
0.292
695
DiabetesP
edigree
Function
0.03352
3
0
.
1
3
7
3
3
7
0.041265 0.183928
0.185
071
0.140
647
1.000000
0.033
561
0.173
844
Age
0.54434
1
0
.
2
6
3
5
1
4
0.239528 0.113970
0.042
163
0.036
242
0.033561
1.000
000
0.238
356
Outcome
0.22189
8
0
.
4
6
6
5
8
1
0.065068 0.074752
0.130
548
0.292
695
0.173844
0.238
356
1.000
000
67
import seaborn as sns
sns.scatterplot(x="Pregnancies", y="Glucose", data=corr);
sns.lmplot(x="Pregnancies", y="Glucose", hue="Outcome", data=corr);
# Histogram+Density Plot
68
sns.distplot(diabetes["Age"], color =
"green")plt.show()
plt.figure()
# Adding Two Plots In One
sns.kdeplot(diabetes[diabetes.Outcome == 0]['Age'],
color = "blue")
sns.kdeplot(diabetes[diabetes.Outcome == 1]['Age'],
color = "orange", shade = True)
plt.show()
69
dia1 = diabetes[diabetes.Outcome==1]
dia0 = diabetes[diabetes.Outcome==0]
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(diabetes.Glucose,
kde=False)plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend() plt.subplot(1,3,3)
sns.boxplot(x=diabetes.Outcome,y=diabetes.Glucose)
plt.title("Boxplot for Glucose by Outcome")
Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')
70
Three dimensional plotting:
import numpy as np
# linear algebra
import pandas as pd
# data processing, CSV file I/O (e.g. pd.read_csv)
from mpl_toolkits
import mplot3d
import matplotlib.pyplot as plt
import matplotlib
import functools
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#Loading Data
diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv")
diabetes
Pregnan
cies
Glucose Blood
Pressure
Skin
Thickness
Insulin BMI Diabetes
Pedigree
Function
Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
Pregnanci es ucos e Blood
Pressu re
Skin
Thickne ss
Insuli n BM I Diabetes
Pedigree
Funct ion
Ag e Outcom e
2 8 183 4 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 0 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
76
3
10 101 76 48 180 32.9 0.171 63 0
76
4
2 122 70 27 0 36.8 0.340 27 0
76
5
5 121 72 23 112 26.2 0.245 30 0
76
6
1 126 60 0 0 30.1 0.349 47 1
76
7
1 93 70 31 0 30.4 0.315 23 0
71
768 rows × 9 columns
x=diabetes.Age[:20]
y=diabetes.Glucose[:20]
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)X,
Y = np.meshgrid(x, y)Z = f(X, Y)
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z');
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,cmap='viridis', edgecolor='none')
ax.set_title('surface');
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z')
72
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.scatter(X,Y,Z, cmap='viridis', linewidth=0.5);
ax.set_title('scatter');
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z')
73
Result:
Thus, the Three dimensional plotting using the UCI diabetes data set was successfully executed and
practically verified.
74
Ex.No:7
Visualizing Geographic Data with Basemap
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and
convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap
import Basemap
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.patches
import Polygon
from matplotlib.collections
import PatchCollection
import warnings
warnings.filterwarnings("ignore")
cities = pd.read_csv (r"C:UsersAdminDownloadsdatasets_557_1096_cities_r2.csv")
cities.head()
fig = plt.figure(figsize=(10,8))
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both',
color='Black',linestyle='-')plt.xlabel('No of cities
taken for analysis', fontsize = 20) plt.show ()
75
fig = plt.figure(figsize=(8,8))
ax=fig.add_subplot(111)
map=Basemap(llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,projection="lcc",lat_0=28,lon_0=77)
#map.bluemarble()
#map.fillcontinents(color="red")
map.drawmapboundary(color="red")
map.drawcountries(color="brown")
map.drawcoastlines(color="blue")
#draw state from shapefile
map.readshapefile(r"C:UsersAdminMusicIndia_State_ShapefileIndia_State_Boundary","India_St
ate_Bo undary")
cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0]) cities['longitude'] =
cities['location'].apply(lambda x: x.split(',')[1])
print("The Top 10 Cities sorted according to the Total Population (Descending Order)")
top_pop_cities = cities.sort_values(by='population_total',ascending=False)
76
top10_pop_cities=top_pop_cities.head(10)
#plt.subplots(figsize=(20, 15))
lg=array(top10_pop_cities['longitude'])
lt=array(top10_pop_cities['latitude'])
pt=array(top10_pop_cities['population_total'])
nc=array(top10_pop_cities['name_of_city'])
x, y = map(lg, lt)
population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x /5000))
plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2,
alpha=0.7)for ncs, xpt, ypt in zip(nc, x, y):
plt.text(xpt+60000, ypt+30000, ncs, fontsize=10, fontweight='bold')
plt.title('Top 10 Populated Cities in India',fontsize=20)
The Top 10 Cities sorted according to the Total Population (Descending
Order)Text(0.5, 1.0, 'Top 10 Populated Cities in India')
77
Result:
Thus, the Visualizing Geographic Data with Basemap was successfully executed and practically
verified.
78
VIVA QUESTIONS
NumPy
1. What is Numpy?
Ans: NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental package
for scientific computing with Python. … A powerful N-dimensional array object. Sophisticated
(broadcasting)functions.
2. Why NumPy is used in Python?
Ans: NumPy is a package in Python used for Scientific Computing. NumPy package is used to
perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store
values of samedatatype. These arrays are indexed just like Sequences, starts with zero.
3. What does NumPy mean in Python?
Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for
the Python programming language, adding support for large, multi-dimensional arrays and matrices,
alongwith a large collection of high-level mathematical functions to operate on these arrays.
4. Where is NumPy used?
Ans: NumPy is an open source numerical Python library. NumPy contains a multi-dimentional array
andmatrix data structures. It can be utilised to perform a number of mathematical operations on arrays
such astrigonometric, statistical and algebraic routines. NumPy is an extension of Numeric and
Numarray.
79
Pandas
1. What is Pandas?
Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-levelbuilding block for doing practical, real world data analysis in Python.
2. What is Python pandas used for?
Ans: Pandas is a software library written for the Python programming language for data
manipulation andanalysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. pandas is free software released under the three-clause BSD
license.
3. What is a Series in Pandas?
Ans: Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string,float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.
4. Mention the different Types of Data structures in pandas??
Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the
data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and
DataFrame isthe two-dimensional data structure in pandas. There is one more axis label known as
Panel which is a three- dimensional data structure and it includes items, major_axis, and minor_axis.
5. Explain Reindexing in pandas?
Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing
NA/NaNin locations having no value in the previous index. It changes the row labels and column
labels of a DataFrame.
6. What are the key features of pandas library ?
Ans: There are various features in pandas library and some of them are mentioned below
 Data Alignment
 Memory Efficient
 Reshaping
 Merge and join
 Time Series
7. What is pandas Used For ?
Ans: This library is written for the Python programming language for performing operations like data
manipulation, data analysis, etc. The library provides various operations as well as data structures to
manipulate time series and numerical tables.
80
8. How can we create copy of series in Pandas?
Ans: pandas.Series.copy
Series.copy(deep=True)
pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With
deep=False neither the indices or the data are copied. Note that when deep=True data is copied,
actual python objectswill not be copied recursively, only the reference to the object.
9. What is Time Series in pandas?
Ans: A time series is an ordered sequence of data which basically represents how some quantity
changesover time. pandas contains extensive capabilities and features for working with time series
data for all domains.
pandas supports:
Parsing time series information from various sources and formats
Generate sequences of fixed-frequency dates and time spans
Manipulating and converting date time with timezone information
Resampling or converting a time series to a particular frequency
Performing date and time arithmetic with absolute or relative time increments
10. What is pylab?
Ans: PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.
Jupyter Notebook
1. What is Jupyter
Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it allows for rapid prototyping and iteration.
2. What are the main features of Jupyter Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it provides an easy way to mix code, output, and explanatory text all in one place.
Jupyter Notebook is also used by educators to teach programming and data science concepts.
3. How can you create a new notebook in Jupyter?
You can create a new notebook in Jupyter by clicking on the “New” button in the upper right corner
andselecting “Notebook” from the drop-down menu.
4. Can you explain what the data science workflow involves?
The data science workflow generally involves four main steps: data wrangling, exploratory data
analysis,modeling, and evaluation. Data wrangling is the process of cleaning and preparing data for
81
analysis.
Exploratory data analysis is the process of exploring data to find patterns and relationships. Modeling
is the process of building models to make predictions or recommendations based on data. Evaluation is
the processof assessing the accuracy of models and using them to make decisions.
5. What are some common use cases for Jupyter Notebook?
Jupyter Notebook is a popular tool for data scientists and analysts because it allows for an interactive
Coding experience. Jupyter Notebook is often used for exploratory data analysis and for visualizing data.
*******

More Related Content

What's hot

5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#135ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13Uptime Technologies LLC (JP)
 
E1000 is faster than VMXNET3
E1000 is faster than VMXNET3E1000 is faster than VMXNET3
E1000 is faster than VMXNET3Eric Sloof
 
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)I Goo Lee
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache SparkについてBrainPad Inc.
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platformashutosh123gupta
 
File System Hierarchy
File System HierarchyFile System Hierarchy
File System Hierarchysritolia
 
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFUSENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFTaeung Song
 
SIP Testing with FreeSWITCH
SIP Testing with FreeSWITCHSIP Testing with FreeSWITCH
SIP Testing with FreeSWITCHMoises Silva
 
FreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk DevelopersFreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk DevelopersMoises Silva
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernelguest547d74
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...NTT DATA Technology & Innovation
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Shinya Takamaeda-Y
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page tableduvvuru madhuri
 

What's hot (20)

5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#135ステップで始めるPostgreSQLレプリケーション@hbstudy#13
5ステップで始めるPostgreSQLレプリケーション@hbstudy#13
 
E1000 is faster than VMXNET3
E1000 is faster than VMXNET3E1000 is faster than VMXNET3
E1000 is faster than VMXNET3
 
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache Sparkについて
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
nodeMCU IOT教學03 - NodeMCU導論
nodeMCU IOT教學03 - NodeMCU導論nodeMCU IOT教學03 - NodeMCU導論
nodeMCU IOT教學03 - NodeMCU導論
 
4 threads
4 threads4 threads
4 threads
 
Swap Administration in linux platform
Swap Administration in linux platformSwap Administration in linux platform
Swap Administration in linux platform
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
File System Hierarchy
File System HierarchyFile System Hierarchy
File System Hierarchy
 
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFUSENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
 
Linux vs windows
Linux vs windowsLinux vs windows
Linux vs windows
 
SIP Testing with FreeSWITCH
SIP Testing with FreeSWITCHSIP Testing with FreeSWITCH
SIP Testing with FreeSWITCH
 
FreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk DevelopersFreeSWITCH Modules for Asterisk Developers
FreeSWITCH Modules for Asterisk Developers
 
Architecture Of The Linux Kernel
Architecture Of The Linux KernelArchitecture Of The Linux Kernel
Architecture Of The Linux Kernel
 
Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介Oracle Data Masking and Subsettingのご紹介
Oracle Data Masking and Subsettingのご紹介
 
Linux device drivers
Linux device driversLinux device drivers
Linux device drivers
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
 
Structure of the page table
Structure of the page tableStructure of the page table
Structure of the page table
 

Similar to DS LAB MANUAL.pdf

Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARKMRKUsafzai0607
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report AlmkdadAli
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxhkabir55
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
2. Data Preprocessing.pdf
2. Data Preprocessing.pdf2. Data Preprocessing.pdf
2. Data Preprocessing.pdfJyoti Yadav
 
pythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docxpythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docxRameshMishra84
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and ModulesRaginiJain21
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 

Similar to DS LAB MANUAL.pdf (20)

Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
Session 2
Session 2Session 2
Session 2
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Machine learning Experiments report
Machine learning Experiments report Machine learning Experiments report
Machine learning Experiments report
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
2. Data Preprocessing.pdf
2. Data Preprocessing.pdf2. Data Preprocessing.pdf
2. Data Preprocessing.pdf
 
pythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docxpythonlibrariesandmodules-210530042906.docx
pythonlibrariesandmodules-210530042906.docx
 
Scientific Python
Scientific PythonScientific Python
Scientific Python
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Python Libraries and Modules
Python Libraries and ModulesPython Libraries and Modules
Python Libraries and Modules
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Manual orange
Manual orangeManual orange
Manual orange
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
Python ml
Python mlPython ml
Python ml
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 

Recently uploaded

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Recently uploaded (20)

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

DS LAB MANUAL.pdf

  • 1. REGULATION – 2021 CS3361 – DATA SCIENCE LABORATORY LAB MANUAL YEAR / SEMESTER: II / III Prepared by P.SANTHIYA Assistant Professor Department of Computer Science and Engineering
  • 2. CS3362 DATA SCIENCE LABORATORY L T P C 0 0 4 2 COURSE OBJECTIVES: To understand the python libraries for data science. To understand the basic Statistical and Probability measures for data science.To learn descriptive analytics on the benchmark data sets. To apply correlation and regression analytics on standard data sets. To present and interpret data using visualization packages in Python. LIST OF EXPERIMENTS: 1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodelsand Pandaspackages. 2. Working with Numpy arrays 3. Working with Pandas data frames 4. Reading data from text files, Excel and the web and exploring variouscommands for doingdescriptive analytics on the Iris data set. 5. Use the diabetes data set from UCI and Pima Indians Diabetes data set forperforming thefollowing: a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, StandardDeviation,Skewness and Kurtosis. b. Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d. Also compare the results of the above analysis for the two data sets. 6. Apply and explore various plotting functions on UCI data sets. a. Normal curves b. Density and contour plots c. Correlation and scatter plots d. Histograms e. Three dimensional plotting 7. Visualizing Geographic Data with Basemap
  • 3. LIST OF EQUIPMENTS :(30 Students per Batch) Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc. TOTAL: 60 PERIODS COURSE OUTCOMES: At the end of this course, the students will be able to:  Make use of the python libraries for data science.  Make use of the basic Statistical and Probability measures for data science.  Perform descriptive analytics on the benchmark data sets.  Perform correlation and regression analytics on standard data sets.  Present and interpret data using visualization packages in Python.
  • 4. 1 Ex.No 1 Download, install and explore the features of NumPy, SciPy, Jupyter,Statsmodels andpackages Date: AIM: Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and packages. Downloading and Installing Anaconda on Linux: 1. Getting Started: 2. Getting through the License Agreement:
  • 5. 2 3. Choose Installation Location: 4. Extracting Files and packages:
  • 6. 3 5. Initializing Anaconda Installation: 6. Finishing up the Installation:
  • 7. 4 7. Working with Anaconda: >> anaconda-navigator
  • 8. 5 a) Installing Jupyter Notebook using Anaconda: To install Jupyter using Anaconda, just go through the following instructions: 1. Launch Anaconda Navigator: 2. Click on the Install Jupyter Notebook Button:
  • 9. 6 3. Beginning the Installation: 4. Loading Packages:
  • 10. 7 5. Finished Installation: 6. Launching Jupyter:
  • 11. 8 b) Installing Jupyter Notebook using pip: To install Jupyter using pip, the following command to update pip: >> python3 -m pip install --upgrade pip After updating the pip version, follow the instructions provided below to install Jupyter: Command to install Jupyter: >>pip3 install Jupyter 1. Beginning Installation:
  • 12. 9 2. Collecting Files and Data: 3. Downloading Packages:
  • 13. 10 4. Running Installation: 5. Finished Installation:
  • 14. 11 6. Launching Jupyter: Use the following command to launch Jupyter using command-line: >>jupyter notebook
  • 15. 12 Explore the following features of python packages: 1. NumPy: NumPy stands for Numerical Python.NumPy (Numerical Python) is an open-source library for the Python programming language. It is used for scientific computing and working with arrays. The source code for NumPy is located at this github repository https://github.com/numpy/numpy. Features: 1. High-performance N-dimensional array object. 2. It contains tools for integrating code from C/C++ and Fortran. 3. It contains a multidimensional container for generic data. 4. Additional linear algebra, Fourier transform, and random number capabilities. 5. It consists of broadcasting functions. 6. It had data type definition capability to work with varied databases. 2. SciPy: SciPy stands for Scientific Python. SciPy is a scientific computation library that uses NumPy underneath. The source code for SciPy is located at this github repository https://github.com/scipy/scipy Features: 1. SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems. 2. It provides more utility functions for optimization, stats and signal processing. Numpy vs. SciPy Numpy and SciPy both are used for mathematical and numerical analysis. Numpy is suitable for basicoperations such as sorting, indexing and many more because it contains array data, whereas SciPy consistsof all the numeric data. Numpy contains many functions that are used to resolve the linear algebra, Fourier transforms, etc. whereas SciPy library contains full featured version of the linear algebra module as well many other numerical algorithms.
  • 16. 13 3. Pandas: Python Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name ofPandas isderived from the word Panel Data, which meansan Econometrics from Multidimensional data. It is used for data analysis in Python. Pandas is built on top of the Numpy package, means Numpy is required for operating the Pandas. Features: 1. Group by data for aggregations and transformations. 2. It has a fast and efficient DataFrame object with the default and customized indexing. 3. Used for reshaping and pivoting of the data sets. 4. It is used for data alignment and integration of the missing data. 5. Provide the functionality of Time Series. 6. Process a variety of data sets in different formats like matrix data, tabular heterogeneous,time series. 7. Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re- ordering, and re-shaping. 4. Statsmodels: statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. The package is released under the open source Modified BSD (3-clause) license. The online documentation is hosted at statsmodels.org. Features: 1. Linear regression models like Ordinary least squares, Generalized least squares,Weighted least squares, Least squares with autoregressive errors. 2. Bayesian Mixed GLM for Binomial and Poisson 3. GEE: Generalized Estimating Equations for one-way clustered or longitudinal data 4. Nonparametric statistics: Univariate and multivariate kernel density estimators 5. Datasets: Datasets used for examples and in testing 6. Sandbox: statsmodels contains a sandbox folder with code in various stages ofdevelopment and testing. RESULT: Thus, the NumPy, SciPy, Jupyter, Statsmodels packages have been successfully download,install and explore their features.
  • 17. 14 Ex.No 2 Working with Numpy arrays Date: AIM: To write a Numpy arrays program to demonstrate basic array concepts in Jupyter Notebook. PROGRAM: 1. Creating Arrays from Python Lists: In[1]: import numpy as np In[2]: # integer array: np.array([1, 4, 2, 5, 3]) Out[2]: array([1, 4, 2, 5, 3]) #NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast ifpossible (here, integers are upcast to floating point): In[3]: np.array([3.14, 4, 2, 3]) Out[3]: array([ 3.14, 4. , 2. , 3. ]) #If we want to explicitly set the data type of the resulting array, we can use the dtype keyword: In[4]: np.array([1, 2, 3, 4], dtype='float32') Out[4]: array([ 1., 2., 3., 4.], dtype=float32) In[5]: # nested lists result in multidimensional arrays np.array([range(i, i + 3) for i in [2, 4, 6]]) Out[5]: array([[2, 3, 4], [4, 5, 6], [6, 7, 8]]) 2. NumPy Array Attributes: In[1]: import numpy as np np.random.seed(0) # seed for reproducibility x1 = np.random.randint(10, size=6) # One-dimensional array x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
  • 18. 15 Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (thetotal size of the array): In[2]: print("x3 ndim: ", x3.ndim) print("x3 shape:", x3.shape)print("x3 size: ", x3.size) Out[2]:x3 ndim: 3 x3 shape: (3, 4, 5) x3 size: 60 In[3]: print("dtype:", x3.dtype)# data type of the array Out[3]:dtype: int64 In[4]: print("itemsize:", x3.itemsize, "bytes")print("nbytes:", x3.nbytes, "bytes") Out[4]:itemsize: 8 bytes Out[4]:nbytes: 480 bytes 3. Array Indexing: Accessing Single Elements: In[5]: x1 Out[5]: array([5, 0, 3, 3, 7, 9]) In[6]: x1[0] Out[6 ]: 5 In[7]: x1[4] Out[7]: 7 #To index from the end of the array, you can use negative indices In[8]: x1[-1] Out[8] : 9 In[9]: x1[-2] Out[9]: 7
  • 19. 16 #In a multidimensional array, you access items using a comma-separated tuple of indices In[10]: x2 Out[10]: array([[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]) In[11]: x2[0, 0] Out[11]: 3 In[12]: x2[2, 0] Out[12]: 1 In[13]: x2[2, -1] Out[13]: 7 #modify values using any of the above index notation In[14]: x2[0, 0] = 12 x2 Out[14]: array([[12, 5, 2, 4], [ 7, 6, 8, 8], [ 1, 6, 7, 7]]) In[15]: x1[0] = 3.14159 # this will be truncated! x1 Out[15]: array([3, 0, 3, 3, 7, 9]) 4. Array Slicing: Accessing Subarrays #One-dimensional subarrays In[16]: x = np.arange(10)x Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In[17]: x[:5] # first five elements Out[17]: array([0, 1, 2, 3, 4]) In[18]: x[5:] # elements after index 5 Out[18]: array([5, 6, 7, 8, 9]) In[19]: x[4:7] # middle subarray Out[19]: array([4, 5, 6]) In[20]: x[::2] # every other element Out[20]: array([0, 2, 4, 6, 8]) In[21]: x[1::2] # every other element, starting at index 1 Out[21]: array([1, 3, 5, 7, 9])
  • 20. 17 In[22]: x[::-1] # all elements, reversed Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) In[23]: x[5::-2] # reversed every other from index 5 Out[23]: array([5, 3, 1]) 5. Multidimensional subarrays: In[24]: x2 Out[24]: array([[12, 5, 2, 4], [ 7, 6, 8, 8], [ 1, 6, 7, 7]]) In[25]: x2[:2, :3] # two rows, three columns Out[25]: array([[12, 5, 2], [ 7, 6, 8]]) In[26]: x2[:3, ::2] # all rows, every other column Out[26]: array([[12, 2], [ 7, 8], [ 1, 7]]) #Finally, subarray dimensions can even be reversed together: In[27]: x2[::-1, ::-1] Out[27]: array([[ 7, 7, 6, 1], [ 8, 8, 6, 7], [ 4, 2, 5, 12]]) 6. Accessing array rows and columns: In[28]: print(x2[:, 0]) # first column of x2 [12 7 1] In[29]: print(x2[0, :]) # first row of x2 [12 5 2 4] #In the case of row access, the empty slice can be omitted for a more compact syntax: In[30]: print(x2[0]) # equivalent to x2[0, :] [12 5 2 4] In[31]: print(x2) [[12 5 2 4] [ 7 6 8 8] [ 1 6 7 7]] #extract a 2×2 subarray from this: In[32]: x2_sub = x2[:2, :2]
  • 21. 18 print(x2_sub) [[12 5] [ 7 6]] #modify this subarray In[33]: x2_sub[0, 0] = 99 print(x2_sub) [[99 5] [ 7 6]] In[34]: print(x2) [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]] 7. Creating copies of arrays: In[35]: x2_sub_copy = x2[:2, :2].copy()print(x2_sub_copy) [[99 5] [ 7 6]] #modify this subarray In[36]: x2_sub_copy[0, 0] = 42 print(x2_sub_copy) [[42 5] [ 7 6]] In[37]: print(x2) [[99 5 2 4] [ 7 6 8 8] [ 1 6 7 7]] Reshaping of Arrays: In[38]: grid = np.arange(1, 10).reshape((3, 3))print(grid) [[1 2 3] [4 5 6] [7 8 9]]
  • 22. 19 In[39]: x = np.array([1, 2, 3]) # row vector via reshape x.reshape((1, 3)) Out[39]: array([[1, 2, 3]]) In[40]: # row vector via newaxis x[np.newaxis, :] Out[40]: array([[1, 2, 3]]) In[41]: # column vector via reshape x.reshape((3, 1)) Out[41]: array([[1],[2], [3]]) In[42]: # column vector via newaxis x[:, np.newaxis] Out[42]: array([[1], [2], [3]]) 8. Array Concatenation and Splitting: In[43]: x = np.array([1, 2, 3]) y = np.array([3, 2, 1]) np.concatenate([x, y]) Out[43]: array([1, 2, 3, 3, 2, 1]) #concatenate more than two arrays at once: In[44]: z = [99, 99, 99] print(np.concatenate([x, y, z]))[ 1 2 3 3 2 1 99 99 99] #np.concatenate can also be used for two-dimensional arrays: In[45]: grid = np.array([[1, 2, 3], [4, 5, 6]]) In[46]: # concatenate along the first axis np.concatenate([grid, grid]) Out[46]: array([[1, 2, 3], [4, 5, 6], [1, 2, 3],
  • 23. 20 [4, 5, 6]]) In[47]: # concatenate along the second axis (zero-indexed) np.concatenate([grid, grid], axis=1) Out[47]: array([[1, 2, 3, 1, 2, 3], [4, 5, 6, 4, 5, 6]]) In[48]: x = np.array([1, 2, 3]) grid = np.array([[9, 8, 7], [6, 5, 4]]) # vertically stack the arrays np.vstack([x, grid]) Out[48]: array([[1, 2, 3], [9, 8, 7], [6, 5, 4]]) In[49]: # horizontally stack the arraysy = np.array([[99],[99]]) np.hstack([grid, y]) Out[49]: array([[ 9, 8, 7, 99], [ 6, 5, 4, 99]]) Splitting of arrays: In[50]: x = [1, 2, 3, 99, 99, 3, 2, 1] x1, x2, x3 = np.split(x, [3, 5]) print(x1, x2, x3) [1 2 3] [99 99] [3 2 1] In[51]: grid = np.arange(16).reshape((4, 4))grid Out[51]: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]]) In[52]: upper, lower = np.vsplit(grid, [2])print(upper) print(lower) [[0 1 2 3]
  • 24. 21 [4 5 6 7]] [[ 8 9 10 11] [12 13 14 15]] In[53]: left, right = np.hsplit(grid, [2]) print(left) print(right) Out[53]: [[ 0 1] [ 4 5] [ 8 9] [12 13]] [[ 2 3] [ 6 7] [10 11] [14 15]] 9. Exploring NumPy’s UFuncs: In[54]: x = np.arange(4) print("x =", x) print("x + 5 =", x + 5) print("x - 5 =", x - 5) print("x * 2 =", x * 2) print("x / 2 =", x / 2) print("x // 2 =", x // 2) # floor division Out[54]: x = [0 1 2 3] x + 5 = [5 6 7 8] x - 5 = [-5 -4 -3 -2] x * 2 = [0 2 4 6] x / 2 = [ 0. 0.5 1. 1.5] x // 2 = [0 0 1 1] In[8]: print("-x = ", -x) print("x ** 2 = ", x ** 2) print("x % 2 = ", x % 2) Out[8]:-x = [ 0 -1 -2 -3] x ** 2 = [0 1 4 9]
  • 25. 22 x % 2 = [0 1 0 1] In[9]: -(0.5*x + 1) ** 2 Out[9]: array([-1. , -2.25, -4. , -6.25]) In[10]: np.add(x, 2) Out[10]: array([2, 3, 4, 5]) 10. Absolute value: In[11]: x = np.array([-2, -1, 0, 1, 2]) abs(x) Out[11]: array([2, 1, 0, 1, 2]) In[12]: np.absolute(x) Out[12]: array([2, 1, 0, 1, 2]) In[13]: np.abs(x) Out[13]: array([2, 1, 0, 1, 2]) In[14]: x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])np.abs(x) Out[14]: array([ 5., 5., 2., 1.]) 11. Trigonometric functions: In[15]: theta = np.linspace(0, np.pi, 3) In[16]: print("theta = ", theta) print("sin(theta) = ", np.sin(theta)) print("cos(theta) = ", np.cos(theta)) print("tan(theta) = ", np.tan(theta)) Out[16]:theta = [ 0. 1.57079633 3.14159265] sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16] cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00] tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16] In[17]: x = [-1, 0, 1] print("x = ", x) print("arcsin(x) = ", np.arcsin(x)) print("arccos(x) = ", np.arccos(x)) print("arctan(x) = ", np.arctan(x)) Out[17]:x = [-1, 0, 1] arcsin(x) = [-1.57079633 0. 1.57079633]
  • 26. 23 arccos(x) = [ 3.14159265 1.57079633 0. ] arctan(x) = [-0.78539816 0. 0.78539816] 12. Exponents and logarithms: In[18]: x = [1, 2, 3] print("x =", x) print("e^x =", np.exp(x)) print("2^x =", np.exp2(x)) print("3^x =", np.power(3, x)) Out[18]:x = [1, 2, 3] e^x = [ 2.71828183 7.3890561 20.08553692] 2^x = [ 2. 4. 8.] 3^x = [ 3 9 27] In[19]: x = [1, 2, 4, 10] print("x =", x) print("ln(x) =", np.log(x)) print("log2(x) =", np.log2(x)) print("log10(x) =", np.log10(x)) Out[19]:x = [1, 2, 4, 10] ln(x) = [ 0. 0.69314718 1.38629436 2.30258509] log2(x) = [ 0. 1. 2. 3.32192809] log10(x) = [ 0. 0.30103 0.60205999 1. ] In[20]: x = [0, 0.001, 0.01, 0.1] print("exp(x) - 1 =", np.expm1(x)) print("log(1 + x) =", np.log1p(x)) Out[20]:exp(x) - 1 = [ 0. 0.0010005 0.01005017 0.10517092] log(1 + x) = [ 0. 0.0009995 0.00995033 0.09531018] 13. Specialized ufuncs: In[21]: from scipy import special In[22]: # Gamma functions (generalized factorials) and related functions x = [1, 5, 10] print("gamma(x) =", special.gamma(x)) print("ln|gamma(x)| =", special.gammaln(x)) print("beta(x, 2) =", special.beta(x, 2)) Out[22]:gamma(x) = [ 1.00000000e+00 2.40000000e+01 3.62880000e+05]ln|gamma(x)| = [ 0. 3.17805383 12.80182748]
  • 27. 24 beta(x, 2) = [ 0.5 0.03333333 0.00909091] In[23]: # Error function (integral of Gaussian) its complement, and its inverse x = np.array([0, 0.3, 0.7, 1.0]) print("erf(x) =", special.erf(x)) print("erfc(x) =", special.erfc(x)) print("erfinv(x) =", special.erfinv(x)) Out[23]:erf(x) = [ 0. 0.32862676 0.67780119 0.84270079] erfc(x) = [ 1. 0.67137324 0.32219881 0.15729921] erfinv(x) = [ 0. 0.27246271 0.73286908 inf] 14. Aggregates: In[26]: x = np.arange(1, 6) np.add.reduce(x) In[27]: np.multiply.reduce(x) Out[27]: 120 In[28]: np.add.accumulate(x) Out[28]: array([ 1, 3, 6, 10, 15]) In[29]: np.multiply.accumulate(x) Out[29]: array([ 1, 2, 6, 24, 120]) 15. Outer products: In[30]: x = np.arange(1, 6) np.multiply.outer(x, x) Out[30]: array([[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10], [ 3, 6, 9, 12, 15], [ 4, 8, 12, 16, 20], [ 5, 10, 15, 20, 25]]) RESULT: Thus, the Numpy array program was successfully executed and verified.
  • 28. 25 Ex.No 3 Working of Pandas DataFrame Date: AIM: To write a Pandas program using dictionary sample dataframe to perform opertations in its DataFrame. Sample DataFrame: exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] PROGRAM: import pandas as pd import numpy as np exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(exam_data , index=labels) print(df) print("Summary of the basic information about this DataFrame and its data:") print(df.info())
  • 29. 26 Sample Output: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 Summary of the basic information about this DataFrame and its data: <class 'pandas.core.frame.DataFrame'>Index: 10 entries, a to j Data columns (total 4 columns): # Column Non-Null Count Dtype 0 name 10 non-null object 1 score 8 non-null float64 2 attempts 10 non-null int64 3 qualify 10 non-null object dtypes: float64(1), int64(1), object(2)memory usage: 400.0+ bytes None i. To get the first 3 rows of a given DataFrame. print("First three rows of the data frame:") print(df.iloc[:3]) Sample Output: First three rows of the data frame: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 ii. To select the 'name' and 'score' columns from the following DataFrame. print("Select specific columns:") print(df[['name', 'score']])
  • 30. 27 Sample Output: Select specific columns: name score a Anastasia 12.5 b Dima 9.0 c Katherine 16.5 d James NaN e Emily 9.0 f Michael 20.0 g Matthew 14.5 h Laura NaN i Kevin 8.0 j Jonas 19.0 iii. To select the specified columns and rows from a given DataFrame. Select 'name' and'score' columns in rows 1, 3, 5, 6 from the following data frame. print("Select specific columns and rows:") print(df.iloc[[1, 3, 5, 6], [1, 3]]) Select specific columns and rows: score qualify b 9.0 no d NaN no f 20.0 yes g 14.5 yes iv. To select the rows where the number of attempts in the examination is greater than 2. print("Number of attempts in the examination is greater than 2:") print(df[df['attempts'] > 2]) Sample Output: Number of attempts in the examination is greater than 2: name score attempts qualify b Dima 9.0 3 no d James NaN 3 no f Michael 20.0 3 yes v. To select the rows where the score is missing, i.e. is NaN. print("Rows where score is missing:") print(df[df['score'].isnull()])
  • 31. 28 Sample Output: Rows where score is missing: attempts name qualify score d 3 James no NaN h 1 Laura no NaN vi. To change the score in row 'd' to 11.5. print("nOriginal data frame:")print(df) print("nChange the score in row 'd' to 11.5:")df.loc['d', 'score'] = 11.5 print(df) Sample Output: Original data frame: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 Change the score in row 'd' to 11.5: Attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no 11.5 e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0
  • 32. 29 vii. To calculate the sum of the examination score by the students. print("nSum of the examination attempts by the students:") print(df['score'].sum()) Sample Output: Sum of the examination attempts by the students: 108.5 viii. To append a new row 'k' to DataFrame with given values for each column. Now delete thenew row and return the original data frame. print("Original rows:")print(df) print("nAppend a new row:") df.loc['k'] = [1, 'Suresh', 'yes', 15.5] print("Print all records after insert a new record:")print(df) print("nDelete the new row and display the original rows:")df = df.drop('k') print(df) Sample Output: Original rows: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 Append a new row: Print all records after insert a new record:attempts name qualify score
  • 33. 30 a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 k 1 Suresh yes 15.5 Delete the new row and display the original rows:attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 ix. To delete the 'attempts' column from the DataFrame. print("Original rows:")print(df) print("nDelete the 'attempts' column from the data frame:") df.pop('attempts') print(df)
  • 34. 31 Sample Output: Original rows: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 Delete the 'attempts' column from the data frame: name qualify score a Anastasia yes 12.5 b Dima no 9.0 c Katherine yes 16.5 d James no NaN e Emily no 9.0 f Michael yes 20.0 g Matthew yes 14.5 h Laura no NaN i Kevin no 8.0 j Jonas yes 19.0 x. To insert a new column in existing DataFrame. print("Original rows:")print(df) color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']df['color'] = color
  • 35. 32 print("nNew DataFrame after inserting the 'color' column")print(df) Sample Output Original rows: attempts name qualify score a 1 Anastasia yes 12.5 b 3 Dima no 9.0 c 2 Katherine yes 16.5 d 3 James no NaN e 2 Emily no 9.0 f 3 Michael yes 20.0 g 1 Matthew yes 14.5 h 1 Laura no NaN i 2 Kevin no 8.0 j 1 Jonas yes 19.0 New DataFrame after inserting the 'color' column Attempts name qualify score color a 1 Anastasia yes 12.5 Red b 3 Dima no 9.0 Blue c 2 Katherine yes 16.5 Orange d 3 James no NaN Red e 2 Emily no 9.0 White f 3 Michael yes 20.0 White g 1 Matthew yes 14.5 Blue h 1 Laura no NaN Green i 2 Kevin no 8.0 Green j 1 Jonas yes 19.0 Red RESULT: Thus, the working of Pandas Dataframe using Dictionary was executed and verified successfully.
  • 36. 33 Ex.No:4 Descriptive analytics on the Iris data set Date: AIM: To Reading data from text files, Excel and the web and exploring various commands for doing descriptive analytics on the Iris data set. PROCEDURE: Download the Iris.csv file from the https://www.kaggle.com/datasets/uciml/iris and use the Pandas library to load this CSV file,and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: import pandas as pd df = pd.read_csv("Music/Iris.csv")# Reading the CSV file print(df) print(df.d types) Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm 0 1 5.1 3.5 1.4 0.2 1 2 4.9 3.0 1.4 0.2 2 3 4.7 3.2 1.3 0.2 3 4 4.6 3.1 1.5 0.2 4 5 5.0 3.6 1.4 0.2 .. ... ... ... ... ... 145 146 6.7 3.0 5.2 2.3 146 147 6.3 2.5 5.0 1.9 147 148 6.5 3.0 5.2 2.0 148 149 6.2 3.4 5.4 2.3 149 150 5.9 3.0 5.1 1.8 Species 0 Iris-setosa 1 Iris-setosa 2 Iris-setosa 3 Iris-setosa 4 Iris-setosa
  • 37. 34 .. ... 145 Iris-virginica 146 Iris-virginica 147 Iris-virginica 148 Iris-virginica 149 Iris-virginica [150 rows x 6 columns] Id int64 SepalLengthCm float64 SepalWidthCm float64 PetalLengthCm float64 PetalWidthCm float64 Species object dtype: object # Printing top 5 rows print(df.head()) #To shape parameter to get the shape of the dataset. print(df.shape) Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species 0 1 5.1 3.5 1.4 0.2 Iris-setosa 1 2 4.9 3.0 1.4 0.2 Iris-setosa 2 3 4.7 3.2 1.3 0.2 Iris-setosa 3 4 4.6 3.1 1.5 0.2 Iris-setosa 4 5 5.0 3.6 1.4 0.2 Iris-setosa (150, 6) #To know the columns and their data types use the info() method. df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149
  • 38. 35 Data columns (total 6 columns): # Column Non-Null Count Dtype 0 Id 150 non-null int64 1 SepalLengthCm 150 non-null float64 2 SepalWidthCm 150 non-null float64 3 PetalLengthCm 150 non-null float64 4 PetalWidthCm 150 non-null float64 Species 150 non-null object dtypes: float64(4), int64(1), object(1) memory usage: 7.2+ KB print(df.describe()) Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm count 150.000000 150.000000 150.000000 150.000000 150.000000 m e a n 75.500000 5.843333 3.054000 3.758667 1.198667 s t d 43.445368 0.828066 0.433594 1.764420 0.763161 m i n 1.000000 4.300000 2.000000 1.000000 0.100000 2 5 % 38.250000 5.100000 2.800000 1.600000 0.300000 5 0 % 75.500000 5.800000 3.000000 4.350000 1.300000 7 5 % 112.750000 6.400000 3.300000 5.100000 1.800000 m a x 150.0000 00 7.900000 4.40000 0 6.90000 0 2.500000
  • 39. 36 # Missing values can occur when no information is provided print(df.isnull().sum()) Id 0 SepalLengthCm 0 SepalWidthCm 0 PetalLengthCm 0 PetalWidthCm 0 Species 0 dtype: int64 # To check dataset contains any duplicates or notdata = df.drop_duplicates(subset ="Species",) print(data) Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm 0 1 5.1 3.5 1.4 0.2 5 0 5 1 7.0 3. 2 4.7 1.4 1 0 0 1 0 1 6.3 3.3 6.0 2.5 Species 0 Iris- setosa 50 Iris-versicolor 100 Iris-virginica #To find unique species from the given dataset print(df.value_counts("Species")) Species Iris-setosa50 Iris-versicolor 50 Iris-virginica 50 dtype: int64
  • 40. 37 #matplotlib import matplotlib.pyplot as plt %matplotlib inline #seaborn import seaborn as sns #plot the variable ‘sepal.width’ plt.scatter(df.index,df['SepalWidthCm'])plt.show() #visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn library. sns.scatterplot(x=df.index,y=df['SepalWidthCm'],hue=df['Species'])
  • 41. 38 #visualizes data by connecting the data points via line segments. plt.figure(figsize=(6,6)) plt.title("line plot for petal length") plt.xlabel('index',fontsize=20) plt.ylabel('PetalLengthCm',fontsize=20) plt.plot(df.index,df['PetalLengthCm'],markevery=1,marker='d') for name,group in df.groupby('Species'): plt.plot(group.index,group['PetalLengthCm'],label=name,markevery=1,marker='d') plt.legend() plt.show() #Plotting histogram using the matplotlib plt.hist() function : plt.hist(df["PetalWidthCm"])
  • 43. 40 Ex.No:5.a Univariate analysis using the UCI diabetes data set Date: AIM: To Reading data from csv files exploring various commands for doing Univariate analysis using the UCI diabetes data set. PROCEDURE: Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/datasets/uciml/pima- indians-diabetes-database and use the Pandaslibrary to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: import pandas as pd df = pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.CSV")print(df) Pregnancies Glucose BloodPressure SkinThickness Insulin BMI 0 6 148 72 35 0 33.6 1 1 85 66 29 0 26.6 2 8 183 64 0 0 23.3 3 1 89 66 23 94 28.1 4 0 137 40 35 168 43.1 .. ... ... . . . . . . .. . ... 763 10 101 76 48 180 32.9 764 2 122 70 27 0 36.8 765 5 121 72 23 112 26.2 766 1 126 60 0 0 30.1 767 1 93 70 31 0 30.4 DiabetesPedigreeFunction Age Outcome 0 0.627 50 1 1 0.351 31 0 2 0.672 32 1 3 0.167 21 0 4 2.288 33 1 .. ... ... ... 763 0.171 63 0 764 0.340 27 0
  • 44. 41 765 0.245 30 0 766 0.349 47 1 767 0.315 23 0 [768 rows x 9 columns] # To know data type print(df.dtypes) Pregnancies int64 Glucose int64 BloodPressure int64 SkinThickness int64 Insulin int64 BMI float64 DiabetesPedigreeFunction float64Age int64 Outcome int64 dtype: object #To print fiest 5 rows print(df.head()) Pregnancies Glucose BloodPressure SkinThickness Insulin BMI 0 6 148 72 35 0 33.6 1 1 85 66 29 0 26.6 2 8 183 64 0 0 23.3 3 1 89 66 23 94 28.1 4 0 137 40 35 168 43.1 DiabetesPedigreeFunction Age Outcome 0 0.627 50 1 1 0.351 31 0 2 0.672 32 1 3 0.167 21 0 4 2.288 33 1 #To shape parameter to get the shape of the dataset. print(df.shape)(768, 9)
  • 45. 42 #calculate mean print("Mean of Preganancies: %f" %df['Pregnancies'].mean()) print("Mean of BloodPressure: %f" %df['BloodPressure'].mean())print("Mean of Glucose: %f" %df['Glucose'].mean()) print("Mean of Age: %f" %df['Age'].mean()) Sample Output: Mean of Preganancies: 3.845052 Mean of BloodPressure: 69.105469 Mean of Glucose: 120.894531 Mean of Age: 33.240885 #calculate median print("median of Preganancies: %f" %df['Pregnancies'].median()) print("median of BloodPressure: %f" %df['BloodPressure'].median())print("medianf Glucose: %f" %df['Glucose'].median()) print("median of Age: %f" %df['Age'].median()) Sample Output: median of Preganancies: 3.000000 median of BloodPressure: 72.000000 medianf Glucose: 117.000000 median of Age: 29.000000 #calculate standard deviation of 'points' print("standard deviation for BloodPressure: %f" % df['BloodPressure'].std())print("standard deviation for Glucose: %f" % df['Glucose'].std()) print("standard deviation for Pregnancies: %f" % df['Pregnancies'].std()) Sample Output: standard deviation for BloodPressure: 19.355807standard deviation for Glucose: 31.972618 standard deviation for Pregnancies: 3.369578 #To describe the data
  • 46. 43 df.Glucose.describe() Sample Output: count 768.000000 mean 120.894531 std 31.972618 min 0.000000 25% 99.000000 50% 117.000000 75% 140.250000 max 199.000000 Name: Glucose, dtype: float64 #create frequency table df['Pregnancies'].value_counts() Sample Output: 99 17 100 17 111 14 129 14 125 14 .. 191 1 177 1 44 1 62 1 190 1 Name: Glucose, Length: 136, dtype: int64#create frequency table df['Glucose'].value_counts() Sample Output: 99 17 100 17 111 14 129 14 125 14
  • 47. 44 .. 191 1 177 1 44 1 62 1 190 1 Name: Glucose, Length: 136, dtype: int64 #skewness and kurtosis print("Skewness: %f" % df['Pregnancies'].skew())print("Kurtosis: %f" % df['Pregnancies'].kurt()) Sample Output: Skewness: 0.901674 Kurtosis: 0.159220 #find frequency of each letter grade pd.crosstab(index=df['Outcome'], columns='count')Sample Output: col_0 count Outcome 0 500 1 268 #create frequency table for 'points' df['Pregnancies'].value_count s() Sample Output: 1 135 0 111 2 103 3 75 4 68 5 57 6 50 7 45 8 38 9 28
  • 48. 45 10 24 11 11 13 10 12 9 14 2 15 1 17 1 Name: Pregnancies, dtype: int64 #find frequency of each letter grade pd.crosstab(index=df['Pregnancies'], columns='count')Sample Output: col_0 count Pregnancies 0 111 1 135 2 103 3 75 4 68 5 57 6 50 7 45 8 38 9 28 10 24 11 11 12 9 13 10 14 2 15 1 17 1 import matplotlib.pyplot as plt df.hist(column='BloodPressure', grid=False, edgecolor='black') Sample Output: array([[<AxesSubplot:title={'center':'BloodPressure'}>]], dtype=object)
  • 49. 46 #to create a density curve import seaborn as sns sns.kdeplot(df['BloodPres sure']) <AxesSubplot:xlabel='BloodPressure', ylabel='Density'> #visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn library.sns.scatterplot(x=df.index,y=df['Age'],hue=df['Outcome'])
  • 50. 47 import numpy as np preg_proportion = np.array(df['Pregnancies'].value_counts()) preg_month = np.array(df['Pregnancies'].value_counts().index) preg_proportion_perc = np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)preg = pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_proportion':preg_pr oportion_perc}) preg.set_index(['month'],inplace= True)preg.head(10) Sample Output: month count_of_preg_prop percentage_proportion 1 135 17 0 111 14 2 103 13 3 75 9 4 68 8 5 57 7 6 50 6 7 45 5 8 38 4 9 28 3
  • 51. 48 import warnings warnings.filterwarnings("igno re") fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6)) plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='gre en') axes[0][0].set_title('Count',fontdict={'fontsize':8}) axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0][0].set_ylabel('Count',fontdict={'fontsize':7}) plt.tight_layout() plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1]) axes[0][1].set_title('Diab. VS Non- Diab.',fontdict={'fontsize':8})axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0][1].set_ylabel('Count',fontdict={'fontsize':7}) plot01.axes.legend(loc=1) plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6') plt.setp(axes[0][1].get_legend().get_title(), fontsize='6') plt.tight_layout() plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0]) axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8}) axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7}) axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7}) plt.tight_layout() plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non- Diab.') plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab. ') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7}) axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7}) plot11.axes.legend(loc=1) plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend textplt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title plt.tight_layout()
  • 52. 49 plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v') axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8}) axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7}) plt.tight_layout() plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df, ax=axes[2][1])axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7}) plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7) plt.tight_layout() plt.show() Sample Output: RESULT: Thus, the Univariate analysis using the UCI diabetes data set was successfully executed and practically verified.
  • 53. 50 Ex.No:5.b Bivariate analysis using the UCI diabetes data set Date: AIM: To Reading data from text files, Excel and the web and exploring various commands for doing Bivariate analysis using the UCI diabetes data set. PROCEDURE: Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: Linear regression modelling import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score # Load the diabetes dataset diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True) # Use only one feature diabetes_X = diabetes_X[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-50] diabetes_X_test = diabetes_X[-50:] # Split the targets into training/testing sets diabetes_y_train = diabetes_y[:-50] diabetes_y_test = diabetes_y[-50:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # The coefficients print("Coefficients: n", regr.coef_)
  • 54. 51 Sample output: Coefficients: [945.4992184] # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred)) Sample output: Mean squared error: 3471.92 # The coefficient of determination: 1 is perfect prediction print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred)) Sample output: Coefficient of determination: 0.41 # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color="black") plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)plt.xticks(()) plt.yticks(())plt.show()
  • 55. 52 Logistic regression modelling #Import Sklearn Packages import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split #to create plot_bar,histogram,boxplot etc import seaborn as sns import matplotlib.pyplot as plt #calculate accurancy measure and confusion matrix from sklearn import metrics import warnings warnings.filterwarnings("ignore") #Loading Data diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv") diabetes Preg nanci es Glucose Blood Pressure Skin Thickness Insulin DiabetesPedigree Function Outcome 6 14 8 72 35 0 0.627 1 1 85 66 29 0 0.351 0 8 18 3 64 0 0 0.672 1 1 89 66 23 94 0.167 0 0 13 7 40 35 168 2.288 1 . . . ... ... ... ... ... ... 1 0 10 1 76 48 180 0.171 0 2 12 2 70 27 0 0.340 0 5 12 1 72 23 112 0.245 0
  • 56. 53 Pregnancies Glucose BloodPressure SkinThickn ess DiabetesPedigreeFunction Outcome 1 126 60 0 0.349 1 1 93 70 31 0.315 0 768 rows × 9 columns #Train/Test split X=df.drop("Outcome",axis=1) Y=df[["Outcome"]] # target variable # split data into training and validation datasets X_train, X_test, y_train, y_t est = train_test_split(X, y, test_size=0.25, random_state=0) from sklearn.linear_model import LogisticRegression # instantiate the model model = LogisticRegression() # fitting the model model.fit(X_train, y_train) y_pred = model.predict(X_test)y_pred[0:5] # metrics print("Accuracy for test set is {}.".format(round(metrics.accuracy_score(y_test, y_pred), 4))) print("Precision for test set is {}.".format(round(metrics.precision_score(y_test, y_pred), 4))) print("Recall for test set is {}.".format(round(metrics.recall_score(y_test, y_pred), 4))) Sample Output: Accuracy for test set is 0.7917.Precision for test set is 0.7115.Recall for test set is 0.5968. print(metrics.classification_report(y_test, y_pred)) Sample Output: precision recall f1-score support 0 0.82 0.88 0.85 130 1 0.71 0.60 0.65 62 accuracy 0.79 192 macro avg 0.77 0.74 0.75 192 weighted avg 0.79 0.79 0.79 192
  • 57. 54 #Visualization f,ax = plt.subplots(figsize=(8,6)) sns.heatmap(df.corr(), cmap="GnBu", annot=True, linewidths=0.5, fmt= '.1f',ax=ax)plt.show() Result: Thus, the Bivariate analysis using the UCI diabetes data set was successfully executed and practically verified.
  • 58. 55 Ex.No:5.c Multiple Regression analysis using the UCI diabetes data set Date: AIM: To Reading data from Excel and exploring various commands for doing Multiple Regression analysis using the UCI diabetes data set. PROCEDURE: Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: #import our Libraries import numpy as np import pandas as pd import seaborn as sns from scipy import stats import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.stats import diagnostic as diag from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error import warnings warnings.filterwarnings("ignore") %matplotlib inline #Loading Data diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv") diabetes Preg nanci es Glucose Blood Pressure Skin Thickness Insulin DiabetesPedigree Function Outcome 6 148 72 35 0 0.627 1 1 85 66 29 0 0.351 0 8 183 64 0 0 0.672 1 1 89 66 23 94 0.167 0 0 137 40 35 168 2.288 1
  • 59. 56 . . . ... ... ... ... ... ... 1 0 101 76 48 180 0.171 0 2 122 70 27 0 0.340 0 5 121 72 23 112 0.245 0 768 rows × 9 columns # calculate the correlation matrix corr=diabetes.corr() # display the correlation matrix display(corr) # plot the Pre gna ncie s Glucose BloodPr essure SkinT hic kness Insuli n BMI Diabetes Pedigree Function Age Outcome Pregnan cies 1.00 000 0 0.129 459 0.1412 82 - 0.0816 72 - 0.073 535 0.017 683 - 0.033523 0.54 4 341 0.221 898 Glucose 0.12 945 9 1.000 000 0.1525 90 0.057 328 0.331 357 0.221 071 0.137337 0.26 3 514 0.466 581 BloodPr essure 0.14 128 2 0.152 590 1.0000 00 0.207 371 0.088 933 0.281 805 0.041265 0.23 9 528 0.065 068 SkinThi ckness - 0.08 167 2 0.057 328 0.2073 71 1.000 000 0.436 783 0.392 573 0.183928 - 0.1 13 970 0.074 752 Insulin - 0.07 353 5 0.331 357 0.0889 33 0.436 783 1.000 000 0.197 859 0.185071 - 0.0 42 163 0.130 548 BMI 0.01 768 3 0.221 071 0.2818 05 0.392 573 0.197 859 1.000 000 0.140647 0.03 6 242 0.292 695 DiabetesP edigree Function - 0.03 352 3 0.137 337 0.0412 65 0.183 928 0.185 071 0.140 647 1.000000 0.03 3 561 0.173 844 Age 0.54 434 1 0.263 514 0.2395 28 - 0.1139 70 - 0.042 163 0.036 242 0.033561 1.00 0 000 0.238 356
  • 60. 57 correlation heatmap sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu') <AxesSubplot:> #Train/Test split X=diabetes.drop("Outcome",axis=1) Y=diabetes[["Outcome"]] # target variable # split data into training and validation datasets # Split X and y into X_ X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0) # create a Linear Regression model object regression_model = LinearRegression() # pass through the X_train & y_train data set regression_model.fit(X_train,Y_train) LinearRegression() Outcom e 0.22 189 8 0.466 581 0.0650 68 0.074 752 0.130 548 0.292 695 0.173844 0.23 8 356 1.000 000
  • 61. 58 # let's grab the coefficient of our model and the intercept intercept = regression_model.intercept_[0] coefficent = regression_model.coef_[0][0] print("The intercept for our model is {:.4}".format(intercept)) print('-'*100) # loop through the dictionary and print the data for coef in zip(X.columns, regression_model.coef_[0]): print("The Coefficient for {} is {:.2}".format(coef[0],coef[1])) Sample output: The intercept for our model is -0.879 The Coefficient for Pregnancies is 0.015 The Coefficient for Glucose is 0.0057 The Coefficient for BloodPressure is -0.0021 The Coefficient for SkinThickness is 0.001 The Coefficient for Insulin is -0.00017 The Coefficient for BMI is 0.013 The Coefficient for DiabetesPedigreeFunction is 0.14 The Coefficient for Age is 0.0038 # Get multiple predictions y_predict = regression_model.predict(X_test) # Show the first 5 predictionsy_predict[:5] array([[1.01391226], [0.21532924], [0.09157383], [0.60583158], [0.15988782]]) # define our intput X2=sm.add_constant(X) # create a OLS model model=sm.OLS(Y, X2) # fit the data est = model.fit()
  • 62. 59 # print out a summary print(est.summary()) OLS Regression Results ============================================================================== Dep. Variable: Outcome R-squared: 0.303 Model: OLS Adj. R-squared: 0.296 Method: Least Squares F-statistic: 41.29 Date: Sat, 15 Oct 2022 Prob (F-statistic): 7.36e-55 Time: 19:14:26 Log-Likelihood: -381.91 No. Observations: 768 AIC: 781.8 Df Residuals: 759 BIC: 823.6 Df Model: 8 Covariance Type: nonrobust ============================================================================================ coef std err t P>|t| [0.025 0.975] const -0.8539 0.085 -9.989 0.000 - 1.022 -0.686 Pregnancies 0.0206 0.005 4.014 0.000 0.011 0.031 Glucose 0.0059 0. 00 1 11. 493 0.0 00 0.005 0.007 BloodPressure -0.0023 0. 00 1 - 2.8 7 3 0.0 0 4 - 0.004 -0.001 SkinThickness 0.0002 0.001 0.139 0.890 -0.002 0.002 Insulin -0.0002 0. 00 0 - 1.2 05 0.2 29 - 0.000 0.000 BMI 0.0132 0. 00 2 6.3 44 0.0 00 0.009 0.017 DiabetesPedigreeFunction 0.1472 0.0 45 3.2 68 0 . 0 0 1 0.05 9 0 . 2 3 6 Age 0.0026 0.002 1.6 93 0.0 91 - 0 . 0 0 0 0.00 6 ============================================================================== Omnibus: 41.539 Durbin-Watson: 1.982 Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183 Skew: 0.395 Prob(JB): 1.69e-07 Kurtosis: 2.408 Cond. No..................... 1.10e+03 ==============================================================================
  • 63. 60 Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.1e+03. This might indicate that there arestrong multicollinearity or other numerical problems. # make some confidence intervals, 95% by default est.conf_int() 0 1 const -1.021709 -0.686079 Pregnancies 0.010521 0.030663 Glucose 0.004909 0.006932 BloodPressure -0.003925 -0.000739 SkinThickness -0.002029 0.002338 Insulin -0.000475 0.000114 BMI 0.009146 0.017343 DiabetesPedigreeFunction 0.058792 0.235682 Age -0.000419 0.005662 # estimate the p-values est.pvalues Sample output: const 3.707465e-22 Pregnancies 6.561462e-05 Glucose 2.691192e-28 BloodPressure 4.178788e-03 SkinThickness 8.895424e-01 Insulin 2.285711e-01 BMI 3.853484e-10 DiabetesPedigreeFunction 1.131733e-03Age 9.092163e-02 dtype: float64import math # calculate the mean squared error model_mse = mean_squared_error(Y_test, y_predict) # calculate the mean absolute error model_mae = mean_absolute_error(Y_test, y_predict) # calulcate the root mean squared error model_rmse = math.sqrt(model_mse)
  • 64. 61 # display the output print("MSE {:.3}".format(model_mse)) print("MAE {:.3}".format(model_mae)) print("RMSE {:.3}".format(model_rmse)) MSE 0.148 MAE 0.322 RMSE 0.384 model_r2 = r2_score(Y_test, y_predict)print("R2: {:.2}".format(model_r2)) R2: 0.32 import pickle # pickle the model with open('my_mulitlinear_regression.sav','wb') as f: pickle.dump(regression_model, f) # load it back in with open('my_mulitlinear_regression.sav', 'rb') as pickle_file: regression_model_2 = pickle.load(pickle_file) # make a new prediction regression_model_2.predict([X_test.loc[150]])array([[0.42308994]]) Result: Thus, the multiple regression analysis using the UCI diabetes data set was successfully executedand practically verified.
  • 65. 62 Ex.No:6 Apply and explore various plotting functions on UCI data sets Date: AIM: To Reading data from Excel and exploring various commands for Apply and explore various plotting functions on UCI data sets. PROCEDURE: Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt sns.set(color_codes =True) %matplotlib inline import warnings warnings.filterwarnings("ignore") #To run numerical descriptive stats for the data set diabetes.describe() Pregnan cies Glucose Blood Pressure Skin Thick ness Insuli n BMI Diabet es Pedigr ee F unct ion Age Outcome cou nt 768.000000 768.00000 0 768.00000 0 768.00000 0 768.000 000 768.00 0 000 768.0000 00 768.00 0 000 768.000 000 me an 3.845052 120.89453 1 69.10546 9 20.536458 79.7994 79 31.992 5 78 0.471876 33.240 8 85 0.34895 8
  • 66. 63 50%3.00000 0 117.000 000 72.000000 23.000000 30.5000 00 32.0000 00 0.372500 29.0000 00 0.00000 0 75%6.00000 0 140.250 000 80.000000 32.000000 127.250 000 36.6000 00 0.626250 41.0000 00 1.00000 0 max 17.0000 00 199.000 000 122.00000 0 99.000000 846.000 000 67.1000 00 2.420000 81.0000 00 1.00000 0 sns.kdeplot(diabetes["Pregnancies"], color = "green",shade = True)plt.show() plt.figure() std 3.369578 31.972618 19.35580 7 15.952218 115.244 002 7.8841 6 0 0.331329 11.760 2 32 0.47695 1 min 0.000000 0.000000 0.00000 0 0.000000 0.00000 0 0.0000 0 0 0.078000 21.000 0 00 0.00000 0 25% 1.000000 99.000000 62.00000 0 0.000000 0.00000 0 27.300 0 00 0.243750 24.000 0 00 0.00000 0
  • 67. 64 plt.figure(figsize=(6,6)) sns.kdeplot(diabetes["Glucose"], color = "green",shade = True) plt.show() plt.figure() plt.figure(figsize=(8,8)) sns.kdeplot(diabetes["Age"], diabetes["BloodPressure"],cmap="RdYlBu", shade = True)plt.show() plt.figure()
  • 68. 65 plt.figure(figsize=(6,6)) sns.kdeplot(x=diabetes.Age, y=diabetes.Glucose, cmap="PRGn", shade=True, bw_adjust=1)plt.show() # calculate the correlation matrixcorr=diabetes.corr() # display the correlation matrixdisplay(corr) Pregna ncies Gluc ose BloodPre ssure SkinThic kness Insuli n BMI Diabetes Pedigree Function Age Outco me Pregnancies 1.00000 0 0.129 459 0.141282 0.081672 0.073 535 0.017 683 0.033523 0.544 341 0.221 898 Glucose 0.12945 9 1.000 000 0.152590 0.057328 0.331 357 0.221 071 0.137337 0.263 514 0.466 581
  • 69. 66 Pregna ncies Glu cose BloodPre ssure SkinThic kness Insulin BMI DiabetesP edigree Function Age Outtco me BloodPre ssure 0.14128 2 0.152 590 1.000000 0.207371 0.088 933 0.281 805 0.041265 0.239 528 0.065 068 SkinThic kness - 0.08167 2 0.057 328 0.207371 1.000000 0.436 783 0.392 573 0.183928 0.113 9 7 0 0.074 752 Insulin 0.07353 5 0 . 3 3 1 3 5 7 0.088933 0.436783 1.000 000 0.197 859 0.185071 0.042 163 0.130 548 BMI 0.01768 3 0 . 2 2 1 0 7 1 0.281805 0.392573 0.197 859 1.000 000 0.140647 0.036 242 0.292 695 DiabetesP edigree Function 0.03352 3 0 . 1 3 7 3 3 7 0.041265 0.183928 0.185 071 0.140 647 1.000000 0.033 561 0.173 844 Age 0.54434 1 0 . 2 6 3 5 1 4 0.239528 0.113970 0.042 163 0.036 242 0.033561 1.000 000 0.238 356 Outcome 0.22189 8 0 . 4 6 6 5 8 1 0.065068 0.074752 0.130 548 0.292 695 0.173844 0.238 356 1.000 000
  • 70. 67 import seaborn as sns sns.scatterplot(x="Pregnancies", y="Glucose", data=corr); sns.lmplot(x="Pregnancies", y="Glucose", hue="Outcome", data=corr); # Histogram+Density Plot
  • 71. 68 sns.distplot(diabetes["Age"], color = "green")plt.show() plt.figure() # Adding Two Plots In One sns.kdeplot(diabetes[diabetes.Outcome == 0]['Age'], color = "blue") sns.kdeplot(diabetes[diabetes.Outcome == 1]['Age'], color = "orange", shade = True) plt.show()
  • 72. 69 dia1 = diabetes[diabetes.Outcome==1] dia0 = diabetes[diabetes.Outcome==0] plt.figure(figsize=(20, 6)) plt.subplot(1,3,1) plt.title("Histogram for Glucose") sns.distplot(diabetes.Glucose, kde=False)plt.subplot(1,3,2) sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0") sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1") plt.title("Histograms for Glucose by Outcome") plt.legend() plt.subplot(1,3,3) sns.boxplot(x=diabetes.Outcome,y=diabetes.Glucose) plt.title("Boxplot for Glucose by Outcome") Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')
  • 73. 70 Three dimensional plotting: import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from mpl_toolkits import mplot3d import matplotlib.pyplot as plt import matplotlib import functools import plotly.express as px import plotly.graph_objects as go %matplotlib inline import warnings warnings.filterwarnings("ignore") #Loading Data diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv") diabetes Pregnan cies Glucose Blood Pressure Skin Thickness Insulin BMI Diabetes Pedigree Function Age Outcome 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 Pregnanci es ucos e Blood Pressu re Skin Thickne ss Insuli n BM I Diabetes Pedigree Funct ion Ag e Outcom e 2 8 183 4 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 0 35 168 43.1 2.288 33 1 ... ... ... ... ... ... ... ... ... ... 76 3 10 101 76 48 180 32.9 0.171 63 0 76 4 2 122 70 27 0 36.8 0.340 27 0 76 5 5 121 72 23 112 26.2 0.245 30 0 76 6 1 126 60 0 0 30.1 0.349 47 1 76 7 1 93 70 31 0 30.4 0.315 23 0
  • 74. 71 768 rows × 9 columns x=diabetes.Age[:20] y=diabetes.Glucose[:20] def f(x, y): return np.sin(np.sqrt(x ** 2 + y ** 2)) x = np.linspace(-6, 6, 30) y = np.linspace(-6, 6, 30)X, Y = np.meshgrid(x, y)Z = f(X, Y) fig = plt.figure(figsize=(10,10)) ax = plt.axes(projection='3d') ax.contour3D(X, Y, Z, 50, cmap='binary') ax.set_xlabel('Age') ax.set_ylabel('Glucose') ax.set_zlabel('z'); fig = plt.figure(figsize=(10,10)) ax = plt.axes(projection='3d') ax.plot_surface(X, Y, Z, rstride=1, cstride=1,cmap='viridis', edgecolor='none') ax.set_title('surface'); ax.set_xlabel('Age') ax.set_ylabel('Glucose') ax.set_zlabel('z')
  • 75. 72 fig = plt.figure(figsize=(10,10)) ax = plt.axes(projection='3d') ax.scatter(X,Y,Z, cmap='viridis', linewidth=0.5); ax.set_title('scatter'); ax.set_xlabel('Age') ax.set_ylabel('Glucose') ax.set_zlabel('z')
  • 76. 73 Result: Thus, the Three dimensional plotting using the UCI diabetes data set was successfully executed and practically verified.
  • 77. 74 Ex.No:7 Visualizing Geographic Data with Basemap Date: AIM: To Reading data from Excel and exploring various commands for Apply and explore various plotting functions on UCI data sets. PROCEDURE: Download the csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files. PROGRAM: import pandas as pd import numpy as np from numpy import array import matplotlib as mpl # for plots import matplotlib.pyplot as plt from matplotlib import cm from mpl_toolkits.basemap import Basemap %matplotlib inline import matplotlib.pyplot as plt from matplotlib.patches import Polygon from matplotlib.collections import PatchCollection import warnings warnings.filterwarnings("ignore") cities = pd.read_csv (r"C:UsersAdminDownloadsdatasets_557_1096_cities_r2.csv") cities.head() fig = plt.figure(figsize=(10,8)) states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True) states.plot(kind="barh", fontsize = 20) plt.grid(b=True, which='both', color='Black',linestyle='-')plt.xlabel('No of cities taken for analysis', fontsize = 20) plt.show ()
  • 78. 75 fig = plt.figure(figsize=(8,8)) ax=fig.add_subplot(111) map=Basemap(llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,projection="lcc",lat_0=28,lon_0=77) #map.bluemarble() #map.fillcontinents(color="red") map.drawmapboundary(color="red") map.drawcountries(color="brown") map.drawcoastlines(color="blue") #draw state from shapefile map.readshapefile(r"C:UsersAdminMusicIndia_State_ShapefileIndia_State_Boundary","India_St ate_Bo undary") cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0]) cities['longitude'] = cities['location'].apply(lambda x: x.split(',')[1]) print("The Top 10 Cities sorted according to the Total Population (Descending Order)") top_pop_cities = cities.sort_values(by='population_total',ascending=False)
  • 79. 76 top10_pop_cities=top_pop_cities.head(10) #plt.subplots(figsize=(20, 15)) lg=array(top10_pop_cities['longitude']) lt=array(top10_pop_cities['latitude']) pt=array(top10_pop_cities['population_total']) nc=array(top10_pop_cities['name_of_city']) x, y = map(lg, lt) population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x /5000)) plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2, alpha=0.7)for ncs, xpt, ypt in zip(nc, x, y): plt.text(xpt+60000, ypt+30000, ncs, fontsize=10, fontweight='bold') plt.title('Top 10 Populated Cities in India',fontsize=20) The Top 10 Cities sorted according to the Total Population (Descending Order)Text(0.5, 1.0, 'Top 10 Populated Cities in India')
  • 80. 77 Result: Thus, the Visualizing Geographic Data with Basemap was successfully executed and practically verified.
  • 81. 78 VIVA QUESTIONS NumPy 1. What is Numpy? Ans: NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python. … A powerful N-dimensional array object. Sophisticated (broadcasting)functions. 2. Why NumPy is used in Python? Ans: NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store values of samedatatype. These arrays are indexed just like Sequences, starts with zero. 3. What does NumPy mean in Python? Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, alongwith a large collection of high-level mathematical functions to operate on these arrays. 4. Where is NumPy used? Ans: NumPy is an open source numerical Python library. NumPy contains a multi-dimentional array andmatrix data structures. It can be utilised to perform a number of mathematical operations on arrays such astrigonometric, statistical and algebraic routines. NumPy is an extension of Numeric and Numarray.
  • 82. 79 Pandas 1. What is Pandas? Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-levelbuilding block for doing practical, real world data analysis in Python. 2. What is Python pandas used for? Ans: Pandas is a software library written for the Python programming language for data manipulation andanalysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. pandas is free software released under the three-clause BSD license. 3. What is a Series in Pandas? Ans: Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string,float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. 4. Mention the different Types of Data structures in pandas?? Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame isthe two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three- dimensional data structure and it includes items, major_axis, and minor_axis. 5. Explain Reindexing in pandas? Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing NA/NaNin locations having no value in the previous index. It changes the row labels and column labels of a DataFrame. 6. What are the key features of pandas library ? Ans: There are various features in pandas library and some of them are mentioned below  Data Alignment  Memory Efficient  Reshaping  Merge and join  Time Series 7. What is pandas Used For ? Ans: This library is written for the Python programming language for performing operations like data manipulation, data analysis, etc. The library provides various operations as well as data structures to manipulate time series and numerical tables.
  • 83. 80 8. How can we create copy of series in Pandas? Ans: pandas.Series.copy Series.copy(deep=True) pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices or the data are copied. Note that when deep=True data is copied, actual python objectswill not be copied recursively, only the reference to the object. 9. What is Time Series in pandas? Ans: A time series is an ordered sequence of data which basically represents how some quantity changesover time. pandas contains extensive capabilities and features for working with time series data for all domains. pandas supports: Parsing time series information from various sources and formats Generate sequences of fixed-frequency dates and time spans Manipulating and converting date time with timezone information Resampling or converting a time series to a particular frequency Performing date and time arithmetic with absolute or relative time increments 10. What is pylab? Ans: PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace. Jupyter Notebook 1. What is Jupyter Notebook? Jupyter Notebook is a web-based interactive computing platform that allows users to create and share code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists and engineers as it allows for rapid prototyping and iteration. 2. What are the main features of Jupyter Notebook? Jupyter Notebook is a web-based interactive computing platform that allows users to create and share code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists and engineers as it provides an easy way to mix code, output, and explanatory text all in one place. Jupyter Notebook is also used by educators to teach programming and data science concepts. 3. How can you create a new notebook in Jupyter? You can create a new notebook in Jupyter by clicking on the “New” button in the upper right corner andselecting “Notebook” from the drop-down menu. 4. Can you explain what the data science workflow involves? The data science workflow generally involves four main steps: data wrangling, exploratory data analysis,modeling, and evaluation. Data wrangling is the process of cleaning and preparing data for
  • 84. 81 analysis. Exploratory data analysis is the process of exploring data to find patterns and relationships. Modeling is the process of building models to make predictions or recommendations based on data. Evaluation is the processof assessing the accuracy of models and using them to make decisions. 5. What are some common use cases for Jupyter Notebook? Jupyter Notebook is a popular tool for data scientists and analysts because it allows for an interactive Coding experience. Jupyter Notebook is often used for exploratory data analysis and for visualizing data. *******