DS LAB MANUAL.pdf

REGULATION – 2021
CS3361 – DATA SCIENCE LABORATORY
LAB MANUAL
YEAR / SEMESTER: II / III
Prepared by
P.SANTHIYA
Assistant Professor
Department of Computer Science and Engineering

CS3362 DATA SCIENCE LABORATORY L T P C
0 0 4 2
COURSE OBJECTIVES:
To understand the python libraries for data science.
To understand the basic Statistical and Probability
measures for data science.To learn descriptive analytics
on the benchmark data sets.
To apply correlation and regression analytics on
standard data sets. To present and interpret data
using visualization packages in Python.
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodelsand Pandaspackages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring
variouscommands for doingdescriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data
set forperforming thefollowing:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, StandardDeviation,Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap

LIST OF EQUIPMENTS :(30 Students per Batch)
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn,
plotly, bokeh
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
TOTAL: 60 PERIODS
COURSE OUTCOMES:
At the end of this course, the students will be able to:
 Make use of the python libraries for data science.
 Make use of the basic Statistical and Probability measures for data science.
 Perform descriptive analytics on the benchmark data sets.
 Perform correlation and regression analytics on standard data sets.
 Present and interpret data using visualization packages in Python.

1
Ex.No 1
Download, install and explore the features of NumPy, SciPy,
Jupyter,Statsmodels andpackages
Date:
AIM:
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
packages.
Downloading and Installing Anaconda on Linux:
1. Getting Started:
2. Getting through the License Agreement:

2
3. Choose Installation Location:
4. Extracting Files and packages:

3
5. Initializing Anaconda Installation:
6. Finishing up the Installation:

4
7. Working with Anaconda:
>> anaconda-navigator

5
a) Installing Jupyter Notebook using Anaconda:
To install Jupyter using Anaconda, just go through the following instructions:
1. Launch Anaconda Navigator:
2. Click on the Install Jupyter Notebook Button:

6
3. Beginning the Installation:
4. Loading Packages:

7
5. Finished Installation:
6. Launching Jupyter:

8
b) Installing Jupyter Notebook using pip:
To install Jupyter using pip, the following command to update pip:
>> python3 -m pip install --upgrade pip
After updating the pip version, follow the instructions provided below to install Jupyter:
Command to install Jupyter:
>>pip3 install Jupyter
1. Beginning Installation:

9
2. Collecting Files and Data:
3. Downloading Packages:

10
4. Running Installation:
5. Finished Installation:

11
6. Launching Jupyter:
Use the following command to launch Jupyter using command-line:
>>jupyter notebook

12
Explore the following features of python packages:
1. NumPy:
NumPy stands for Numerical Python.NumPy (Numerical Python) is an open-source library for
the Python programming language. It is used for scientific computing and working with arrays. The
source code for NumPy is located at this github repository https://github.com/numpy/numpy.
Features:
1. High-performance N-dimensional array object.
2. It contains tools for integrating code from C/C++ and Fortran.
3. It contains a multidimensional container for generic data.
4. Additional linear algebra, Fourier transform, and random number capabilities.
5. It consists of broadcasting functions.
6. It had data type definition capability to work with varied databases.
2. SciPy:
SciPy stands for Scientific Python. SciPy is a scientific computation library that uses NumPy
underneath. The source code for SciPy is located at this github repository
https://github.com/scipy/scipy
Features:
1. SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems,
algebraic equations, differential equations, statistics and many other classes of problems.
2. It provides more utility functions for optimization, stats and signal processing.
Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy is suitable for
basicoperations such as sorting, indexing and many more because it contains array data, whereas SciPy
consistsof all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra, Fourier transforms,
etc. whereas SciPy library contains full featured version of the linear algebra module as well many other
numerical algorithms.

13
3. Pandas:
Python Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name ofPandas isderived from the word Panel Data, which meansan
Econometrics from Multidimensional data. It is used for data analysis in Python. Pandas is built
on top of the Numpy package, means Numpy is required for operating the Pandas.
Features:
1. Group by data for aggregations and transformations.
2. It has a fast and efficient DataFrame object with the default and customized indexing.
3. Used for reshaping and pivoting of the data sets.
4. It is used for data alignment and integration of the missing data.
5. Provide the functionality of Time Series.
6. Process a variety of data sets in different formats like matrix data, tabular heterogeneous,time
series.
7. Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.
4. Statsmodels:
statsmodels is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. The package is released under the open source Modified BSD (3-clause) license. The
online documentation is hosted at statsmodels.org.
Features:
1. Linear regression models like Ordinary least squares, Generalized least
squares,Weighted least squares, Least squares with autoregressive errors.
2. Bayesian Mixed GLM for Binomial and Poisson
3. GEE: Generalized Estimating Equations for one-way clustered or longitudinal data
4. Nonparametric statistics: Univariate and multivariate kernel density estimators
5. Datasets: Datasets used for examples and in testing
6. Sandbox: statsmodels contains a sandbox folder with code in various
stages ofdevelopment and testing.
RESULT:
Thus, the NumPy, SciPy, Jupyter, Statsmodels packages have been successfully
download,install and explore their features.

14
Ex.No 2
Working with Numpy arrays
Date:
AIM:
To write a Numpy arrays program to demonstrate basic array concepts in Jupyter Notebook.
PROGRAM:
1. Creating Arrays from Python Lists:
In[1]: import numpy as np
In[2]: # integer array:
np.array([1, 4, 2, 5, 3])
Out[2]: array([1, 4, 2, 5, 3])
#NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will
upcast ifpossible (here, integers are upcast to floating point):
In[3]: np.array([3.14, 4, 2, 3])
Out[3]: array([ 3.14, 4. , 2. , 3. ])
#If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:
In[4]: np.array([1, 2, 3, 4], dtype='float32')
Out[4]: array([ 1., 2., 3., 4.], dtype=float32)
In[5]: # nested lists result in multidimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
Out[5]: array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
2. NumPy Array Attributes:
In[1]: import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

15
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and
size (thetotal size of the array):
In[2]: print("x3 ndim: ",
x3.ndim) print("x3 shape:",
x3.shape)print("x3 size: ",
x3.size)
Out[2]:x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
In[3]: print("dtype:", x3.dtype)# data type of the array
Out[3]:dtype: int64
In[4]: print("itemsize:", x3.itemsize,
"bytes")print("nbytes:", x3.nbytes,
"bytes")
Out[4]:itemsize: 8 bytes
Out[4]:nbytes: 480 bytes
3. Array Indexing: Accessing Single Elements:
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
In[6]:
x1[0]
Out[6
]: 5
In[7]:
x1[4]
Out[7]: 7
#To index from the end of the array, you can use negative indices
In[8]:
x1[-1]
Out[8]
: 9
In[9]:
x1[-2]
Out[9]: 7

16
#In a multidimensional array, you access items using a comma-separated tuple of indices
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In[11]: x2[0, 0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
#modify values using any of the above index notation
In[14]: x2[0, 0] = 12
x2
Out[14]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[15]: x1[0] = 3.14159 # this will be truncated!
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
4. Array Slicing: Accessing Subarrays
#One-dimensional subarrays
In[16]: x =
np.arange(10)x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In[17]: x[:5] # first five elements
Out[17]: array([0, 1, 2, 3, 4])
In[18]: x[5:] # elements after index 5
Out[18]: array([5, 6, 7, 8, 9])
In[19]: x[4:7] # middle subarray
Out[19]: array([4, 5, 6])
In[20]: x[::2] # every other element
Out[20]: array([0, 2, 4, 6, 8])
In[21]: x[1::2] # every other element, starting at index 1
Out[21]: array([1, 3, 5, 7, 9])

17
In[22]: x[::-1] # all elements, reversed
Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In[23]: x[5::-2] # reversed every other from index 5
Out[23]: array([5, 3, 1])
5. Multidimensional subarrays:
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
Out[25]: array([[12, 5, 2],
[ 7, 6, 8]])
In[26]: x2[:3, ::2] # all rows, every other column
Out[26]: array([[12, 2],
[ 7, 8],
[ 1, 7]])
#Finally, subarray dimensions can even be reversed together:
In[27]: x2[::-1, ::-1]
Out[27]: array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
6. Accessing array rows and columns:
In[28]: print(x2[:, 0]) # first column of x2
[12 7 1]
In[29]: print(x2[0, :]) # first row of x2
[12 5 2 4]
#In the case of row access, the empty slice can be omitted for a more compact syntax:
In[30]: print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
In[31]: print(x2)
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
#extract a 2×2 subarray from this:
In[32]: x2_sub = x2[:2, :2]

18
print(x2_sub)
[[12 5]
[ 7 6]]
#modify this subarray
In[33]: x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In[34]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
7. Creating copies of arrays:
In[35]: x2_sub_copy = x2[:2,
:2].copy()print(x2_sub_copy)
[[99 5]
[ 7 6]]
#modify this subarray
In[36]: x2_sub_copy[0,
0] = 42
print(x2_sub_copy) [[42
5]
[ 7 6]]
In[37]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Reshaping of Arrays:
In[38]: grid = np.arange(1,
10).reshape((3, 3))print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]

19
In[39]: x = np.array([1, 2, 3])
# row vector via
reshape
x.reshape((1, 3))
Out[39]: array([[1, 2, 3]])
In[40]: # row vector via newaxis
x[np.newaxis, :]
Out[40]: array([[1, 2, 3]])
In[41]: # column vector via reshape
x.reshape((3, 1))
Out[41]:
array([[1],[2],
[3]])
In[42]: # column vector via newaxis
x[:, np.newaxis]
Out[42]: array([[1],
[2],
[3]])
8. Array Concatenation and Splitting:
In[43]: x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]: array([1, 2, 3, 3, 2, 1])
#concatenate more than two arrays at once:
In[44]: z = [99, 99, 99]
print(np.concatenate([x, y, z]))[
1 2 3 3 2 1 99 99 99]
#np.concatenate can also be used for two-dimensional arrays:
In[45]: grid = np.array([[1, 2, 3],
[4, 5, 6]])
In[46]: # concatenate along the first
axis
np.concatenate([grid, grid])
Out[46]: array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],

20
[4, 5, 6]])
In[47]: # concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[47]: array([[1, 2, 3, 1,
2, 3],
[4, 5, 6, 4, 5, 6]])
In[48]: x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
Out[48]: array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
In[49]: # horizontally stack the
arraysy = np.array([[99],[99]])
np.hstack([grid, y])
Out[49]: array([[ 9, 8,
7, 99],
[ 6, 5, 4, 99]])
Splitting of arrays:
In[50]: x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
In[51]: grid =
np.arange(16).reshape((4, 4))grid
Out[51]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In[52]: upper, lower =
np.vsplit(grid, [2])print(upper)
print(lower)
[[0 1 2 3]

21
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In[53]: left, right = np.hsplit(grid, [2])
print(left)
print(right)
Out[53]: [[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
9. Exploring NumPy’s UFuncs:
In[54]: x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
Out[54]: x = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [ 0. 0.5 1. 1.5]
x // 2 = [0 0 1 1]
In[8]: print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
Out[8]:-x = [ 0 -1
-2 -3]
x ** 2 = [0 1 4 9]

22
x % 2 = [0 1 0 1]
In[9]: -(0.5*x + 1) ** 2
Out[9]: array([-1. , -2.25, -4. , -6.25])
In[10]: np.add(x, 2)
Out[10]: array([2, 3, 4, 5])
10. Absolute value:
In[11]: x = np.array([-2, -1, 0, 1, 2])
abs(x)
Out[11]: array([2, 1, 0, 1, 2])
In[12]: np.absolute(x)
Out[12]: array([2, 1, 0, 1, 2])
In[13]: np.abs(x)
Out[13]: array([2, 1, 0, 1, 2])
In[14]: x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0
+ 1j])np.abs(x)
Out[14]: array([ 5., 5., 2., 1.])
11. Trigonometric functions:
In[15]: theta = np.linspace(0, np.pi, 3)
In[16]: print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
Out[16]:theta = [ 0. 1.57079633 3.14159265]
sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16]
cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00]
tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16]
In[17]: x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
Out[17]:x = [-1, 0, 1]
arcsin(x) = [-1.57079633 0. 1.57079633]

23
arccos(x) = [ 3.14159265 1.57079633 0. ]
arctan(x) = [-0.78539816 0. 0.78539816]
12. Exponents and logarithms:
In[18]: x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
Out[18]:x = [1, 2, 3]
e^x = [ 2.71828183 7.3890561 20.08553692]
2^x = [ 2. 4. 8.]
3^x = [ 3 9 27]
In[19]: x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
Out[19]:x = [1, 2, 4, 10]
ln(x) = [ 0. 0.69314718 1.38629436 2.30258509]
log2(x) = [ 0. 1. 2. 3.32192809]
log10(x) = [ 0. 0.30103 0.60205999 1. ]
In[20]: x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
Out[20]:exp(x) - 1 = [ 0. 0.0010005 0.01005017 0.10517092]
log(1 + x) = [ 0. 0.0009995 0.00995033 0.09531018]
13. Specialized ufuncs:
In[21]: from scipy import special
In[22]: # Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
Out[22]:gamma(x) = [ 1.00000000e+00 2.40000000e+01
3.62880000e+05]ln|gamma(x)| = [ 0. 3.17805383 12.80182748]

24
beta(x, 2) = [ 0.5 0.03333333 0.00909091]
In[23]: # Error function (integral of Gaussian) its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
Out[23]:erf(x) = [ 0. 0.32862676 0.67780119 0.84270079]
erfc(x) = [ 1. 0.67137324 0.32219881 0.15729921]
erfinv(x) = [ 0. 0.27246271 0.73286908 inf]
14. Aggregates:
In[26]: x = np.arange(1, 6)
np.add.reduce(x)
In[27]: np.multiply.reduce(x)
Out[27]: 120
In[28]: np.add.accumulate(x)
Out[28]: array([ 1, 3, 6, 10, 15])
In[29]: np.multiply.accumulate(x)
Out[29]: array([ 1, 2, 6, 24, 120])
15. Outer products:
In[30]: x = np.arange(1, 6)
np.multiply.outer(x, x)
Out[30]: array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
RESULT:
Thus, the Numpy array program was successfully executed and verified.

25
Ex.No 3
Working of Pandas DataFrame
Date:
AIM:
To write a Pandas program using dictionary sample dataframe to perform opertations in its DataFrame.
Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
PROGRAM:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(df)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())

26
Sample Output:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>Index:
10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 name 10 non-null object
1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2)memory
usage: 400.0+ bytes
None
i. To get the first 3 rows of a given DataFrame.
print("First three rows of the data frame:")
print(df.iloc[:3])
Sample Output:
First three rows of the data frame:
b 3 Dima no 9.0
ii. To select the 'name' and 'score' columns from the following DataFrame.
print("Select specific columns:")
print(df[['name', 'score']])

27
Sample Output:
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
iii. To select the specified columns and rows from a given DataFrame. Select 'name' and'score'
columns in rows 1, 3, 5, 6 from the following data frame.
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes
iv. To select the rows where the number of attempts in the examination is greater than 2.
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])
Sample Output:
Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes
v. To select the rows where the score is missing, i.e. is NaN.
print("Rows where score is missing:")
print(df[df['score'].isnull()])

28
Sample Output:
Rows where score is missing: attempts
name qualify score
d 3 James no NaN
h 1 Laura no NaN
vi. To change the score in row 'd' to 11.5.
print("nOriginal data frame:")print(df)
print("nChange the score in row 'd' to 11.5:")df.loc['d',
'score'] = 11.5
print(df)
Sample Output:
Original data frame:
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Change the score in row 'd' to 11.5:
Attempts name qualify score
b 3 Dima no 9.0
d 3 James no 11.5
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0

29
vii. To calculate the sum of the examination score by the students.
print("nSum of the examination attempts by the students:")
print(df['score'].sum())
Sample Output:
Sum of the examination attempts by the students:
108.5
viii. To append a new row 'k' to DataFrame with given values for each column. Now delete thenew
row and return the original data frame.
print("Original rows:")print(df)
print("nAppend a new row:") df.loc['k'] =
[1, 'Suresh', 'yes', 15.5]
print("Print all records after insert a new record:")print(df)
print("nDelete the new row and display the original rows:")df =
df.drop('k')
print(df)
Sample Output:
Original rows:
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Append a new row:
Print all records after insert a new record:attempts
name qualify score

30
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
k 1 Suresh yes 15.5
Delete the new row and display the original rows:attempts
name qualify score
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
ix. To delete the 'attempts' column from the DataFrame.
print("nDelete the 'attempts' column from the data frame:")
df.pop('attempts')
print(df)

31
Sample Output:
Original rows:
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Delete the 'attempts' column from the data frame:
name qualify score
a Anastasia yes 12.5
b Dima no 9.0
c Katherine yes 16.5

d James no NaN
e Emily no 9.0
f Michael yes 20.0
g Matthew yes 14.5
h Laura no NaN
i Kevin no 8.0
j Jonas yes 19.0
x. To insert a new column in existing DataFrame.
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']df['color'] =
color

32
print("nNew DataFrame after inserting the 'color' column")print(df)
Sample Output
Original rows:
b 3 Dima no 9.0
d 3 James no NaN
e 2 Emily no 9.0
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
New DataFrame after inserting the 'color' column
Attempts name qualify score color
a 1 Anastasia yes 12.5 Red
b 3 Dima no 9.0 Blue
c 2 Katherine yes 16.5 Orange
d 3 James no NaN Red
e 2 Emily no 9.0 White
f 3 Michael yes 20.0 White
g 1 Matthew yes 14.5 Blue
h 1 Laura no NaN Green
i 2 Kevin no 8.0 Green
j 1 Jonas yes 19.0 Red
RESULT:
Thus, the working of Pandas Dataframe using Dictionary was executed and verified successfully.

33
Ex.No:4
Descriptive analytics on the Iris data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing descriptive
analytics on the Iris data set.
PROCEDURE:
Download the Iris.csv file from the https://www.kaggle.com/datasets/uciml/iris and use the Pandas library to
load this CSV file,and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
df = pd.read_csv("Music/Iris.csv")# Reading the CSV file
print(df)
print(df.d
types)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa

34
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150 rows x 6 columns]
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object dtype: object
# Printing top 5 rows
print(df.head())
#To shape parameter to get the shape of the dataset.
print(df.shape)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
(150, 6)
#To know the columns and their data types use the info() method.
df.info()
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149

35
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
Species
150 non-null object
dtypes: float64(4),
int64(1),
object(1)
memory usage: 7.2+ KB
print(df.describe())
count 150.000000 150.000000 150.000000 150.000000 150.000000
m
e
a
n
75.500000 5.843333 3.054000 3.758667 1.198667
s
t
d
43.445368 0.828066 0.433594 1.764420 0.763161
m
i
n
1.000000 4.300000 2.000000 1.000000 0.100000
2
5
%
38.250000 5.100000 2.800000 1.600000 0.300000
5
0
%
75.500000 5.800000 3.000000 4.350000 1.300000
7
5
%
112.750000 6.400000 3.300000 5.100000 1.800000
m
a
x
150.0000
00
7.900000 4.40000
0
6.90000
0
2.500000

36
# Missing values can occur when no information is provided
print(df.isnull().sum())
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
# To check dataset contains any duplicates
or notdata = df.drop_duplicates(subset
="Species",) print(data)
0 1 5.1 3.5 1.4 0.2
5
0
5
1
7.0 3.
2
4.7 1.4
1
0
0
1
0
1
6.3 3.3 6.0 2.5
Species
0 Iris-
setosa
50 Iris-versicolor
100 Iris-virginica
#To find unique species from the given dataset
print(df.value_counts("Species"))
Species
Iris-setosa50
Iris-versicolor 50
Iris-virginica 50
dtype: int64

37
#matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#seaborn
import seaborn as sns
#plot the variable ‘sepal.width’
plt.scatter(df.index,df['SepalWidthCm'])plt.show()
#visualize the same plot by considering its variety using the sns.scatterplot() function of the
seaborn library.
sns.scatterplot(x=df.index,y=df['SepalWidthCm'],hue=df['Species'])

38
#visualizes data by connecting the data points via line segments.
plt.figure(figsize=(6,6))
plt.title("line plot for petal length")
plt.xlabel('index',fontsize=20)
plt.ylabel('PetalLengthCm',fontsize=20)
plt.plot(df.index,df['PetalLengthCm'],markevery=1,marker='d') for name,group in
df.groupby('Species'):
plt.plot(group.index,group['PetalLengthCm'],label=name,markevery=1,marker='d')
plt.legend()
plt.show()
#Plotting histogram using the matplotlib plt.hist() function :
plt.hist(df["PetalWidthCm"])

39
sns.distplot(df["PetalWidthCm"],kde=False,color='RED',bins=10)
<AxesSubplot:xlabel='PetalWidthCm'>
RESULT:
Thus, the descriptive analysis on the iris data set was successfully executed and practically
verified.

40
Ex.No:5.a
Univariate analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from csv files exploring various commands for doing Univariate analysis using the
UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/datasets/uciml/pima-
indians-diabetes-database and use the Pandaslibrary to load this CSV file, and convert it into the dataframe.
read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
df = pd.read_csv("E:DATA
SCIENCEPima_indian_diabetesdiabetes.CSV")print(df)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ...
.
.
.
.
.
.
..
.
...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4
DiabetesPedigreeFunction Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0

41
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0
[768 rows x 9 columns]
# To know data type print(df.dtypes)
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction
float64Age int64
Outcome int64
dtype: object
#To print fiest 5 rows
print(df.head())
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
DiabetesPedigreeFunction Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
#To shape parameter to get the shape of the dataset.
print(df.shape)(768, 9)

42
#calculate mean
print("Mean of Preganancies: %f"
%df['Pregnancies'].mean()) print("Mean of BloodPressure:
%f" %df['BloodPressure'].mean())print("Mean of Glucose:
%f" %df['Glucose'].mean()) print("Mean of Age: %f"
%df['Age'].mean())
Sample Output:
Mean of Preganancies:
3.845052 Mean of
BloodPressure: 69.105469
Mean of Glucose:
120.894531 Mean of Age:
33.240885 #calculate
median
print("median of Preganancies: %f"
%df['Pregnancies'].median()) print("median of BloodPressure:
%f" %df['BloodPressure'].median())print("medianf Glucose:
%f" %df['Glucose'].median())
print("median of Age: %f" %df['Age'].median())
Sample Output:
median of Preganancies:
3.000000 median of
BloodPressure: 72.000000
medianf Glucose: 117.000000
median of Age: 29.000000
#calculate standard deviation of 'points'
print("standard deviation for BloodPressure: %f" %
df['BloodPressure'].std())print("standard deviation for Glucose: %f" %
df['Glucose'].std()) print("standard deviation for Pregnancies: %f" %
df['Pregnancies'].std()) Sample Output:
standard deviation for BloodPressure:
19.355807standard deviation for Glucose:
31.972618 standard deviation for
Pregnancies: 3.369578 #To describe the
data

43
df.Glucose.describe()
Sample Output:
count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
#create frequency table
df['Pregnancies'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype:
int64#create frequency table
df['Glucose'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14

44
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype: int64
#skewness and kurtosis
print("Skewness: %f" %
df['Pregnancies'].skew())print("Kurtosis:
%f" % df['Pregnancies'].kurt()) Sample
Output:
Skewness: 0.901674
Kurtosis: 0.159220
#find frequency of each letter grade
pd.crosstab(index=df['Outcome'],
columns='count')Sample Output:
col_0 count
Outcome
0 500
1 268
#create frequency table for
'points'
df['Pregnancies'].value_count
s() Sample Output:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28

45
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies, dtype: int64
#find frequency of each letter grade
pd.crosstab(index=df['Pregnancies'],
columns='count')Sample Output:
col_0 count
Pregnancies
0 111
1 135
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
12 9
13 10
14 2
15 1
17 1
df.hist(column='BloodPressure', grid=False,
edgecolor='black')
Sample Output:
array([[<AxesSubplot:title={'center':'BloodPressure'}>]], dtype=object)

46
#to create a density curve
sns.kdeplot(df['BloodPres
sure'])
<AxesSubplot:xlabel='BloodPressure', ylabel='Density'>
#visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn
library.sns.scatterplot(x=df.index,y=df['Age'],hue=df['Outcome'])

47
import numpy as np
preg_proportion =
np.array(df['Pregnancies'].value_counts()) preg_month =
np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_proportion':preg_pr
oportion_perc})
preg.set_index(['month'],inplace=
True)preg.head(10)
Sample Output:
month count_of_preg_prop percentage_proportion
1 135 17
0 111 14
2 103 13
3 75 9
4 68 8
5 57 7
6 50 6
7 45 5
8 38 4
9 28 3

48
import warnings
warnings.filterwarnings("igno
re")
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='gre
en') axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-
Diab.',fontdict={'fontsize':8})axes[0][1].set_xlabel('Month
of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies
Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.
') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for
legend textplt.setp(axes[1][1].get_legend().get_title(), fontsize='6')
# for legend title plt.tight_layout()

49
plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,
ax=axes[2][1])axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()
Sample Output:
RESULT:
Thus, the Univariate analysis using the UCI diabetes data set was successfully executed and
practically verified.

50
Ex.No:5.b
Bivariate analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing Bivariate
analysis using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
Linear regression modelling
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:,
np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-50]
diabetes_X_test = diabetes_X[-50:]
# Split the targets into
training/testing sets
diabetes_y_train = diabetes_y[:-50]
diabetes_y_test = diabetes_y[-50:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients print("Coefficients: n", regr.coef_)

51
Sample output:
Coefficients: [945.4992184]
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
Sample output:
Mean squared error: 3471.92
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
Sample output:
Coefficient of determination: 0.41
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue",
linewidth=3)plt.xticks(())
plt.yticks(())plt.show()

52
Logistic regression modelling
#Import Sklearn Packages
import numpy as np
import pandas as pd
from sklearn.linear_model
import LogisticRegressionfrom sklearn.model_selection
import train_test_split
#to create plot_bar,histogram,boxplot etc
#calculate accurancy measure and confusion matrix
from sklearn import metrics
import warnings warnings.filterwarnings("ignore")
#Loading Data
diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv")
diabetes
Preg
nanci
es
Glucose Blood
Pressure
Skin
Thickness
Insulin DiabetesPedigree
Function
Outcome
6 14
8
72 35 0 0.627 1
1 85 66 29 0 0.351 0
8 18
3
64 0 0 0.672 1
1 89 66 23 94 0.167 0
0 13
7
40 35 168 2.288 1
.
.
.
... ... ... ... ... ...
1
0
10
1
76 48 180 0.171 0
2 12
2
70 27 0 0.340 0
5 12
1
72 23 112 0.245 0

53
Pregnancies Glucose BloodPressure SkinThickn
ess
DiabetesPedigreeFunction Outcome
1 126 60 0 0.349 1
1 93 70 31 0.315 0
768 rows × 9 columns
#Train/Test split
X=df.drop("Outcome",axis=1)
Y=df[["Outcome"]]
# target variable
# split data into training and validation datasets
X_train, X_test, y_train, y_t est = train_test_split(X, y, test_size=0.25, random_state=0)
import LogisticRegression
# instantiate the model
model = LogisticRegression()
# fitting the model
model.fit(X_train, y_train) y_pred = model.predict(X_test)y_pred[0:5]
# metrics
print("Accuracy for test set is {}.".format(round(metrics.accuracy_score(y_test, y_pred), 4)))
print("Precision for test set is {}.".format(round(metrics.precision_score(y_test, y_pred), 4)))
print("Recall for test set is {}.".format(round(metrics.recall_score(y_test, y_pred), 4)))
Sample Output:
Accuracy for test set is
0.7917.Precision for test
set is 0.7115.Recall for
test set is 0.5968.
print(metrics.classification_report(y_test, y_pred))
Sample Output:
precision recall f1-score support
0 0.82 0.88 0.85 130
1 0.71 0.60 0.65 62
accuracy 0.79 192
macro avg 0.77 0.74 0.75 192
weighted avg 0.79 0.79 0.79 192

54
#Visualization
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(df.corr(), cmap="GnBu", annot=True, linewidths=0.5, fmt=
'.1f',ax=ax)plt.show()
Result:
Thus, the Bivariate analysis using the UCI diabetes data set was successfully executed and

55
Ex.No:5.c
Multiple Regression analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from Excel and exploring various commands for doing Multiple Regression analysis
using the UCI diabetes data set.
PROCEDURE:
PROGRAM:
#import our Libraries
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats
import diagnostic as diag
from statsmodels.stats.outliers_influence
import variance_inflation_factor
import LinearRegression
from sklearn.model_selection
import train_test_split
from sklearn.metrics
import mean_squared_error, r2_score, mean_absolute_error
import warnings warnings.filterwarnings("ignore")
%matplotlib inline
#Loading Data
diabetes
Preg
nanci
es
Glucose Blood
Pressure
Skin
Thickness
Insulin DiabetesPedigree
Function
Outcome
6 148 72 35 0 0.627 1
1
85 66 29 0 0.351 0
8
183 64 0 0 0.672 1
1
89 66 23 94 0.167 0
0
137 40 35 168 2.288 1

56
.
.
.
... ... ... ... ... ...
1
0
101 76 48 180 0.171 0
2
122 70 27 0 0.340 0
5
121 72 23 112 0.245 0
# calculate the correlation matrix
corr=diabetes.corr()
# display the correlation matrix
display(corr)
# plot
the
Pre
gna
ncie
s
Glucose
BloodPr
essure
SkinT
hic
kness
Insuli
n BMI
Diabetes
Pedigree
Function
Age Outcome
Pregnan
cies
1.00
000
0
0.129
459
0.1412
82
-
0.0816
72
-
0.073
535
0.017
683
-
0.033523
0.54
4
341
0.221
898
Glucose
0.12
945
9
1.000
000
0.1525
90
0.057
328
0.331
357
0.221
071 0.137337
0.26
3
514
0.466
581
BloodPr
essure
0.14
128
2
0.152
590
1.0000
00
0.207
371
0.088
933
0.281
805 0.041265
0.23
9
528
0.065
068
SkinThi
ckness
-
0.08
167
2
0.057
328
0.2073
71
1.000
000
0.436
783
0.392
573 0.183928
-
0.1
13
970
0.074
752
Insulin
-
0.07
353
5
0.331
357
0.0889
33
0.436
783
1.000
000
0.197
859 0.185071
-
0.0
42
163
0.130
548
BMI
0.01
768
3
0.221
071
0.2818
05
0.392
573
0.197
859
1.000
000 0.140647
0.03
6
242
0.292
695
DiabetesP
edigree
Function
-
0.03
352
3
0.137
337
0.0412
65
0.183
928
0.185
071
0.140
647 1.000000
0.03
3
561
0.173
844
Age
0.54
434
1
0.263
514
0.2395
28
-
0.1139
70
-
0.042
163
0.036
242 0.033561
1.00
0
000
0.238
356

57
correlation heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu')
<AxesSubplot:>
#Train/Test split
X=diabetes.drop("Outcome",axis=1)
Y=diabetes[["Outcome"]]
# target variable
# split data into training and validation datasets
# Split X and y into X_
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train,Y_train) LinearRegression()
Outcom
e
0.22
189
8
0.466
581
0.0650
68
0.074
752
0.130
548
0.292
695 0.173844
0.23
8
356
1.000
000

58
# let's grab the coefficient of our model and the intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {:.4}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {:.2}".format(coef[0],coef[1]))
Sample output:
The intercept for our model is -0.879
The Coefficient for Pregnancies is 0.015
The Coefficient for Glucose is 0.0057
The Coefficient for BloodPressure is -0.0021
The Coefficient for SkinThickness is 0.001
The Coefficient for Insulin is -0.00017
The Coefficient for BMI is 0.013
The Coefficient for DiabetesPedigreeFunction is 0.14
The Coefficient for Age is 0.0038
# Get multiple predictions
y_predict = regression_model.predict(X_test)
# Show the first 5
predictionsy_predict[:5]
array([[1.01391226],
[0.21532924],
[0.09157383],
[0.60583158],
[0.15988782]])
# define our intput
X2=sm.add_constant(X)
# create a OLS model
model=sm.OLS(Y, X2)
# fit the data
est = model.fit()

59
# print out a summary
print(est.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Sat, 15 Oct 2022 Prob (F-statistic): 7.36e-55
Time: 19:14:26 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
const -0.8539 0.085 -9.989 0.000 -
1.022
-0.686
Pregnancies 0.0206 0.005 4.014 0.000 0.011 0.031
Glucose 0.0059 0.
00
1
11.
493
0.0
00
0.005 0.007
BloodPressure -0.0023 0.
00
1
-
2.8
7
3
0.0
0
4
-
0.004
-0.001
SkinThickness 0.0002 0.001 0.139 0.890 -0.002 0.002
Insulin -0.0002 0.
00
0
-
1.2
05
0.2
29
-
0.000
0.000
BMI 0.0132 0.
00
2
6.3
44
0.0
00
0.009 0.017
DiabetesPedigreeFunction 0.1472 0.0
45
3.2
68
0
.
0
0
1
0.05
9
0
.
2
3
6
Age 0.0026 0.002 1.6
93
0.0
91
-
0
.
0
0
0
0.00
6
==============================================================================
Omnibus: 41.539 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183
Skew: 0.395 Prob(JB): 1.69e-07
Kurtosis: 2.408 Cond. No..................... 1.10e+03
==============================================================================

60
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that
there arestrong multicollinearity or other numerical problems.
# make some confidence intervals, 95% by default
est.conf_int()
0 1
const -1.021709 -0.686079
Pregnancies 0.010521 0.030663
Glucose 0.004909 0.006932
BloodPressure -0.003925 -0.000739
SkinThickness -0.002029 0.002338
Insulin -0.000475 0.000114
BMI 0.009146 0.017343
DiabetesPedigreeFunction 0.058792 0.235682
Age -0.000419 0.005662
# estimate the p-values
est.pvalues
Sample output:
const 3.707465e-22
Pregnancies 6.561462e-05
Glucose 2.691192e-28
BloodPressure 4.178788e-03
SkinThickness 8.895424e-01
Insulin 2.285711e-01
BMI 3.853484e-10
DiabetesPedigreeFunction
1.131733e-03Age
9.092163e-02
dtype: float64import math
# calculate the mean squared error
model_mse = mean_squared_error(Y_test, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(Y_test, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)

61
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
MSE 0.148
MAE 0.322
RMSE 0.384
model_r2 = r2_score(Y_test, y_predict)print("R2: {:.2}".format(model_r2))
R2: 0.32
import pickle
# pickle the model
with open('my_mulitlinear_regression.sav','wb') as f:
pickle.dump(regression_model, f)
# load it back in
with open('my_mulitlinear_regression.sav', 'rb') as pickle_file:
regression_model_2 = pickle.load(pickle_file)
# make a new prediction
regression_model_2.predict([X_test.loc[150]])array([[0.42308994]])
Result:
Thus, the multiple regression analysis using the UCI diabetes data set was successfully
executedand practically verified.

62
Ex.No:6
Apply and explore various plotting functions on UCI data sets
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
PROGRAM:
import numpy as np
import pandas as pd
sns.set(color_codes =True)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#To run numerical descriptive stats for the data set
diabetes.describe()
Pregnan
cies
Glucose
Blood
Pressure
Skin
Thick
ness
Insuli
n
BMI
Diabet
es
Pedigr
ee
F
unct
ion
Age
Outcome
cou
nt
768.000000 768.00000
0
768.00000
0
768.00000
0
768.000
000
768.00
0
000
768.0000
00
768.00
0
000
768.000
000
me
an
3.845052 120.89453
1 69.10546
9
20.536458
79.7994
79
31.992
5
78
0.471876
33.240
8
85
0.34895
8

63
50%3.00000
0
117.000
000
72.000000 23.000000
30.5000
00
32.0000
00
0.372500
29.0000
00
0.00000
0
75%6.00000
0
140.250
000
80.000000 32.000000
127.250
000
36.6000
00
0.626250
41.0000
00
1.00000
0
max 17.0000
00
199.000
000
122.00000
0
99.000000
846.000
000
67.1000
00
2.420000
81.0000
00
1.00000
0
sns.kdeplot(diabetes["Pregnancies"], color = "green",shade =
True)plt.show()
plt.figure()
std
3.369578 31.972618
19.35580
7
15.952218
115.244
002
7.8841
6
0
0.331329
11.760
2
32
0.47695
1
min
0.000000 0.000000
0.00000
0
0.000000
0.00000
0
0.0000
0
0
0.078000
21.000
0
00
0.00000
0
25% 1.000000 99.000000
62.00000
0
0.000000
0.00000
0
27.300
0
00
0.243750
24.000
0
00
0.00000
0

64
sns.kdeplot(diabetes["Glucose"], color = "green",shade = True)
plt.show()
plt.figure()
sns.kdeplot(diabetes["Age"], diabetes["BloodPressure"],cmap="RdYlBu", shade =
True)plt.show()
plt.figure()

65
sns.kdeplot(x=diabetes.Age, y=diabetes.Glucose, cmap="PRGn", shade=True,
bw_adjust=1)plt.show()
# calculate the correlation
matrixcorr=diabetes.corr()
# display the correlation
matrixdisplay(corr)
Pregna
ncies
Gluc
ose
BloodPre
ssure
SkinThic
kness
Insuli
n
BMI
Diabetes
Pedigree
Function
Age
Outco
me
Pregnancies
1.00000
0
0.129
459
0.141282 0.081672
0.073
535
0.017
683
0.033523
0.544
341
0.221
898
Glucose
0.12945
9
1.000
000
0.152590 0.057328
0.331
357
0.221
071
0.137337
0.263
514
0.466
581

66
Pregna
ncies
Glu
cose
BloodPre
ssure
SkinThic
kness
Insulin
BMI
DiabetesP
edigree
Function
Age
Outtco
me
BloodPre
ssure
0.14128
2
0.152
590
1.000000 0.207371
0.088
933
0.281
805
0.041265
0.239
528
0.065
068
SkinThic
kness
-
0.08167
2
0.057
328
0.207371 1.000000
0.436
783
0.392
573
0.183928 0.113
9
7
0
0.074
752
Insulin
0.07353
5
0
.
3
3
1
3
5
7
0.088933 0.436783
1.000
000
0.197
859
0.185071
0.042
163
0.130
548
BMI
0.01768
3
0
.
2
2
1
0
7
1
0.281805 0.392573
0.197
859
1.000
000
0.140647
0.036
242
0.292
695
DiabetesP
edigree
Function
0.03352
3
0
.
1
3
7
3
3
7
0.041265 0.183928
0.185
071
0.140
647
1.000000
0.033
561
0.173
844
Age
0.54434
1
0
.
2
6
3
5
1
4
0.239528 0.113970
0.042
163
0.036
242
0.033561
1.000
000
0.238
356
Outcome
0.22189
8
0
.
4
6
6
5
8
1
0.065068 0.074752
0.130
548
0.292
695
0.173844
0.238
356
1.000
000

67
sns.scatterplot(x="Pregnancies", y="Glucose", data=corr);
sns.lmplot(x="Pregnancies", y="Glucose", hue="Outcome", data=corr);
# Histogram+Density Plot

68
sns.distplot(diabetes["Age"], color =
"green")plt.show()
plt.figure()
# Adding Two Plots In One
sns.kdeplot(diabetes[diabetes.Outcome == 0]['Age'],
color = "blue")
sns.kdeplot(diabetes[diabetes.Outcome == 1]['Age'],
color = "orange", shade = True)
plt.show()

69
dia1 = diabetes[diabetes.Outcome==1]
dia0 = diabetes[diabetes.Outcome==0]
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(diabetes.Glucose,
kde=False)plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend() plt.subplot(1,3,3)
sns.boxplot(x=diabetes.Outcome,y=diabetes.Glucose)
plt.title("Boxplot for Glucose by Outcome")
Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')

70
Three dimensional plotting:
import numpy as np
# linear algebra
import pandas as pd
# data processing, CSV file I/O (e.g. pd.read_csv)
from mpl_toolkits
import mplot3d
import matplotlib
import functools
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
import warnings
#Loading Data
diabetes
Pregnan
cies
Glucose Blood
Pressure
Skin
Thickness
Insulin BMI Diabetes
Pedigree
Function
Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
Pregnanci es ucos e Blood
Pressu re
Skin
Thickne ss
Insuli n BM I Diabetes
Pedigree
Funct ion
Ag e Outcom e
2 8 183 4 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 0 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
76
3
10 101 76 48 180 32.9 0.171 63 0
76
4
2 122 70 27 0 36.8 0.340 27 0
76
5
5 121 72 23 112 26.2 0.245 30 0
76
6
1 126 60 0 0 30.1 0.349 47 1
76
7
1 93 70 31 0 30.4 0.315 23 0

71
x=diabetes.Age[:20]
y=diabetes.Glucose[:20]
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)X,
Y = np.meshgrid(x, y)Z = f(X, Y)
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z');
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,cmap='viridis', edgecolor='none')
ax.set_title('surface');
ax.set_zlabel('z')

72
ax.scatter(X,Y,Z, cmap='viridis', linewidth=0.5);
ax.set_title('scatter');
ax.set_zlabel('z')

73
Result:
Thus, the Three dimensional plotting using the UCI diabetes data set was successfully executed and

74
Ex.No:7
Visualizing Geographic Data with Basemap
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and
convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
from matplotlib import cm
from mpl_toolkits.basemap
import Basemap
%matplotlib inline
from matplotlib.patches
import Polygon
from matplotlib.collections
import PatchCollection
import warnings
cities = pd.read_csv (r"C:UsersAdminDownloadsdatasets_557_1096_cities_r2.csv")
cities.head()
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both',
color='Black',linestyle='-')plt.xlabel('No of cities
taken for analysis', fontsize = 20) plt.show ()

75
ax=fig.add_subplot(111)
map=Basemap(llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,projection="lcc",lat_0=28,lon_0=77)
#map.bluemarble()
#map.fillcontinents(color="red")
map.drawmapboundary(color="red")
map.drawcountries(color="brown")
map.drawcoastlines(color="blue")
#draw state from shapefile
map.readshapefile(r"C:UsersAdminMusicIndia_State_ShapefileIndia_State_Boundary","India_St
ate_Bo undary")
cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0]) cities['longitude'] =
cities['location'].apply(lambda x: x.split(',')[1])
print("The Top 10 Cities sorted according to the Total Population (Descending Order)")
top_pop_cities = cities.sort_values(by='population_total',ascending=False)

76
top10_pop_cities=top_pop_cities.head(10)
#plt.subplots(figsize=(20, 15))
lg=array(top10_pop_cities['longitude'])
lt=array(top10_pop_cities['latitude'])
pt=array(top10_pop_cities['population_total'])
nc=array(top10_pop_cities['name_of_city'])
x, y = map(lg, lt)
population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x /5000))
plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2,
alpha=0.7)for ncs, xpt, ypt in zip(nc, x, y):
plt.text(xpt+60000, ypt+30000, ncs, fontsize=10, fontweight='bold')
plt.title('Top 10 Populated Cities in India',fontsize=20)
The Top 10 Cities sorted according to the Total Population (Descending
Order)Text(0.5, 1.0, 'Top 10 Populated Cities in India')

77
Result:
Thus, the Visualizing Geographic Data with Basemap was successfully executed and practically
verified.

78
VIVA QUESTIONS
NumPy
1. What is Numpy?
Ans: NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental package
for scientific computing with Python. … A powerful N-dimensional array object. Sophisticated
(broadcasting)functions.
2. Why NumPy is used in Python?
Ans: NumPy is a package in Python used for Scientific Computing. NumPy package is used to
perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store
values of samedatatype. These arrays are indexed just like Sequences, starts with zero.
3. What does NumPy mean in Python?
Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for
the Python programming language, adding support for large, multi-dimensional arrays and matrices,
alongwith a large collection of high-level mathematical functions to operate on these arrays.
4. Where is NumPy used?
Ans: NumPy is an open source numerical Python library. NumPy contains a multi-dimentional array
andmatrix data structures. It can be utilised to perform a number of mathematical operations on arrays
such astrigonometric, statistical and algebraic routines. NumPy is an extension of Numeric and
Numarray.

79
Pandas
1. What is Pandas?
Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-levelbuilding block for doing practical, real world data analysis in Python.
2. What is Python pandas used for?
Ans: Pandas is a software library written for the Python programming language for data
manipulation andanalysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. pandas is free software released under the three-clause BSD
license.
3. What is a Series in Pandas?
Ans: Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string,float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.
4. Mention the different Types of Data structures in pandas??
Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the
data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and
DataFrame isthe two-dimensional data structure in pandas. There is one more axis label known as
Panel which is a three- dimensional data structure and it includes items, major_axis, and minor_axis.
5. Explain Reindexing in pandas?
Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing
NA/NaNin locations having no value in the previous index. It changes the row labels and column
labels of a DataFrame.
6. What are the key features of pandas library ?
Ans: There are various features in pandas library and some of them are mentioned below
 Data Alignment
 Memory Efficient
 Reshaping
 Merge and join
 Time Series
7. What is pandas Used For ?
Ans: This library is written for the Python programming language for performing operations like data
manipulation, data analysis, etc. The library provides various operations as well as data structures to
manipulate time series and numerical tables.

80
8. How can we create copy of series in Pandas?
Ans: pandas.Series.copy
Series.copy(deep=True)
pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With
deep=False neither the indices or the data are copied. Note that when deep=True data is copied,
actual python objectswill not be copied recursively, only the reference to the object.
9. What is Time Series in pandas?
Ans: A time series is an ordered sequence of data which basically represents how some quantity
changesover time. pandas contains extensive capabilities and features for working with time series
data for all domains.
pandas supports:
Parsing time series information from various sources and formats
Generate sequences of fixed-frequency dates and time spans
Manipulating and converting date time with timezone information
Resampling or converting a time series to a particular frequency
Performing date and time arithmetic with absolute or relative time increments
10. What is pylab?
Ans: PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.
Jupyter Notebook
1. What is Jupyter
Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it allows for rapid prototyping and iteration.
2. What are the main features of Jupyter Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it provides an easy way to mix code, output, and explanatory text all in one place.
Jupyter Notebook is also used by educators to teach programming and data science concepts.
3. How can you create a new notebook in Jupyter?
You can create a new notebook in Jupyter by clicking on the “New” button in the upper right corner
andselecting “Notebook” from the drop-down menu.
4. Can you explain what the data science workflow involves?
The data science workflow generally involves four main steps: data wrangling, exploratory data
analysis,modeling, and evaluation. Data wrangling is the process of cleaning and preparing data for

81
analysis.
Exploratory data analysis is the process of exploring data to find patterns and relationships. Modeling
is the process of building models to make predictions or recommendations based on data. Evaluation is
the processof assessing the accuracy of models and using them to make decisions.
5. What are some common use cases for Jupyter Notebook?
Jupyter Notebook is a popular tool for data scientists and analysts because it allows for an interactive
Coding experience. Jupyter Notebook is often used for exploratory data analysis and for visualizing data.
*******

DS LAB MANUAL.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DS LAB MANUAL.pdf

Similar to DS LAB MANUAL.pdf (20)

Recently uploaded

Recently uploaded (20)

DS LAB MANUAL.pdf