A Critique of the Proposed National Education Policy Reform
DS LAB MANUAL.pdf
1. REGULATION – 2021
CS3361 – DATA SCIENCE LABORATORY
LAB MANUAL
YEAR / SEMESTER: II / III
Prepared by
P.SANTHIYA
Assistant Professor
Department of Computer Science and Engineering
2. CS3362 DATA SCIENCE LABORATORY L T P C
0 0 4 2
COURSE OBJECTIVES:
To understand the python libraries for data science.
To understand the basic Statistical and Probability
measures for data science.To learn descriptive analytics
on the benchmark data sets.
To apply correlation and regression analytics on
standard data sets. To present and interpret data
using visualization packages in Python.
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodelsand Pandaspackages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring
variouscommands for doingdescriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data
set forperforming thefollowing:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, StandardDeviation,Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap
3. LIST OF EQUIPMENTS :(30 Students per Batch)
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn,
plotly, bokeh
Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.
TOTAL: 60 PERIODS
COURSE OUTCOMES:
At the end of this course, the students will be able to:
Make use of the python libraries for data science.
Make use of the basic Statistical and Probability measures for data science.
Perform descriptive analytics on the benchmark data sets.
Perform correlation and regression analytics on standard data sets.
Present and interpret data using visualization packages in Python.
4. 1
Ex.No 1
Download, install and explore the features of NumPy, SciPy,
Jupyter,Statsmodels andpackages
Date:
AIM:
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
packages.
Downloading and Installing Anaconda on Linux:
1. Getting Started:
2. Getting through the License Agreement:
8. 5
a) Installing Jupyter Notebook using Anaconda:
To install Jupyter using Anaconda, just go through the following instructions:
1. Launch Anaconda Navigator:
2. Click on the Install Jupyter Notebook Button:
11. 8
b) Installing Jupyter Notebook using pip:
To install Jupyter using pip, the following command to update pip:
>> python3 -m pip install --upgrade pip
After updating the pip version, follow the instructions provided below to install Jupyter:
Command to install Jupyter:
>>pip3 install Jupyter
1. Beginning Installation:
15. 12
Explore the following features of python packages:
1. NumPy:
NumPy stands for Numerical Python.NumPy (Numerical Python) is an open-source library for
the Python programming language. It is used for scientific computing and working with arrays. The
source code for NumPy is located at this github repository https://github.com/numpy/numpy.
Features:
1. High-performance N-dimensional array object.
2. It contains tools for integrating code from C/C++ and Fortran.
3. It contains a multidimensional container for generic data.
4. Additional linear algebra, Fourier transform, and random number capabilities.
5. It consists of broadcasting functions.
6. It had data type definition capability to work with varied databases.
2. SciPy:
SciPy stands for Scientific Python. SciPy is a scientific computation library that uses NumPy
underneath. The source code for SciPy is located at this github repository
https://github.com/scipy/scipy
Features:
1. SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems,
algebraic equations, differential equations, statistics and many other classes of problems.
2. It provides more utility functions for optimization, stats and signal processing.
Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy is suitable for
basicoperations such as sorting, indexing and many more because it contains array data, whereas SciPy
consistsof all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra, Fourier transforms,
etc. whereas SciPy library contains full featured version of the linear algebra module as well many other
numerical algorithms.
16. 13
3. Pandas:
Python Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name ofPandas isderived from the word Panel Data, which meansan
Econometrics from Multidimensional data. It is used for data analysis in Python. Pandas is built
on top of the Numpy package, means Numpy is required for operating the Pandas.
Features:
1. Group by data for aggregations and transformations.
2. It has a fast and efficient DataFrame object with the default and customized indexing.
3. Used for reshaping and pivoting of the data sets.
4. It is used for data alignment and integration of the missing data.
5. Provide the functionality of Time Series.
6. Process a variety of data sets in different formats like matrix data, tabular heterogeneous,time
series.
7. Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.
4. Statsmodels:
statsmodels is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. The package is released under the open source Modified BSD (3-clause) license. The
online documentation is hosted at statsmodels.org.
Features:
1. Linear regression models like Ordinary least squares, Generalized least
squares,Weighted least squares, Least squares with autoregressive errors.
2. Bayesian Mixed GLM for Binomial and Poisson
3. GEE: Generalized Estimating Equations for one-way clustered or longitudinal data
4. Nonparametric statistics: Univariate and multivariate kernel density estimators
5. Datasets: Datasets used for examples and in testing
6. Sandbox: statsmodels contains a sandbox folder with code in various
stages ofdevelopment and testing.
RESULT:
Thus, the NumPy, SciPy, Jupyter, Statsmodels packages have been successfully
download,install and explore their features.
17. 14
Ex.No 2
Working with Numpy arrays
Date:
AIM:
To write a Numpy arrays program to demonstrate basic array concepts in Jupyter Notebook.
PROGRAM:
1. Creating Arrays from Python Lists:
In[1]: import numpy as np
In[2]: # integer array:
np.array([1, 4, 2, 5, 3])
Out[2]: array([1, 4, 2, 5, 3])
#NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will
upcast ifpossible (here, integers are upcast to floating point):
In[3]: np.array([3.14, 4, 2, 3])
Out[3]: array([ 3.14, 4. , 2. , 3. ])
#If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:
In[4]: np.array([1, 2, 3, 4], dtype='float32')
Out[4]: array([ 1., 2., 3., 4.], dtype=float32)
In[5]: # nested lists result in multidimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
Out[5]: array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
2. NumPy Array Attributes:
In[1]: import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
18. 15
Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and
size (thetotal size of the array):
In[2]: print("x3 ndim: ",
x3.ndim) print("x3 shape:",
x3.shape)print("x3 size: ",
x3.size)
Out[2]:x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
In[3]: print("dtype:", x3.dtype)# data type of the array
Out[3]:dtype: int64
In[4]: print("itemsize:", x3.itemsize,
"bytes")print("nbytes:", x3.nbytes,
"bytes")
Out[4]:itemsize: 8 bytes
Out[4]:nbytes: 480 bytes
3. Array Indexing: Accessing Single Elements:
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
In[6]:
x1[0]
Out[6
]: 5
In[7]:
x1[4]
Out[7]: 7
#To index from the end of the array, you can use negative indices
In[8]:
x1[-1]
Out[8]
: 9
In[9]:
x1[-2]
Out[9]: 7
19. 16
#In a multidimensional array, you access items using a comma-separated tuple of indices
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In[11]: x2[0, 0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
#modify values using any of the above index notation
In[14]: x2[0, 0] = 12
x2
Out[14]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[15]: x1[0] = 3.14159 # this will be truncated!
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
4. Array Slicing: Accessing Subarrays
#One-dimensional subarrays
In[16]: x =
np.arange(10)x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In[17]: x[:5] # first five elements
Out[17]: array([0, 1, 2, 3, 4])
In[18]: x[5:] # elements after index 5
Out[18]: array([5, 6, 7, 8, 9])
In[19]: x[4:7] # middle subarray
Out[19]: array([4, 5, 6])
In[20]: x[::2] # every other element
Out[20]: array([0, 2, 4, 6, 8])
In[21]: x[1::2] # every other element, starting at index 1
Out[21]: array([1, 3, 5, 7, 9])
20. 17
In[22]: x[::-1] # all elements, reversed
Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In[23]: x[5::-2] # reversed every other from index 5
Out[23]: array([5, 3, 1])
5. Multidimensional subarrays:
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
Out[25]: array([[12, 5, 2],
[ 7, 6, 8]])
In[26]: x2[:3, ::2] # all rows, every other column
Out[26]: array([[12, 2],
[ 7, 8],
[ 1, 7]])
#Finally, subarray dimensions can even be reversed together:
In[27]: x2[::-1, ::-1]
Out[27]: array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
6. Accessing array rows and columns:
In[28]: print(x2[:, 0]) # first column of x2
[12 7 1]
In[29]: print(x2[0, :]) # first row of x2
[12 5 2 4]
#In the case of row access, the empty slice can be omitted for a more compact syntax:
In[30]: print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
In[31]: print(x2)
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
#extract a 2×2 subarray from this:
In[32]: x2_sub = x2[:2, :2]
28. 25
Ex.No 3
Working of Pandas DataFrame
Date:
AIM:
To write a Pandas program using dictionary sample dataframe to perform opertations in its DataFrame.
Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
PROGRAM:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(df)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())
29. 26
Sample Output:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'>Index:
10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
0 name 10 non-null object
1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2)memory
usage: 400.0+ bytes
None
i. To get the first 3 rows of a given DataFrame.
print("First three rows of the data frame:")
print(df.iloc[:3])
Sample Output:
First three rows of the data frame:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
ii. To select the 'name' and 'score' columns from the following DataFrame.
print("Select specific columns:")
print(df[['name', 'score']])
30. 27
Sample Output:
Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0
iii. To select the specified columns and rows from a given DataFrame. Select 'name' and'score'
columns in rows 1, 3, 5, 6 from the following data frame.
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes
iv. To select the rows where the number of attempts in the examination is greater than 2.
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])
Sample Output:
Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes
v. To select the rows where the score is missing, i.e. is NaN.
print("Rows where score is missing:")
print(df[df['score'].isnull()])
31. 28
Sample Output:
Rows where score is missing: attempts
name qualify score
d 3 James no NaN
h 1 Laura no NaN
vi. To change the score in row 'd' to 11.5.
print("nOriginal data frame:")print(df)
print("nChange the score in row 'd' to 11.5:")df.loc['d',
'score'] = 11.5
print(df)
Sample Output:
Original data frame:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Change the score in row 'd' to 11.5:
Attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no 11.5
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
32. 29
vii. To calculate the sum of the examination score by the students.
print("nSum of the examination attempts by the students:")
print(df['score'].sum())
Sample Output:
Sum of the examination attempts by the students:
108.5
viii. To append a new row 'k' to DataFrame with given values for each column. Now delete thenew
row and return the original data frame.
print("Original rows:")print(df)
print("nAppend a new row:") df.loc['k'] =
[1, 'Suresh', 'yes', 15.5]
print("Print all records after insert a new record:")print(df)
print("nDelete the new row and display the original rows:")df =
df.drop('k')
print(df)
Sample Output:
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Append a new row:
Print all records after insert a new record:attempts
name qualify score
33. 30
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
k 1 Suresh yes 15.5
Delete the new row and display the original rows:attempts
name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
ix. To delete the 'attempts' column from the DataFrame.
print("Original rows:")print(df)
print("nDelete the 'attempts' column from the data frame:")
df.pop('attempts')
print(df)
34. 31
Sample Output:
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Delete the 'attempts' column from the data frame:
name qualify score
a Anastasia yes 12.5
b Dima no 9.0
c Katherine yes 16.5
d James no NaN
e Emily no 9.0
f Michael yes 20.0
g Matthew yes 14.5
h Laura no NaN
i Kevin no 8.0
j Jonas yes 19.0
x. To insert a new column in existing DataFrame.
print("Original rows:")print(df)
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']df['color'] =
color
35. 32
print("nNew DataFrame after inserting the 'color' column")print(df)
Sample Output
Original rows:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
New DataFrame after inserting the 'color' column
Attempts name qualify score color
a 1 Anastasia yes 12.5 Red
b 3 Dima no 9.0 Blue
c 2 Katherine yes 16.5 Orange
d 3 James no NaN Red
e 2 Emily no 9.0 White
f 3 Michael yes 20.0 White
g 1 Matthew yes 14.5 Blue
h 1 Laura no NaN Green
i 2 Kevin no 8.0 Green
j 1 Jonas yes 19.0 Red
RESULT:
Thus, the working of Pandas Dataframe using Dictionary was executed and verified successfully.
36. 33
Ex.No:4
Descriptive analytics on the Iris data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing descriptive
analytics on the Iris data set.
PROCEDURE:
Download the Iris.csv file from the https://www.kaggle.com/datasets/uciml/iris and use the Pandas library to
load this CSV file,and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
df = pd.read_csv("Music/Iris.csv")# Reading the CSV file
print(df)
print(df.d
types)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
37. 34
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150 rows x 6 columns]
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object dtype: object
# Printing top 5 rows
print(df.head())
#To shape parameter to get the shape of the dataset.
print(df.shape)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
(150, 6)
#To know the columns and their data types use the info() method.
df.info()
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149
38. 35
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
Species
150 non-null object
dtypes: float64(4),
int64(1),
object(1)
memory usage: 7.2+ KB
print(df.describe())
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
m
e
a
n
75.500000 5.843333 3.054000 3.758667 1.198667
s
t
d
43.445368 0.828066 0.433594 1.764420 0.763161
m
i
n
1.000000 4.300000 2.000000 1.000000 0.100000
2
5
%
38.250000 5.100000 2.800000 1.600000 0.300000
5
0
%
75.500000 5.800000 3.000000 4.350000 1.300000
7
5
%
112.750000 6.400000 3.300000 5.100000 1.800000
m
a
x
150.0000
00
7.900000 4.40000
0
6.90000
0
2.500000
39. 36
# Missing values can occur when no information is provided
print(df.isnull().sum())
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
# To check dataset contains any duplicates
or notdata = df.drop_duplicates(subset
="Species",) print(data)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
5
0
5
1
7.0 3.
2
4.7 1.4
1
0
0
1
0
1
6.3 3.3 6.0 2.5
Species
0 Iris-
setosa
50 Iris-versicolor
100 Iris-virginica
#To find unique species from the given dataset
print(df.value_counts("Species"))
Species
Iris-setosa50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
40. 37
#matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#seaborn
import seaborn as sns
#plot the variable ‘sepal.width’
plt.scatter(df.index,df['SepalWidthCm'])plt.show()
#visualize the same plot by considering its variety using the sns.scatterplot() function of the
seaborn library.
sns.scatterplot(x=df.index,y=df['SepalWidthCm'],hue=df['Species'])
41. 38
#visualizes data by connecting the data points via line segments.
plt.figure(figsize=(6,6))
plt.title("line plot for petal length")
plt.xlabel('index',fontsize=20)
plt.ylabel('PetalLengthCm',fontsize=20)
plt.plot(df.index,df['PetalLengthCm'],markevery=1,marker='d') for name,group in
df.groupby('Species'):
plt.plot(group.index,group['PetalLengthCm'],label=name,markevery=1,marker='d')
plt.legend()
plt.show()
#Plotting histogram using the matplotlib plt.hist() function :
plt.hist(df["PetalWidthCm"])
45. 42
#calculate mean
print("Mean of Preganancies: %f"
%df['Pregnancies'].mean()) print("Mean of BloodPressure:
%f" %df['BloodPressure'].mean())print("Mean of Glucose:
%f" %df['Glucose'].mean()) print("Mean of Age: %f"
%df['Age'].mean())
Sample Output:
Mean of Preganancies:
3.845052 Mean of
BloodPressure: 69.105469
Mean of Glucose:
120.894531 Mean of Age:
33.240885 #calculate
median
print("median of Preganancies: %f"
%df['Pregnancies'].median()) print("median of BloodPressure:
%f" %df['BloodPressure'].median())print("medianf Glucose:
%f" %df['Glucose'].median())
print("median of Age: %f" %df['Age'].median())
Sample Output:
median of Preganancies:
3.000000 median of
BloodPressure: 72.000000
medianf Glucose: 117.000000
median of Age: 29.000000
#calculate standard deviation of 'points'
print("standard deviation for BloodPressure: %f" %
df['BloodPressure'].std())print("standard deviation for Glucose: %f" %
df['Glucose'].std()) print("standard deviation for Pregnancies: %f" %
df['Pregnancies'].std()) Sample Output:
standard deviation for BloodPressure:
19.355807standard deviation for Glucose:
31.972618 standard deviation for
Pregnancies: 3.369578 #To describe the
data
49. 46
#to create a density curve
import seaborn as sns
sns.kdeplot(df['BloodPres
sure'])
<AxesSubplot:xlabel='BloodPressure', ylabel='Density'>
#visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn
library.sns.scatterplot(x=df.index,y=df['Age'],hue=df['Outcome'])
53. 50
Ex.No:5.b
Bivariate analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing Bivariate
analysis using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
Linear regression modelling
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:,
np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-50]
diabetes_X_test = diabetes_X[-50:]
# Split the targets into
training/testing sets
diabetes_y_train = diabetes_y[:-50]
diabetes_y_test = diabetes_y[-50:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients print("Coefficients: n", regr.coef_)
54. 51
Sample output:
Coefficients: [945.4992184]
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
Sample output:
Mean squared error: 3471.92
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
Sample output:
Coefficient of determination: 0.41
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue",
linewidth=3)plt.xticks(())
plt.yticks(())plt.show()
56. 53
Pregnancies Glucose BloodPressure SkinThickn
ess
DiabetesPedigreeFunction Outcome
1 126 60 0 0.349 1
1 93 70 31 0.315 0
768 rows × 9 columns
#Train/Test split
X=df.drop("Outcome",axis=1)
Y=df[["Outcome"]]
# target variable
# split data into training and validation datasets
X_train, X_test, y_train, y_t est = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.linear_model
import LogisticRegression
# instantiate the model
model = LogisticRegression()
# fitting the model
model.fit(X_train, y_train) y_pred = model.predict(X_test)y_pred[0:5]
# metrics
print("Accuracy for test set is {}.".format(round(metrics.accuracy_score(y_test, y_pred), 4)))
print("Precision for test set is {}.".format(round(metrics.precision_score(y_test, y_pred), 4)))
print("Recall for test set is {}.".format(round(metrics.recall_score(y_test, y_pred), 4)))
Sample Output:
Accuracy for test set is
0.7917.Precision for test
set is 0.7115.Recall for
test set is 0.5968.
print(metrics.classification_report(y_test, y_pred))
Sample Output:
precision recall f1-score support
0 0.82 0.88 0.85 130
1 0.71 0.60 0.65 62
accuracy 0.79 192
macro avg 0.77 0.74 0.75 192
weighted avg 0.79 0.79 0.79 192
58. 55
Ex.No:5.c
Multiple Regression analysis using the UCI diabetes data set
Date:
AIM:
To Reading data from Excel and exploring various commands for doing Multiple Regression analysis
using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
#import our Libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats
import diagnostic as diag
from statsmodels.stats.outliers_influence
import variance_inflation_factor
from sklearn.linear_model
import LinearRegression
from sklearn.model_selection
import train_test_split
from sklearn.metrics
import mean_squared_error, r2_score, mean_absolute_error
import warnings warnings.filterwarnings("ignore")
%matplotlib inline
#Loading Data
diabetes=pd.read_csv("E:DATA SCIENCEPima_indian_diabetesdiabetes.csv")
diabetes
Preg
nanci
es
Glucose Blood
Pressure
Skin
Thickness
Insulin DiabetesPedigree
Function
Outcome
6 148 72 35 0 0.627 1
1
85 66 29 0 0.351 0
8
183 64 0 0 0.672 1
1
89 66 23 94 0.167 0
0
137 40 35 168 2.288 1
60. 57
correlation heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu')
<AxesSubplot:>
#Train/Test split
X=diabetes.drop("Outcome",axis=1)
Y=diabetes[["Outcome"]]
# target variable
# split data into training and validation datasets
# Split X and y into X_
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train,Y_train) LinearRegression()
Outcom
e
0.22
189
8
0.466
581
0.0650
68
0.074
752
0.130
548
0.292
695 0.173844
0.23
8
356
1.000
000
61. 58
# let's grab the coefficient of our model and the intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {:.4}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {:.2}".format(coef[0],coef[1]))
Sample output:
The intercept for our model is -0.879
The Coefficient for Pregnancies is 0.015
The Coefficient for Glucose is 0.0057
The Coefficient for BloodPressure is -0.0021
The Coefficient for SkinThickness is 0.001
The Coefficient for Insulin is -0.00017
The Coefficient for BMI is 0.013
The Coefficient for DiabetesPedigreeFunction is 0.14
The Coefficient for Age is 0.0038
# Get multiple predictions
y_predict = regression_model.predict(X_test)
# Show the first 5
predictionsy_predict[:5]
array([[1.01391226],
[0.21532924],
[0.09157383],
[0.60583158],
[0.15988782]])
# define our intput
X2=sm.add_constant(X)
# create a OLS model
model=sm.OLS(Y, X2)
# fit the data
est = model.fit()
63. 60
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that
there arestrong multicollinearity or other numerical problems.
# make some confidence intervals, 95% by default
est.conf_int()
0 1
const -1.021709 -0.686079
Pregnancies 0.010521 0.030663
Glucose 0.004909 0.006932
BloodPressure -0.003925 -0.000739
SkinThickness -0.002029 0.002338
Insulin -0.000475 0.000114
BMI 0.009146 0.017343
DiabetesPedigreeFunction 0.058792 0.235682
Age -0.000419 0.005662
# estimate the p-values
est.pvalues
Sample output:
const 3.707465e-22
Pregnancies 6.561462e-05
Glucose 2.691192e-28
BloodPressure 4.178788e-03
SkinThickness 8.895424e-01
Insulin 2.285711e-01
BMI 3.853484e-10
DiabetesPedigreeFunction
1.131733e-03Age
9.092163e-02
dtype: float64import math
# calculate the mean squared error
model_mse = mean_squared_error(Y_test, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(Y_test, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
64. 61
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
MSE 0.148
MAE 0.322
RMSE 0.384
model_r2 = r2_score(Y_test, y_predict)print("R2: {:.2}".format(model_r2))
R2: 0.32
import pickle
# pickle the model
with open('my_mulitlinear_regression.sav','wb') as f:
pickle.dump(regression_model, f)
# load it back in
with open('my_mulitlinear_regression.sav', 'rb') as pickle_file:
regression_model_2 = pickle.load(pickle_file)
# make a new prediction
regression_model_2.predict([X_test.loc[150]])array([[0.42308994]])
Result:
Thus, the multiple regression analysis using the UCI diabetes data set was successfully
executedand practically verified.
65. 62
Ex.No:6
Apply and explore various plotting functions on UCI data sets
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#To run numerical descriptive stats for the data set
diabetes.describe()
Pregnan
cies
Glucose
Blood
Pressure
Skin
Thick
ness
Insuli
n
BMI
Diabet
es
Pedigr
ee
F
unct
ion
Age
Outcome
cou
nt
768.000000 768.00000
0
768.00000
0
768.00000
0
768.000
000
768.00
0
000
768.0000
00
768.00
0
000
768.000
000
me
an
3.845052 120.89453
1 69.10546
9
20.536458
79.7994
79
31.992
5
78
0.471876
33.240
8
85
0.34895
8
76. 73
Result:
Thus, the Three dimensional plotting using the UCI diabetes data set was successfully executed and
practically verified.
77. 74
Ex.No:7
Visualizing Geographic Data with Basemap
Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and
convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:
import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap
import Basemap
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.patches
import Polygon
from matplotlib.collections
import PatchCollection
import warnings
warnings.filterwarnings("ignore")
cities = pd.read_csv (r"C:UsersAdminDownloadsdatasets_557_1096_cities_r2.csv")
cities.head()
fig = plt.figure(figsize=(10,8))
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both',
color='Black',linestyle='-')plt.xlabel('No of cities
taken for analysis', fontsize = 20) plt.show ()
81. 78
VIVA QUESTIONS
NumPy
1. What is Numpy?
Ans: NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental package
for scientific computing with Python. … A powerful N-dimensional array object. Sophisticated
(broadcasting)functions.
2. Why NumPy is used in Python?
Ans: NumPy is a package in Python used for Scientific Computing. NumPy package is used to
perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store
values of samedatatype. These arrays are indexed just like Sequences, starts with zero.
3. What does NumPy mean in Python?
Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for
the Python programming language, adding support for large, multi-dimensional arrays and matrices,
alongwith a large collection of high-level mathematical functions to operate on these arrays.
4. Where is NumPy used?
Ans: NumPy is an open source numerical Python library. NumPy contains a multi-dimentional array
andmatrix data structures. It can be utilised to perform a number of mathematical operations on arrays
such astrigonometric, statistical and algebraic routines. NumPy is an extension of Numeric and
Numarray.
82. 79
Pandas
1. What is Pandas?
Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-levelbuilding block for doing practical, real world data analysis in Python.
2. What is Python pandas used for?
Ans: Pandas is a software library written for the Python programming language for data
manipulation andanalysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. pandas is free software released under the three-clause BSD
license.
3. What is a Series in Pandas?
Ans: Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string,float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.
4. Mention the different Types of Data structures in pandas??
Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the
data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and
DataFrame isthe two-dimensional data structure in pandas. There is one more axis label known as
Panel which is a three- dimensional data structure and it includes items, major_axis, and minor_axis.
5. Explain Reindexing in pandas?
Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing
NA/NaNin locations having no value in the previous index. It changes the row labels and column
labels of a DataFrame.
6. What are the key features of pandas library ?
Ans: There are various features in pandas library and some of them are mentioned below
Data Alignment
Memory Efficient
Reshaping
Merge and join
Time Series
7. What is pandas Used For ?
Ans: This library is written for the Python programming language for performing operations like data
manipulation, data analysis, etc. The library provides various operations as well as data structures to
manipulate time series and numerical tables.
83. 80
8. How can we create copy of series in Pandas?
Ans: pandas.Series.copy
Series.copy(deep=True)
pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With
deep=False neither the indices or the data are copied. Note that when deep=True data is copied,
actual python objectswill not be copied recursively, only the reference to the object.
9. What is Time Series in pandas?
Ans: A time series is an ordered sequence of data which basically represents how some quantity
changesover time. pandas contains extensive capabilities and features for working with time series
data for all domains.
pandas supports:
Parsing time series information from various sources and formats
Generate sequences of fixed-frequency dates and time spans
Manipulating and converting date time with timezone information
Resampling or converting a time series to a particular frequency
Performing date and time arithmetic with absolute or relative time increments
10. What is pylab?
Ans: PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.
Jupyter Notebook
1. What is Jupyter
Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it allows for rapid prototyping and iteration.
2. What are the main features of Jupyter Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code,equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it provides an easy way to mix code, output, and explanatory text all in one place.
Jupyter Notebook is also used by educators to teach programming and data science concepts.
3. How can you create a new notebook in Jupyter?
You can create a new notebook in Jupyter by clicking on the “New” button in the upper right corner
andselecting “Notebook” from the drop-down menu.
4. Can you explain what the data science workflow involves?
The data science workflow generally involves four main steps: data wrangling, exploratory data
analysis,modeling, and evaluation. Data wrangling is the process of cleaning and preparing data for
84. 81
analysis.
Exploratory data analysis is the process of exploring data to find patterns and relationships. Modeling
is the process of building models to make predictions or recommendations based on data. Evaluation is
the processof assessing the accuracy of models and using them to make decisions.
5. What are some common use cases for Jupyter Notebook?
Jupyter Notebook is a popular tool for data scientists and analysts because it allows for an interactive
Coding experience. Jupyter Notebook is often used for exploratory data analysis and for visualizing data.
*******