Data Pre processing
SY Btech Sem:III
What is Data Preprocessing?
• Data preprocessing is a process of preparing
the raw data and making it suitable for a
machine learning model. It is the first and
crucial step while creating a machine learning
model.
Why do we need Data Preprocessing?
• data generally contains noises, missing values,
unusable format
• tasks for cleaning the data and making it
suitable for a machine learning model
• increasing the accuracy and efficiency of a
machine learning model.
Steps in Data Pre processing
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
Python Libraries for Data Preprocessing
• NumPy
• Pandas
• Matplotlib
NumPy: Numerical Python
• NumPy is used for working with arrays.
• It also has functions for working in domain of
linear algebra, fourier transform, and
matrices.
• NumPy was created in 2005 by Travis
Oliphant.
• It is an open source project and we can use it
freely.
Import NumPy
• import numpy
• import numpy as np
import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
import numpy as np
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)
Create a NumPy ndarray Object
• The array object in NumPy is called ndarray.
• We can create a NumPy ndarray object by
using the array() function.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
Dimensions in Arrays
• 0-D Arrays
• 1-D Arrays
import numpy as np
arr = np.array(42)
print(arr)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Array cont…
• 2-D Arrays
• 3-D arrays
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
Check Number of Dimensions?
• NumPy Arrays provides the ndim attribute
that returns an integer that tells us how many
dimensions the array have.
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3],
[4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
NumPy Array Indexing
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Cont…
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd row: ', arr[1, 4])
Cont…
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
15
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
16
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
17
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
18
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
19
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
20
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
21
Arrays, creation
• np.ones, np.zeros
• np.arange
• np.concatenate
• np.astype
• np.zeros_like,
np.ones_like
• np.random.random
22
Arrays, danger zone
• Must be dense, no holes.
• Must be one type
• Cannot combine arrays of different shape
23
Slicing arrays
• taking elements from one given index to
another given index.
• [start:end]
• [start:end:step]
• If we don't pass start its considered 0
• If we don't pass end its considered length of
array in that dimension
• If we don't pass step its considered 1
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[:4])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
import numpy as np
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])
Data Types in NumPy
• strings - used to represent text data, the text is
given under quote marks. e.g. "ABCD"
• integer - used to represent integer numbers. e.g. -
1, -2, -3
• float - used to represent real numbers. e.g. 1.2,
42.42
• boolean - used to represent True or False.
• complex - used to represent complex numbers.
e.g. 1.0 + 2.0j, 1.5 + 2.5j
Cont…
import numpy as np
arr = np.array([1, 2, 3, 4], dtype='i4')
print(arr)
print(arr.dtype)
import numpy as np
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype(int)
print(newarr)
print(newarr.dtype)
NumPy Array Shape/Reshape
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(2, 3, 2)
print(newarr)
NumPy Array Iterating
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
for x in arr:
print(x)
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]],
[[7, 8, 9], [10, 11, 12]]])
for x in arr:
for y in x:
for z in y:
print(z)
Iterating Arrays Using nditer()
import numpy as np
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
for x in np.nditer(arr):
print(x)
import numpy as np
arr = np.array([1, 2, 3])
for idx, x in np.ndenumerate(arr):
print(idx, x)
Joining NumPy Arrays
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr)
Joining Arrays Using Stack Functions
• Stacking Along Rows
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)
print(arr)
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.hstack((arr1, arr2))
print(arr)
Stacking Along Columns
• Stacking Along Height (depth)
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.vstack((arr1, arr2))
print(arr)
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.dstack((arr1, arr2))
print(arr)
Splitting NumPy Arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
NumPy Searching Arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)
Sorting Arrays
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
import numpy as np
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))
Random Numbers in NumPy
• What is a Random Number?
– Random means something that can not be
predicted logically.
• Generate Random Number
from numpy import random
x = random.randint(100)
print(x)
Generate Random Float
• Generate Random Array
– x = random.randint(100, size=(3, 5))
– x = random.rand(3, 5)
– x = random.choice([3, 5, 7, 9])
from numpy import random
x = random.rand()
print(x)
from numpy import random
x=random.randint(100, size=(5))
print(x)

Data Preprocessing Introduction for Machine Learning

  • 1.
  • 2.
    What is DataPreprocessing? • Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
  • 3.
    Why do weneed Data Preprocessing? • data generally contains noises, missing values, unusable format • tasks for cleaning the data and making it suitable for a machine learning model • increasing the accuracy and efficiency of a machine learning model.
  • 4.
    Steps in DataPre processing • Getting the dataset • Importing libraries • Importing datasets • Finding Missing Data • Encoding Categorical Data • Splitting dataset into training and test set • Feature scaling
  • 5.
    Python Libraries forData Preprocessing • NumPy • Pandas • Matplotlib
  • 6.
    NumPy: Numerical Python •NumPy is used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • NumPy was created in 2005 by Travis Oliphant. • It is an open source project and we can use it freely.
  • 7.
    Import NumPy • importnumpy • import numpy as np import numpy arr = numpy.array([1, 2, 3, 4, 5]) print(arr) import numpy as np arr = numpy.array([1, 2, 3, 4, 5]) print(arr)
  • 8.
    Create a NumPyndarray Object • The array object in NumPy is called ndarray. • We can create a NumPy ndarray object by using the array() function. import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr) print(type(arr))
  • 9.
    Dimensions in Arrays •0-D Arrays • 1-D Arrays import numpy as np arr = np.array(42) print(arr) import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr)
  • 10.
    Array cont… • 2-DArrays • 3-D arrays import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) print(arr) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(arr)
  • 11.
    Check Number ofDimensions? • NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions the array have. import numpy as np a = np.array(42) b = np.array([1, 2, 3, 4, 5]) c = np.array([[1, 2, 3], [4, 5, 6]]) d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(a.ndim) print(b.ndim) print(c.ndim) print(d.ndim)
  • 12.
    NumPy Array Indexing importnumpy as np arr = np.array([1, 2, 3, 4]) print(arr[0]) import numpy as np arr = np.array([1, 2, 3, 4]) print(arr[2] + arr[3])
  • 13.
    Cont… import numpy asnp arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('2nd element on 1st row: ', arr[0, 1]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('5th element on 2nd row: ', arr[1, 4])
  • 14.
    Cont… import numpy asnp arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) print(arr[0, 1, 2]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('Last element from 2nd dim: ', arr[1, -1])
  • 15.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 15
  • 16.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 16
  • 17.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 17
  • 18.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 18
  • 19.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 19
  • 20.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 20
  • 21.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 21
  • 22.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 22
  • 23.
    Arrays, danger zone •Must be dense, no holes. • Must be one type • Cannot combine arrays of different shape 23
  • 24.
    Slicing arrays • takingelements from one given index to another given index. • [start:end] • [start:end:step] • If we don't pass start its considered 0 • If we don't pass end its considered length of array in that dimension • If we don't pass step its considered 1
  • 25.
    import numpy asnp arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[4:]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[:4])
  • 26.
    import numpy asnp arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[-3:-1]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5:2])
  • 27.
    import numpy asnp arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[1, 1:4]) import numpy as np arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[0:2, 1:4])
  • 28.
    Data Types inNumPy • strings - used to represent text data, the text is given under quote marks. e.g. "ABCD" • integer - used to represent integer numbers. e.g. - 1, -2, -3 • float - used to represent real numbers. e.g. 1.2, 42.42 • boolean - used to represent True or False. • complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
  • 29.
    Cont… import numpy asnp arr = np.array([1, 2, 3, 4], dtype='i4') print(arr) print(arr.dtype) import numpy as np arr = np.array([1.1, 2.1, 3.1]) newarr = arr.astype(int) print(newarr) print(newarr.dtype)
  • 30.
    NumPy Array Shape/Reshape importnumpy as np arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) print(arr.shape) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) newarr = arr.reshape(2, 3, 2) print(newarr)
  • 31.
    NumPy Array Iterating importnumpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) for x in arr: print(x) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) for x in arr: for y in x: for z in y: print(z)
  • 32.
    Iterating Arrays Usingnditer() import numpy as np arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) for x in np.nditer(arr): print(x) import numpy as np arr = np.array([1, 2, 3]) for idx, x in np.ndenumerate(arr): print(idx, x)
  • 33.
    Joining NumPy Arrays importnumpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.concatenate((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([[1, 2], [3, 4]]) arr2 = np.array([[5, 6], [7, 8]]) arr = np.concatenate((arr1, arr2), axis=1) print(arr)
  • 34.
    Joining Arrays UsingStack Functions • Stacking Along Rows import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.stack((arr1, arr2), axis=1) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.hstack((arr1, arr2)) print(arr)
  • 35.
    Stacking Along Columns •Stacking Along Height (depth) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.vstack((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.dstack((arr1, arr2)) print(arr)
  • 36.
    Splitting NumPy Arrays importnumpy as np arr = np.array([1, 2, 3, 4, 5, 6]) newarr = np.array_split(arr, 3) print(newarr)
  • 37.
    NumPy Searching Arrays importnumpy as np arr = np.array([1, 2, 3, 4, 5, 4, 4]) x = np.where(arr == 4) print(x) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8]) x = np.where(arr%2 == 0) print(x)
  • 38.
    Sorting Arrays import numpyas np arr = np.array([3, 2, 0, 1]) print(np.sort(arr)) import numpy as np arr = np.array(['banana', 'cherry', 'apple']) print(np.sort(arr))
  • 39.
    Random Numbers inNumPy • What is a Random Number? – Random means something that can not be predicted logically. • Generate Random Number from numpy import random x = random.randint(100) print(x)
  • 40.
    Generate Random Float •Generate Random Array – x = random.randint(100, size=(3, 5)) – x = random.rand(3, 5) – x = random.choice([3, 5, 7, 9]) from numpy import random x = random.rand() print(x) from numpy import random x=random.randint(100, size=(5)) print(x)