2. NUMPY BASICS: ARRAYS AND VECTORIZED COMPUTATION
NumPy (Numerical Python) is a fundamental library in Python for numerical and scientific
computing. It provides support for arrays (multi-dimensional, homogeneous data structures)
and a wide range of mathematical functions to perform vectorized computations efficiently. This
guide will cover some of the basics of working with NumPy arrays and performing vectorized
computations.
Installing NumPy
Before using NumPy, you need to make sure it's installed. You can install it using pip:
pip install numpy
2
3. Importing NumPy
To use NumPy in your Python code, you should import it:
import numpy as np
By convention, it's common to import NumPy as np for brevity.
Creating NumPy Arrays
You can create NumPy arrays using various methods:
1. From Python Lists:
arr = np.array([1, 2, 3, 4, 5])
2. Using NumPy Functions:
zeros_arr = np.zeros(5) # Creates an array of zeros with 5 elements
ones_arr = np.ones(3) # Creates an array of ones with 3 elements
rand_arr = np.random.rand(3, 3) # Creates a 3x3 array with random values between 0 and 1
3
4. 3. Using NumPy's Range Function:
range_arr = np.arange(0, 10, 2) # Creates an array with values [0, 2, 4, 6, 8]
4
5. BASIC ARRAY OPERATIONS
Once you have NumPy arrays, you can perform various operations on them:
1. Element-wise Operations:
NumPy allows you to perform element-wise operations, like addition, subtraction, multiplication,
and division:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = a + b # Element-wise addition: [5, 7, 9]
d = a * b # Element-wise multiplication: [4, 10, 18]
5
6. 2. Indexing and Slicing:
You can access individual elements and slices of NumPy arrays using indexing and slicing:
arr = np.array([0, 1, 2, 3, 4, 5])
element = arr[2] # Access element at index 2 (value: 2)
sub_array = arr[2:5] # Slice from index 2 to 4 (values: [2, 3, 4])
3. Array Shape and Reshaping:
You can check and change the shape of NumPy arrays:
arr = np.array([[1, 2, 3], [4, 5, 6]])
shape = arr.shape # Get the shape (2, 3)
reshaped = arr.reshape(3, 2) # Reshape the array to (3, 2)
4. Aggregation Functions:
NumPy provides functions to compute statistics on arrays:
arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr) # Calculate the mean (average)
max_val = np.max(arr) # Find the maximum value
min_val = np.min(arr) # Find the minimum value
6
7. VECTORIZED COMPUTATION
Vectorized computation in Python refers to performing operations on entire arrays or sequences
of data without the need for explicit loops. This approach leverages highly optimized, low-level
code to achieve faster and more efficient computations. The primary library for vectorized
computation in Python is NumPy.
Traditional Loop-Based Computation
In traditional Python programming, you might use explicit loops to perform operations on arrays
or lists. For example:
# Using loops to add two lists element-wise
list1 = [1, 2, 3]
list2 = [4, 5, 6]
result = []
for i in range(len(list1)):
result.append(list1[i] + list2[i]) # Result: [5, 7, 9]
7
8. Vectorized Computation with NumPy
NumPy allows you to perform operations on entire arrays, making code more concise and efficient. Here's how
how you can achieve the same result using NumPy:
import numpy as np
# Using NumPy for element-wise addition
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = arr1 + arr2
# Result: array([5, 7, 9])
8
9. INTRODUCTION TO PANDAS DATA STRUCTURES
Pandas is a popular Python library for data manipulation and analysis. It provides two primary data structures:
the DataFrame and the Series. These data structures are designed to handle structured data, making it easier
to work with datasets in a tabular format.
DataFrame:
A DataFrame is a 2-dimensional, labeled data structure that resembles a spreadsheet or SQL table.
It consists of rows and columns, where each column can have a different data type (e.g., integers, floats,
strings, or even custom data types).
You can think of a DataFrame as a collection of Series objects, where each Series is a column.
DataFrames are highly versatile and are used for a wide range of data analysis tasks, including data
cleaning, exploration, and transformation. 9
10. Here's a basic example of how to create a DataFrame using Pandas:
10
11. Series:
A Series is a one-dimensional labeled array that can hold data of any data type.
It is like a column in a DataFrame or a single variable in statistics.
Series objects are commonly used for time series data, as well as other one-dimensional data.
Key characteristics of a Pandas Series:
Homogeneous Data: Unlike Python lists or NumPy arrays, a Pandas Series enforces homogeneity, meaning
all the data within a Series must be of the same data type. For example, if you create a Series with integer
values, all values within that Series will be integers.
Labeled Data: Series have two parts: the data itself and an associated index. The index provides labels or
names for each data point in the Series. By default, Series have a numeric index starting from 0, but you can
specify custom labels if needed.
Size and Shape: A Series has a size (the number of elements) and shape (1-dimensional) but does not have
columns or rows like a DataFrame.
11
12. import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
# Display the Series
print(series)
0 10
1 20
2 30
3 40
4 50
dtype: int64
12
13. Some common tasks you can perform with Pandas:
Data Loading: Pandas can read data from various sources, including CSV files, Excel spreadsheets, SQL
databases, and more.
Data Cleaning: You can clean and preprocess data by handling missing values, removing duplicates, and
transforming data types.
Data Selection: Easily select specific rows and columns of interest using various indexing techniques.
Data Aggregation: Perform group by operations, calculate statistics, and aggregate data based on specific
criteria.
Data Visualization: You can use Pandas in conjunction with visualization libraries like Matplotlib and
Seaborn to create informative plots and charts.
13
14. A DataFrame in Python typically refers to a two-dimensional, size-mutable, and potentially heterogeneous
tabular data structure provided by the popular library called Pandas. It is a fundamental data structure for data
manipulation and analysis in Python.
Here's how you can work with DataFrames in Python using Pandas:
1. Import Pandas:
First, you need to import the Pandas library.
import pandas as pd
2. Creating a DataFrame:
You can create a DataFrame in several ways. Here are a few
common methods:
From a dictionary:
data = {'Column1': [value1, value2, ...],
'Column2': [value1, value2, ...]}
df = pd.DataFrame(data)
DataFrame
14
15. • From a list of lists:
data = [[value1, value2],
[value3, value4]]
df = pd.DataFrame(data, columns=['Column1', 'Column2'])
• From a CSV file:
df = pd.read_csv('file.csv')
3. Viewing Data:
You can use various methods to view and explore your DataFrame:
df.head(): Displays the first few rows of the DataFrame.
df.tail(): Displays the last few rows of the DataFrame.
df.shape: Returns the number of rows and columns.
df.columns: Returns the column names.
df.info(): Provides information about the DataFrame, including data types and non-null counts. 15
16. 4. Selecting Data:
You can select specific columns or rows from a DataFrame using indexing or filtering. For example:
df['Column1'] # Select a specific column
df[['Column1', 'Column2']] # Select multiple columns
df[df['Column1'] > 5] # Filter rows based on a condition
5. Modifying Data:
You can modify the DataFrame by adding or modifying columns, updating values, or appending rows. For
example:
df['NewColumn'] = [new_value1, new_value2, ...] # Add a
new column
df.at[index, 'Column1'] = new_value # Update a specific
value
df = df.append({'Column1': value1, 'Column2': value2},
ignore_index=True) # Append a new row
16
17. 6. Data Analysis:
Pandas provides various functions for data analysis, such
as describe(), groupby(), agg(), and more.
7. Saving Data:
You can save the DataFrame to a CSV file or other formats:
df.to_csv('output.csv', index=False)
17
18. INDEX OBJECTS-INDEXING, SELECTION, AND FILTERING
In Pandas, the Index object is a fundamental component of both Series and DataFrame data
structures.
It provides the labels or names for the rows or columns of your data. You can use indexing,
selection, and filtering techniques with these indexes to access specific data points or subsets of
your data. Here's how you can work with index objects in Pandas:
1. Indexing:
Indexing allows you to access specific elements or rows in your data using labels. You can use .loc[] for label-
based indexing and .iloc[] for integer-based indexing.
• Label-based indexing:
df.loc['label'] # Access a specific row by its label
df.loc['label', 'column_name'] # Access a specific element
by label and column name
18
19. • Integer-based indexing:
df.iloc[0] # Access the first row
df.iloc[0, 1] # Access an element by row and column index
2. Selection:
You can use various methods to select specific data based on conditions or criteria.
• Select rows based on a condition:
19
df[df['Column'] > 5] # Select rows where 'Column' is greater than 5
• Select rows by multiple conditions:
df[(df['Column1'] > 5) & (df['Column2'] < 10)] # Rows where 'Column1' > 5 and 'Column2' < 10
20. 20
3. Filtering:
Filtering allows you to create a boolean mask based on a condition and then apply that mask to your
DataFrame to select rows meeting the condition.
Create a boolean mask:
condition = df['Column'] > 5
Apply the mask to the DataFrame:
filtered_df = df[condition]
4. Setting a New Index:
You can set a specific column as the index of your DataFrame using the .set_index() method.
df.set_index('Column_Name', inplace=True)
21. 21
5. Resetting the Index:
If you've set a column as the index and want to revert to the default integer-based index, you can use the
.reset_index() method.
df.reset_index(inplace=True)
6. Multi-level Indexing:
You can create DataFrames with multi-level indexes, allowing you to work with more complex hierarchical data
structures.
df.set_index(['Index1', 'Index2'], inplace=True)
Index objects in Pandas are versatile and powerful for working with data because they enable you to
access and manipulate your data in various ways, whether it's for data retrieval, filtering, or
restructuring.
22. ARITHMETIC AND DATA ALIGNMENT IN PANDAS
22
Arithmetic and data alignment in Pandas refer to how mathematical operations are performed between Series an
DataFrames when they have different shapes or indices. Pandas automatically aligns data based on the labels o
the objects involved in the operation, which ensures that the result of the operation maintains data integrity and
aligned correctly. Here are some key aspects of arithmetic and data alignment in Pandas:
1. Automatic Alignment:
When you perform mathematical operations (e.g., addition, subtraction, multiplication, division) between tw
Series or DataFrames, Pandas aligns the data based on their labels (index or column names). It aligns the dat
based on common labels and performs the operation only on matching labels.
series1 = pd.Series([1, 2, 3], index=['A', 'B', 'C'])
series2 = pd.Series([4, 5, 6], index=['B', 'C', 'D'])
result = series1 + series2
In this example, the result Series will have NaN values for the 'A' and 'D' labels because those labels don't matc
between series1 and series2.
23. 23
2. Missing Data (NaN):
When labels don't match, Pandas fills in the result with NaN (Not-a-Number) to indicate missing values.
3. DataFrame Alignment:
The same principles apply to DataFrames when performing operations between them. The alignment occurs both
for rows (based on the index) and columns (based on column names).
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
df2 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]}, index=['Y', 'Z'])
result = df1 + df2
In this case, result will have NaN values in columns 'A' and 'C' because those columns don't exist in both df1 and
df2.
4. Handling Missing Data:
You can use methods like .fillna() to replace NaN values with a specific value or use .dropna() to remove rows or
columns with missing data.
result_filled = result.fillna(0) # Replace NaN with 0
result_dropped = result.dropna() # Remove rows or columns with NaN values
24. 24
5. Alignment with Broadcasting:
Pandas allows you to perform operations between a Series and a scalar value, and it broadcasts the scalar to
match the shape of the Series.
series = pd.Series([1, 2, 3])
scalar = 2
result = series * scalar
In this example, result will be a Series with values [2, 4, 6].
Automatic alignment in Pandas is a powerful feature that simplifies data manipulation and allows you to work
with datasets of different shapes without needing to manually align them. It ensures that operations are
performed in a way that maintains the integrity and structure of your data.
25. 25
ARITHMETIC AND DATA ALIGNMENT IN NUMPY
NumPy, like Pandas, performs arithmetic and data alignment when working with arrays. However, unlike
Pandas, NumPy is primarily focused on numerical computations with homogeneous arrays (arrays of the
same data type). Here's how arithmetic and data alignment work in NumPy:
Automatic Alignment:
NumPy arrays perform element-wise operations, and they automatically align data based on the shape of the
arrays being operated on. This means that if you perform an operation between two NumPy arrays of
different shapes, NumPy will broadcast the smaller array to match the shape of the larger one, element-wise.
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5])
result = arr1 + arr2
In this example, NumPy will automatically broadcast arr2 to match the shape of arr1, resulting in [5, 7, 8].
26. 26
Broadcasting Rules:
NumPy follows specific rules when broadcasting arrays:
If the arrays have a different number of dimensions, pad the smaller shape with ones on the left side.
Compare the shapes element-wise, starting from the right. If dimensions are equal or one of them is 1, they are
compatible.
If the dimensions are incompatible, NumPy raises a "ValueError: operands could not be broadcast together" error.
Handling Missing Data:
In NumPy, there is no concept of missing data like NaN in Pandas. If you perform operations between arrays with
mismatched shapes, NumPy will either broadcast or raise an error, depending on whether broadcasting is
possible.
Element-Wise Operations:
NumPy performs arithmetic operations element-wise by default. This means that each element in the resulting
array is the result of applying the operation to the corresponding elements in the input arrays.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = arr1 * arr2
In this case, result will be [4, 10, 18].
27. WHAT IS VECTORIZATION ?
Vectorization is used to speed up the Python code without using loop.
Using such a function can help in minimizing the running time of code efficiently.
Various operations are being performed over vector such as dot product of vectors which is also known
as scalar product as it produces single output, outer products which results in square matrix of
dimension equal to length X length of the vectors, Element wise multiplication which products the
element of same indexes and dimension of the matrix remain unchanged.
27
28. 28
APPLYING FUNCTIONS AND MAPPING
In NumPy, you can apply functions and perform element-wise operations on arrays using various techniques,
including vectorized functions, np.apply_along_axis(), and the np.vectorize() function. Additionally, you can use
the np.vectorize() function for mapping operations. Here's an overview of these approaches:
Vectorized Functions:
NumPy is designed to work efficiently with vectorized operations, meaning you can apply functions to entire
arrays or elements of arrays without the need for explicit loops. NumPy provides built-in functions that can be
applied element-wise to arrays.
import numpy as np
arr = np.array([1, 2, 3, 4])
# Applying a function element-wise
result = np.square(arr) # Square each element
In this example, the np.square() function is applied element-wise to the arr array.
30. HOW TO CREATE YOUR OWN UFUNC
To create your own ufunc(Universal Functions), you have to define a function, like you do with normal
functions in Python, then you add it to your NumPy ufunc library with the frompyfunc() method.
ufuncs are used to implement vectorization in NumPy which is way faster than iterating over elements.
They also provide broadcasting and additional methods like reduce, accumulate etc. that are very helpful for
computation.
ufuncs also take additional arguments, like:
The frompyfunc() method takes the following arguments:
1.function - the name of the function.
2.inputs - the number of input arguments (arrays).
3.outputs - the number of output arrays.
30
32. 32
‘np.apply_along_axis():
You can use the np.apply_along_axis() function to apply a function along a specified axis of a multi-dimensional
array. This is useful when you want to apply a function to each row or column of a 2D array.
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6]])
# Apply a function along the rows (axis=1)
def sum_of_row(row):
return np.sum(row)
result = np.apply_along_axis(sum_of_row, axis=1, arr=arr)
In this example, sum_of_row is applied to each row along axis=1, resulting in a new 1D array.
33. 33
np.vectorize():
The np.vectorize() function allows you to create a vectorized version of a Python function, which can then be
applied element-wise to NumPy arrays.
import numpy as np
arr = np.array([1, 2, 3, 4])
# Define a Python function
def my_function(x):
return x * 2
# Create a vectorized version of the function
vectorized_func = np.vectorize(my_function)
# Apply the vectorized function to the array
result = vectorized_func(arr)
This approach is useful when you have a custom function that you want to apply to an array.
34. 34
Mapping with np.vectorize():
You can use np.vectorize() to map a function to each element of an array.
import numpy as np
arr = np.array([1, 2, 3, 4])
# Define a Python function
def my_function(x):
return x * 2
# Create a vectorized version of the function
vectorized_func = np.vectorize(my_function)
# Map the function to each element
result = vectorized_func(arr)
This approach is similar to applying a function element-wise but can be used for more complex
mapping operations.
These methods allow you to apply functions and perform mapping operations efficiently on NumPy
arrays, making it a powerful library for numerical and scientific computing tasks.
35. 35
SORTING AND RANKING
Sorting and ranking are common data manipulation operations in data analysis and are widely supported in
Python through libraries like NumPy and Pandas. These operations help organize data in a desired order or
rank elements based on specific criteria. Here's how to perform sorting and ranking in both libraries:
Sorting in NumPy:
In NumPy, you can sort NumPy arrays using the np.sort() and np.argsort() functions.
np.sort(): This function returns a new sorted array without modifying the original array.
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
sorted_arr = np.sort(arr)
36. np. sort() returns the sorted array whereas np. argsort() returns an array of the corresponding indices.
The figure shows how the algorithm transforms an unsorted array [10, 6, 8, 2, 5, 4, 9, 1] into a sorted
array [1, 2, 4, 5, 6, 8, 9, 10] .
36
37. 37
np.argsort(): This function returns the indices that would sort the array. You can use these indices to sort the
original array.
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
indices = np.argsort(arr)
sorted_arr = arr[indices]
Sorting in Pandas:
In Pandas, you can sort Series and DataFrames using the sort_values() method. You can specify the column(s)
to sort by and the sorting order.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 35]}
df = pd.DataFrame(data)
# Sort by 'Age' column in ascending order
sorted_df = df.sort_values(by='Age', ascending=True)
38. 38
NumPy doesn't have a built-in ranking function, but you can use np.argsort() to get the ranking of elements.
You can then use these rankings to create a ranked array.
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
indices = np.argsort(arr)
ranked_arr = np.argsort(indices) + 1 # Add 1 to start ranking from 1 instead of 0
Ranking in Pandas:
In Pandas, you can rank data using the rank() method. You can specify the sorting order and how to handle
ties (e.g., assigning the average rank to tied values).
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 30]}
df = pd.DataFrame(data)
# Rank by 'Age' column in descending order and assign average rank to tied values
df['Rank'] = df['Age'].rank(ascending=False, method='average')
Ranking in NumPy:
39. 39
SUMMARIZING AND COMPUTING DESCRIPTIVE STATISTICS
1. Summary Statistics:
NumPy provides functions to compute summary statistics directly on arrays.
import numpy as np
data = np.array([25, 30, 22, 35, 28])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
40. 40
2. Percentiles and Quartiles:
You can compute specific percentiles and quartiles using the np.percentile() function.
percentile_25 = np.percentile(data, 25)
percentile_75 = np.percentile(data, 75)
3. Correlation and Covariance:
You can compute correlation and covariance between arrays using np.corrcoef() and np.cov().
correlation_matrix = np.corrcoef(data1, data2)
covariance_matrix = np.cov(data1, data2)
41. 41
CORRELATION AND COVARIANCE
In NumPy, you can compute correlation and covariance between arrays using the np.corrcoef() and np.cov()
functions, respectively. These functions are useful for analyzing relationships and dependencies between
variables. Here's how to use them:
Computing Correlation Coefficient (Correlation):
The correlation coefficient measures the strength and direction of a linear relationship between two variables.
It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear
correlation.
import numpy as np
# Create two arrays representing variables
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
42. 42
# Compute the correlation coefficient between x and y
correlation_matrix = np.corrcoef(x, y)
# The correlation coefficient is in the (0, 1) element of the matrix
correlation_coefficient = correlation_matrix[0, 1]
In this example, correlation_coefficient will contain the Pearson correlation coefficient between x and y.
43. 43
Computing Covariance:
Covariance measures the degree to which two variables change together. Positive values indicate a positive
relationship (both variables increase or decrease together), while negative values indicate an inverse
relationship (one variable increases as the other decreases).
import numpy as np
# Create two arrays representing variables
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
# Compute the covariance between x and y
covariance_matrix = np.cov(x, y)
# The covariance is in the (0, 1) element of the matrix
covariance = covariance_matrix[0, 1]
In this example, covariance will contain the covariance between x and y.
Both np.corrcoef() and np.cov() can accept multiple arrays as input, allowing you to compute correlations and
covariances for multiple variables simultaneously. For example, if you have a dataset with multiple columns,
you can compute the correlation matrix or covariance matrix for all pairs of variables.