Groupby
groupby() used tosplit a DataFrame into groups based on certain
columns, apply a function to each group
Use the groupby() method on the Store column and calculate the total sales by
summing the Amount column
df_sales = pd.DataFrame({ 'Store': ['Store A', 'Store A', 'Store B', 'Store C', 'Store B',
'Store C'], 'Amount': [100, 200, 150, 300, 250, 100] })
3.
Exercise
You have aDataFrame that contains grades for different subjects. Each
row represents a student's grade in a specific subject.
Find the average grade for each subject
df_grades = pd.DataFrame({
'Subject': ['Math', 'Science', 'Math', 'English', 'Science', 'English’],
'Grade': [85, 90, 78, 88, 92, 81]
})
Try reading from gapminder-FiveYearData excel
4.
Merge
Merging combines twoDataFrames based on a key or common column(s).
This is similar to SQL joins, where you can combine data from different
sources based on a shared identifier (can be more than one column/feature)
df_employees = pd.DataFrame({'EmployeeID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df_salaries = pd.DataFrame({'EmployeeID': [1, 2], 'Salary': [50000, 55000]})
5.
Exercise
Create a tablethat only includes employees who have a valid department
Using these Data Frames:
df_employees = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David’],
'DepartmentID': [1, 2, 1, 4]
})
df_departments = pd.DataFrame({
'DepartmentID': [1, 2, 3,5],
'DepartmentName': ['HR', 'IT', 'Sales','Marketing'] })
Try it with outer merge and drop
the NAN rows
6.
join
The join functionin Pandas is used to combine two DataFrames based
on their index (row labels) rather than columns. It’s useful when the
data already has the same index in both DataFrames.
7.
Exercise
Ensure all studentsin the df_students DataFrame are included, even if
there is no corresponding score in df_scores
df_students = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'],'Age':
[20, 21, 22, 23]},index=['S001', 'S002', 'S003', 'S004’])
df_scores = pd.DataFrame({'Math': [85, 90, 78, 88],'Science': [92, 80, 85,
91]},index=['S001', 'S002', 'S003', 'S005'])
8.
Concat
Concatenation in Pandasis used to stack or combine DataFrames along
a particular axis (either rows or columns) as putting two or more
DataFrames together either vertically (like stacking one on top of
another) or horizontally (side-by-side)
df_jan = pd.DataFrame({'Store': ['A', 'B'], 'Sales': [200, 150]})
df_feb = pd.DataFrame({'Store': ['A', 'B'], 'Sales': [250, 100]})
df_combined = pd.concat([df_jan, df_feb], ignore_index=True)
9.
Pivot
Pivoting reshapes databy turning unique values from one column into
separate columns. It’s useful for reorganizing data for easy analysis or
visualization.
Determine each product's sales in each month
df_sales = pd.DataFrame({ 'Product': ['A', 'A', 'B', 'B'], 'Region': ['North', 'South',
'North', 'South'], 'Month': ['Jan', 'Jan', 'Feb', 'Feb'], 'Sales': [100, 150, 200, 250] })
10.
Exercise
Transform this datato show the total sales per store on specific dates
df_sales = pd.DataFrame({'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C’],
'date': ['2024-10-01', '2024-10-02', '2024-10-01', '2024-10-02', '2024-10-01', '2024-10-02'],
'product': ['A', 'A', 'B', 'B', 'A', 'B’],
'sales': [100, 120, 150, 160, 90, 110]})
Identify the Index Column: In this case, we want each store to be the index, so store will be the index.
Identify the Columns to Expand: We want each date to become a column in the resulting DataFrame.
Values to Fill: We want to fill each cell with the sales amount for each product in each store and date.