This document provides an introduction to data analysis using Pandas and NumPy. It discusses the key data structures in Pandas like Series and DataFrames, and how to load CSV files into DataFrames. It also covers common DataFrame methods for exploring data like shape, head, tail, info, and describe. The document then discusses data cleansing techniques. Finally, it introduces NumPy, describing it as a memory efficient library for scientific computing with N-dimensional arrays and various array manipulation functions.
4. Data Analysis: the process of discovering useful
information from the raw data to empower data-driven
business decision. It is the detailed examination of the
elements or structure of something.
Data Analytics: It is a systematic computational analysis
of data or statistics.
4
5. Process Flow of Data Analysis:
5
Requirements:
gathering and
planning
Data Collection Data Cleansing
Data Preparation
Data Analysis
Data
Interpretation
and Result
Summarization
Data
Visualization
7. Pandas data Structure
Series
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
7
8. Pandas data Structure
DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.
8
9. Pandas data Structure
Load CSV file
• A simple way to store big data sets is to use CSV files (comma separated
files).
• CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
employees.csv
9
11. Exploring data of a DataFrame
11
• DataFrame.shape
The shape will return the number of rows and columns
Data below contains 320 rows and 9 columns
12. Exploring data of a DataFrame
12
• DataFrame.head(n) and DataFrame.tail(n)
13. Exploring data of a DataFrame
13
• DataFrame.info(): a useful tool for getting a quick overview of a
DataFrame. It can be used to identify the data types of the
columns, the number of rows and columns, and the memory
usage of the DataFrame. This information can be helpful for
understanding the DataFrame and for planning further analysis.
14. 14
The dataframe.describe(): method
calculates the following statistics for each
column in the DataFrame:
• Count: The number of non-null values
in the column.
• Mean: The average value of the column.
• Standard deviation: The standard
deviation of the column.
• Minimum: The minimum value in the
column.
• 25% percentile: The 25th percentile of
the column.
• 50% percentile: The 50th percentile of
the column, also known as the median.
• 75% percentile: The 75th percentile of
the column.
• Maximum: The maximum value in the
column.
15. 15
The dataframe.dtypes :method get the data
types of the columns in a DataFrame. This
method returns a Series object with the data
type of each column. The index of the Series
object is the name of the column and the
value of the Series object is the data type of
the column.
The data types that can be returned by the
dataframe.dtypes method include:
•object: strings, lists, or other non-numeric
data.
•int64: integers.
•float64: floating-point numbers.
•datetime64[ns]: dates and times.
17. • Handling duplicate data
• Dropping or deleting duplicate records
• Handing missing value in data
• Dropping the row which has missing data/ filling missing values
17
Data Cleansing
19. Why NumPy?
19
NumPy (Numerical Python) is :
• vastly used Python library for scientific computation
• It is memory efficient and fast
• It has N-dimensional array objects and a rich collection of
routines to process and analyse them
• Homogenous array (same data types)
20. • To create an ndarray, we can pass a list, tuple or any array-like
object into the array() method, and it will be converted into
an ndarray:
20
23. NumPy array manipulation
Function Description
reshape() A returned new array with a specific shape without modify data
flat() flattens the array then returns the element of a specific index
flatten() returns the one-dimensional copy of input array
ravel() returns the one-dimensional view of input array
transpose() Transpose the axes
resize() Same as reshape(), but resize modifies the input array on which
this has been applied.
23
25. References
• Dixit, R. (2022). Data Analysis with Python: Introducing NumPy,
Pandas, Matplotlib, and Essential Elements of Python
Programming (English Edition). India: BPB Publications.
25