SlideShare a Scribd company logo
1 of 25
Data Preprocessing
Lay Puthineath
Session 2:
1
Contents
- Introducing Data Analysis
- Introducing Pandas
- Introducing Numpy
2
Data Analysis?
3
Data Analysis: the process of discovering useful
information from the raw data to empower data-driven
business decision. It is the detailed examination of the
elements or structure of something.
Data Analytics: It is a systematic computational analysis
of data or statistics.
4
Process Flow of Data Analysis:
5
Requirements:
gathering and
planning
Data Collection Data Cleansing
Data Preparation
Data Analysis
Data
Interpretation
and Result
Summarization
Data
Visualization
Why use
Pandas?
6
Pandas data Structure
Series
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
7
Pandas data Structure
DataFrame
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.
8
Pandas data Structure
Load CSV file
• A simple way to store big data sets is to use CSV files (comma separated
files).
• CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
employees.csv
9
10
If the data have no header, we can add header
Exploring data of a DataFrame
11
• DataFrame.shape
The shape will return the number of rows and columns
Data below contains 320 rows and 9 columns
Exploring data of a DataFrame
12
• DataFrame.head(n) and DataFrame.tail(n)
Exploring data of a DataFrame
13
• DataFrame.info(): a useful tool for getting a quick overview of a
DataFrame. It can be used to identify the data types of the
columns, the number of rows and columns, and the memory
usage of the DataFrame. This information can be helpful for
understanding the DataFrame and for planning further analysis.
14
The dataframe.describe(): method
calculates the following statistics for each
column in the DataFrame:
• Count: The number of non-null values
in the column.
• Mean: The average value of the column.
• Standard deviation: The standard
deviation of the column.
• Minimum: The minimum value in the
column.
• 25% percentile: The 25th percentile of
the column.
• 50% percentile: The 50th percentile of
the column, also known as the median.
• 75% percentile: The 75th percentile of
the column.
• Maximum: The maximum value in the
column.
15
The dataframe.dtypes :method get the data
types of the columns in a DataFrame. This
method returns a Series object with the data
type of each column. The index of the Series
object is the name of the column and the
value of the Series object is the data type of
the column.
The data types that can be returned by the
dataframe.dtypes method include:
•object: strings, lists, or other non-numeric
data.
•int64: integers.
•float64: floating-point numbers.
•datetime64[ns]: dates and times.
16
dataframe.value_counts()
method includes the count
of each unique value in the
"Job Title" column.
• Handling duplicate data
• Dropping or deleting duplicate records
• Handing missing value in data
• Dropping the row which has missing data/ filling missing values
17
Data Cleansing
• Grouping data
• Sorting
• Ranking
18
Data Summary
Why NumPy?
19
NumPy (Numerical Python) is :
• vastly used Python library for scientific computation
• It is memory efficient and fast
• It has N-dimensional array objects and a rich collection of
routines to process and analyse them
• Homogenous array (same data types)
• To create an ndarray, we can pass a list, tuple or any array-like
object into the array() method, and it will be converted into
an ndarray:
20
21
22
NumPy array manipulation
Function Description
reshape() A returned new array with a specific shape without modify data
flat() flattens the array then returns the element of a specific index
flatten() returns the one-dimensional copy of input array
ravel() returns the one-dimensional view of input array
transpose() Transpose the axes
resize() Same as reshape(), but resize modifies the input array on which
this has been applied.
23
24
References
• Dixit, R. (2022). Data Analysis with Python: Introducing NumPy,
Pandas, Matplotlib, and Essential Elements of Python
Programming (English Edition). India: BPB Publications.
25

More Related Content

Similar to Data Prep

Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsHPCC Systems
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxSreeLaya9
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
Data Structure # vpmp polytechnic
Data Structure # vpmp polytechnicData Structure # vpmp polytechnic
Data Structure # vpmp polytechniclavparmar007
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxsabithabanu83
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysisRajesh M
 
Data Structure & Algorithm.pptx
Data Structure & Algorithm.pptxData Structure & Algorithm.pptx
Data Structure & Algorithm.pptxMumtaz
 

Similar to Data Prep (20)

Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Quick dive to pandas
Quick dive to pandasQuick dive to pandas
Quick dive to pandas
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
Pandas csv
Pandas csvPandas csv
Pandas csv
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Data Structure # vpmp polytechnic
Data Structure # vpmp polytechnicData Structure # vpmp polytechnic
Data Structure # vpmp polytechnic
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptxUnit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
 
Pandas
PandasPandas
Pandas
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptx
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
Data Structure & Algorithm.pptx
Data Structure & Algorithm.pptxData Structure & Algorithm.pptx
Data Structure & Algorithm.pptx
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Data Prep

  • 2. Contents - Introducing Data Analysis - Introducing Pandas - Introducing Numpy 2
  • 4. Data Analysis: the process of discovering useful information from the raw data to empower data-driven business decision. It is the detailed examination of the elements or structure of something. Data Analytics: It is a systematic computational analysis of data or statistics. 4
  • 5. Process Flow of Data Analysis: 5 Requirements: gathering and planning Data Collection Data Cleansing Data Preparation Data Analysis Data Interpretation and Result Summarization Data Visualization
  • 7. Pandas data Structure Series • A Pandas Series is like a column in a table. • It is a one-dimensional array holding data of any type. 7
  • 8. Pandas data Structure DataFrame A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. 8
  • 9. Pandas data Structure Load CSV file • A simple way to store big data sets is to use CSV files (comma separated files). • CSV files contains plain text and is a well know format that can be read by everyone including Pandas. employees.csv 9
  • 10. 10 If the data have no header, we can add header
  • 11. Exploring data of a DataFrame 11 • DataFrame.shape The shape will return the number of rows and columns Data below contains 320 rows and 9 columns
  • 12. Exploring data of a DataFrame 12 • DataFrame.head(n) and DataFrame.tail(n)
  • 13. Exploring data of a DataFrame 13 • DataFrame.info(): a useful tool for getting a quick overview of a DataFrame. It can be used to identify the data types of the columns, the number of rows and columns, and the memory usage of the DataFrame. This information can be helpful for understanding the DataFrame and for planning further analysis.
  • 14. 14 The dataframe.describe(): method calculates the following statistics for each column in the DataFrame: • Count: The number of non-null values in the column. • Mean: The average value of the column. • Standard deviation: The standard deviation of the column. • Minimum: The minimum value in the column. • 25% percentile: The 25th percentile of the column. • 50% percentile: The 50th percentile of the column, also known as the median. • 75% percentile: The 75th percentile of the column. • Maximum: The maximum value in the column.
  • 15. 15 The dataframe.dtypes :method get the data types of the columns in a DataFrame. This method returns a Series object with the data type of each column. The index of the Series object is the name of the column and the value of the Series object is the data type of the column. The data types that can be returned by the dataframe.dtypes method include: •object: strings, lists, or other non-numeric data. •int64: integers. •float64: floating-point numbers. •datetime64[ns]: dates and times.
  • 16. 16 dataframe.value_counts() method includes the count of each unique value in the "Job Title" column.
  • 17. • Handling duplicate data • Dropping or deleting duplicate records • Handing missing value in data • Dropping the row which has missing data/ filling missing values 17 Data Cleansing
  • 18. • Grouping data • Sorting • Ranking 18 Data Summary
  • 19. Why NumPy? 19 NumPy (Numerical Python) is : • vastly used Python library for scientific computation • It is memory efficient and fast • It has N-dimensional array objects and a rich collection of routines to process and analyse them • Homogenous array (same data types)
  • 20. • To create an ndarray, we can pass a list, tuple or any array-like object into the array() method, and it will be converted into an ndarray: 20
  • 21. 21
  • 22. 22
  • 23. NumPy array manipulation Function Description reshape() A returned new array with a specific shape without modify data flat() flattens the array then returns the element of a specific index flatten() returns the one-dimensional copy of input array ravel() returns the one-dimensional view of input array transpose() Transpose the axes resize() Same as reshape(), but resize modifies the input array on which this has been applied. 23
  • 24. 24
  • 25. References • Dixit, R. (2022). Data Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition). India: BPB Publications. 25