SlideShare a Scribd company logo
1 of 30
Data Wrangling
Due to the rapid expansion of the amount of data and data sources available today, storing and
organizing large quantities of data for analysis is becoming increasingly necessary.
Data wrangling is the process of removing errors and combining complex data sets to make
them more accessible and easier to analyze.
Data wrangling can be defined as the process of cleaning, organizing, and transforming
raw data into the desired format for analysts to use for prompt decision-making.
Data wrangling enables businesses to tackle more complex data in less time, produce more
accurate results, and make better decisions.
More and more organizations are increasingly relying on data wrangling tools to make
data ready for downstream analytics.
Importance of Data Wrangling
The primary importance of using data wrangling tools can be described as:
•Making raw data usable. Accurately wrangled data guarantees that quality data is entered into the
downstream analysis.
•Getting all data from various sources into a centralized location so it can be used.
•Piecing together raw data according to the required format and understanding the business context of
•Automated data integration tools are used as data wrangling techniques that clean and convert source
data into a standard format that can be used repeatedly according to end requirements. Businesses use
this standardized data to perform crucial, cross-data set analytics.
•Cleansing the data from the noise or flawed, missing elements
•Data wrangling acts as a preparation stage for the data mining process, which involves gathering data
and making sense of it.
•Helping business users make concrete, timely decisions
Data wrangling software typically performs six iterative steps of Discovering, Structuring,
Cleaning, Enriching, Validating, and Publishing data before it is ready for analytics.
1. Data discovery
The first step helps you make sense of the data you're working with. You'll also
need to keep the primary goal of the data analysis during this step. For example, if
your organization wants to gain customer behavior insight, you might sort
customer data according to location, promotional codes, and purchases.
2. Data structuring
Once you've finished the first step, you might find raw data that could be more
organized, complete, or mis formatted for your purposes. That's where data
structuring comes into play. This is the process in which you transform that raw
data into a form appropriate for the analytical model you want to use to interpret
the data.
3. Data cleaning
During the data cleaning step, you remove data errors that might distort or damage the value of your analysis.
This includes tasks like standardising inputs, deleting empty cells, removing outliers, and deleting blank rows.
Ultimately, the goal is to ensure the data is as error-free as possible.
4. Enriching data
Once you've transformed your data into a more usable state, you must determine if you have all the data you
need for the project. If you don't, you can enrich it by adding values from other data sets. And if you do so, you
might have to repeat steps one through three for that new data.
5. Validating data
When you work on data validation, you verify that your data is consistent and of sufficient quality. During this
step, you might find some issues you need to address or that the data is ready to be analysed. This step is
typically completed using automated processes and requires some programming skills.
6. Publishing data
After validating your data, you're ready to publish it. In this step, you'll put it into whatever format you prefer for
sharing with other organisation members for analysis purposes. Use written reports or digital files, depending on
the nature of the data and the organisation's overarching goals.
Benefits of Data Wrangling
•Data wrangling helps to improve data usability as it converts data into a compatible
format for the end system.
•It helps to quickly build data flows within an intuitive user interface and easily
schedule and automate the data-flow process.
•Integrates various types of information and their sources (like databases, web
services, files, etc.)
•Help users to process very large volumes of data easily and easily share data-flow
Data Wrangling Tools
Some examples of basic data munging tools are:
•Spreadsheets / Excel Power Query - It is the most basic manual data wrangling tool
•OpenRefine - An automated data cleaning tool that requires programming skills
•Tabula – It is a tool suited for all data types
•Google DataPrep – It is a data service that explores, cleans, and prepares data
•Data wrangler – It is a data cleaning and transforming tool
Data Wrangling Examples
Data wrangling techniques are used for various use cases. The most commonly used examples of
data wrangling are for:
•Merging several data sources into one data set for analysis
•Identifying gaps or empty cells in data and either filling or removing them
•Deleting irrelevant or unnecessary data
•Identifying severe outliers in data and either explaining the inconsistencies or deleting them to
facilitate analysis
Businesses also use data wrangling tools to
•Detect corporate fraud
•Support data security
•Ensure accurate and recurring data modeling results
•Ensure business compliance with industry standards
•Perform Customer Behavior Analysis
•Reduce time spent on preparing data for analysis
•Promptly recognize the business value of your data
•Find out data trends
Data Wrangling Skills
• To be able to perform series of data transformations like merging, ordering, aggregating
• To use data science programming languages like R, Python, Julia, SQL on specified
data sets
• To make logical judgments based on underlying business context
Data Wrangling in Python
Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of
Python is used for Data Wrangling.
Data wrangling in Python deals with the below functionalities:
1.Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column, or simply by dropping the row having
a NaN value.
3.Reshaping data: In this process, data is manipulated according to the requirements, where
new data can be added or pre-existing data can be modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Data exploration in Python
Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a tabular
# Import pandas package
import pandas as pd
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
'Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
# Convert into DataFrame
df = pd.DataFrame(data)
# Display data
Dealing with missing values in Python
As we can see from the previous output, there are NaN values present in the MARKS column which is
a missing value in the dataframe that is going to be taken care of in data wrangling by replacing them
with the column mean.
# Compute average
c = avg = 0
for ele in df['Marks']:
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c
# Replace missing values
df = df.replace(to_replace="NaN",
# Display data
Data Replacing in Data Wrangling
In the GENDER column, we can replace the Gender column data by categorizing them into
different numbers.
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
'F': 1, }).astype(float)
# Display data
Filtering data in Data Wrangling
Suppose there is a requirement for the details regarding name, gender, and marks of the top-
scoring students. Here we need to remove some using the pandas slicing method in data
wrangling from unwanted data.
# Filter top scoring students
df = df[df['Marks'] >= 75].copy()
# Remove age column from filtered DataFrame
df.drop('Age', axis=1, inplace=True)
# Display data
Hence, we have finally obtained an efficient dataset that can be further used for
various purposes.
Data Wrangling Using Merge Operation
Merge operation is used to merge two raw data into the desired format.
Syntax: pd.merge( data_frame1,data_frame2, on=”field “)
Here the field is the name of the column which is similar in both data-frame.
For example: Suppose that a Teacher has two types of Data, the first type of Data
consists of Details of Students and the Second type of Data consists of Pending Fees
Status which is taken from the Account Office. So The Teacher will use the merge
operation here in order to merge the data and provide it meaning. So that teacher will
analyze it easily and it also reduces the time and effort of the Teacher from Manual
Creating First Dataframe to Perform Merge Operation using Data Wrangling:
# import module
import pandas as pd
# creating DataFrame for Student Details
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105, 106,
107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly',
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
# printing details
Creating Second Dataframe to Perform Merge operation using Data Wrangling:
# Import module
import pandas as pd
# Creating Dataframe for Fees_Status
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250',
# Printing fees_status
Data Wrangling Using Merge Operation:
# Import module
import pandas as pd
# Creating Dataframe
details = pd.DataFrame({
'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'NAME': ['Jagroop', 'Praveen', 'Harjot',
'Pooja', 'Rahul', 'Nikita',
'Saurabh', 'Ayush', 'Dolly', "Mohit"],
'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
# Creating Dataframe
fees_status = pd.DataFrame(
{'ID': [101, 102, 103, 104, 105,
106, 107, 108, 109, 110],
'PENDING': ['5000', '250', 'NIL',
'9000', '15000', 'NIL',
'4500', '1800', '250', 'NIL']})
# Merging Dataframe
print(pd.merge(details, fees_status, on='ID'))
Data Wrangling Using Grouping Method
The grouping method in Data wrangling is used to provide results in terms of various groups
taken out from Large Data.
This method of pandas is used to group the outset of data from the large data set.
Example: There is a Car Selling company and this company have different Brands of various Car
Manufacturing Company like Maruti, Toyota, Mahindra, Ford, etc., and have data on where different
cars are sold in different years. So the Company wants to wrangle only that data where cars are sold
during the year 2010. For this problem, we use another data Wrangling technique which is a
pandas groupby() method.
Creating dataframe to use Grouping methods[Car selling datasets]:
# Import module
import pandas as pd
# Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti',
'Maruti','Maruti', 'Hyundai', 'Hyundai','Toyota', 'Mahindra',
'Mahindra','Ford', 'Toyota', 'Ford'],
'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013,
2010, 2010, 2011],
'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]}
# Creating Dataframe of car_selling_data
df = pd.DataFrame(car_selling_data)
# printing Dataframe
Creating Dataframe to use Grouping methods[DATA OF THE YEAR 2010]:
# Import module
import pandas as pd
# Creating Data
car_selling_data = {'Brand': ['Maruti', 'Maruti', 'Maruti’,
'Maruti', 'Hyundai', 'Hyundai','Toyota', 'Mahindra', 'Mahindra',
'Ford', 'Toyota', 'Ford'],
'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013,
2010, 2010, 2011],
'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]}
# Creating Dataframe for Provided Data
df = pd.DataFrame(car_selling_data)
# Group the data when year = 2010
grouped = df.groupby('Year')
Data Wrangling by Removing Duplication
Pandas duplicates() method helps us to remove duplicate values from Large Data. An important
part of Data Wrangling is removing Duplicate values from the large data set.
Syntax: DataFrame.duplicated(subset=None, keep=’first’)
Here subset is the column value where we want to remove the Duplicate value.
In keeping, we have 3 options :
•if keep =’first’ then the first value is marked as the original rest of all values if occur
will be removed as it is considered duplicate.
•if keep=’last’ then the last value is marked as the original rest the above same values
will be removed as it is considered duplicate values.
•if keep =’false’ all the values which occur more than once will be removed as all are
considered duplicate values.
For example, A University will organize the event. In order to participate Students
have to fill in their details in the online form so that they will contact them. It may
be possible that a student will fill out the form multiple times. It may cause
difficulty for the event organizer if a single student will fill in multiple entries. The
Data that the organizers will get can be Easily Wrangles by removing duplicate
Creating a Student Dataset who want to participate in the event:
# Import module
import pandas as pd
# Initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop’, 'Rahul',
'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul', 'Praveen',
'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23],
'Email': ['', '',
'', '',
'', '',
'', '',
'', '',
'', '']}
# Creating Dataframe of Data
df = pd.DataFrame(student_data)
# Printing Dataframe
Removing Duplicate data from the Dataset using Data wrangling:
# import module
import pandas as pd
# initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop',
'Rahul', 'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul',
'Praveen', 'Amit'],
'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23],
'Email': ['', '',
'', '',
'', '',
'', '',
'', '',
'', '']}
# creating dataframe
df = pd.DataFrame(student_data)
# Here df.duplicated() list duplicate Entries in ROllno.
# So that ~(NOT) is placed in order to get non duplicate values.
non_duplicate = df[~df.duplicated('Roll_no')]
# printing non-duplicate values
Creating New Datasets Using the Concatenation of Two Datasets In Data Wrangling.
We can join two dataframe in several ways. For our example in Concanating Two datasets, we
use pd.concat() function.
Creating Two Dataframe For Concatenation.
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad',
'Qualification':['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
# Define a dictionary containing employee data
data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age':[22, 32, 12, 52],
'Address':['Allahabad', 'Kannuaj',
'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary':[1000, 2000, 3000, 4000]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
We will join these two dataframe along axis 0.
res = pd.concat([df, df1])
We can see that data1 does not have a salary column so all four rows of new
dataframe res are Nan values.

More Related Content

Similar to data wrangling (1).pptx kjhiukjhknjbnkjh

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxRunning head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxtodd271
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxSourabhkumar729579
What is data science ?
What is data science ?What is data science ?
What is data science ?ShahlKv
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
ETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxParnalSatle
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfShristi Shrestha

Similar to data wrangling (1).pptx kjhiukjhknjbnkjh (20)

The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Planning Data Warehouse
Planning Data WarehousePlanning Data Warehouse
Planning Data Warehouse
Data processing
Data processingData processing
Data processing
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docxRunning head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
data collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptxdata collection, data integration, data management, data modeling.pptx
data collection, data integration, data management, data modeling.pptx
What is data science ?
What is data science ?What is data science ?
What is data science ?
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
ETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptxETL processes , Datawarehouse and Datamarts.pptx
ETL processes , Datawarehouse and Datamarts.pptx
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
Data Mining
Data MiningData Mining
Data Mining
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
KIT601 Unit I.pptx
KIT601 Unit I.pptxKIT601 Unit I.pptx
KIT601 Unit I.pptx

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001

Recently uploaded (20)

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing

data wrangling (1).pptx kjhiukjhknjbnkjh

  • 2. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary. Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze.
  • 3. Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data into the desired format for analysts to use for prompt decision-making. Data wrangling enables businesses to tackle more complex data in less time, produce more accurate results, and make better decisions. More and more organizations are increasingly relying on data wrangling tools to make data ready for downstream analytics.
  • 4. Importance of Data Wrangling The primary importance of using data wrangling tools can be described as: •Making raw data usable. Accurately wrangled data guarantees that quality data is entered into the downstream analysis. •Getting all data from various sources into a centralized location so it can be used. •Piecing together raw data according to the required format and understanding the business context of data •Automated data integration tools are used as data wrangling techniques that clean and convert source data into a standard format that can be used repeatedly according to end requirements. Businesses use this standardized data to perform crucial, cross-data set analytics. •Cleansing the data from the noise or flawed, missing elements •Data wrangling acts as a preparation stage for the data mining process, which involves gathering data and making sense of it. •Helping business users make concrete, timely decisions
  • 5. Data wrangling software typically performs six iterative steps of Discovering, Structuring, Cleaning, Enriching, Validating, and Publishing data before it is ready for analytics. 1. Data discovery The first step helps you make sense of the data you're working with. You'll also need to keep the primary goal of the data analysis during this step. For example, if your organization wants to gain customer behavior insight, you might sort customer data according to location, promotional codes, and purchases. 2. Data structuring Once you've finished the first step, you might find raw data that could be more organized, complete, or mis formatted for your purposes. That's where data structuring comes into play. This is the process in which you transform that raw data into a form appropriate for the analytical model you want to use to interpret the data.
  • 6. 3. Data cleaning During the data cleaning step, you remove data errors that might distort or damage the value of your analysis. This includes tasks like standardising inputs, deleting empty cells, removing outliers, and deleting blank rows. Ultimately, the goal is to ensure the data is as error-free as possible. 4. Enriching data Once you've transformed your data into a more usable state, you must determine if you have all the data you need for the project. If you don't, you can enrich it by adding values from other data sets. And if you do so, you might have to repeat steps one through three for that new data. 5. Validating data When you work on data validation, you verify that your data is consistent and of sufficient quality. During this step, you might find some issues you need to address or that the data is ready to be analysed. This step is typically completed using automated processes and requires some programming skills. 6. Publishing data After validating your data, you're ready to publish it. In this step, you'll put it into whatever format you prefer for sharing with other organisation members for analysis purposes. Use written reports or digital files, depending on the nature of the data and the organisation's overarching goals.
  • 7. Benefits of Data Wrangling •Data wrangling helps to improve data usability as it converts data into a compatible format for the end system. •It helps to quickly build data flows within an intuitive user interface and easily schedule and automate the data-flow process. •Integrates various types of information and their sources (like databases, web services, files, etc.) •Help users to process very large volumes of data easily and easily share data-flow techniques.
  • 8. Data Wrangling Tools Some examples of basic data munging tools are: •Spreadsheets / Excel Power Query - It is the most basic manual data wrangling tool •OpenRefine - An automated data cleaning tool that requires programming skills •Tabula – It is a tool suited for all data types •Google DataPrep – It is a data service that explores, cleans, and prepares data •Data wrangler – It is a data cleaning and transforming tool
  • 9. Data Wrangling Examples Data wrangling techniques are used for various use cases. The most commonly used examples of data wrangling are for: •Merging several data sources into one data set for analysis •Identifying gaps or empty cells in data and either filling or removing them •Deleting irrelevant or unnecessary data •Identifying severe outliers in data and either explaining the inconsistencies or deleting them to facilitate analysis
  • 10. Businesses also use data wrangling tools to •Detect corporate fraud •Support data security •Ensure accurate and recurring data modeling results •Ensure business compliance with industry standards •Perform Customer Behavior Analysis •Reduce time spent on preparing data for analysis •Promptly recognize the business value of your data •Find out data trends
  • 11. Data Wrangling Skills • To be able to perform series of data transformations like merging, ordering, aggregating • To use data science programming languages like R, Python, Julia, SQL on specified data sets • To make logical judgments based on underlying business context
  • 12. Data Wrangling in Python Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of Python is used for Data Wrangling. Data wrangling in Python deals with the below functionalities: 1.Data exploration: In this process, the data is studied, analyzed, and understood by visualizing representations of data. 2.Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column, or simply by dropping the row having a NaN value. 3.Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or pre-existing data can be modified. 4.Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or filtered 5.Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization, model training etc.
  • 13. Data exploration in Python Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a tabular format. # Import pandas package import pandas as pd # Assign data data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj', 'Ravi', 'Natasha', 'Riya'], 'Age': [17, 17, 18, 17, 18, 17, 17], 'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'], 'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]} # Convert into DataFrame df = pd.DataFrame(data) # Display data df
  • 14. Dealing with missing values in Python As we can see from the previous output, there are NaN values present in the MARKS column which is a missing value in the dataframe that is going to be taken care of in data wrangling by replacing them with the column mean. # Compute average c = avg = 0 for ele in df['Marks']: if str(ele).isnumeric(): c += 1 avg += ele avg /= c # Replace missing values df = df.replace(to_replace="NaN", value=avg) # Display data df
  • 15. Data Replacing in Data Wrangling In the GENDER column, we can replace the Gender column data by categorizing them into different numbers. # Categorize gender df['Gender'] = df['Gender'].map({'M': 0, 'F': 1, }).astype(float) # Display data df
  • 16. Filtering data in Data Wrangling Suppose there is a requirement for the details regarding name, gender, and marks of the top- scoring students. Here we need to remove some using the pandas slicing method in data wrangling from unwanted data. # Filter top scoring students df = df[df['Marks'] >= 75].copy() # Remove age column from filtered DataFrame df.drop('Age', axis=1, inplace=True) # Display data df Hence, we have finally obtained an efficient dataset that can be further used for various purposes.
  • 17. Data Wrangling Using Merge Operation Merge operation is used to merge two raw data into the desired format. Syntax: pd.merge( data_frame1,data_frame2, on=”field “) Here the field is the name of the column which is similar in both data-frame. For example: Suppose that a Teacher has two types of Data, the first type of Data consists of Details of Students and the Second type of Data consists of Pending Fees Status which is taken from the Account Office. So The Teacher will use the merge operation here in order to merge the data and provide it meaning. So that teacher will analyze it easily and it also reduces the time and effort of the Teacher from Manual Merging.
  • 18. Creating First Dataframe to Perform Merge Operation using Data Wrangling: # import module import pandas as pd # creating DataFrame for Student Details details = pd.DataFrame({ 'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'NAME': ['Jagroop', 'Praveen', 'Harjot', 'Pooja', 'Rahul', 'Nikita', 'Saurabh', 'Ayush', 'Dolly', "Mohit"], 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']}) # printing details print(details)
  • 19. Creating Second Dataframe to Perform Merge operation using Data Wrangling: # Import module import pandas as pd # Creating Dataframe for Fees_Status fees_status = pd.DataFrame( {'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'PENDING': ['5000', '250', 'NIL', '9000', '15000', 'NIL', '4500', '1800', '250', 'NIL']}) # Printing fees_status print(fees_status)
  • 20. Data Wrangling Using Merge Operation: # Import module import pandas as pd # Creating Dataframe details = pd.DataFrame({ 'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'NAME': ['Jagroop', 'Praveen', 'Harjot', 'Pooja', 'Rahul', 'Nikita', 'Saurabh', 'Ayush', 'Dolly', "Mohit"], 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']}) # Creating Dataframe fees_status = pd.DataFrame( {'ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'PENDING': ['5000', '250', 'NIL', '9000', '15000', 'NIL', '4500', '1800', '250', 'NIL']}) # Merging Dataframe print(pd.merge(details, fees_status, on='ID'))
  • 21. Data Wrangling Using Grouping Method The grouping method in Data wrangling is used to provide results in terms of various groups taken out from Large Data. This method of pandas is used to group the outset of data from the large data set. Example: There is a Car Selling company and this company have different Brands of various Car Manufacturing Company like Maruti, Toyota, Mahindra, Ford, etc., and have data on where different cars are sold in different years. So the Company wants to wrangle only that data where cars are sold during the year 2010. For this problem, we use another data Wrangling technique which is a pandas groupby() method.
  • 22. Creating dataframe to use Grouping methods[Car selling datasets]: # Import module import pandas as pd # Creating Data car_selling_data = {'Brand': ['Maruti', 'Maruti', 'Maruti','Maruti', 'Hyundai', 'Hyundai','Toyota', 'Mahindra', 'Mahindra','Ford', 'Toyota', 'Ford'], 'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013, 2010, 2010, 2011], 'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]} # Creating Dataframe of car_selling_data df = pd.DataFrame(car_selling_data) # printing Dataframe print(df)
  • 23. Creating Dataframe to use Grouping methods[DATA OF THE YEAR 2010]: # Import module import pandas as pd # Creating Data car_selling_data = {'Brand': ['Maruti', 'Maruti', 'Maruti’, 'Maruti', 'Hyundai', 'Hyundai','Toyota', 'Mahindra', 'Mahindra', 'Ford', 'Toyota', 'Ford'], 'Year': [2010, 2011, 2009, 2013, 2010, 2011, 2011, 2010, 2013, 2010, 2010, 2011], 'Sold': [6, 7, 9, 8, 3, 5, 2, 8, 7, 2, 4, 2]} # Creating Dataframe for Provided Data df = pd.DataFrame(car_selling_data) # Group the data when year = 2010 grouped = df.groupby('Year') print(grouped.get_group(2010))
  • 24. Data Wrangling by Removing Duplication Pandas duplicates() method helps us to remove duplicate values from Large Data. An important part of Data Wrangling is removing Duplicate values from the large data set. Syntax: DataFrame.duplicated(subset=None, keep=’first’) Here subset is the column value where we want to remove the Duplicate value. In keeping, we have 3 options : •if keep =’first’ then the first value is marked as the original rest of all values if occur will be removed as it is considered duplicate. •if keep=’last’ then the last value is marked as the original rest the above same values will be removed as it is considered duplicate values. •if keep =’false’ all the values which occur more than once will be removed as all are considered duplicate values.
  • 25. For example, A University will organize the event. In order to participate Students have to fill in their details in the online form so that they will contact them. It may be possible that a student will fill out the form multiple times. It may cause difficulty for the event organizer if a single student will fill in multiple entries. The Data that the organizers will get can be Easily Wrangles by removing duplicate values.
  • 26. Creating a Student Dataset who want to participate in the event: # Import module import pandas as pd # Initializing Data student_data = {'Name': ['Amit', 'Praveen', 'Jagroop’, 'Rahul', 'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul', 'Praveen', 'Amit'], 'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23], 'Email': ['', '', '', '', '', '', '', '', '', '', '', '']} # Creating Dataframe of Data df = pd.DataFrame(student_data) # Printing Dataframe print(df)
  • 27. Removing Duplicate data from the Dataset using Data wrangling: # import module import pandas as pd # initializing Data student_data = {'Name': ['Amit', 'Praveen', 'Jagroop', 'Rahul', 'Vishal', 'Suraj','Rishab', 'Satyapal', 'Amit','Rahul', 'Praveen', 'Amit'], 'Roll_no': [23, 54, 29, 36, 59, 38, 12, 45, 34, 36, 54, 23], 'Email': ['', '', '', '', '', '', '', '', '', '', '', '']} # creating dataframe df = pd.DataFrame(student_data) # Here df.duplicated() list duplicate Entries in ROllno. # So that ~(NOT) is placed in order to get non duplicate values. non_duplicate = df[~df.duplicated('Roll_no')] # printing non-duplicate values print(non_duplicate)
  • 28. Creating New Datasets Using the Concatenation of Two Datasets In Data Wrangling. We can join two dataframe in several ways. For our example in Concanating Two datasets, we use pd.concat() function.
  • 29. Creating Two Dataframe For Concatenation. # importing pandas module import pandas as pd # Define a dictionary containing employee data data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd'], 'Mobile No': [97, 91, 58, 76]} # Define a dictionary containing employee data data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'], 'Age':[22, 32, 12, 52], 'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'], 'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons'], 'Salary':[1000, 2000, 3000, 4000]} # Convert the dictionary into DataFrame df = pd.DataFrame(data1,index=[0, 1, 2, 3]) # Convert the dictionary into DataFrame df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
  • 30. We will join these two dataframe along axis 0. res = pd.concat([df, df1]) We can see that data1 does not have a salary column so all four rows of new dataframe res are Nan values.