SlideShare a Scribd company logo
1 of 12
Data Processing for
Machine Learning in
PYTHON
• What is data processing?
• Need of data preprocessing.
• Steps in data processing.
• Conclusion
Overview
What is data processing?
• Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
• Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.
Need of Data Preprocessing.
• For achieving better results from the applied model in Machine Learning projects the
format of the data has to be in a proper manner. Some specified Machine Learning
model needs information in a specified format, for example, Random Forest algorithm
does not support null values, therefore to execute random forest algorithm null values
have to be managed from the original raw data set.
• Another aspect is that data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithms are executed in one data set, and
best out of them is chosen.
Steps in Data Processing.
Step 1: Preparing for the Preparation;
Data preparation can be seen in the CRISP-DM model (though it can be reasonably argued
that "data understanding" falls within our definition as well). We can also equate our data
preparation with the framework of the KDD Process — specifically the first 3 major steps — which
are selection, preprocessing, and transformation. We can break these down into finer granularity,
but at a macro level, these steps of the KDD Process encompass what data wrangling is.
Step 2: Exploratory Data Analysis;
The purpose of Exploratory data analysis (EDA) is to use summary statistics and
visualizations to better understand data, and find clues about the tendencies of the data, its
quality and to formulate assumptions and the hypothesis of our analysis.
The basic gist is that we need to know the makeup of our data before we can effectively select
predictive algorithms or map out the remaining steps of our data preparation. Throwing our
dataset at the hottest algorithm and hoping for the best is not a strategy.
Step 3: Missing Values;
Some commonly used methods for dealing with missing values include:
 Dropping instances with missing values
 Dropping attributes with missing values
 Imputing the attribute { mean | median | mode } for all missing values
 Imputing the attribute missing values via linear regression
Combination strategies may also be employed: drop any instances with more than 2 missing
values and use the mean attribute value imputation those which remain. Clearly the type of
modeling methods being employed will have an effect on your decision — for example, decision
trees are not amenable to missing values. Additionally, you could technically entertain any
statistical method you could think of for determining missing values from the dataset, but the listed
approaches are tried, tested, and commonly used.
Step 4: Outliers;
Outliers can be the result of poor data collection, or they can be genuinely good, anomalous
data. These are 2 different scenarios, and must be approached differently, and so no "one size fits
all" advice is applicable here, similar to that of dealing with missing values.
One option is to try a transformation. Square root and log transformations both pull in high
numbers. This can make assumptions work better if the outlier is a dependent variable and can
reduce the impact of a single point if the outlier is an independent variable.
Step 5: Imbalanced Data;
A good explanation of why we can run into imbalanced data, and why we can do so in some
domains much more frequently than in others (from 7 Techniques to Handle Imbalanced Data,
below):
1. Use the right evaluation metrics
2. Resample the training set
3. Use K-fold Cross-Validation in the right way
4.Ensemble different resampled datasets
5.Resample with different ratios
6.Cluster the abundant class
7.Design your own models
• However, most machine learning algorithms do not work very well with imbalanced datasets. The
following seven techniques can help you, to train a classifier to detect the abnormal class.
• Note that, while this may not genuinely be a data preparation task, such a dataset characteristic
will make itself known early in the data preparation stage (the importance of EDA), and the validity
of such data can certainly be assessed preliminarily during this preparation stage.
Step 6: Data Transformations;
Transforming data is one of the most important aspects of data preparation, requiring
more finesse than some others. When missing values manifest themselves in data, they are
generally easy to find, and can be dealt with by one of the common methods outlined above
— or by more complex measures gained from insight over time in a domain.
Standardization and normalization are a pair of often employed data transformations in
machine learning projects. Both are data scaling methods: standardization refers to scaling
the data to have a mean of 0 and a standard deviation of 1; normalization refers to the scaling
the data values to fit into a predetermined range, generally between 0 and 1.
• One-hot encoding is a method for transforming categorical features to a format which will
better work for classification and regression.Logarithmic distribution transformation is useful
for transforming non-linear models into linear models and working with skewed data.
• There are numerous additional standard data transformations which are regularly employed,
depending on the data and your requirements. Experience with data preprocessing and
preparation should provide intuition on what types of transformations are required in which
circumstance.
Step 7: Finishing Touches & Moving Ahead;
Alright. Your data is "clean." But what do you do with it?
If you want to go right to feeding your data into a machine learning algorithm in order to attempt
building a model, you probably need your data in a more appropriate representation. In the
Python ecosystem, that would generally be a Numpy ndarray (or matrix).
Conclusion
The future of data processing lies in the cloud. Cloud technology builds on the convenience of
current electronic data processing methods and accelerates its speed and effectiveness. Faster,
higher-quality data means more data for each organization to utilize and more valuable insights to
extract. Python Development services delivered by Suma Soft make use of an Agile
methodology. Our services help improve time to market.
Suma Soft’s Outsourced Python Development services include Python Web Development using
Django framework, Python Flask Web Development, Python Web Crawler Development, Python
Integration and maintenance, Migration Services and many more.
We have delivered 100+ Python Development outsourcing projects to clients from 8+industries.
Our expert team has proficiency in all versions of PHP, including the latest Python 3.6 version.
We maintain 100% transparency throughout Python Development process.
Suma Soft Pvt Lmt.
sales@sumasoft.com
https://www.sumasoft.com/
Contact us
+91 20 4013 0400

More Related Content

What's hot

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingJaspreet Issaj
 
Application of data mining tools for
Application of data mining tools forApplication of data mining tools for
Application of data mining tools forIJDKP
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processingMahmoud Alfarra
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...IRJET Journal
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistanceSelecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistancePhD Assistance
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
 
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistanceSelecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistancePhD Assistance
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...csandit
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Data science lecture4_doaa_mohey
Data science lecture4_doaa_moheyData science lecture4_doaa_mohey
Data science lecture4_doaa_moheyDoaa Mohey Eldin
 

What's hot (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and querying
 
Application of data mining tools for
Application of data mining tools forApplication of data mining tools for
Application of data mining tools for
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Final Report
Final ReportFinal Report
Final Report
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistanceSelecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - PhdassistanceSelecting the Right Type of Algorithm for Various Applications - Phdassistance
Selecting the Right Type of Algorithm for Various Applications - Phdassistance
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
G046024851
G046024851G046024851
G046024851
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
Feature Selection : A Novel Approach for the Prediction of Learning Disabilit...
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Data science lecture4_doaa_mohey
Data science lecture4_doaa_moheyData science lecture4_doaa_mohey
Data science lecture4_doaa_mohey
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 

Similar to Data processing

Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Qualitypriyanka rajput
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challengesijcnes
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersSatyam Jaiswal
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data ScienceJohn B. Rollins, Ph.D.
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 

Similar to Data processing (20)

KDD assignmnt data.docx
KDD assignmnt data.docxKDD assignmnt data.docx
KDD assignmnt data.docx
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Unit II.pdf
Unit II.pdfUnit II.pdf
Unit II.pdf
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 

More from AnupamSingh211

Factors that influence app development cost
Factors that influence app development costFactors that influence app development cost
Factors that influence app development costAnupamSingh211
 
Unexpected benefits of .net development outsourcing. 2
Unexpected benefits of .net development outsourcing. 2Unexpected benefits of .net development outsourcing. 2
Unexpected benefits of .net development outsourcing. 2AnupamSingh211
 
5 benefits of mobile app development outsourcing
5 benefits of mobile app development outsourcing5 benefits of mobile app development outsourcing
5 benefits of mobile app development outsourcingAnupamSingh211
 
5 Benefits of Offshoring your Android App Development
5 Benefits of Offshoring your Android App Development5 Benefits of Offshoring your Android App Development
5 Benefits of Offshoring your Android App DevelopmentAnupamSingh211
 
Software Development services
Software Development servicesSoftware Development services
Software Development servicesAnupamSingh211
 
5 mistakes to avoid in outsourcing
5 mistakes to avoid in outsourcing5 mistakes to avoid in outsourcing
5 mistakes to avoid in outsourcingAnupamSingh211
 
Dot Net development cost
Dot Net development cost Dot Net development cost
Dot Net development cost AnupamSingh211
 

More from AnupamSingh211 (7)

Factors that influence app development cost
Factors that influence app development costFactors that influence app development cost
Factors that influence app development cost
 
Unexpected benefits of .net development outsourcing. 2
Unexpected benefits of .net development outsourcing. 2Unexpected benefits of .net development outsourcing. 2
Unexpected benefits of .net development outsourcing. 2
 
5 benefits of mobile app development outsourcing
5 benefits of mobile app development outsourcing5 benefits of mobile app development outsourcing
5 benefits of mobile app development outsourcing
 
5 Benefits of Offshoring your Android App Development
5 Benefits of Offshoring your Android App Development5 Benefits of Offshoring your Android App Development
5 Benefits of Offshoring your Android App Development
 
Software Development services
Software Development servicesSoftware Development services
Software Development services
 
5 mistakes to avoid in outsourcing
5 mistakes to avoid in outsourcing5 mistakes to avoid in outsourcing
5 mistakes to avoid in outsourcing
 
Dot Net development cost
Dot Net development cost Dot Net development cost
Dot Net development cost
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Data processing

  • 1. Data Processing for Machine Learning in PYTHON
  • 2. • What is data processing? • Need of data preprocessing. • Steps in data processing. • Conclusion Overview
  • 3. What is data processing? • Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. • Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
  • 4. Need of Data Preprocessing. • For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set. • Another aspect is that data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.
  • 5. Steps in Data Processing.
  • 6. Step 1: Preparing for the Preparation; Data preparation can be seen in the CRISP-DM model (though it can be reasonably argued that "data understanding" falls within our definition as well). We can also equate our data preparation with the framework of the KDD Process — specifically the first 3 major steps — which are selection, preprocessing, and transformation. We can break these down into finer granularity, but at a macro level, these steps of the KDD Process encompass what data wrangling is. Step 2: Exploratory Data Analysis; The purpose of Exploratory data analysis (EDA) is to use summary statistics and visualizations to better understand data, and find clues about the tendencies of the data, its quality and to formulate assumptions and the hypothesis of our analysis. The basic gist is that we need to know the makeup of our data before we can effectively select predictive algorithms or map out the remaining steps of our data preparation. Throwing our dataset at the hottest algorithm and hoping for the best is not a strategy. Step 3: Missing Values; Some commonly used methods for dealing with missing values include:  Dropping instances with missing values  Dropping attributes with missing values  Imputing the attribute { mean | median | mode } for all missing values  Imputing the attribute missing values via linear regression
  • 7. Combination strategies may also be employed: drop any instances with more than 2 missing values and use the mean attribute value imputation those which remain. Clearly the type of modeling methods being employed will have an effect on your decision — for example, decision trees are not amenable to missing values. Additionally, you could technically entertain any statistical method you could think of for determining missing values from the dataset, but the listed approaches are tried, tested, and commonly used. Step 4: Outliers; Outliers can be the result of poor data collection, or they can be genuinely good, anomalous data. These are 2 different scenarios, and must be approached differently, and so no "one size fits all" advice is applicable here, similar to that of dealing with missing values. One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.
  • 8. Step 5: Imbalanced Data; A good explanation of why we can run into imbalanced data, and why we can do so in some domains much more frequently than in others (from 7 Techniques to Handle Imbalanced Data, below): 1. Use the right evaluation metrics 2. Resample the training set 3. Use K-fold Cross-Validation in the right way 4.Ensemble different resampled datasets 5.Resample with different ratios 6.Cluster the abundant class 7.Design your own models • However, most machine learning algorithms do not work very well with imbalanced datasets. The following seven techniques can help you, to train a classifier to detect the abnormal class. • Note that, while this may not genuinely be a data preparation task, such a dataset characteristic will make itself known early in the data preparation stage (the importance of EDA), and the validity of such data can certainly be assessed preliminarily during this preparation stage.
  • 9. Step 6: Data Transformations; Transforming data is one of the most important aspects of data preparation, requiring more finesse than some others. When missing values manifest themselves in data, they are generally easy to find, and can be dealt with by one of the common methods outlined above — or by more complex measures gained from insight over time in a domain. Standardization and normalization are a pair of often employed data transformations in machine learning projects. Both are data scaling methods: standardization refers to scaling the data to have a mean of 0 and a standard deviation of 1; normalization refers to the scaling the data values to fit into a predetermined range, generally between 0 and 1. • One-hot encoding is a method for transforming categorical features to a format which will better work for classification and regression.Logarithmic distribution transformation is useful for transforming non-linear models into linear models and working with skewed data. • There are numerous additional standard data transformations which are regularly employed, depending on the data and your requirements. Experience with data preprocessing and preparation should provide intuition on what types of transformations are required in which circumstance.
  • 10. Step 7: Finishing Touches & Moving Ahead; Alright. Your data is "clean." But what do you do with it? If you want to go right to feeding your data into a machine learning algorithm in order to attempt building a model, you probably need your data in a more appropriate representation. In the Python ecosystem, that would generally be a Numpy ndarray (or matrix).
  • 11. Conclusion The future of data processing lies in the cloud. Cloud technology builds on the convenience of current electronic data processing methods and accelerates its speed and effectiveness. Faster, higher-quality data means more data for each organization to utilize and more valuable insights to extract. Python Development services delivered by Suma Soft make use of an Agile methodology. Our services help improve time to market. Suma Soft’s Outsourced Python Development services include Python Web Development using Django framework, Python Flask Web Development, Python Web Crawler Development, Python Integration and maintenance, Migration Services and many more. We have delivered 100+ Python Development outsourcing projects to clients from 8+industries. Our expert team has proficiency in all versions of PHP, including the latest Python 3.6 version. We maintain 100% transparency throughout Python Development process.
  • 12. Suma Soft Pvt Lmt. sales@sumasoft.com https://www.sumasoft.com/ Contact us +91 20 4013 0400