Data processing

Data Processing for
Machine Learning in
PYTHON

• What is data processing?
• Need of data preprocessing.
• Steps in data processing.
• Conclusion
Overview

What is data processing?
• Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
• Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.

Need of Data Preprocessing.
• For achieving better results from the applied model in Machine Learning projects the
format of the data has to be in a proper manner. Some specified Machine Learning
model needs information in a specified format, for example, Random Forest algorithm
does not support null values, therefore to execute random forest algorithm null values
have to be managed from the original raw data set.
• Another aspect is that data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithms are executed in one data set, and
best out of them is chosen.

Step 1: Preparing for the Preparation;
Data preparation can be seen in the CRISP-DM model (though it can be reasonably argued
that "data understanding" falls within our definition as well). We can also equate our data
preparation with the framework of the KDD Process — specifically the first 3 major steps — which
are selection, preprocessing, and transformation. We can break these down into finer granularity,
but at a macro level, these steps of the KDD Process encompass what data wrangling is.
Step 2: Exploratory Data Analysis;
The purpose of Exploratory data analysis (EDA) is to use summary statistics and
visualizations to better understand data, and find clues about the tendencies of the data, its
quality and to formulate assumptions and the hypothesis of our analysis.
The basic gist is that we need to know the makeup of our data before we can effectively select
predictive algorithms or map out the remaining steps of our data preparation. Throwing our
dataset at the hottest algorithm and hoping for the best is not a strategy.
Step 3: Missing Values;
Some commonly used methods for dealing with missing values include:
 Dropping instances with missing values
 Dropping attributes with missing values
 Imputing the attribute { mean | median | mode } for all missing values
 Imputing the attribute missing values via linear regression

Combination strategies may also be employed: drop any instances with more than 2 missing
values and use the mean attribute value imputation those which remain. Clearly the type of
modeling methods being employed will have an effect on your decision — for example, decision
trees are not amenable to missing values. Additionally, you could technically entertain any
statistical method you could think of for determining missing values from the dataset, but the listed
approaches are tried, tested, and commonly used.
Step 4: Outliers;
Outliers can be the result of poor data collection, or they can be genuinely good, anomalous
data. These are 2 different scenarios, and must be approached differently, and so no "one size fits
all" advice is applicable here, similar to that of dealing with missing values.
One option is to try a transformation. Square root and log transformations both pull in high
numbers. This can make assumptions work better if the outlier is a dependent variable and can
reduce the impact of a single point if the outlier is an independent variable.

Step 5: Imbalanced Data;
A good explanation of why we can run into imbalanced data, and why we can do so in some
domains much more frequently than in others (from 7 Techniques to Handle Imbalanced Data,
below):
1. Use the right evaluation metrics
2. Resample the training set
3. Use K-fold Cross-Validation in the right way
4.Ensemble different resampled datasets
5.Resample with different ratios
6.Cluster the abundant class
7.Design your own models
• However, most machine learning algorithms do not work very well with imbalanced datasets. The
following seven techniques can help you, to train a classifier to detect the abnormal class.
• Note that, while this may not genuinely be a data preparation task, such a dataset characteristic
will make itself known early in the data preparation stage (the importance of EDA), and the validity
of such data can certainly be assessed preliminarily during this preparation stage.

Step 6: Data Transformations;
Transforming data is one of the most important aspects of data preparation, requiring
more finesse than some others. When missing values manifest themselves in data, they are
generally easy to find, and can be dealt with by one of the common methods outlined above
— or by more complex measures gained from insight over time in a domain.
Standardization and normalization are a pair of often employed data transformations in
machine learning projects. Both are data scaling methods: standardization refers to scaling
the data to have a mean of 0 and a standard deviation of 1; normalization refers to the scaling
the data values to fit into a predetermined range, generally between 0 and 1.
• One-hot encoding is a method for transforming categorical features to a format which will
better work for classification and regression.Logarithmic distribution transformation is useful
for transforming non-linear models into linear models and working with skewed data.
• There are numerous additional standard data transformations which are regularly employed,
depending on the data and your requirements. Experience with data preprocessing and
preparation should provide intuition on what types of transformations are required in which
circumstance.

Step 7: Finishing Touches & Moving Ahead;
Alright. Your data is "clean." But what do you do with it?
If you want to go right to feeding your data into a machine learning algorithm in order to attempt
building a model, you probably need your data in a more appropriate representation. In the
Python ecosystem, that would generally be a Numpy ndarray (or matrix).

Conclusion
The future of data processing lies in the cloud. Cloud technology builds on the convenience of
current electronic data processing methods and accelerates its speed and effectiveness. Faster,
higher-quality data means more data for each organization to utilize and more valuable insights to
extract. Python Development services delivered by Suma Soft make use of an Agile
methodology. Our services help improve time to market.
Suma Soft’s Outsourced Python Development services include Python Web Development using
Django framework, Python Flask Web Development, Python Web Crawler Development, Python
Integration and maintenance, Migration Services and many more.
We have delivered 100+ Python Development outsourcing projects to clients from 8+industries.
Our expert team has proficiency in all versions of PHP, including the latest Python 3.6 version.
We maintain 100% transparency throughout Python Development process.

Suma Soft Pvt Lmt.
sales@sumasoft.com
https://www.sumasoft.com/
Contact us
+91 20 4013 0400

Data processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data processing

Similar to Data processing (20)

More from AnupamSingh211

More from AnupamSingh211 (7)

Recently uploaded

Recently uploaded (20)

Data processing