Data preprocessing.pdf

IETE: “AI & ML For IOT” Workshop
Analysis and Preprocessing of Data in IoT
Environment
Dr. Sankirti Sandeep Shiravale
Associate Professor, MMCOE, Pune
sankirtishiravale@mmcoe.edu.in
8/8/2023

Contents
● Understanding of data
● Data Pre-processing
● Selection of ML models
● IoT- ML architecture
● Tiny ML
● Summary
2

Understanding of data
● Data: Raw output generated by IoT devices, sendores, cameras etc is referred as data.
● Information: Data is processed to derive some meaning full contents is called as Information.
● Knowledge: If Information is used in decision making process then called as knowledge.
E.g. 1.
Data: Birthdate
Information: Derive age
Knowledge: Younger age people are more tech savvy.
E.g.2.
Temperature (LM35) sensor produces analog data , converted into digital values (celsius / fahrenheit) and
used to predict room temperature i.e. normal/cool/hot
3

Data Types
● Nominal
● Binary
● Numeric: quantitative
○ Interval-scaled
○ Ratio-scaled
● Discrete Vs Continuous
4

Data Types
■ Nominal: categories, states, or “names of things”
■ Hair_color = {auburn, black, blond, brown, grey, red, white}
■ marital status, occupation, ID numbers, zip codes
■ Binary
■ Nominal attribute with only 2 states (0 and 1)
■ Symmetric binary: both outcomes equally important
■ E.g. gender
■ Asymmetric binary: outcomes not equally important.
■ e.g. medical test (positive vs. negative)
■ Convention: assign 1 to most important outcome (e.g., HIV positive)
■ Ordinal
■ Values have a meaningful order (ranking) but magnitude between successive values is not
known.
■ Size = {small, medium, large}, grades, army rankings
5

Data Types: Numerical
● Quantity (integer or real-valued)
● Interval
■ Measured on a scale of equal-sized units
■ Values have order
● E.g., temperature in C˚or F˚, calendar dates
■ No true zero-point
● Ratio
■ Inherent zero-point
■ We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
● e.g., temperature in Kelvin, length, counts, monetary quantities
6

Discrete Vs Continuous Attribute
● Discrete Attribute
○ Has only a finite or countably infinite set of values
E.g., zip codes, profession
○ Sometimes, represented as integer variables
○ Binary attributes are a special case of discrete attributes
● Continuous Attribute
○ Has real numbers as attribute values
E.g., temperature, height, or weight
○ Practically, real values can only be measured and represented using a finite
number of digits
○ Continuous attributes are typically represented as floating-point variables
7

8
“No quality data, no quality results”

Why data preprocessing?
● Data generated from IoT devices may be
incomplete, noisy and inconsistent. Such
a data is called as raw data.
● Accuracy of ML models applied on
processed data is more precise compared
to raw data.
● Hence to improve the accuracy of ML
models preprocessing is mandatory step
in any AI-ML based IoT Application.
9

Steps for data preprocessing
10

Steps for data preprocessing
11
Fig. Steps for data preprocessing [1]

Data Cleaning
● Incomplete data may come from
○ “Not applicable” data value when collected
○ Different considerations between the time when the data was collected and when it is analyzed.
○ Human/hardware/software problems
● Noisy data (incorrect values) may come from
○ Faulty data collection instruments
○ Human or computer error at data entry
○ Errors in data transmission
● Inconsistent data may come from
○ Different data sources
○ Functional dependency violation (e.g., modify some linked data)
● Duplicate records also need data cleaning
12

Data Cleaning
● Data cleaning is the process of filling missing values , noise removal and correct inconsistency
● Data is not always available
○ E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
● Missing data may be due to
○ equipment malfunction
○ inconsistent with other recorded data and thus deleted
○ data not entered due to misunderstanding
○ certain data may not be considered important at the time of entry
○ not register history or changes of the data
13

Data Cleaning: handling missing values
● Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
○ a global constant : e.g., “unknown”, a new class?!
○ the attribute mean
○ the attribute mean for all samples belonging to the same class: smarter
○ the most probable value: inference-based such as Bayesian formula or decision tree
14

Data Cleaning: noise removal
● Noise: random error or variance in a measured variable
● Incorrect attribute values may due to
○ faulty data collection instruments
○ data entry problems
○ data transmission problems
○ technology limitation
○ inconsistency in naming convention
● Other data problems which requires data cleaning
○ duplicate records
○ incomplete data
○ inconsistent data
15

Data Cleaning: noise removal
● Binning
○ first sort data and partition into (equal-frequency) bins
○ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
● Regression
○ smooth by fitting the data into regression functions
● Clustering
○ detect and remove outliers
● Combined computer and human inspection
○ detect suspicious values and check by human (e.g., deal with possible outliers)
16

Data Cleaning: inconsistency correction
- Inconsistency may occurred due to poor design, error in data entry
or data decay.
- E.g Difference in data representation i.e. date format (dd/mm/yy) or
(mm/dd/yy)
- Inconsistency can be handled using
- Meta data
- Domain knowledge
- Scrubbing tools
- Audit tools
- Apply Unique rules, null rules, consecutive rules
17

Data Integration
● Data integration combines data from multiple sources into a coherent store
● Schema integration:
○ Entity identification problem: e.g., A.cust-id ≡ B.cust-#
○ Integrate metadata from different sources
● Redundancy can be detected using correlation analysis
● Detecting and resolving data value conflicts
○ For the same real world entity, attribute values from different sources are different
○ Possible reasons: different representations, different scales, e.g., metric vs. British units
18

Data Transformation
● Smoothing: remove noise from data
● Aggregation: summarization, data cube construction
● Generalization: concept hierarchy climbing
● Normalization: scaled to fall within a small, specified range
○ min-max normalization
○ z-score normalization
○ normalization by decimal scaling
● Attribute/feature construction
○ New attributes constructed from the given ones
19
Fig. Aggregation

Data Transformation : Normalization
● Why transformation?
● Sensor 1 produces o/p value in between [0,1] ; Sensor 2 produces o/p in [1, 100]
E.g. When we compare two tuples t1(5,50) with t2(10, 40) using Euclidean distance measure then
results get inﬂuenced by higher ordered values produced by sensor 2 hence ML will produce biased
decision.
● Solution is normalization: e.g. min-max normalization
E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
20

Data Reduction
● Why data reduction?
○ A database/data warehouse/big data may store terabytes of data
○ Complex data analysis/ML models may take a very long time to run on the complete data set
● Data reduction
○ Obtain a reduced representation of the data set that is much smaller in volume but yet produce
the same (or almost the same) analytical results
● Data reduction strategies
○ Data cube aggregation:
○ Dimensionality reduction — e.g., remove unimportant attributes
○ Data Compression
○ Numerosity reduction — e.g., fit data into models
○ Discretization and concept hierarchy generation
21

Data Reduction: concept hierarchy generation
22

“Good data preparation is key to producing valid
and reliable models”
23

Tools for data preprocessing
24
ETL Tools
• PowerBI
• Informatica
• IBM Cognos
Data Analytical Tools
• Pentaho
• RapidMiner
• Knime
• Weka
• R-programing
• Python Libraries:
– Numpy
– Pandas
– NLTK

Selection of ML Models
● Nature of problem statement
○ E.g. Supervised/ unsupervised or Prediction/classification
● Availability of data
○ E.g. CNN models should perform better if training dataset is larger enough
● Understanding of data
○ E.g. temp attributes with numeral values (temp=30 oc) will become a problem of prediction
and temp with categorical values (temp=high) is the classification problem.
● Availability of processing power
○ E.g. CNN architectures will not executed directly on raspi, but simple rule based ML can be
executed on raspi
25

IoT- ML architecture
26
IOT device
Inputs from sensors
Basic
Processing
ML Models
● Conventional IoT-ML architectures are
cloud based

Tiny ML
● Latency and bottleneck are the
major drawbacks of cloud based
processing
● Tiny ML is new emerging research
area
● Compressed and optimized ML
models are installed/executed on
IoT devices for better performance
and cost effectiveness
27
Inputs from sensors
Basic
Processing
Tiny ML Models
IoT device
output

Summary
● Tremendous data is generated by edge technologies like IoT devices,
Cameras, Phones etc.
● Understanding and preprocessing of data improves the accuracy of ML
models
● Data cleaning, integration, transformation and reduction are four major steps
of data preprocessing.
● Availability of data , processing devices , nature of problem statement and
data insights are the key parameters for selection of ML models
● TinyML framework in IoT is aimed to provide low latency, effective bandwidth
utilization, strengthen data safety, enhance privacy, and reduce cost [2].
28

References
1. Data Mining: Concepts and Techniques, 3rd Edition. Jiawei Han, Micheline
Kamber,
2. Dr. Lachit Dutta, Swapna Bharali, “TinyML Meets IoT: A Comprehensive
Survey” , Internet of Things, Volume 16, 2021,100461
29

Data preprocessing.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data preprocessing.pdf

Similar to Data preprocessing.pdf (20)

Recently uploaded

Recently uploaded (20)

Data preprocessing.pdf