Introduction to IEEE STANDARDS and its different types.pptx
Data preprocessing.pdf
1. IETE: “AI & ML For IOT” Workshop
Analysis and Preprocessing of Data in IoT
Environment
Dr. Sankirti Sandeep Shiravale
Associate Professor, MMCOE, Pune
sankirtishiravale@mmcoe.edu.in
8/8/2023
2. Contents
● Understanding of data
● Data Pre-processing
● Selection of ML models
● IoT- ML architecture
● Tiny ML
● Summary
2
3. Understanding of data
● Data: Raw output generated by IoT devices, sendores, cameras etc is referred as data.
● Information: Data is processed to derive some meaning full contents is called as Information.
● Knowledge: If Information is used in decision making process then called as knowledge.
E.g. 1.
Data: Birthdate
Information: Derive age
Knowledge: Younger age people are more tech savvy.
E.g.2.
Temperature (LM35) sensor produces analog data , converted into digital values (celsius / fahrenheit) and
used to predict room temperature i.e. normal/cool/hot
3
4. Data Types
● Nominal
● Binary
● Numeric: quantitative
○ Interval-scaled
○ Ratio-scaled
● Discrete Vs Continuous
4
5. Data Types
■ Nominal: categories, states, or “names of things”
■ Hair_color = {auburn, black, blond, brown, grey, red, white}
■ marital status, occupation, ID numbers, zip codes
■ Binary
■ Nominal attribute with only 2 states (0 and 1)
■ Symmetric binary: both outcomes equally important
■ E.g. gender
■ Asymmetric binary: outcomes not equally important.
■ e.g. medical test (positive vs. negative)
■ Convention: assign 1 to most important outcome (e.g., HIV positive)
■ Ordinal
■ Values have a meaningful order (ranking) but magnitude between successive values is not
known.
■ Size = {small, medium, large}, grades, army rankings
5
6. Data Types: Numerical
● Quantity (integer or real-valued)
● Interval
■ Measured on a scale of equal-sized units
■ Values have order
● E.g., temperature in C˚or F˚, calendar dates
■ No true zero-point
● Ratio
■ Inherent zero-point
■ We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
● e.g., temperature in Kelvin, length, counts, monetary quantities
6
7. Discrete Vs Continuous Attribute
● Discrete Attribute
○ Has only a finite or countably infinite set of values
E.g., zip codes, profession
○ Sometimes, represented as integer variables
○ Binary attributes are a special case of discrete attributes
● Continuous Attribute
○ Has real numbers as attribute values
E.g., temperature, height, or weight
○ Practically, real values can only be measured and represented using a finite
number of digits
○ Continuous attributes are typically represented as floating-point variables
7
9. Why data preprocessing?
● Data generated from IoT devices may be
incomplete, noisy and inconsistent. Such
a data is called as raw data.
● Accuracy of ML models applied on
processed data is more precise compared
to raw data.
● Hence to improve the accuracy of ML
models preprocessing is mandatory step
in any AI-ML based IoT Application.
9
11. Steps for data preprocessing
11
Fig. Steps for data preprocessing [1]
12. Data Cleaning
● Incomplete data may come from
○ “Not applicable” data value when collected
○ Different considerations between the time when the data was collected and when it is analyzed.
○ Human/hardware/software problems
● Noisy data (incorrect values) may come from
○ Faulty data collection instruments
○ Human or computer error at data entry
○ Errors in data transmission
● Inconsistent data may come from
○ Different data sources
○ Functional dependency violation (e.g., modify some linked data)
● Duplicate records also need data cleaning
12
13. Data Cleaning
● Data cleaning is the process of filling missing values , noise removal and correct inconsistency
● Data is not always available
○ E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
● Missing data may be due to
○ equipment malfunction
○ inconsistent with other recorded data and thus deleted
○ data not entered due to misunderstanding
○ certain data may not be considered important at the time of entry
○ not register history or changes of the data
13
14. Data Cleaning: handling missing values
● Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
○ a global constant : e.g., “unknown”, a new class?!
○ the attribute mean
○ the attribute mean for all samples belonging to the same class: smarter
○ the most probable value: inference-based such as Bayesian formula or decision tree
14
15. Data Cleaning: noise removal
● Noise: random error or variance in a measured variable
● Incorrect attribute values may due to
○ faulty data collection instruments
○ data entry problems
○ data transmission problems
○ technology limitation
○ inconsistency in naming convention
● Other data problems which requires data cleaning
○ duplicate records
○ incomplete data
○ inconsistent data
15
16. Data Cleaning: noise removal
● Binning
○ first sort data and partition into (equal-frequency) bins
○ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
● Regression
○ smooth by fitting the data into regression functions
● Clustering
○ detect and remove outliers
● Combined computer and human inspection
○ detect suspicious values and check by human (e.g., deal with possible outliers)
16
17. Data Cleaning: inconsistency correction
- Inconsistency may occurred due to poor design, error in data entry
or data decay.
- E.g Difference in data representation i.e. date format (dd/mm/yy) or
(mm/dd/yy)
- Inconsistency can be handled using
- Meta data
- Domain knowledge
- Scrubbing tools
- Audit tools
- Apply Unique rules, null rules, consecutive rules
17
18. Data Integration
● Data integration combines data from multiple sources into a coherent store
● Schema integration:
○ Entity identification problem: e.g., A.cust-id ≡ B.cust-#
○ Integrate metadata from different sources
● Redundancy can be detected using correlation analysis
● Detecting and resolving data value conflicts
○ For the same real world entity, attribute values from different sources are different
○ Possible reasons: different representations, different scales, e.g., metric vs. British units
18
19. Data Transformation
● Smoothing: remove noise from data
● Aggregation: summarization, data cube construction
● Generalization: concept hierarchy climbing
● Normalization: scaled to fall within a small, specified range
○ min-max normalization
○ z-score normalization
○ normalization by decimal scaling
● Attribute/feature construction
○ New attributes constructed from the given ones
19
Fig. Aggregation
20. Data Transformation : Normalization
● Why transformation?
● Sensor 1 produces o/p value in between [0,1] ; Sensor 2 produces o/p in [1, 100]
E.g. When we compare two tuples t1(5,50) with t2(10, 40) using Euclidean distance measure then
results get influenced by higher ordered values produced by sensor 2 hence ML will produce biased
decision.
● Solution is normalization: e.g. min-max normalization
E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
20
21. Data Reduction
● Why data reduction?
○ A database/data warehouse/big data may store terabytes of data
○ Complex data analysis/ML models may take a very long time to run on the complete data set
● Data reduction
○ Obtain a reduced representation of the data set that is much smaller in volume but yet produce
the same (or almost the same) analytical results
● Data reduction strategies
○ Data cube aggregation:
○ Dimensionality reduction — e.g., remove unimportant attributes
○ Data Compression
○ Numerosity reduction — e.g., fit data into models
○ Discretization and concept hierarchy generation
21
24. Tools for data preprocessing
24
ETL Tools
• PowerBI
• Informatica
• IBM Cognos
Data Analytical Tools
• Pentaho
• RapidMiner
• Knime
• Weka
• R-programing
• Python Libraries:
– Numpy
– Pandas
– NLTK
25. Selection of ML Models
● Nature of problem statement
○ E.g. Supervised/ unsupervised or Prediction/classification
● Availability of data
○ E.g. CNN models should perform better if training dataset is larger enough
● Understanding of data
○ E.g. temp attributes with numeral values (temp=30 oc) will become a problem of prediction
and temp with categorical values (temp=high) is the classification problem.
● Availability of processing power
○ E.g. CNN architectures will not executed directly on raspi, but simple rule based ML can be
executed on raspi
25
26. IoT- ML architecture
26
IOT device
Inputs from sensors
Basic
Processing
ML Models
● Conventional IoT-ML architectures are
cloud based
27. Tiny ML
● Latency and bottleneck are the
major drawbacks of cloud based
processing
● Tiny ML is new emerging research
area
● Compressed and optimized ML
models are installed/executed on
IoT devices for better performance
and cost effectiveness
27
Inputs from sensors
Basic
Processing
Tiny ML Models
IoT device
output
28. Summary
● Tremendous data is generated by edge technologies like IoT devices,
Cameras, Phones etc.
● Understanding and preprocessing of data improves the accuracy of ML
models
● Data cleaning, integration, transformation and reduction are four major steps
of data preprocessing.
● Availability of data , processing devices , nature of problem statement and
data insights are the key parameters for selection of ML models
● TinyML framework in IoT is aimed to provide low latency, effective bandwidth
utilization, strengthen data safety, enhance privacy, and reduce cost [2].
28
29. References
1. Data Mining: Concepts and Techniques, 3rd Edition. Jiawei Han, Micheline
Kamber,
2. Dr. Lachit Dutta, Swapna Bharali, “TinyML Meets IoT: A Comprehensive
Survey” , Internet of Things, Volume 16, 2021,100461
29