SlideShare a Scribd company logo
1 of 30
Download to read offline
IETE: “AI & ML For IOT” Workshop
Analysis and Preprocessing of Data in IoT
Environment
Dr. Sankirti Sandeep Shiravale
Associate Professor, MMCOE, Pune
sankirtishiravale@mmcoe.edu.in
8/8/2023
Contents
● Understanding of data
● Data Pre-processing
● Selection of ML models
● IoT- ML architecture
● Tiny ML
● Summary
2
Understanding of data
● Data: Raw output generated by IoT devices, sendores, cameras etc is referred as data.
● Information: Data is processed to derive some meaning full contents is called as Information.
● Knowledge: If Information is used in decision making process then called as knowledge.
E.g. 1.
Data: Birthdate
Information: Derive age
Knowledge: Younger age people are more tech savvy.
E.g.2.
Temperature (LM35) sensor produces analog data , converted into digital values (celsius / fahrenheit) and
used to predict room temperature i.e. normal/cool/hot
3
Data Types
● Nominal
● Binary
● Numeric: quantitative
○ Interval-scaled
○ Ratio-scaled
● Discrete Vs Continuous
4
Data Types
■ Nominal: categories, states, or “names of things”
■ Hair_color = {auburn, black, blond, brown, grey, red, white}
■ marital status, occupation, ID numbers, zip codes
■ Binary
■ Nominal attribute with only 2 states (0 and 1)
■ Symmetric binary: both outcomes equally important
■ E.g. gender
■ Asymmetric binary: outcomes not equally important.
■ e.g. medical test (positive vs. negative)
■ Convention: assign 1 to most important outcome (e.g., HIV positive)
■ Ordinal
■ Values have a meaningful order (ranking) but magnitude between successive values is not
known.
■ Size = {small, medium, large}, grades, army rankings
5
Data Types: Numerical
● Quantity (integer or real-valued)
● Interval
■ Measured on a scale of equal-sized units
■ Values have order
● E.g., temperature in C˚or F˚, calendar dates
■ No true zero-point
● Ratio
■ Inherent zero-point
■ We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
● e.g., temperature in Kelvin, length, counts, monetary quantities
6
Discrete Vs Continuous Attribute
● Discrete Attribute
○ Has only a finite or countably infinite set of values
E.g., zip codes, profession
○ Sometimes, represented as integer variables
○ Binary attributes are a special case of discrete attributes
● Continuous Attribute
○ Has real numbers as attribute values
E.g., temperature, height, or weight
○ Practically, real values can only be measured and represented using a finite
number of digits
○ Continuous attributes are typically represented as floating-point variables
7
8
“No quality data, no quality results”
Why data preprocessing?
● Data generated from IoT devices may be
incomplete, noisy and inconsistent. Such
a data is called as raw data.
● Accuracy of ML models applied on
processed data is more precise compared
to raw data.
● Hence to improve the accuracy of ML
models preprocessing is mandatory step
in any AI-ML based IoT Application.
9
Steps for data preprocessing
10
Steps for data preprocessing
11
Fig. Steps for data preprocessing [1]
Data Cleaning
● Incomplete data may come from
○ “Not applicable” data value when collected
○ Different considerations between the time when the data was collected and when it is analyzed.
○ Human/hardware/software problems
● Noisy data (incorrect values) may come from
○ Faulty data collection instruments
○ Human or computer error at data entry
○ Errors in data transmission
● Inconsistent data may come from
○ Different data sources
○ Functional dependency violation (e.g., modify some linked data)
● Duplicate records also need data cleaning
12
Data Cleaning
● Data cleaning is the process of filling missing values , noise removal and correct inconsistency
● Data is not always available
○ E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
● Missing data may be due to
○ equipment malfunction
○ inconsistent with other recorded data and thus deleted
○ data not entered due to misunderstanding
○ certain data may not be considered important at the time of entry
○ not register history or changes of the data
13
Data Cleaning: handling missing values
● Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
○ a global constant : e.g., “unknown”, a new class?!
○ the attribute mean
○ the attribute mean for all samples belonging to the same class: smarter
○ the most probable value: inference-based such as Bayesian formula or decision tree
14
Data Cleaning: noise removal
● Noise: random error or variance in a measured variable
● Incorrect attribute values may due to
○ faulty data collection instruments
○ data entry problems
○ data transmission problems
○ technology limitation
○ inconsistency in naming convention
● Other data problems which requires data cleaning
○ duplicate records
○ incomplete data
○ inconsistent data
15
Data Cleaning: noise removal
● Binning
○ first sort data and partition into (equal-frequency) bins
○ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
● Regression
○ smooth by fitting the data into regression functions
● Clustering
○ detect and remove outliers
● Combined computer and human inspection
○ detect suspicious values and check by human (e.g., deal with possible outliers)
16
Data Cleaning: inconsistency correction
- Inconsistency may occurred due to poor design, error in data entry
or data decay.
- E.g Difference in data representation i.e. date format (dd/mm/yy) or
(mm/dd/yy)
- Inconsistency can be handled using
- Meta data
- Domain knowledge
- Scrubbing tools
- Audit tools
- Apply Unique rules, null rules, consecutive rules
17
Data Integration
● Data integration combines data from multiple sources into a coherent store
● Schema integration:
○ Entity identification problem: e.g., A.cust-id ≡ B.cust-#
○ Integrate metadata from different sources
● Redundancy can be detected using correlation analysis
● Detecting and resolving data value conflicts
○ For the same real world entity, attribute values from different sources are different
○ Possible reasons: different representations, different scales, e.g., metric vs. British units
18
Data Transformation
● Smoothing: remove noise from data
● Aggregation: summarization, data cube construction
● Generalization: concept hierarchy climbing
● Normalization: scaled to fall within a small, specified range
○ min-max normalization
○ z-score normalization
○ normalization by decimal scaling
● Attribute/feature construction
○ New attributes constructed from the given ones
19
Fig. Aggregation
Data Transformation : Normalization
● Why transformation?
● Sensor 1 produces o/p value in between [0,1] ; Sensor 2 produces o/p in [1, 100]
E.g. When we compare two tuples t1(5,50) with t2(10, 40) using Euclidean distance measure then
results get influenced by higher ordered values produced by sensor 2 hence ML will produce biased
decision.
● Solution is normalization: e.g. min-max normalization
E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
20
Data Reduction
● Why data reduction?
○ A database/data warehouse/big data may store terabytes of data
○ Complex data analysis/ML models may take a very long time to run on the complete data set
● Data reduction
○ Obtain a reduced representation of the data set that is much smaller in volume but yet produce
the same (or almost the same) analytical results
● Data reduction strategies
○ Data cube aggregation:
○ Dimensionality reduction — e.g., remove unimportant attributes
○ Data Compression
○ Numerosity reduction — e.g., fit data into models
○ Discretization and concept hierarchy generation
21
Data Reduction: concept hierarchy generation
22
“Good data preparation is key to producing valid
and reliable models”
23
Tools for data preprocessing
24
ETL Tools
• PowerBI
• Informatica
• IBM Cognos
Data Analytical Tools
• Pentaho
• RapidMiner
• Knime
• Weka
• R-programing
• Python Libraries:
– Numpy
– Pandas
– NLTK
Selection of ML Models
● Nature of problem statement
○ E.g. Supervised/ unsupervised or Prediction/classification
● Availability of data
○ E.g. CNN models should perform better if training dataset is larger enough
● Understanding of data
○ E.g. temp attributes with numeral values (temp=30 oc) will become a problem of prediction
and temp with categorical values (temp=high) is the classification problem.
● Availability of processing power
○ E.g. CNN architectures will not executed directly on raspi, but simple rule based ML can be
executed on raspi
25
IoT- ML architecture
26
IOT device
Inputs from sensors
Basic
Processing
ML Models
● Conventional IoT-ML architectures are
cloud based
Tiny ML
● Latency and bottleneck are the
major drawbacks of cloud based
processing
● Tiny ML is new emerging research
area
● Compressed and optimized ML
models are installed/executed on
IoT devices for better performance
and cost effectiveness
27
Inputs from sensors
Basic
Processing
Tiny ML Models
IoT device
output
Summary
● Tremendous data is generated by edge technologies like IoT devices,
Cameras, Phones etc.
● Understanding and preprocessing of data improves the accuracy of ML
models
● Data cleaning, integration, transformation and reduction are four major steps
of data preprocessing.
● Availability of data , processing devices , nature of problem statement and
data insights are the key parameters for selection of ML models
● TinyML framework in IoT is aimed to provide low latency, effective bandwidth
utilization, strengthen data safety, enhance privacy, and reduce cost [2].
28
References
1. Data Mining: Concepts and Techniques, 3rd Edition. Jiawei Han, Micheline
Kamber,
2. Dr. Lachit Dutta, Swapna Bharali, “TinyML Meets IoT: A Comprehensive
Survey” , Internet of Things, Volume 16, 2021,100461
29
Thank You
30

More Related Content

What's hot

Cryptography and Network Security
Cryptography and Network SecurityCryptography and Network Security
Cryptography and Network SecurityRamki M
 
Deployment Models in Cloud Computing
Deployment Models in Cloud ComputingDeployment Models in Cloud Computing
Deployment Models in Cloud ComputingAnirban Pati
 
ML_ Unit_1_PART_A
ML_ Unit_1_PART_AML_ Unit_1_PART_A
ML_ Unit_1_PART_ASrimatre K
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
Substitution techniques
Substitution techniquesSubstitution techniques
Substitution techniquesvinitha96
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningAarshDhokai
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationAnkit Gupta
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Encapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistenceEncapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistencePrem Lamsal
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptxImXaib
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

What's hot (20)

Cryptography and Network Security
Cryptography and Network SecurityCryptography and Network Security
Cryptography and Network Security
 
Deployment Models in Cloud Computing
Deployment Models in Cloud ComputingDeployment Models in Cloud Computing
Deployment Models in Cloud Computing
 
ML_ Unit_1_PART_A
ML_ Unit_1_PART_AML_ Unit_1_PART_A
ML_ Unit_1_PART_A
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Substitution techniques
Substitution techniquesSubstitution techniques
Substitution techniques
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning Presentation
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Encapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistenceEncapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistence
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

Similar to Data preprocessing.pdf

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
01. introduction to data structures
01. introduction to data structures01. introduction to data structures
01. introduction to data structuresረዳኢ በሪሁ
 
EDI Training Module 5: Creating Clean Data foro Publishing
EDI Training Module 5:  Creating Clean Data foro PublishingEDI Training Module 5:  Creating Clean Data foro Publishing
EDI Training Module 5: Creating Clean Data foro PublishingEnvironmental Data Initiative
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLEDB
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxManishaPatil932723
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
 

Similar to Data preprocessing.pdf (20)

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Data science guide
Data science guideData science guide
Data science guide
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
data-structures_unit-01.pdf
data-structures_unit-01.pdfdata-structures_unit-01.pdf
data-structures_unit-01.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
Druid
DruidDruid
Druid
 
01. introduction to data structures
01. introduction to data structures01. introduction to data structures
01. introduction to data structures
 
EDI Training Module 5: Creating Clean Data foro Publishing
EDI Training Module 5:  Creating Clean Data foro PublishingEDI Training Module 5:  Creating Clean Data foro Publishing
EDI Training Module 5: Creating Clean Data foro Publishing
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Chapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptxChapter 3 Data Preprocessing techniques.pptx
Chapter 3 Data Preprocessing techniques.pptx
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Data preprocessing.pdf

  • 1. IETE: “AI & ML For IOT” Workshop Analysis and Preprocessing of Data in IoT Environment Dr. Sankirti Sandeep Shiravale Associate Professor, MMCOE, Pune sankirtishiravale@mmcoe.edu.in 8/8/2023
  • 2. Contents ● Understanding of data ● Data Pre-processing ● Selection of ML models ● IoT- ML architecture ● Tiny ML ● Summary 2
  • 3. Understanding of data ● Data: Raw output generated by IoT devices, sendores, cameras etc is referred as data. ● Information: Data is processed to derive some meaning full contents is called as Information. ● Knowledge: If Information is used in decision making process then called as knowledge. E.g. 1. Data: Birthdate Information: Derive age Knowledge: Younger age people are more tech savvy. E.g.2. Temperature (LM35) sensor produces analog data , converted into digital values (celsius / fahrenheit) and used to predict room temperature i.e. normal/cool/hot 3
  • 4. Data Types ● Nominal ● Binary ● Numeric: quantitative ○ Interval-scaled ○ Ratio-scaled ● Discrete Vs Continuous 4
  • 5. Data Types ■ Nominal: categories, states, or “names of things” ■ Hair_color = {auburn, black, blond, brown, grey, red, white} ■ marital status, occupation, ID numbers, zip codes ■ Binary ■ Nominal attribute with only 2 states (0 and 1) ■ Symmetric binary: both outcomes equally important ■ E.g. gender ■ Asymmetric binary: outcomes not equally important. ■ e.g. medical test (positive vs. negative) ■ Convention: assign 1 to most important outcome (e.g., HIV positive) ■ Ordinal ■ Values have a meaningful order (ranking) but magnitude between successive values is not known. ■ Size = {small, medium, large}, grades, army rankings 5
  • 6. Data Types: Numerical ● Quantity (integer or real-valued) ● Interval ■ Measured on a scale of equal-sized units ■ Values have order ● E.g., temperature in C˚or F˚, calendar dates ■ No true zero-point ● Ratio ■ Inherent zero-point ■ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ● e.g., temperature in Kelvin, length, counts, monetary quantities 6
  • 7. Discrete Vs Continuous Attribute ● Discrete Attribute ○ Has only a finite or countably infinite set of values E.g., zip codes, profession ○ Sometimes, represented as integer variables ○ Binary attributes are a special case of discrete attributes ● Continuous Attribute ○ Has real numbers as attribute values E.g., temperature, height, or weight ○ Practically, real values can only be measured and represented using a finite number of digits ○ Continuous attributes are typically represented as floating-point variables 7
  • 8. 8 “No quality data, no quality results”
  • 9. Why data preprocessing? ● Data generated from IoT devices may be incomplete, noisy and inconsistent. Such a data is called as raw data. ● Accuracy of ML models applied on processed data is more precise compared to raw data. ● Hence to improve the accuracy of ML models preprocessing is mandatory step in any AI-ML based IoT Application. 9
  • 10. Steps for data preprocessing 10
  • 11. Steps for data preprocessing 11 Fig. Steps for data preprocessing [1]
  • 12. Data Cleaning ● Incomplete data may come from ○ “Not applicable” data value when collected ○ Different considerations between the time when the data was collected and when it is analyzed. ○ Human/hardware/software problems ● Noisy data (incorrect values) may come from ○ Faulty data collection instruments ○ Human or computer error at data entry ○ Errors in data transmission ● Inconsistent data may come from ○ Different data sources ○ Functional dependency violation (e.g., modify some linked data) ● Duplicate records also need data cleaning 12
  • 13. Data Cleaning ● Data cleaning is the process of filling missing values , noise removal and correct inconsistency ● Data is not always available ○ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ● Missing data may be due to ○ equipment malfunction ○ inconsistent with other recorded data and thus deleted ○ data not entered due to misunderstanding ○ certain data may not be considered important at the time of entry ○ not register history or changes of the data 13
  • 14. Data Cleaning: handling missing values ● Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. ● Fill in the missing value manually: tedious + infeasible? ● Fill in it automatically with ○ a global constant : e.g., “unknown”, a new class?! ○ the attribute mean ○ the attribute mean for all samples belonging to the same class: smarter ○ the most probable value: inference-based such as Bayesian formula or decision tree 14
  • 15. Data Cleaning: noise removal ● Noise: random error or variance in a measured variable ● Incorrect attribute values may due to ○ faulty data collection instruments ○ data entry problems ○ data transmission problems ○ technology limitation ○ inconsistency in naming convention ● Other data problems which requires data cleaning ○ duplicate records ○ incomplete data ○ inconsistent data 15
  • 16. Data Cleaning: noise removal ● Binning ○ first sort data and partition into (equal-frequency) bins ○ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. ● Regression ○ smooth by fitting the data into regression functions ● Clustering ○ detect and remove outliers ● Combined computer and human inspection ○ detect suspicious values and check by human (e.g., deal with possible outliers) 16
  • 17. Data Cleaning: inconsistency correction - Inconsistency may occurred due to poor design, error in data entry or data decay. - E.g Difference in data representation i.e. date format (dd/mm/yy) or (mm/dd/yy) - Inconsistency can be handled using - Meta data - Domain knowledge - Scrubbing tools - Audit tools - Apply Unique rules, null rules, consecutive rules 17
  • 18. Data Integration ● Data integration combines data from multiple sources into a coherent store ● Schema integration: ○ Entity identification problem: e.g., A.cust-id ≡ B.cust-# ○ Integrate metadata from different sources ● Redundancy can be detected using correlation analysis ● Detecting and resolving data value conflicts ○ For the same real world entity, attribute values from different sources are different ○ Possible reasons: different representations, different scales, e.g., metric vs. British units 18
  • 19. Data Transformation ● Smoothing: remove noise from data ● Aggregation: summarization, data cube construction ● Generalization: concept hierarchy climbing ● Normalization: scaled to fall within a small, specified range ○ min-max normalization ○ z-score normalization ○ normalization by decimal scaling ● Attribute/feature construction ○ New attributes constructed from the given ones 19 Fig. Aggregation
  • 20. Data Transformation : Normalization ● Why transformation? ● Sensor 1 produces o/p value in between [0,1] ; Sensor 2 produces o/p in [1, 100] E.g. When we compare two tuples t1(5,50) with t2(10, 40) using Euclidean distance measure then results get influenced by higher ordered values produced by sensor 2 hence ML will produce biased decision. ● Solution is normalization: e.g. min-max normalization E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 20
  • 21. Data Reduction ● Why data reduction? ○ A database/data warehouse/big data may store terabytes of data ○ Complex data analysis/ML models may take a very long time to run on the complete data set ● Data reduction ○ Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results ● Data reduction strategies ○ Data cube aggregation: ○ Dimensionality reduction — e.g., remove unimportant attributes ○ Data Compression ○ Numerosity reduction — e.g., fit data into models ○ Discretization and concept hierarchy generation 21
  • 22. Data Reduction: concept hierarchy generation 22
  • 23. “Good data preparation is key to producing valid and reliable models” 23
  • 24. Tools for data preprocessing 24 ETL Tools • PowerBI • Informatica • IBM Cognos Data Analytical Tools • Pentaho • RapidMiner • Knime • Weka • R-programing • Python Libraries: – Numpy – Pandas – NLTK
  • 25. Selection of ML Models ● Nature of problem statement ○ E.g. Supervised/ unsupervised or Prediction/classification ● Availability of data ○ E.g. CNN models should perform better if training dataset is larger enough ● Understanding of data ○ E.g. temp attributes with numeral values (temp=30 oc) will become a problem of prediction and temp with categorical values (temp=high) is the classification problem. ● Availability of processing power ○ E.g. CNN architectures will not executed directly on raspi, but simple rule based ML can be executed on raspi 25
  • 26. IoT- ML architecture 26 IOT device Inputs from sensors Basic Processing ML Models ● Conventional IoT-ML architectures are cloud based
  • 27. Tiny ML ● Latency and bottleneck are the major drawbacks of cloud based processing ● Tiny ML is new emerging research area ● Compressed and optimized ML models are installed/executed on IoT devices for better performance and cost effectiveness 27 Inputs from sensors Basic Processing Tiny ML Models IoT device output
  • 28. Summary ● Tremendous data is generated by edge technologies like IoT devices, Cameras, Phones etc. ● Understanding and preprocessing of data improves the accuracy of ML models ● Data cleaning, integration, transformation and reduction are four major steps of data preprocessing. ● Availability of data , processing devices , nature of problem statement and data insights are the key parameters for selection of ML models ● TinyML framework in IoT is aimed to provide low latency, effective bandwidth utilization, strengthen data safety, enhance privacy, and reduce cost [2]. 28
  • 29. References 1. Data Mining: Concepts and Techniques, 3rd Edition. Jiawei Han, Micheline Kamber, 2. Dr. Lachit Dutta, Swapna Bharali, “TinyML Meets IoT: A Comprehensive Survey” , Internet of Things, Volume 16, 2021,100461 29