ML Data Acquisition Techniques

ME-438
AI AND INTERNET OF THINGS
ELECTIVE COURSE
NED University of Engineering & Technology
1

THIS WEEK
 Data Acquisition in Machine Learning
 Data Acquisition Techniques and Tools
AI and Internet of Things
DR. HAIDER ALI 2

DATA ACQUISITION IN MACHINE LEARNING
DR. HAIDER ALI 3

DATA ACQUISITION
DR. HAIDER ALI 4
“Data acquisition is the process of sampling signals that
measure real-world physical conditions and converting the
resulting samples into digital numeric values that a computer
can manipulate.”

LIFE-CYCLE OF A MACHINE LEARNING PROJECT
The life-cycle of a Machine Learning project follows:
1. Defining the project objective: Identifying the business problem, converting it into a
statistical problem, and then to the optimization problem
2. Data Acquisition or Collection: Acquiring and merging the data from all the appropriate
sources
3. Data Exploration and Pre-processing: Cleaning and preprocessing the data to create
homogeneity, performing exploratory data analysis and statistical analysis to understand the
relationships between the variables.
4. Feature Engineering: Create new features based on empirical relationships and select
significant variables using dimension reductional techniques.
DR. HAIDER ALI 5

LIFE-CYCLE OF A MACHINE LEARNING PROJECT
5. Model Building: Training the dataset and building the model by selecting the appropriate ML
algorithms to identify the patterns.
6. Execution & Model Validation: Implementation of the model and validating the model such
as validating and fine-tuning the parameters.
7. Deployment: is the representation of business-usable results of the ML process — models
are deployed to enterprise apps, systems, and data stores.
8. Interpretation, Data Visualization, and Documentation: Interpreting, visualizing, and
communicating the model insights. Documenting the modeling process for reproducibility and
creating the model monitoring and maintenance plan.
DR. HAIDER ALI 6

DR. HAIDER ALI 7

DATA ACQUISITION IN MACHINE LEARNING
 Collection and Integration of the data
 Formatting
 Labeling
DR. HAIDER ALI 8

COLLECTION AND INTEGRATION OF THE DATA
The data is extracted from various sources and also the data is
usually available at different places so multiple data need to be
combined to be used. The data acquired is typically in raw format
and not suitable for immediate consumption and analysis.
DR. HAIDER ALI 9

FORMATTING
 Prepare or organize the datasets as per the analysis requirements.
DR. HAIDER ALI 10
LABELING
 After gathering data, it is required to label the data. One such
instance is in an application factory, one would want to label the
images of the components if the components are defective or not.

THE DATA ACQUISITION PROCESS
The process of data acquisition involves searching for the datasets that
can be used to train the Machine Learning models. Having said that, it is
not simple. There are various approaches to acquiring data, here have
bucketed into three main segments such as:
 Data Discovery
 Data Augmentation
 Data Generation
DR. HAIDER ALI 11

DATA DISCOVERY
The first approach to acquiring data is Data discovery. It is a
key step when indexing, sharing, and searching for new
datasets available on the web and incorporating data lakes.
It can be broken into two steps: Searching and Sharing.
Firstly, the data must be labeled or indexed and published
for sharing using many available collaborative systems for
this purpose.
DR. HAIDER ALI 12

DATA AUGMENTATION
The next approach for data acquisition is Data
augmentation. Augment means to make something greater
by adding to it, so here in the context of data acquisition, we
are essentially enriching the existing data by adding more
external data. In Deep and Machine learning, using pre-
trained models and embeddings is common to increase the
features to train on.
DR. HAIDER ALI 13

DR. HAIDER ALI 14

DR. HAIDER ALI 15

DATA GENERATION
As the name suggests, the data is generated. If we do not have enough and
any external data is not available, the option is to generate the datasets
manually or automatically.
Instead of collecting and labeling large datasets, there are several techniques
for generating synthetic data that has similar properties to real data. Synthetic
data has major advantages, including reduced cost, higher accuracy in data
labeling (because the labels in synthetic data are already known), scalability (it
is easy to create vast amounts of simulated data), and variety. Synthetic data
can be used to create data samples for edge cases that do not frequently occur
in the real world.
DR. HAIDER ALI 16

DATA ACQUISITION TECHNIQUES AND TOOLS
DR. HAIDER ALI 17

DATA ACQUISITION TECHNIQUES AND TOOLS
The major tools and techniques for data acquisition are:
1.Data Warehouses and ETL
2.Data Lakes and ELT
3.Cloud Data Warehouse providers
DR. HAIDER ALI 18

DATA WAREHOUSES AND ETL
DR. HAIDER ALI AI and Internet of Things 19

DATA WAREHOUSES AND ETL
A data warehouse is a type of database that is used for storing and managing
large amounts of data. It is designed to facilitate the process of querying and
analyzing data, and is often used by organizations to support business
intelligence and decision-making activities. Data warehouses typically store data
from multiple sources, such as operational databases, transactional systems,
and external sources, and are designed to support the efficient execution of
complex queries and analysis. This allows organizations to gain insights into
their data and make informed decisions based on that information.
DR. HAIDER ALI 20

DATA LAKES AND ELT
A data lake is a storage repository having the capacity to store large amounts of
data, including structured, semi-structured, and unstructured data. It can store
images, videos, audio, sound records, and PDF files. It helps for faster ingestion
of new data.
Unlike data warehouses, data lakes store everything, are more flexible, and
follow the Extract, Load, and Transform (ELT) approach. The data is first loaded
and not transformed until required to transform. Therefore, the data is processed
later as per the requirements.
DR. HAIDER ALI 21

CLOUD DATA WAREHOUSE PROVIDERS
A cloud data warehouse is another service that collects, organizes, and
stores data. Cloud data warehouses are quicker and cheaper to set up as
no physical hardware needs to be procured.
• Amazon Redshift
• Snowflake
• Google BigQuery
• IBM Db2 Warehouse
• Microsoft Azure Synapse
• Oracle Autonomous Data Warehouse
• SAP Data Warehouse Cloud
• Yellowbrick Data
• Teradata Integrated Data Warehouse
DR. HAIDER ALI
AI and Internet of
Things 22

THANK YOU
DR. HAIDER ALI AI and Internet of Things 23

ML Data Acquisition Techniques

Recommended

Recommended

More Related Content

Similar to ML Data Acquisition Techniques

Similar to ML Data Acquisition Techniques (20)

Recently uploaded

Recently uploaded (20)

ML Data Acquisition Techniques