This document discusses data acquisition techniques for machine learning. It describes data acquisition as the process of sampling real-world signals and converting them to digital values. The document then outlines the typical lifecycle of a machine learning project, which includes steps like data collection, preprocessing, model building, and deployment. Further, it discusses approaches to data acquisition like data discovery, augmentation, and generation. Finally, the document lists some common tools and techniques for data acquisition, such as data warehouses, data lakes, cloud data warehouses, and ETL/ELT processes.
1. ME-438
AI AND INTERNET OF THINGS
ELECTIVE COURSE
NED University of Engineering & Technology
1
2. THIS WEEK
Data Acquisition in Machine Learning
Data Acquisition Techniques and Tools
AI and Internet of Things
DR. HAIDER ALI 2
3. DATA ACQUISITION IN MACHINE LEARNING
AI and Internet of Things
DR. HAIDER ALI 3
4. DATA ACQUISITION
AI and Internet of Things
DR. HAIDER ALI 4
“Data acquisition is the process of sampling signals that
measure real-world physical conditions and converting the
resulting samples into digital numeric values that a computer
can manipulate.”
5. LIFE-CYCLE OF A MACHINE LEARNING PROJECT
The life-cycle of a Machine Learning project follows:
1. Defining the project objective: Identifying the business problem, converting it into a
statistical problem, and then to the optimization problem
2. Data Acquisition or Collection: Acquiring and merging the data from all the appropriate
sources
3. Data Exploration and Pre-processing: Cleaning and preprocessing the data to create
homogeneity, performing exploratory data analysis and statistical analysis to understand the
relationships between the variables.
4. Feature Engineering: Create new features based on empirical relationships and select
significant variables using dimension reductional techniques.
AI and Internet of Things
DR. HAIDER ALI 5
6. LIFE-CYCLE OF A MACHINE LEARNING PROJECT
5. Model Building: Training the dataset and building the model by selecting the appropriate ML
algorithms to identify the patterns.
6. Execution & Model Validation: Implementation of the model and validating the model such
as validating and fine-tuning the parameters.
7. Deployment: is the representation of business-usable results of the ML process — models
are deployed to enterprise apps, systems, and data stores.
8. Interpretation, Data Visualization, and Documentation: Interpreting, visualizing, and
communicating the model insights. Documenting the modeling process for reproducibility and
creating the model monitoring and maintenance plan.
AI and Internet of Things
DR. HAIDER ALI 6
8. DATA ACQUISITION IN MACHINE LEARNING
Collection and Integration of the data
Formatting
Labeling
AI and Internet of Things
DR. HAIDER ALI 8
9. COLLECTION AND INTEGRATION OF THE DATA
The data is extracted from various sources and also the data is
usually available at different places so multiple data need to be
combined to be used. The data acquired is typically in raw format
and not suitable for immediate consumption and analysis.
AI and Internet of Things
DR. HAIDER ALI 9
10. FORMATTING
Prepare or organize the datasets as per the analysis requirements.
AI and Internet of Things
DR. HAIDER ALI 10
LABELING
After gathering data, it is required to label the data. One such
instance is in an application factory, one would want to label the
images of the components if the components are defective or not.
11. THE DATA ACQUISITION PROCESS
The process of data acquisition involves searching for the datasets that
can be used to train the Machine Learning models. Having said that, it is
not simple. There are various approaches to acquiring data, here have
bucketed into three main segments such as:
Data Discovery
Data Augmentation
Data Generation
AI and Internet of Things
DR. HAIDER ALI 11
12. DATA DISCOVERY
The first approach to acquiring data is Data discovery. It is a
key step when indexing, sharing, and searching for new
datasets available on the web and incorporating data lakes.
It can be broken into two steps: Searching and Sharing.
Firstly, the data must be labeled or indexed and published
for sharing using many available collaborative systems for
this purpose.
AI and Internet of Things
DR. HAIDER ALI 12
13. DATA AUGMENTATION
The next approach for data acquisition is Data
augmentation. Augment means to make something greater
by adding to it, so here in the context of data acquisition, we
are essentially enriching the existing data by adding more
external data. In Deep and Machine learning, using pre-
trained models and embeddings is common to increase the
features to train on.
AI and Internet of Things
DR. HAIDER ALI 13
16. DATA GENERATION
As the name suggests, the data is generated. If we do not have enough and
any external data is not available, the option is to generate the datasets
manually or automatically.
Instead of collecting and labeling large datasets, there are several techniques
for generating synthetic data that has similar properties to real data. Synthetic
data has major advantages, including reduced cost, higher accuracy in data
labeling (because the labels in synthetic data are already known), scalability (it
is easy to create vast amounts of simulated data), and variety. Synthetic data
can be used to create data samples for edge cases that do not frequently occur
in the real world.
AI and Internet of Things
DR. HAIDER ALI 16
18. DATA ACQUISITION TECHNIQUES AND TOOLS
The major tools and techniques for data acquisition are:
1.Data Warehouses and ETL
2.Data Lakes and ELT
3.Cloud Data Warehouse providers
AI and Internet of Things
DR. HAIDER ALI 18
20. DATA WAREHOUSES AND ETL
A data warehouse is a type of database that is used for storing and managing
large amounts of data. It is designed to facilitate the process of querying and
analyzing data, and is often used by organizations to support business
intelligence and decision-making activities. Data warehouses typically store data
from multiple sources, such as operational databases, transactional systems,
and external sources, and are designed to support the efficient execution of
complex queries and analysis. This allows organizations to gain insights into
their data and make informed decisions based on that information.
AI and Internet of Things
DR. HAIDER ALI 20
21. DATA LAKES AND ELT
A data lake is a storage repository having the capacity to store large amounts of
data, including structured, semi-structured, and unstructured data. It can store
images, videos, audio, sound records, and PDF files. It helps for faster ingestion
of new data.
Unlike data warehouses, data lakes store everything, are more flexible, and
follow the Extract, Load, and Transform (ELT) approach. The data is first loaded
and not transformed until required to transform. Therefore, the data is processed
later as per the requirements.
AI and Internet of Things
DR. HAIDER ALI 21
22. CLOUD DATA WAREHOUSE PROVIDERS
A cloud data warehouse is another service that collects, organizes, and
stores data. Cloud data warehouses are quicker and cheaper to set up as
no physical hardware needs to be procured.
• Amazon Redshift
• Snowflake
• Google BigQuery
• IBM Db2 Warehouse
• Microsoft Azure Synapse
• Oracle Autonomous Data Warehouse
• SAP Data Warehouse Cloud
• Yellowbrick Data
• Teradata Integrated Data Warehouse
DR. HAIDER ALI
AI and Internet of
Things 22