Topik 2
Langkah Membangun Solusi
Berbasis Kecerdasan Buatan
Dr. Sunu Wibirama
Modul Kuliah Kecerdasan Buatan
Kode mata kuliah: UGMx 001001132012
April 14, 2022
April 14, 2022
1 Capaian Pembelajaran Mata Kuliah
Topik ini akan memenuhi CPMK 2, yakni mampu menjelaskan langkah-langkah mem-
bangun sistem berbasis kecerdasan buatan serta mengerti kelebihan dan kekurangan solusi
berbasis kecerdasan buatan. Adapun indikator tercapainya CPMK tersebut adalah mampu
membedakan proyek sains data dan AI, memahami proses pengembangan solusi berbasis AI
serta kelebihan dan kekurangannya.
2 Cakupan Materi
Cakupan materi dalam topik ini sebagai berikut:
a) Know Your Data: materi ini menjelaskan dasar-dasar big data dan aspek-aspek yang
perlu diperhatikan dalam big data—seperti: volume (data size), velocity (speed of
change), variety (different forms of data sources), dan veracity (uncertainty of data).
b) Types of Data: materi ini membahas tipe-tipe data yang umum dijumpai, seperti data
nominal (kategori), data ordinal, data interval, dan data rasio. Selain itu, materi
ini juga menjelaskan contoh-contoh tipe data tersebut dalam dataset yang digunakan
untuk melatih model machine learning.
c) Common Problems with Data: materi ini membahas berbagai masalah yang muncul
pada data yang didapatkan dari lingkungan riil atau perangkat sensor, misalnya keti-
daklengkapan data, data yang berulang (redundan), tercampurnya data numeris dan
karakter, dan sebagainya. Selain permasalahan yang muncul pada data, materi ini
juga membahas penyebab munculnya kesalahan pada data, misalnya kesalahan pada
saat melakukan pengukuran (measurement error), kesalahan pada saat mengambil
sampel (sampling error), dan data leakage.
d) Machine Learning vs. Data Science: materi ini membahas perbedaan sudut pan-
dang antara machine learning dan data science. Machine learning biasanya berfokus
pada pengembangan sistem atau perangkat lunak yang berbasis AI. Sementara itu
data science berfokus pada aspek-aspek yang penting untuk pengambilan keputusan
dan luarannya ditungakan dalam presentasi atau laporan khusus untuk pengambil
keputusan (decision maker).
e) Data Science Life Cycle: materi ini menjelaskan alasan pentingnya data science se-
bagai bahan pertimbangan untuk pengambilan keputusan. Selain itu materi ini juga
menjelaskan langkah-langkah atau proses pada data science, misalnya: discovery, data
preparation, model planning, model building, operationalize, dan communicate results.
f) Workflow of A Data Science Project: materi ini menjelaskan contoh dari data science
life cycle untuk memprediksi apakah seseorang menderita diabetes.
g) What Machine Learning Can and Cannot Do: materi ini membahas tentang hal-hal
yang dapat dan tidak dapat dilakukan oleh machine learning. Machine learning dapat
1
April 14, 2022
diterapkan untuk permasalahan yang memiliki tujuan simpel (simple objective), prob-
lem yang diselesaikan terlalu kompleks untuk dibuat aturannya dengan pemrograman
(rule based program), problem yang dipecahkan adalah problem persepsi (misalnya:
pengenalan suara atau pengenalan wajah), dan jika pengembang memiliki data dalam
jumlah besar.
h) Various Tasks in Machine Learning: materi ini membahas hal-hal yang dapat di-
lakukan dengan machine learning, misalnya: automate, alert or prompt, organize,
annotate, extract, recommend, classify, quantify, synthesize, answer an explicit ques-
tion (yes/no), transform its input, dan detect anomaly.
i) Workflow of A Machine Learning Project: materi ini mencakup general machine learn-
ing project life cycle dan hal-hal yang terkait dengan proyek machine learning, misalnya
pertimbangan pada dampak dan biaya proyek, serta berbagai macam pertimbangan
untuk melihat biaya proyek machine learning.
2
13/04/2022
1
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
First Things First: Know Your Data
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
The biggest impact
of Industry 4.0:
we are great on producing data,
but struggling to understand the
underlying human behaviour
13/04/2022
2
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
What is Big Data?
• Several other ways to think about how “Big” is “Big Data”:
• When the size of data is a challenge.
Data without limit of size.
• When the data can’t be fit on one machine
• When the data is part of cultural and daily activities
• Collecting and using a lot of data rather than small
samples
• Accepting messiness in our data
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
13/04/2022
3
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
What we should know about data?
Type
of data
Quality
of data
Preprocessing
of data
Relationship
of data
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
6
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Types of Data
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Types of data
• To analyze data, we must understand
type of data:
• What you can do and can’t do with
different types of data
• Helping you transforming data from one
type to other types
• Four types of data:
• Nominal data
• Ordinal data
• Interval data
• Ratio data
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Nominal data
• Unordered group or categories
• Nominal values provide only enough information to distinguish
one object from another. Without order, cannot say one is
better than another
• Examples:
• Zip codes
• Employee ID numbers
• Eye color
• Types of operating systems
• Possible operations:
• Mode
• Frequency
• Chi-squared test
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Ordinal data
• Ordered group or categories
• Data are ordered in certain ways but intervals between
measurements are not meaningful
• Examples:
• Self-reported data / questionnaire
• Website rating (excellent, good, fair, poor)
• Possible operations:
• Frequency
• Mode
• Median
• Percentiles
• Chi-squared test
• Wilcoxon rank sum test
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Interval data
• Continuous data where differences between
measurements are meaningful
• Zero point on the scale is arbitrary
• Examples:
• calendar dates
• temperature in Celsius or Fahrenheit
• Possible operations:
• All descriptive statistics
(e.g.: mean, standard deviation)
• Pearson’s correlation
• t and F tests
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
Ratio data
• Same with interval data, but with meaningful zero
• Examples:
• Difference between a person of 38 and 35
years old is same as difference between a
person of 15 and 12 years old
• Time to complete a particular task
• Distance, length, weight, height
• Possible operations:
• All descriptive statistics
(e.g.: mean, standard deviation)
• Pearson’s correlation
• t and F tests
• Regression analysis
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
Other classifications of data
• Binary data: A set of just two values
(e.g., gender)
• Textual data: free form, usually short
text data (e.g.: name, address)
• Numeric data: True numeric values
that allow arithmetic operations (e.g.,
price, age)
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
8
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Observing Real Case Training Dataset
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Review: types of data
• To analyze data, we must understand
type of data:
• What you can do and can’t do with
different types of data
• Helping you transforming data from one
type to other types
• Four types of data:
• Nominal data
• Ordinal data
• Interval data
• Ratio data
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Put them together
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Our previous example?
Categorical Ratio Ratio Categorical Categorical
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Data acquisition
- Manual labeling
Car Not car Car Not car Not car
- Observing behavior of machine
Machine ID Temperature (OC) Pressure (psi) machine fault
1897
8977
1345
1376
5476
60
100
120
140
160
7.65
26.4
75.7
100.0
125.0
N
N
Y
Y
Y
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
Data acquisition
- Download from website/partnership
Amazon Sage Maker Label Studio
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
Displaying data: understanding types of data
[Source: Aurelien Geron, 2021]
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Common Problems With Data (Part 01)
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Data are messy
• Real world data are messy, 80% effort for most AI
engineers is to clean and pre-process the data.
• Garbage in – garbage out : do not throw your data right
away to your machine learning model, because you will get
“wrong” results.
• Data is not reality:
• Humans define phenomenon that they want to measure
• They design system to collect data
• They clean and pre-process the data
• They interpret the results
Even with the same datasets, two people can form vastly
different conclusions
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Noisy dataset
Source: https://medium.com/well-red/cleaning-a-messy-dataset-using-python-7d7ab0bf199b
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Another problem: undefined goals
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Common Problems With Data (Part 02)
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Definition error
You are developing an AI system for a small department store
How do you define ”customer” of your client?
Do you need to specify different segment for your customers?
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Capture error
Without counterbalancing (capture error)
With counterbalancing
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Measurement error
• Errors occur when the software or hardware
to capture the data goes awry.
• Example:
• You develop a machine learning algorithm
to detect Covid-19 based on daily habits
using user’s smartphone.
• Information about user behavior may be
lost if the user experiences connectivity
issue.
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Sampling error
Coverage error Nonresponse error
[Source: https://www.geopoll.com/blog/sample-frame-sample-error-research]
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
Another consideration about data
• Accessibility of the data:
• Physically, contractually, ethically
• How much do the data cost you / your company?
• Agreement / copyright problems
• Potential privacy issues (e.g., classified project of
government)
• Reliability of the data:
• Can you trust the label? (e.g., doubt from Mechanical
Turk’s labeled data)
• Feedback loop: the data used to train the model is
obtained using the model itself. (e.g., there are unreliability
in data from a website user that clicks on product
recommended by a machine learning. There is intervention
of machine learning instead of pure interest from the user)
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
Data leakage
Data leakage: unintentional introduction of information about the target that
should not be made available. Training on contaminated data leads to overly
optimistic expectations about the model performance.
[Source: A. Burkov, Machine Learning Engineering, 2020]
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
8
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Machine Learning vs. Data Science
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Machine Learning and Data Science
[Source: https://www.edureka.co/blog/what-is-data-science/]
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Machine Learning and Data Science
Machine Learning Data Science
Field of study that gives computers the ability to learn
without being explicitly programmed (Arthur Samuel, 1959).
Science of understanding and finding hidden patterns or
useful insights from the data, which helps to take smarter
business decisions.
It is used for making predictions and classifying the result
for new data points.
It is used for discovering insights from the data.
A machine learning engineer needs to have skills such as
computer science fundamentals, programming skills in
Python or R, statistics and probability concepts, etc.
A data scientist needs to have skills to use big data tools
like Hadoop, Hive and Pig, statistics, programming in
Python, R, or Scala.
It mostly requires structured data to work on. It can work with raw, structured, and unstructured data.
Machine learning engineers spend a lot of time in handling
and cleansing data, as well as managing the complexities
that occur during the implementation of algorithms and
mathematical concepts behind that.
Data scientists spent lots of time in handling the data,
cleansing the data, and understanding its patterns.
Software is mainly the product of machine learning project Slide deck or report is mainly the product of data science
project
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Machine learning point of view
size of house
(square meter)
# of bedrooms # of bathrooms newly renovated price
(IDR 1000,000)
123 1 2 N 300
125 1 3 N 360
346 2 1 N 580
432 3 3 Y 630
547 4 4 N 730
678 5 5 Y 850
Using four features/attributes to predict house price
Output of the project:
a website or a mobile app of house prediction with AI system inside
House price in Yogyakarta (2015-2017)
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Data science point of view
size of house
(square meter)
# of bedrooms # of bathrooms newly renovated price
(IDR 1000,000)
123 1 2 N 300
125 1 3 N 360
346 2 1 N 580
432 3 3 Y 630
547 4 4 N 730
678 5 5 Y 850
Output of the project:
a slide deck / presentation / report
House price in Yogyakarta (2015-2017)
Insight:
• Houses with 3 bathrooms are more expensive than houses with 2 bathrooms of a similar size
• Size of house and number of bedrooms are two strong predictors of house price
• Renovating houses significantly increase house price
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
Data-Driven Decision Making (3DM)
End user:
Clients or
stakeholders
Business
analysts
Data
scientists
Database
administrator/
data engineers
Improving accuracy
of business
decision making
Source: A. Tan, “Data management: A foundation of effective data science”, The Capco Institute Journal of Financial Transformation, 2019, p.31
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama
How data science helps business
Source:
http://www.keywebmetrics.com/2015/02/big-data-job-without-being-data-scientist/
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Data Science Life Cycle
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Data-Driven Decision Making (3DM)
End user:
Clients or
stakeholders
Business
analysts
Data
scientists
Database
administrator/
data engineers
Improving accuracy
of business
decision making
Source: A. Tan, “Data management: A foundation of effective data science”, The Capco Institute Journal of Financial Transformation, 2019, p.31
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Data Science Life Cycle
Source: https://www.edureka.co/blog/what-is-data-science/
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
1. Discovery
• Understand the problem to be solved and the business goals
• Determining success criteria of your solution
• Understand the constraints:
• requirement of the solutions
• costs and budgets
• priorities
• Understand availability of resources for data analytics:
• People
• Technology
• Time
• Data
• Understand the characteristics of your data:
• Structured or unstructured
• Relevance and completeness
• Using available dataset for simulating solution (secondary
data) or using your own data (primary data)
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
2. Data preparation
Data preparation encompasses all activities to construct
and clean the data set:
Handling errors, missing or invalid values
Normalization
Eliminating duplicate rows
Formatting properly (standardization)
Combining multiple data sources
Transforming data
Feature engineering
Text analysis
Accelerate data preparation by automating common steps
Validation
Source: https://www.edureka.co/blog/what-is-data-science/
Arguably the most time-consuming step,“80% of the entire DS process,
is in data cleaning and preparation”
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
3~4. Model Planning and Building
You will determine the methods and techniques to draw
the relationships between variables.
You will also develop datasets for training and testing
purposes.
You will consider whether your existing tools will suffice for
running the machine learning models or it will need a more
robust environment (like fast and parallel processing).
You will analyze various machine learning techniques like
regression, classification, association and clustering to
build the model
You perform model planning and building in a simulated
environment
Source: https://www.edureka.co/blog/what-is-data-science/
Generally, these steps require technical knowledge on programming,
statistics, and machine learning
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
5. Operationalize
You deliver final reports, briefings, code, and
technical documents.
In addition, sometimes a pilot project is also
implemented in a real-time production
environment to provide online insight:
Integration viability in enterprise architecture
Monitoring and maintenance
Scalability on real data
This will provide you a clear picture of the
performance and other related constraints on a
small scale before full deployment.
Source:
https://dotdata.com/blog/data-science-operationalization-what-the-heck-is-it/
https://medium.datadriveninvestor.com/10-dimensions-of-making-data-science-work-part-8-ff1d672b7571
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
6. Communicate results
It is important to evaluate if you have been able
to achieve your goal that you had planned in
the discovery phase.
So, in the last phase, you identify all the key
findings, communicate to the stakeholders.
Determine if the results of the project are a
success or a failure based on the criteria
developed in discovery phase.
Source: https://www.annalect.fi/actionable-insights-data-presentation/
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Workflow of a Data Science Project
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Data Science Life Cycle
Source: https://www.edureka.co/blog/what-is-data-science/
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Case study: Diabetes Prevention
Problem: how can we predict the occurrence of diabetes and
take appropriate measures beforehand to prevent it?
Goal: predict the “positive” or “negative” diagnosis of diabetes
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Type 1 vs. type 2 diabetes
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Attributes:
1. npreg – Number of times pregnant
2. glucose – Plasma glucose concentration
3. bp – Blood pressure
4. skin – Triceps skinfold thickness
5. bmi – Body mass index
6. ped – Diabetes pedigree function
7. age – Age
8. income – Income
Step 1: Discovery
First, we will collect
the data based on
the medical
history of the patient
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
• Now, once we have the data, we need
to clean and prepare the data for data
analysis.
• This data has a lot of inconsistencies
like missing values, blank columns,
abrupt values and incorrect data format
which need to be cleaned.
• Here, we have organized the data into a
single table under different attributes –
making it look more structured.
Step 2: Preparation
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
Our data after cleansing process
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
Step 3: Model planning
• First, we apply various statistical functions
on it. We can explore the number of
missing values and unique values.
• We can also use the summary function
which will give us statistical information
like mean, median, range, min and max
values.
• Then, we use visualization techniques like
histograms, line graphs, box plots to get a
fair idea of the distribution of data.
• Observe correlation between features: we
may need to use only one among two
correlated features
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9
Step 4: Model building
• Now, based on insights derived from the
previous step, the best fit for this kind of
problem is the decision tree
• Since, we already have the major
attributes for analysis like npreg, bmi,
etc., so we will use supervised learning
technique to build a model here.
• Further, we have particularly used
decision tree because it takes all
attributes into consideration in one go.
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 10
Step 5 and Step 6:
Model building and operationalize
• Step 5:
• In this phase, we will run a small pilot project to check if our results are
appropriate.
• We will also look for performance constraints if any. If the results are
not accurate, then we need to replan and rebuild the model.
• Step 6:
• Once we have executed the project successfully, we will share the
output for full deployment.
• We should use real data in deployment environment and observe
model performance
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 11
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
What Machine Learning Can and Cannot Do – Part 01
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Things before implementing machine learning
• Engineers should develop intuition on
what machine learning can and cannot do.
• Academic literatures tend to report
positive results or success stories, so
people think that machine learning can do
almost everything
• To successfully implement a machine
learning project, failed case should be
recognizes as well, so that we can learn
from past experience.
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Supervised learning
Input Output Application
email spam? (yes/no) spam filtering
audio text transcript speech recognition
ads, user info click? (yes/no) online advertising
Indonesian Japanese machine translation
plate number pics plate number computer vision
Anything you can do with 1 second of thought,
we can probably now or soon automate
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
What machine learning cannot do?
Do user experience research over 500
participants and write an extensive 50
pages UX report in 10 seconds
Even a team of humans
cannot do that!
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
What machine learning can and cannot do
Case:
The toy that I bought was broken, so I could not give the toy to my
daughter for her birthday party. Can I request a refund?
Machine learning
can do this
“Detecting refund request”
Machine learning
cannot do this
“Generating complicated piece of texts
and empathize with you”
input text
from email
Refund/Shipping/Other
“We apologize for the inconvenience.
We will proceed your request to finance department. We
hope your daughter had a fun and lively birthday party…..”
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
Using machine learning to generate empathetic response?
Input (A)
User email
Output (B)
2-3 paragraph response
1000 examples
“My package was damaged”
“Can I write a review for this product?”
“Do you have any refund policy”
“My shipping is in transit, when will
I receive my package?”
”Thank you for your email”
”Thank you for your email”
”Thank you for your email”
”Yes now thank we ….”
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7
When to use machine learning ?
1. When the problem has a simple objective:
detecting object with less than 1 sec of thought (yes/no)
2. When the problem is to complex for coding:
Writing rules for detecting spam in your inbox
3. When it is perceptive problem
Detecting speech, or recognizing people in a video
4. When you have a lot of good data
Good data are essential to train the machine learning model
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8
Best practices
• Before deciding whether machine learning can tackle a problem:
• Look at similar project or case studies:
papers, presentation, Youtube videos, books, etc.
• Ask yourself before build your first prototype:
• does the problem have simple objective?
• is the problem too complex to be coded manually?
• do you have enough good data to train the AI model?
• is it perceptive problem?
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
What Machine Learning Can and Cannot Do – Part 02
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
Self driving car vs. intention detection
Input : scene video from dashboard camera
Output : position of other car on the road
Construction
worker
Tourist Biker
Car detection Intention detection
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Why it is difficult?
Construction
worker
Tourist Biker
Intention detection
• Data: the number of ways
people gesturing at you is very
very large (i.e., too many
variation) – not counting local
wisdom
• Need high accuracy: critical for
some cases—construction
worker sign requires 100% clarity
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
X-ray diagnosis
Source:
Wang D, Mo J, Zhou G, Xu L, Liu Y (2020) An efficient mixture of deep and machine learning
models for COVID-19 diagnosis in chest X-ray images. PLoS ONE 15(11): e0242535.
https://doi.org/10.1371/journal.pone.0242535
Can do:
Diagnose Covid-19
from > 1000 labeled images
Cannot do:
Diagnose Covid-19 from 8 images
of a medical textbook chapter
explaining pneumonia and
Covid-19
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Take home messages
• Machine learning generally works well when:
• The problem has simple objective
• When you have a lot of available data
• When the problem is to complex for coding
• Machine learning works poorly when:
• The problem has complicated objective with small amount of data
• The ML model is asked to perform on new types of data
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Various Tasks in Machine Learning Projects
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
So, what kind of tasks that
machine learning can handle?
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Tasks that an ML model can do (1/3)
• Automate (for example, by taking action on the user’s
behalf or by starting or stopping a specific activity on a
server),
• Alert or prompt (for example, by asking the user if an
action should be taken or by asking a system
administrator if the traffic seems suspicious),
• Organize, by presenting a set of items in an order that
might be useful for a user (for example, by sorting
pictures or documents in the order of similarity to a
query or according to the user’s preferences),
• Annotate (for instance, by adding contextual
annotations to displayed information, or by highlighting,
in a text, phrases relevant to the user’s task),
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Tasks that an ML model can do (2/3)
• Extract (for example, by detecting smaller pieces
of relevant information in a larger input, such as
named entities in the text: keywords, proper
names, companies, or locations),
• Recommend (for example, by detecting and
showing to a user highly relevant items in a large
collection based on item’s content or user’s
reaction to the past recommendations),
• Classify (for example, by dispatching input
examples into one, or several, of a predefined set
of distinctly-named groups),
• Quantify (for example, by assigning a number,
such as a price, to an object, such as a house)
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Tasks that an ML model can do (3/3)
• Synthesize (for example, by generating new text,
image, sound, or another object similar to the objects
in a collection),
• Answer an explicit question-not an open ended
question (for example, “Does this text describe that
image?” or “Are these two images similar?”),
• Transform its input (for example, by reducing its
dimensionality for visualization purposes,
paraphrasing a long text as a short abstract,
translating a sentence into another language, or
augmenting an image by applying a filter to it),
• Detect a novelty or an anomaly.
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
End of File
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Workflow of A Machine Learning Project
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2
General machine learning project life cycle
[Source: A. Burkov, Machine Learning Engineering, 2020]
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3
Notes on ML life cycle
• Engineering project may or may not have machine learning part.
• Machine learning must first have a well-defined goal:
• What kind of input used in the ML model
• What kind of output generated by the ML model
• Success and failed criteria of the ML model
• Goal of ML project is not always same as the business objective.
Example:
• Business objective of Google Mail (Gmail): to make Gmail the
most-used email service in the world.
• ML project objective: to distinguish primary email from
promotions with accuracy above 90%
• Prioritization of a machine learning project depends on impact
and cost.
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
Impact of machine learning project
• The impact of ML project is high when:
• ML can replace a complex part in your engineering project
• Great benefit in getting inexpensive (but not 100% perfect)
predictions.
Predicting students failure during two years study in a university:
• Using rule-based expert system is impossible with thousands
academic data.
• Some data are easy to be categorized  send all data to ML
algorithm to classify ”straightforward considered” (easy data) vs.
“considered decision” (complicated data). Considered decision
needs human intervention.
• If ML algorithm makes mistakes, easy data will be classified as
“considered decision”. This mistakes are no harm since human
can make decision for this easy data
Source:
Qazdar, A., Er-Raha, B., Cherkaoui, C. et al. A machine learning
algorithm framework for predicting students performance: A case study of
baccalaureate students in Morocco.Educ Inf Technol 24, 3577–3589
(2019). https://doi.org/10.1007/s10639-019-09946-8
13/04/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5
Cost of machine learning project
• Three factors highly influence the cost of a machine learning project:
o The difficulty of the problem
• Whether an implemented algorithm or a software library capable of solving the
problem
• Whether significant computational power is needed to build the ML model or to run
the ML model in production environment.
o The cost of data
• Can data be generated automatically (i.e. if manual labelling is still needed than
the data are costly)?
• How many examples are needed to cover various classes to be classified?
o The need for accuracy
• How costly is each wrong prediction (e.g. when detecting Covid-19, we prefer false
positive instead of false negative result)?
• what is the lowest accuracy level below which the model becomes impractical?
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6
End of File

Modul Topik 2 - Kecerdasan Buatan.pdf

  • 1.
    Topik 2 Langkah MembangunSolusi Berbasis Kecerdasan Buatan Dr. Sunu Wibirama Modul Kuliah Kecerdasan Buatan Kode mata kuliah: UGMx 001001132012 April 14, 2022
  • 2.
    April 14, 2022 1Capaian Pembelajaran Mata Kuliah Topik ini akan memenuhi CPMK 2, yakni mampu menjelaskan langkah-langkah mem- bangun sistem berbasis kecerdasan buatan serta mengerti kelebihan dan kekurangan solusi berbasis kecerdasan buatan. Adapun indikator tercapainya CPMK tersebut adalah mampu membedakan proyek sains data dan AI, memahami proses pengembangan solusi berbasis AI serta kelebihan dan kekurangannya. 2 Cakupan Materi Cakupan materi dalam topik ini sebagai berikut: a) Know Your Data: materi ini menjelaskan dasar-dasar big data dan aspek-aspek yang perlu diperhatikan dalam big data—seperti: volume (data size), velocity (speed of change), variety (different forms of data sources), dan veracity (uncertainty of data). b) Types of Data: materi ini membahas tipe-tipe data yang umum dijumpai, seperti data nominal (kategori), data ordinal, data interval, dan data rasio. Selain itu, materi ini juga menjelaskan contoh-contoh tipe data tersebut dalam dataset yang digunakan untuk melatih model machine learning. c) Common Problems with Data: materi ini membahas berbagai masalah yang muncul pada data yang didapatkan dari lingkungan riil atau perangkat sensor, misalnya keti- daklengkapan data, data yang berulang (redundan), tercampurnya data numeris dan karakter, dan sebagainya. Selain permasalahan yang muncul pada data, materi ini juga membahas penyebab munculnya kesalahan pada data, misalnya kesalahan pada saat melakukan pengukuran (measurement error), kesalahan pada saat mengambil sampel (sampling error), dan data leakage. d) Machine Learning vs. Data Science: materi ini membahas perbedaan sudut pan- dang antara machine learning dan data science. Machine learning biasanya berfokus pada pengembangan sistem atau perangkat lunak yang berbasis AI. Sementara itu data science berfokus pada aspek-aspek yang penting untuk pengambilan keputusan dan luarannya ditungakan dalam presentasi atau laporan khusus untuk pengambil keputusan (decision maker). e) Data Science Life Cycle: materi ini menjelaskan alasan pentingnya data science se- bagai bahan pertimbangan untuk pengambilan keputusan. Selain itu materi ini juga menjelaskan langkah-langkah atau proses pada data science, misalnya: discovery, data preparation, model planning, model building, operationalize, dan communicate results. f) Workflow of A Data Science Project: materi ini menjelaskan contoh dari data science life cycle untuk memprediksi apakah seseorang menderita diabetes. g) What Machine Learning Can and Cannot Do: materi ini membahas tentang hal-hal yang dapat dan tidak dapat dilakukan oleh machine learning. Machine learning dapat 1
  • 3.
    April 14, 2022 diterapkanuntuk permasalahan yang memiliki tujuan simpel (simple objective), prob- lem yang diselesaikan terlalu kompleks untuk dibuat aturannya dengan pemrograman (rule based program), problem yang dipecahkan adalah problem persepsi (misalnya: pengenalan suara atau pengenalan wajah), dan jika pengembang memiliki data dalam jumlah besar. h) Various Tasks in Machine Learning: materi ini membahas hal-hal yang dapat di- lakukan dengan machine learning, misalnya: automate, alert or prompt, organize, annotate, extract, recommend, classify, quantify, synthesize, answer an explicit ques- tion (yes/no), transform its input, dan detect anomaly. i) Workflow of A Machine Learning Project: materi ini mencakup general machine learn- ing project life cycle dan hal-hal yang terkait dengan proyek machine learning, misalnya pertimbangan pada dampak dan biaya proyek, serta berbagai macam pertimbangan untuk melihat biaya proyek machine learning. 2
  • 4.
    13/04/2022 1 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA First Things First: Know Your Data Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 The biggest impact of Industry 4.0: we are great on producing data, but struggling to understand the underlying human behaviour
  • 5.
    13/04/2022 2 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 What is Big Data? • Several other ways to think about how “Big” is “Big Data”: • When the size of data is a challenge. Data without limit of size. • When the data can’t be fit on one machine • When the data is part of cultural and daily activities • Collecting and using a lot of data rather than small samples • Accepting messiness in our data sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4
  • 6.
    13/04/2022 3 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 What we should know about data? Type of data Quality of data Preprocessing of data Relationship of data sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 6 End of File
  • 7.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Types of Data Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Types of data • To analyze data, we must understand type of data: • What you can do and can’t do with different types of data • Helping you transforming data from one type to other types • Four types of data: • Nominal data • Ordinal data • Interval data • Ratio data
  • 8.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Nominal data • Unordered group or categories • Nominal values provide only enough information to distinguish one object from another. Without order, cannot say one is better than another • Examples: • Zip codes • Employee ID numbers • Eye color • Types of operating systems • Possible operations: • Mode • Frequency • Chi-squared test sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Ordinal data • Ordered group or categories • Data are ordered in certain ways but intervals between measurements are not meaningful • Examples: • Self-reported data / questionnaire • Website rating (excellent, good, fair, poor) • Possible operations: • Frequency • Mode • Median • Percentiles • Chi-squared test • Wilcoxon rank sum test
  • 9.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Interval data • Continuous data where differences between measurements are meaningful • Zero point on the scale is arbitrary • Examples: • calendar dates • temperature in Celsius or Fahrenheit • Possible operations: • All descriptive statistics (e.g.: mean, standard deviation) • Pearson’s correlation • t and F tests sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 Ratio data • Same with interval data, but with meaningful zero • Examples: • Difference between a person of 38 and 35 years old is same as difference between a person of 15 and 12 years old • Time to complete a particular task • Distance, length, weight, height • Possible operations: • All descriptive statistics (e.g.: mean, standard deviation) • Pearson’s correlation • t and F tests • Regression analysis
  • 10.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 Other classifications of data • Binary data: A set of just two values (e.g., gender) • Textual data: free form, usually short text data (e.g.: name, address) • Numeric data: True numeric values that allow arithmetic operations (e.g., price, age) sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 8 End of File
  • 11.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Observing Real Case Training Dataset Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Review: types of data • To analyze data, we must understand type of data: • What you can do and can’t do with different types of data • Helping you transforming data from one type to other types • Four types of data: • Nominal data • Ordinal data • Interval data • Ratio data
  • 12.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Put them together sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Our previous example? Categorical Ratio Ratio Categorical Categorical
  • 13.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Data acquisition - Manual labeling Car Not car Car Not car Not car - Observing behavior of machine Machine ID Temperature (OC) Pressure (psi) machine fault 1897 8977 1345 1376 5476 60 100 120 140 160 7.65 26.4 75.7 100.0 125.0 N N Y Y Y sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 Data acquisition - Download from website/partnership Amazon Sage Maker Label Studio
  • 14.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 Displaying data: understanding types of data [Source: Aurelien Geron, 2021] sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 End of File
  • 15.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Common Problems With Data (Part 01) Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Data are messy • Real world data are messy, 80% effort for most AI engineers is to clean and pre-process the data. • Garbage in – garbage out : do not throw your data right away to your machine learning model, because you will get “wrong” results. • Data is not reality: • Humans define phenomenon that they want to measure • They design system to collect data • They clean and pre-process the data • They interpret the results Even with the same datasets, two people can form vastly different conclusions
  • 16.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Noisy dataset Source: https://medium.com/well-red/cleaning-a-messy-dataset-using-python-7d7ab0bf199b
  • 17.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Another problem: undefined goals sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 End of File
  • 18.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Common Problems With Data (Part 02) Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Definition error You are developing an AI system for a small department store How do you define ”customer” of your client? Do you need to specify different segment for your customers?
  • 19.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Capture error Without counterbalancing (capture error) With counterbalancing sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Measurement error • Errors occur when the software or hardware to capture the data goes awry. • Example: • You develop a machine learning algorithm to detect Covid-19 based on daily habits using user’s smartphone. • Information about user behavior may be lost if the user experiences connectivity issue.
  • 20.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Sampling error Coverage error Nonresponse error [Source: https://www.geopoll.com/blog/sample-frame-sample-error-research] sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 Another consideration about data • Accessibility of the data: • Physically, contractually, ethically • How much do the data cost you / your company? • Agreement / copyright problems • Potential privacy issues (e.g., classified project of government) • Reliability of the data: • Can you trust the label? (e.g., doubt from Mechanical Turk’s labeled data) • Feedback loop: the data used to train the model is obtained using the model itself. (e.g., there are unreliability in data from a website user that clicks on product recommended by a machine learning. There is intervention of machine learning instead of pure interest from the user)
  • 21.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 Data leakage Data leakage: unintentional introduction of information about the target that should not be made available. Training on contaminated data leads to overly optimistic expectations about the model performance. [Source: A. Burkov, Machine Learning Engineering, 2020] sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 8 End of File
  • 22.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Machine Learning vs. Data Science Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Machine Learning and Data Science [Source: https://www.edureka.co/blog/what-is-data-science/]
  • 23.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Machine Learning and Data Science Machine Learning Data Science Field of study that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). Science of understanding and finding hidden patterns or useful insights from the data, which helps to take smarter business decisions. It is used for making predictions and classifying the result for new data points. It is used for discovering insights from the data. A machine learning engineer needs to have skills such as computer science fundamentals, programming skills in Python or R, statistics and probability concepts, etc. A data scientist needs to have skills to use big data tools like Hadoop, Hive and Pig, statistics, programming in Python, R, or Scala. It mostly requires structured data to work on. It can work with raw, structured, and unstructured data. Machine learning engineers spend a lot of time in handling and cleansing data, as well as managing the complexities that occur during the implementation of algorithms and mathematical concepts behind that. Data scientists spent lots of time in handling the data, cleansing the data, and understanding its patterns. Software is mainly the product of machine learning project Slide deck or report is mainly the product of data science project sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Machine learning point of view size of house (square meter) # of bedrooms # of bathrooms newly renovated price (IDR 1000,000) 123 1 2 N 300 125 1 3 N 360 346 2 1 N 580 432 3 3 Y 630 547 4 4 N 730 678 5 5 Y 850 Using four features/attributes to predict house price Output of the project: a website or a mobile app of house prediction with AI system inside House price in Yogyakarta (2015-2017)
  • 24.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Data science point of view size of house (square meter) # of bedrooms # of bathrooms newly renovated price (IDR 1000,000) 123 1 2 N 300 125 1 3 N 360 346 2 1 N 580 432 3 3 Y 630 547 4 4 N 730 678 5 5 Y 850 Output of the project: a slide deck / presentation / report House price in Yogyakarta (2015-2017) Insight: • Houses with 3 bathrooms are more expensive than houses with 2 bathrooms of a similar size • Size of house and number of bedrooms are two strong predictors of house price • Renovating houses significantly increase house price sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 Data-Driven Decision Making (3DM) End user: Clients or stakeholders Business analysts Data scientists Database administrator/ data engineers Improving accuracy of business decision making Source: A. Tan, “Data management: A foundation of effective data science”, The Capco Institute Journal of Financial Transformation, 2019, p.31
  • 25.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama How data science helps business Source: http://www.keywebmetrics.com/2015/02/big-data-job-without-being-data-scientist/ sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 End of File
  • 26.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Data Science Life Cycle Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Data-Driven Decision Making (3DM) End user: Clients or stakeholders Business analysts Data scientists Database administrator/ data engineers Improving accuracy of business decision making Source: A. Tan, “Data management: A foundation of effective data science”, The Capco Institute Journal of Financial Transformation, 2019, p.31
  • 27.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Data Science Life Cycle Source: https://www.edureka.co/blog/what-is-data-science/ sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 1. Discovery • Understand the problem to be solved and the business goals • Determining success criteria of your solution • Understand the constraints: • requirement of the solutions • costs and budgets • priorities • Understand availability of resources for data analytics: • People • Technology • Time • Data • Understand the characteristics of your data: • Structured or unstructured • Relevance and completeness • Using available dataset for simulating solution (secondary data) or using your own data (primary data)
  • 28.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 2. Data preparation Data preparation encompasses all activities to construct and clean the data set: Handling errors, missing or invalid values Normalization Eliminating duplicate rows Formatting properly (standardization) Combining multiple data sources Transforming data Feature engineering Text analysis Accelerate data preparation by automating common steps Validation Source: https://www.edureka.co/blog/what-is-data-science/ Arguably the most time-consuming step,“80% of the entire DS process, is in data cleaning and preparation” sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 3~4. Model Planning and Building You will determine the methods and techniques to draw the relationships between variables. You will also develop datasets for training and testing purposes. You will consider whether your existing tools will suffice for running the machine learning models or it will need a more robust environment (like fast and parallel processing). You will analyze various machine learning techniques like regression, classification, association and clustering to build the model You perform model planning and building in a simulated environment Source: https://www.edureka.co/blog/what-is-data-science/ Generally, these steps require technical knowledge on programming, statistics, and machine learning
  • 29.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 5. Operationalize You deliver final reports, briefings, code, and technical documents. In addition, sometimes a pilot project is also implemented in a real-time production environment to provide online insight: Integration viability in enterprise architecture Monitoring and maintenance Scalability on real data This will provide you a clear picture of the performance and other related constraints on a small scale before full deployment. Source: https://dotdata.com/blog/data-science-operationalization-what-the-heck-is-it/ https://medium.datadriveninvestor.com/10-dimensions-of-making-data-science-work-part-8-ff1d672b7571 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 6. Communicate results It is important to evaluate if you have been able to achieve your goal that you had planned in the discovery phase. So, in the last phase, you identify all the key findings, communicate to the stakeholders. Determine if the results of the project are a success or a failure based on the criteria developed in discovery phase. Source: https://www.annalect.fi/actionable-insights-data-presentation/
  • 30.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9 End of File
  • 31.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Workflow of a Data Science Project Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Data Science Life Cycle Source: https://www.edureka.co/blog/what-is-data-science/
  • 32.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Case study: Diabetes Prevention Problem: how can we predict the occurrence of diabetes and take appropriate measures beforehand to prevent it? Goal: predict the “positive” or “negative” diagnosis of diabetes sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Type 1 vs. type 2 diabetes
  • 33.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Attributes: 1. npreg – Number of times pregnant 2. glucose – Plasma glucose concentration 3. bp – Blood pressure 4. skin – Triceps skinfold thickness 5. bmi – Body mass index 6. ped – Diabetes pedigree function 7. age – Age 8. income – Income Step 1: Discovery First, we will collect the data based on the medical history of the patient sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 • Now, once we have the data, we need to clean and prepare the data for data analysis. • This data has a lot of inconsistencies like missing values, blank columns, abrupt values and incorrect data format which need to be cleaned. • Here, we have organized the data into a single table under different attributes – making it look more structured. Step 2: Preparation
  • 34.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 Our data after cleansing process sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 Step 3: Model planning • First, we apply various statistical functions on it. We can explore the number of missing values and unique values. • We can also use the summary function which will give us statistical information like mean, median, range, min and max values. • Then, we use visualization techniques like histograms, line graphs, box plots to get a fair idea of the distribution of data. • Observe correlation between features: we may need to use only one among two correlated features
  • 35.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9 Step 4: Model building • Now, based on insights derived from the previous step, the best fit for this kind of problem is the decision tree • Since, we already have the major attributes for analysis like npreg, bmi, etc., so we will use supervised learning technique to build a model here. • Further, we have particularly used decision tree because it takes all attributes into consideration in one go. sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 10 Step 5 and Step 6: Model building and operationalize • Step 5: • In this phase, we will run a small pilot project to check if our results are appropriate. • We will also look for performance constraints if any. If the results are not accurate, then we need to replan and rebuild the model. • Step 6: • Once we have executed the project successfully, we will share the output for full deployment. • We should use real data in deployment environment and observe model performance
  • 36.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 11 End of File
  • 37.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA What Machine Learning Can and Cannot Do – Part 01 Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Things before implementing machine learning • Engineers should develop intuition on what machine learning can and cannot do. • Academic literatures tend to report positive results or success stories, so people think that machine learning can do almost everything • To successfully implement a machine learning project, failed case should be recognizes as well, so that we can learn from past experience.
  • 38.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Supervised learning Input Output Application email spam? (yes/no) spam filtering audio text transcript speech recognition ads, user info click? (yes/no) online advertising Indonesian Japanese machine translation plate number pics plate number computer vision Anything you can do with 1 second of thought, we can probably now or soon automate sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 What machine learning cannot do? Do user experience research over 500 participants and write an extensive 50 pages UX report in 10 seconds Even a team of humans cannot do that!
  • 39.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 What machine learning can and cannot do Case: The toy that I bought was broken, so I could not give the toy to my daughter for her birthday party. Can I request a refund? Machine learning can do this “Detecting refund request” Machine learning cannot do this “Generating complicated piece of texts and empathize with you” input text from email Refund/Shipping/Other “We apologize for the inconvenience. We will proceed your request to finance department. We hope your daughter had a fun and lively birthday party…..” sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 Using machine learning to generate empathetic response? Input (A) User email Output (B) 2-3 paragraph response 1000 examples “My package was damaged” “Can I write a review for this product?” “Do you have any refund policy” “My shipping is in transit, when will I receive my package?” ”Thank you for your email” ”Thank you for your email” ”Thank you for your email” ”Yes now thank we ….”
  • 40.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 7 When to use machine learning ? 1. When the problem has a simple objective: detecting object with less than 1 sec of thought (yes/no) 2. When the problem is to complex for coding: Writing rules for detecting spam in your inbox 3. When it is perceptive problem Detecting speech, or recognizing people in a video 4. When you have a lot of good data Good data are essential to train the machine learning model sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 8 Best practices • Before deciding whether machine learning can tackle a problem: • Look at similar project or case studies: papers, presentation, Youtube videos, books, etc. • Ask yourself before build your first prototype: • does the problem have simple objective? • is the problem too complex to be coded manually? • do you have enough good data to train the AI model? • is it perceptive problem?
  • 41.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 9 End of File
  • 42.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA What Machine Learning Can and Cannot Do – Part 02 Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 Self driving car vs. intention detection Input : scene video from dashboard camera Output : position of other car on the road Construction worker Tourist Biker Car detection Intention detection
  • 43.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Why it is difficult? Construction worker Tourist Biker Intention detection • Data: the number of ways people gesturing at you is very very large (i.e., too many variation) – not counting local wisdom • Need high accuracy: critical for some cases—construction worker sign requires 100% clarity sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 X-ray diagnosis Source: Wang D, Mo J, Zhou G, Xu L, Liu Y (2020) An efficient mixture of deep and machine learning models for COVID-19 diagnosis in chest X-ray images. PLoS ONE 15(11): e0242535. https://doi.org/10.1371/journal.pone.0242535 Can do: Diagnose Covid-19 from > 1000 labeled images Cannot do: Diagnose Covid-19 from 8 images of a medical textbook chapter explaining pneumonia and Covid-19
  • 44.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Take home messages • Machine learning generally works well when: • The problem has simple objective • When you have a lot of available data • When the problem is to complex for coding • Machine learning works poorly when: • The problem has complicated objective with small amount of data • The ML model is asked to perform on new types of data sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 End of File
  • 45.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Various Tasks in Machine Learning Projects Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 So, what kind of tasks that machine learning can handle?
  • 46.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Tasks that an ML model can do (1/3) • Automate (for example, by taking action on the user’s behalf or by starting or stopping a specific activity on a server), • Alert or prompt (for example, by asking the user if an action should be taken or by asking a system administrator if the traffic seems suspicious), • Organize, by presenting a set of items in an order that might be useful for a user (for example, by sorting pictures or documents in the order of similarity to a query or according to the user’s preferences), • Annotate (for instance, by adding contextual annotations to displayed information, or by highlighting, in a text, phrases relevant to the user’s task), sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Tasks that an ML model can do (2/3) • Extract (for example, by detecting smaller pieces of relevant information in a larger input, such as named entities in the text: keywords, proper names, companies, or locations), • Recommend (for example, by detecting and showing to a user highly relevant items in a large collection based on item’s content or user’s reaction to the past recommendations), • Classify (for example, by dispatching input examples into one, or several, of a predefined set of distinctly-named groups), • Quantify (for example, by assigning a number, such as a price, to an object, such as a house)
  • 47.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Tasks that an ML model can do (3/3) • Synthesize (for example, by generating new text, image, sound, or another object similar to the objects in a collection), • Answer an explicit question-not an open ended question (for example, “Does this text describe that image?” or “Are these two images similar?”), • Transform its input (for example, by reducing its dimensionality for visualization purposes, paraphrasing a long text as a short abstract, translating a sentence into another language, or augmenting an image by applying a filter to it), • Detect a novelty or an anomaly. sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 End of File
  • 48.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1 Sunu Wibirama sunu@ugm.ac.id Department of Electrical and Information Engineering Faculty of Engineering Universitas Gadjah Mada INDONESIA Workflow of A Machine Learning Project Kecerdasan Buatan | Artificial Intelligence Version: January 2022 sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 2 General machine learning project life cycle [Source: A. Burkov, Machine Learning Engineering, 2020]
  • 49.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 3 Notes on ML life cycle • Engineering project may or may not have machine learning part. • Machine learning must first have a well-defined goal: • What kind of input used in the ML model • What kind of output generated by the ML model • Success and failed criteria of the ML model • Goal of ML project is not always same as the business objective. Example: • Business objective of Google Mail (Gmail): to make Gmail the most-used email service in the world. • ML project objective: to distinguish primary email from promotions with accuracy above 90% • Prioritization of a machine learning project depends on impact and cost. sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 4 Impact of machine learning project • The impact of ML project is high when: • ML can replace a complex part in your engineering project • Great benefit in getting inexpensive (but not 100% perfect) predictions. Predicting students failure during two years study in a university: • Using rule-based expert system is impossible with thousands academic data. • Some data are easy to be categorized  send all data to ML algorithm to classify ”straightforward considered” (easy data) vs. “considered decision” (complicated data). Considered decision needs human intervention. • If ML algorithm makes mistakes, easy data will be classified as “considered decision”. This mistakes are no harm since human can make decision for this easy data Source: Qazdar, A., Er-Raha, B., Cherkaoui, C. et al. A machine learning algorithm framework for predicting students performance: A case study of baccalaureate students in Morocco.Educ Inf Technol 24, 3577–3589 (2019). https://doi.org/10.1007/s10639-019-09946-8
  • 50.
    13/04/2022 sunu@ugm.ac.id Copyright © 2022Sunu Wibirama | Do not distribute without permission @sunu_wibirama 5 Cost of machine learning project • Three factors highly influence the cost of a machine learning project: o The difficulty of the problem • Whether an implemented algorithm or a software library capable of solving the problem • Whether significant computational power is needed to build the ML model or to run the ML model in production environment. o The cost of data • Can data be generated automatically (i.e. if manual labelling is still needed than the data are costly)? • How many examples are needed to cover various classes to be classified? o The need for accuracy • How costly is each wrong prediction (e.g. when detecting Covid-19, we prefer false positive instead of false negative result)? • what is the lowest accuracy level below which the model becomes impractical? sunu@ugm.ac.id Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 6 End of File