Introduction to Predictive Analytics with case studies

Introduction to  
Predictive Analytics  
with case studies
Eakasit Pacharawongsakda, Ph.D.
Co-founders of Data Cube &
Big Data Engineering Program, CITE, DPU
 
28 June 2017 at
The 2nd NIDA Business Analytics and Data Sciences Conference

http://dataminingtrend.com http://facebook.com/datacube.th
About us
• ชื่อ: เอกสิทธิ์ พัชรวงศ์ศักดา
• การศึกษา:
• ปริญญาเอก วิทยาการคอมพิวเตอร์  
สถาบันเทคโนโลยีนานาชาติสิรินธร (SIIT) มหาวิทยาลัยธรรมศาสตร์
• ปริญญาโท วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยเกษตรศาสตร์
• ปริญญาตรี วิศวกรรมคอมพิวเตอร์ มหาวิทยาลัยเกษตรศาสตร์  
(เกียรตินิยมอันดับ 2)
• ประสบการณ์
• Certiﬁed RapidMiner Analyst and Ambassador
• Research Collaboration with Western Digital (Thailand) เฟสที่ 1 ระยะเวลา 6 เดือน
• ร่วมวิจัย โครงการสํารวจข้อมูลเพื่อการวิเคราะห์พฤติกรรมของนักท่องเที่ยวเชิงลึก ด้วยวิธีการทําเหมือง
ข้อมูล การท่องเที่ยวแห่งประเทศไทย (ททท)
• วิทยากรอบรมการใช้งานซอฟต์แวร์ open source ทางด้าน data mining
2

About us
• หนังสือ Data Mining ฉบับภาษาไทย
3

About us
4
RapidMiner Analyst
Certification
This is to Certify that
Successfully passed the examination for the Certified RapidMiner Analyst.
The RapidMiner Analyst certification level is designed for individuals who wish to demonstrate
a fundamental understanding of how RapidMiner software works and is used.
Certified Analyst professionals will be able to prepare data and create predictive models in
standard data environments typically found within most analyst positions.
The candidate has proven the ability to:
Prepare data Build predictive models
Evaluate the model’s quality Score new data sets
Deploy data mining models
With:
RapidMiner Studio RapidMiner Server
Date:
Eakasit Pacharawongsakda

About us
• www.facebook.com/datacube.th
• www.dataminingtrend.com
5

Our customers (Financial sector)
6
ผู้สนใจเข้าร่วมอบรมจากหน่วยงานต่างๆ

Outline
• Part 1: Introduction to Big Data
• Part 2: Introduction to Data Mining Techniques
• Part 3: Introduction to Predictive Modelling
• Part 4: Unbalanced data problem
• Part 5: Feature Selection
• Part 6: Applications
7

ในหนึ่งวันทำงาน

source:http://pad1.whstatic.com/images/thumb/a/aa/Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg/
aid196018-728px-Reduce-Anxiety-About-Driving-if-You're-a-Teenager-Step-5-Version-2.jpg
เวลา 07:00 น. ออกเดินทางไปทำงาน

source: http://www.clipartkid.com/images/259/research-and-report-writing-9-23-12-9-30-12-q2r0wg-clipart.jpg
เวลา 07:45 น. ยังคงติดอยู่บนถนน

เวลา 08:00 น. เจ้านายโทรศัพท์เข้ามาถามงาน
source: https://d1ai9qtk9p41kl.cloudfront.net/assets/mc/psuderman/2011_07/text-drive.png

เวลา 08:05 น. ขับรถไปชนกับคันอื่น

เวลา 10:00 น. ถึงที่ทำงานและทำงานต่อไป
source: http://stuffpoint.com/anime-and-manga/image/285181-anime-and-manga-girl-working-in-the-computer.jpg

เวลา 18:00 น. แวะซื้อของกลับบ้าน

เวลา 20:00 น. กลับถึงบ้านและอยู่คนเดียว

ในหนึ่งวันทำงานกับ 
เทคโนโลยีข้อมูลขนาดใหญ่ (Big Data)

ระบบนำทาง
• แอพพลิเคชัน Waze
17

ระบบนำทาง
• แอพพลิเคชัน Waze
18

รถที่ไม่ต้องมีคนขับ (self driving car)
• Waymo (Google self-driving car)
19

แผงไข่อัจฉริยะ
• Egg Minder
20

ร้านค้าที่ไม่ต้องรอคิว
• Amazon Go
21

เทคโนโลยีที่ทำให้ชีวิตประจำวันสะดวกขึ้น
22

ทำไมผู้หญิงถึงโสด
23
source: https://pishetshotisak.wordpress.com/2016/12/07/ทำไมผู้หญิงถึงขึ้นคาน-ค/

What is Big Data?
24
source: http://dataconomy.com/2014/08/infographic-how-to-explain-big-data-to-your-grandmother/

Outline
25

Business without analytics
26

Business without analytics
27
image source: http://www.oknation.net/blog/print.php?id=434843

Business with analytics
28
source: https://www.youtube.com/watch?v=7tAgbni9kpY

Where does data come from?
29
source: https://www.youtube.com/watch?v=Y_JlkzzhAgw

Where does data come from?
• ข้อมูลแบ่งตามที่มา
• ภายในบริษัท/องค์กร
• ข้อมูลการซื้อขาย
• ข้อมูลประวัติลูกค้า
• ข้อมูลประวัติพนักงาน
• ภายนอกบริษัท/องค์กร
• ข้อมูลจาก social media ต่างๆ
• ข้อมูลข่าวต่างๆ
• ข้อมูลรูปภาพและเสียง
30
source: http://dailyprivacy.ﬁles.wordpress.com/2013/02/2012_big_data_study_infographic_600.jpg

Database & warehouse & mining
31
Database
Sales
Accounting
CRM
Extract 
Transform 
Load 
(ETL)
Data Mining
Data Warehouse
image source:https://sites.google.com/a/whps.org/diamond-teamkp/ 
http://www.iconarchive.com/tag/data

Database & warehouse & mining
• Database
• ฐานข้อมูลใช้ในการจัดเก็บข้อมูล ลดความซ้ำซ้อนของข้อมูล เน้นการจัดเก็บ เพ่ิม
แก้ไข และลบข้อมูล
• Data warehouse
• คลังข้อมูลรวบรวมช้อมูลจากหลายๆ ฐานข้อมูล แปลงข้อมูลให้มีความเหมือนกัน
เหมาะสำหรับการเรียกดู (view) เพื่อสร้างรายงานสรุป
• Data Mining
• การวิเคราะห์ข้อมูลเพื่อค้นหาความสัมพันธ์หรือรูปแบบที่มีประโยชน์ในฐานข้อมูล
32

BI & Data Mining
33
Business
Intelligence
Data
Mining
Time
Analytical  
Approach
Past Future
Explanatory
Exploratory
source:Data Science and Big Data Analytics: Discovering, analyzing, visualizing and presenting data
BI questions
• What happened last
quarter?
• How many unit sold?
• Where is the problem? In
which situations
Data Mining questions
• What if … ?
• What will happen next?
• Why is this happen?

What is data mining
• “The exploration and analysis of large quantities  
of data in order to discover meaningful patterns and
rules” – Data Mining Techniques (3rd Edition)
• เป็นการวิเคราะห์ข้อมูล เพื่อหารูปแบบ (patterns) หรือความสัมพันธ์
(relation) ระหว่างข้อมูลในฐานข้อมูลขนาดใหญ่
• “Extraction of interesting (non-trivial, previously,
unknown and potential useful) information from data in
large databases” – Data Mining Concepts &
Techniques (3rd Edition)
• เป็นกระบวนการดึงข่าวสารที่น่าสนใจ และมีประโยชน์แต่ไม่เคยรู้มา
ก่อนจากฐานข้อมูลขนาดใหญ่
34
image sources: https://binarylinks.wordpress.com/tag/data-mining/ 
http://www.amazon.com/Data-Mining-Techniques-Relationship-Management/dp/0470650931

What is data mining
35
ข้อมูล' เทคนิคการทำ data mining' รูปแบบที่มีประโยชน์'
image source:http://www.computerrepairanaheim.net 
https://sites.google.com/a/whps.org/diamond-teamkp/ 
http://meetings2.informs.org/wordpress/analytics2014/2014/04/01/why-oranalytics-people-need-to-know-about-database-technology/

Data Mining Applications
• ตัวอย่างการนำ Data Mining ไปใช้งาน
36
source: http://www.youtube.com/watch?v=f2Kji24833Y

• บัตรสมาชิก (loyalty card)
• ติดตามพฤติกรรมการซื้อสินค้า
ของลูกค้าจากบัตร loyalty
• นำมาวิเคราะห์และนำเสนอเป็น
โปรโมชันพิเศษให้แต่ละบุคคล
• เพิ่มโอกาสในการขายสินค้าให้กับ
ลูกค้า
• กระตุ้นให้ลูกค้าได้ซื้อสินค้ามาก
ขึ้น เช่น ซื้อสินค้าวันนี้ จะได้
ส่วนลดพิเศษ ทำให้ลูกค้าเกิดการ
ตัดสินใจซื้อทันที
37
image source: http://www.positioningmag.com

• ทราบพฤติกรรมการซื้อสินค้าของลูกค้า เพื่อนำมาวิเคราะห์ และ 
นำเสนอเป็นโปรโมชันพิเศษให้แต่ละบุคคล
38

• เบียร์และผ้าอ้อม
• ห้าง Walmart พบว่าทุกวันศุกร์
หลังบ่ายโมง จะมีลูกค้าเพศชาย
อายุระหว่าง 25 – 35 ปี ซื้อสินค้า
Beers และ Diapers มากที่สุด
39

• คาดการณ์การตั้งครรภ์
• ห้าง Target ทำการ
วิเคราะห์พฤติกรรมการซื้อ
สินค้าของลูกค้าเพศหญิง
• พบรูปแบบ (pattern) ว่า
ถ้ามีการซื้อวิตามิน ซื้อ
อาหารบำรุง หรือ ซื้อตู้
เตียงเพิ่ม ลูกค้าจะเริ่มตั้ง
ครรภ์
• Target จะส่ง promotion
ให้ลูกค้าเหล่านั้น
40

• แนะนำสินค้าที่เกี่ยวข้อง
• amazon.com แนะนำหนังสือที่เกี่ยวข้องกับ RapidMiner
• Netﬂix แนะนำภาพยนต์ที่คล้ายกับที่เคยดู เช่น Life of Pi
41

• Google Self-Driving Car
42
source: https://www.youtube.com/watch?v=8fjNSUWX7nQ

• แนวโน้มราคาตั๋วเครื่องบิน
43

• คาดการณ์การลาออกของพนักงาน
44
Receive Promotion
= NO = YES
Years with ﬁrm < 5
Not Quit
= YES = NO
Partner changed job
Quit Not Quit
= YES = NO
Quit
ตัวอย่างโมเดลคาดการณ์การลาออกของพนักงาน

• วิเคราะห์ทัศนคติในแง่ต่างๆ จากสังคมออนไลน์
45

• วิเคราะห์ทัศนคติในแง่ต่างๆ จากสังคมออนไลน์ (ภาษาไทย)
46

• ทำนายอายุและเพศจากรูปภาพ
47
source: http://www.how-old.net

Outline
48

• เป็นขั้นตอนการวิเคราะห์ข้อมูลด้วยเทคนิคดาต้าไมน์นิ่ง
• การหาความสัมพันธ์ (association analysis)
• หาความสัมพันธ์ของข้อมูลที่เกิดร่วมกัน
• เช่น ค้นหาสินค้าที่มีการซื้อร่วมกันบ่อยๆ
• การจัดกลุ่มข้อมูล (clustering)
• แบ่งข้อมูลหลายๆ กลุ่มตามความคล้ายคลึง
• เช่น แบ่งกลุ่มลูกค้าตามพฤติกรรมการใช้งาน
• การจำแนกประเภทข้อมูล (classiﬁcation)
• สร้างโมเดลจากข้อมูลที่มีอยู่เพื่อทำนายอนาคต
• เช่น ทำนายปริมาณน้ำฝนที่ตกในวันถัดไป
Data Science/Data Mining methods
49
association rules
clustering
classiﬁcation

• การหาความสัมพันธ์ที่เกิดขึ้นในข้อมูล (Association Analysis)
50
ตะกร้าใบที่ 1 ตะกร้าใบที่ 2 ตะกร้าใบที่ 3

51

52

53

54

55

56

57

58
association rules
clustering
classiﬁcation

• การจัดกลุ่ม  
(Segmentation)
59image source: Major Development PCL Facebook

Segmentation by RFM
• แบ่งกลุ่มลูกค้าตามพฤติกรรมการซื้อสินค้าของลูกค้า
• ระยะเวลา (จำนวนวัน) จากการซื้อล่าสุดที่ผ่านมา (Recency)
• ความถี่ของการซื้อสินค้า (Frequency)
• การใช้จ่ายของลูกค้า (Monetary)
60
Customer ID Recency Frequency Monetary
C10001
C10002
C10003
หมายเหตุ: คำนวณ ณ วันที่ 01/08/2015
OrderID Customer ID Order Date Total
Amount
O14001 C10003 01-01-2014 10.00
O14002 C10001 02-13-2014 20.00
O14003 C10002 03-14-2014 200.00
O14004 C10001 04-15-2014 10.00
O14005 C10001 08-10-2014 30.00
O14006 C10002 09-14-2014 300.00
ตาราง order detail

Amount
O14001 C10003 01-01-2014 10.00
O14002 C10001 02-13-2014 20.00
O14003 C10002 03-14-2014 200.00
O14004 C10001 04-15-2014 10.00
O14005 C10001 08-10-2014 30.00
O14006 C10002 09-14-2014 300.00
Segmentation by RFM
61
C10001 151 3 60

Amount
O14001 C10003 01-01-2014 10.00
O14002 C10001 02-13-2014 20.00
O14003 C10002 03-14-2014 200.00
O14004 C10001 04-15-2014 10.00
O14005 C10001 08-10-2014 30.00
O14006 C10002 09-14-2014 300.00
Segmentation by RFM
62
C10001 151 3 60
C10002 116 2 500

Amount
O14001 C10003 01-01-2014 10.00
O14002 C10001 02-13-2014 20.00
O14003 C10002 03-14-2014 200.00
O14004 C10001 04-15-2014 10.00
O14005 C10001 08-10-2014 30.00
O14006 C10002 09-14-2014 300.00
Segmentation by RFM
63
C10001 151 3 60
C10002 116 2 500
C10003 372 1 10

Segmentation by RFM
• แบ่งกลุ่มลูกค้าด้วยวิธี RFM
• เรียงลำดับข้อมูล
• Recency จากน้อยไปมาก
• Frequency และ Monetary จากมากไปน้อย
• แบ่งข้อมูลออกเป็น 5 กลุ่ม กลุ่มละจำนวนเท่าๆ กัน (quintile)
• คำนวณคะแนน RFM ของแต่ละกลุ่ม
64
source: http://www.b-eye-network.com/view/10256
น้อย
มาก
Recency
score = 5
score = 4
score = 3
score = 2
score = 1
มาก
น้อย
Frequency
score = 5
score = 4
score = 3
score = 2
score = 1
มาก
น้อย
Monetary
20% ของข้อมูล
score = 5
score = 4
score = 3
score = 2
score = 1

Segmentation by RFM
• แบ่งกลุ่มลูกค้าด้วยวิธี RFM
• ลูกค้าในแต่ละกลุ่มจะมีลักษณะต่างๆ กัน เช่น
• ลูกค้ากลุ่ม RFM = 555
• เป็นกลุ่มลูกค้าที่มีค่ามากสุด
• เป็นกลุ่มลูกค้าที่มีการซื้อบ่อยๆ  
แต่ซื้อจำนวนน้อย
• ออก campaign กระตุ้นให้ลูกค้าซื้อสินค้าที่มีราคา 
สูงขึ้น (up-selling)
• เป็นกลุ่มลูกค้าที่นานๆ จะซื้อสักครั้ง แต่ซื้อสินค้าที่มีราคาสูง
• ออก campaign กระตุ้นให้ลูกค้าซื้อสินค้าบ่อยขึ้น
65
Recency
Frequency
M
onetary
1 2 3 4 5
5
4
3
2
1
5
4
3
2
1

• การจัดกลุ่มข้อมูลตามความคล้ายคลึง (Clustering)
66
1
2
3
5
6
4

• การจัดกลุ่มข้อมูลตามความคล้ายคลึง (Clustering)
67
1 6
4 5
2 3
ลูกค้าที่ใช้โทรเยอะ
ลูกค้าที่ส่ง SMS เยอะ
ลูกค้าที่ใช้งานไม่เยอะ

68
association rules
clustering
classiﬁcation

• Classiﬁcation (การคาดการณ์สิ่งที่จะเกิดขึ้นในอนาคต)
69

Outline
70

Classiﬁcation in daily life
• การพยากรณ์อากาศ
71
สภาพอากาศวันปัจจุบัน สภาพอากาศวันถัดไป

• face recognition
72
image source: http://www.bloomberg.com/news/articles/2012-06-18/facebook-buys-face-com-adds-facial-recognition-software

• spam e-mail
73
spam e-mail

Classiﬁcation example
• ตัวอย่าง spam e-mail classiﬁcation
• ระบุว่า e-mail ไหนบ้างที่เป็น spam e-mail
74
ID Text Type
1
Please call our customer service representative on FREE PHONE 0808 145 4742 between
9am-11pm as you have WON a guaranteed £1000 cash
2 You have won $1,000 cash or a $2,000 prize! To claim, call 09050000327
3 I'm gonna be home soon and I don't want to talk about this stuff anymore tonight
4 Is that seriously how you spell his name?
5
Double mins and txts 4 6months FREE Bluetooth on Orange. Available on Sony, Nokia Motorola
phones.
6 FREE RINGTONE text FIRST to 87131 for a poly or text GET to 87131 for a true tone!
7 Sorry, I'll you call later in meeting.
8
Congratulations - in this week's competition draw u have won the £1450 prize to claim just call
09050002311
9 Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable.
10 Hello, What are you doing? Did you attend the training course today?
spam
spam
normal
normal
normal
normal
spam
spam
spam
normal

• ระบุว่า e-mail ไหนบ้างที่เป็น spam e-mail
75
ID Text Type
1
Please call our customer service representative on FREE PHONE 0808 145 4742 between
9am-11pm as you have WON a guaranteed £1000 cash
spam
2 You have won $1,000 cash or a $2,000 prize! To claim, call 09050000327 spam
3 I'm gonna be home soon and I don't want to talk about this stuff anymore tonight normal
4 Is that seriously how you spell his name? normal
5
Double mins and txts 4 6months FREE Bluetooth on Orange. Available on Sony, Nokia Motorola
phones.
spam
6 FREE RINGTONE text FIRST to 87131 for a poly or text GET to 87131 for a true tone! spam
7 Sorry, I'll you call later in meeting. normal
8
Congratulations - in this week's competition draw u have won the £1450 prize to claim just call
09050002311
spam
9 Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable. normal
10 Hello, What are you doing? Did you attend the training course today? normal

• หา keyword ที่ใช้บ่งบอกว่าเป็น spam e-mail
76
ID Free Won Cash Type
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
ID Text Type
1
Please call our customer service
representative on FREE PHONE 0808
145 4742 between 9am-11pm as you
have WON a guaranteed £1000 cash
spam
2
You have won $1,000 cash or a $2,000
prize! To claim, call 09050000327
spam
3
I'm gonna be home soon and I don't
want to talk about this stuff anymore
tonight
normal
4
Is that seriously how you spell his
name?
normal
5
Double mins and txts 4 6months FREE
Bluetooth on Orange. Available on
Sony, Nokia Motorola phones.
spam
… … …
keywords

• สร้างโมเดล (classiﬁcation model) จากข้อมูล training data ซึ่งมีลาเบล (label)
77
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
attribute label
Free
Won
Normal Spam
Spam
classiﬁcation model
= N = Y
= N = Y
training data

• นำข้อมูลใหม่ (unseen data) ทำนายโดยใช้โมเดล
78
attribute
Free
Won
Normal Spam
Spam
= N = Y
= N = Y
training data
11 Y Y N ?
12 N Y N ?

79
attribute
Free
Won
Normal Spam
Spam
= N = Y
= N = Y
training data
11 Y Y N ?
12 N Y N ?

80
attribute
Free
Won
Normal Spam
Spam
= N = Y
= N = Y
training data
11 Y Y N ?
12 N Y N ?

1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
81
attribute labelID
training data
สร้าง classiﬁcation model
11 Y Y N ?
12 N Y N ?
unseen data
ID Type
11 spam
12 spam
1
2
3 4

Classiﬁcation & Regression
• นำข้อมูลเดิมที่มีคำตอบที่สนใจ หรือ คลาส (class) มาสร้างเป็นโมเดล (model) เพื่อ
หาคำตอบให้กับข้อมูลใหม่ (unseen data)
• คลาสคำตอบเป็น ประเภท (nominal)
• ฝนตก หรือ ไม่ตก
• spam email หรือ normal email
• การประมาณค่าข้อมูล (regression)
• มีลักษณะเหมือนกับ classiﬁcation เพียงแต่คลาสคำตอบที่สนใจเป็น ตัวเลข
(numeric)
• อุณหภูมิในวันถัดไป
• ยอดขายในไตรมาสถัดไป
82

• การสร้างโมเดลและการทดสอบประสิทธิภาพ
Classiﬁcation & Regression task
83
สร้าง  
prediction results
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
1
training data
testing data
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
2
3 4

Performance (classiﬁcation)
• ตัววัดประสิทธิิภาพของโมเดล classiﬁcation
• Confusion Matrix
• True Positive (TP), True Negative (TN)
• False Positive (FP), False Negative (FN)
• Precision and Recall
• F-Measure
• Accuracy
• ROC Graph & Area Under Curve (AUC)
84

• พิจารณาคลาส normal
• True Positive (TP)
• True Negative (TN)
• False Positive (FP)
• False Negative (FN)
85
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
pred.true. normal spam
normal TP FP
spam FN TN
dataminingtrend.com

• จำนวนที่ทำนายตรงกับข้อมูลจริงใน
คลาสที่กำลังพิจารณา
86
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
normal 4 FP
spam FN TN
dataminingtrend.com

• จำนวนที่ทำนายตรงกับข้อมูลจริงใน
คลาสที่ไม่ได้กำลังพิจารณา
87
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
normal 4 FP
spam FN 6
dataminingtrend.com

• จำนวนที่ทำนายผิดเป็นคลาสที่กำลัง
พิจารณา
88
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
normal 4 3
spam FN 6
dataminingtrend.com

• จำนวนที่ทำนายผิดเป็นคลาสที่ไม่ได้
กำลังพิจารณา
89
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
normal 4 3
spam 2 6
dataminingtrend.com

• F-Measure
• Accuracy
90

• Precision
• จำนวนที่ทำนายถูกจากข้อมูลที่
ทำนายว่าเป็นคลาสที่พิจารณาอยู่
• Precision สำหรับ normal
• True Positive 
True Positive + False Positive
• 4/7 x 100 = 57.12%
• Precision สำหรับ spam
• 6/8 x 100 = 75%
91
ID Type Predicted
3 normal normal
8 spam normal
9 normal normal
10 normal normal
13 spam normal
14 spam normal
15 normal normal
normal TP FP
spam FN TN
Precision
ID Type Predicted
1 spam spam
2 spam spam
4 normal spam
5 spam spam
6 spam spam
7 normal spam
11 spam spam
12 spam spam
predict เป็นคลาส spam
predict เป็นคลาส normal
confusion matrix ของคลาส normal

• Recall
• จำนวนข้อมูลที่ทำนายถูก
• Recall สำหรับ normal
• True Positive 
True Positive + False Negative
• 4/6 x 100 = 66.67%
• Recall สำหรับ spam
• 6/9 x 100 = 66.67%
92
normal TP FP
spam FN TN
คลาส spam
คลาส normal
confusion matrix ของคลาส normal
Recall
ID Type Predicted
3 normal normal
4 normal spam
7 normal spam
9 normal normal
10 normal normal
15 normal normal
ID Type Predicted
1 spam spam
2 spam spam
5 spam spam
6 spam spam
8 spam normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal

• F-Measure
• Accuracy
93

• F-Measure
• ค่าเฉลี่ยของ Precision และ Recall
• 2 x Precision x Recall  
Precision + Recall
• F-Measure สำหรับ normal
• 2 x 57.12 x 66.67 = 61.53% 
57.12 + 66.67
• F-Measure สำหรับ spam
• 2 x 75 x 66.7 = 70.59% 
75 + 66.7
94
ID Type Predicted
3 normal normal
8 spam normal
9 normal normal
10 normal normal
13 spam normal
14 spam normal
15 normal normal
Precision = 4/7 x 100 = 57.12%
Recall = 4/6 x 100 = 66.67%
ID Type Predicted
3 normal normal
4 normal spam
7 normal spam
9 normal normal
10 normal normal
15 normal normal

• F-Measure
• Accuracy
• ROC Graph & Area
95

• Accuracy
• จำนวนข้อมูลที่ทำนายถูกของทุก 
คลาส
• True Positive + True Negative 
True Positive + True Negative + False Positive + False Negative
• 10/15 x 100 =66.67%
96
normal TP FP
spam FN TN
Accuracy
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
4 normal spam
5 spam spam
6 spam spam
7 normal spam
8 spam normal
9 normal normal
10 normal normal
11 spam spam
12 spam spam
13 spam normal
14 spam normal
15 normal normal
dataminingtrend.com

• F-Measure
• Accuracy
97

ROC Graph & Area
• Receiver Operating Characteristics (ROC) แสดงกราฟความ
สัมพันธ์ระหว่างข้อมูลที่ทำนายถูก (แกน Y) และทำนายผิด (แกน X)
98
ID Type Predicted Score TP rate FP rate
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
False Positive Rate (FP rate)
0.3
0.4
0.5
0.6
0.7
True Positive rate (TP rate)
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
99
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
100
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
101
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
102
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
103
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
104
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
105
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0

ROC Graph & Area
106
1 normal spam 0.80 1.00 1.00
2 spam spam 0.85 1.00 0.66
4 normal spam 0.87 0.80 0.66
5 spam spam 0.90 0.80 0.33
6 spam spam 0.92 0.60 0.33
7 normal spam 0.95 0.40 0.33
11 spam spam 0.98 0.40 0.00
12 spam spam 0.99 0.20 0.00
0.1 0.3 0.4 0.5 0.6 0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2
0.8
0.9
1.0
0.8 0.9 1.0
ROC Curve

ROC Graph & Area
• ROC Curve มีค่าเข้าใกล้ 1 จะแสดงว่ามีประสิทธิภาพดี
• เนื่องจากมีค่า True Positive เยอะ
107
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
True Positive
False Positive
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.8 0.9 1.0
The best
Good
Bad

ROC Graph & Area
• Area Under Curve (AUC) ใช้แสดงค่าพื้นที่ใต้กราฟ ROC
• มีค่ามาก (เข้าใกล้ 1) จะยิ่งดี
108
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
False Positive
0.3
0.4
0.5
0.6
0.7
True Positive
AUC
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
0.2
False Positive
0.3
0.4
0.5
0.6
0.7
True Positive
AUC

Validation
• การแบ่งข้อมูลเพื่อทดสอบประสิทธิภาพของโมเดล
• Self consistency test (use training set)
• Split test
• Cross-validation test
109

• ใช้ข้อมูล training ในการทดสอบประสิทธิภาพของโมเดล
Self Consistency test
110
สร้าง  
prediction results
ID Type Predicted
1 spam spam
2 spam spam
3 normal normal
1
training data
testing data
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
2
3 4
Goodช้อมูลชุดเดียวกัน

• แบ่งข้อมูลออกเป็น 2 ชุด
• training data สำหรับสร้างโมเดล และ testing data สำหรับทดสอบ
Split test
111
สร้าง  
prediction results
ID Type Predicted
3 normal normal
1
training data
testing data
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
2
3 4
ข้อมูล 2 ใน 3 ใช้สร้างโมเดล
ข้อมูล 1 ใน 3 ใช้ทดสอบโมเดล

• แบ่งข้อมูลออกเป็น N ชุด เช่น N = 5 หรือ 10
• ข้อมูล N-1 ชุดสำหรับสร้างโมเดล และ ข้อมูลส่วนที่เหลือสำหรับทดสอบ วนทำจนครบ N
Cross-validation
112
สร้าง  
prediction results
ID Type Predicted
3 normal normal
1
training data
testing data
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
2
3 4
ข้อมูล ID 1 และ 2 ใช้สร้างโมเดล
ข้อมูล ID 3 ใช้ทดสอบโมเดล

Cross-validation
113
สร้าง  
prediction results
ID Type Predicted
2 spam spam
1
training data
testing data
1 Y Y Y spam
3 N N N normal
2 N Y Y spam
2
3 4

Cross-validation
114
สร้าง  
prediction results
ID Type Predicted
1 spam spam
1
training data
testing data
2 N Y Y spam
3 N N N normal
1 Y Y Y spam
2
3 4

• ตัวอย่างของ 5-fold cross-validation
Cross-validation
115
ID Attributes Label
1 X1 spam
2 X2 spam
3 X3 normal
4 X4 spam
5 X5 spam
6 X6 spam
7 X7 spam
8 X8 normal
9 X9 normal
10 X10 normal
11 X11 spam
12 X12 spam
13 X13 normal
14 X14 normal
15 X15 normal
1
2
3
4
5
2
3
4
5
1
training
testing
รอบที่ 1
1
3
4
5
2
training
testing
1
2
4
5
3
training
testing
1
2
3
5
4
training
testing
1
2
3
4
5
training
testing
model model model model model
training data

Decision Tree
• Overview of a Decision Tree
Logins 4 weeks
> 6.5 < 6.5
Emailyes
yes
= free = premium
Sales 4 weeks
yes no
> 2 < 2
116
Depth = 1
Root — Top internal node
Branch — Outcome of test
Leaf Node — Class label
Internal Node — Decision on variable

Decision Tree
• สร้างกฏได้จาก Decision Tree โดยการใส่ไปตามแต่ละ Path ของ
Tree
117
Logins 4 weeks
> 6.5 < 6.5
Emailyes
yes
= free = premium
Sales 4 weeks
yes no
> 2 < 2
โมเดล decision tree
• IF Logins 4 weeks > 6.5 THEN
Response = yes
business rule ที่ได้จากโมเดล decision tree

Decision Tree
Tree
118
Logins 4 weeks
> 6.5 < 6.5
Emailyes
yes
= free = premium
Sales 4 weeks
yes no
> 2 < 2
Response = yes
• IF Logins 4 weeks < 6.5 AND 
Email = premium THEN 
Response = yes

Decision Tree
Tree
119
Logins 4 weeks
> 6.5 < 6.5
Emailyes
yes
= free = premium
Sales 4 weeks
yes no
> 2 < 2
Response = yes
Response = yes
Email = free AND 
Sales 4 weeks > 2 THEN 
Response = yes

Decision Tree
Tree
120
Logins 4 weeks
> 6.5 < 6.5
Emailyes
yes
= free = premium
Sales 4 weeks
yes no
> 2 < 2
Response = yes
Response = yes
Email = free AND 
Sales 4 weeks > 2 THEN 
Response = yes
Email = free AND 
Sales 4 weeks < 2 THEN 
Response = no

Decision Tree
• เป็นเทคนิคที่นิยมใช้ในการทำ classiﬁcation
• ขั้นตอนการสร้าง decision tree จะเลือกแอตทริบิวต์ที่มีความสัมพันธ์
กับคลาสมาใช้งาน
• คำนวณค่า Entropy และ Information Gain (IG)
121
Entropy(c1) = -p(c1) log p(c1)
IG (parent, child) = Entropy(parent) – [p(c1) × Entropy(c1) + p(c2) × Entropy(c2) + ...]

Decision Tree
• ลักษณะของค่า Entropy
122
0.0
0.2
0.4
0.6
0.8
1.0
Entropy

Decision Tree
• ข้อมูล Weather
• เก็บสภาพภูมิอากาศจำนวน 14 วันเพื่อพิจารณาว่าจะมีการแข่งขันกีฬาได้หรือไม่
123
ID Outlook Temperature Humidity Windy Play
1 sunny hot high FALSE no
2 sunny hot high TRUE no
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
6 rainy cool normal TRUE no
7 overcast mild normal TRUE yes
8 sunny mild high FALSE no
9 sunny mild normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes
14 rainy mild high TRUE no

Decision Tree
• คำนวณค่า Entropy ของข้อมูล
ทั้งหมด 14 ตัวอย่าง
• -[p(Play=yes)×log2p(Play=yes) 
+p(Play=no)×log2p(Play=no)]
• -[0.64 × log2(0.64) +  
0.36 × log2(0.36)]
• -[0.64 × -0.64 + 0.36 × -1.47]
• 0.94
124
ข้อมูลทั้งหมด (14 ตัวอย่าง)
แอตทริบิวต์ play = yes
แอตทริบิวต์ play = no
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36

Decision Tree
• คำนวณค่า Entropy เมื่อ
Outlook เป็น sunny
• -[0.4 × log2(0.4) +  
0.6 × log2(0.6)]
• -[0.4 × -1.32 + 0.6 × -0.74]
• 0.97
125
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
แอตทริบิวต์  
Outlook = sunny
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6

Decision Tree
Outlook เป็น overcast
• -[1.0 × log2(1.0) +  
0.0 × log2(0.0)]
• -[1.0 × 0.0 + 0.0 × 1.0]
• 0.00
126
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny แอตทริบิวต์  
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 4/4 = 1.0
p( ) = 0/4 = 0.0

Decision Tree
Outlook เป็น rainy
• -[0.6 × log2(0.6) +  
0.4 × log2(0.4)]
• -[0.6 × -0.74 + 0.4 × -1.32]
• 0.97
127
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny
Outlook = rainyแอตทริบิวต์  
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 4/4 = 1.0
p( ) = 0/4 = 0.0
p( ) = 3/5 = 0.6
p( ) = 2/5 = 0.4

Decision Tree
• คำนวณค่า IG เมื่อใช้ Outlook
• Entropy(parent)–
[p(Outlook=sunny)× 
Entropy(Outlook=sunny)
• +p(Outlook=overcast) ×
Entropy(Outlook=overcast)
• +p(Outlook=rainy) ×
Entropy(Outlook=rainy)]
• 0.94–
[0.35×0.97+0.30×0+0.35×0.97]
• 0.26
128
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 4/4 = 1.0
p( ) = 0/4 = 0.0
p( ) = 3/5 = 0.6
p( ) = 2/5 = 0.4

Decision Tree
• คำนวณค่า IG ของแอตทริบิวต์ Outlook
• IG = 0.26 (ดูการคำนวณจากหนังสือ An Introduction to Data Mining Techniques)
129
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 4/4 = 1.0
p( ) = 0/4 = 0.0
p( ) = 3/5 = 0.6
p( ) = 2/5 = 0.4
attribute IG
Outlook 0.26
Temperature
Humidity
Windy

Decision Tree
• คำนวณค่า IG ของแอตทริบิวต์ Temperature
130
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Temperature = cool
Temperature = mildแอตทริบิวต์  
Temperature = hot
p( ) = 3/4 = 0.75
p( ) = 1/4 = 0.25
p( ) = 2/4 = 0.5
p( ) = 2/4 = 0.5
p( ) = 4/6 = 0.67
p( ) = 2/6 = 0.33
attribute IG
Outlook 0.26
Temperature 0.03
Humidity
Windy

Decision Tree
• คำนวณค่า IG ของแอตทริบิวต์ Humidity
131
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Humidity = high
Humidity = normal
p( ) = 3/7 = 0.43
p( ) = 4/7 = 0.57
p( ) = 6/7 = 0.86
p( ) = 1/7 = 0.14
attribute IG
Outlook 0.26
Temperature 0.03
Humidity 0.15
Windy

Decision Tree
• คำนวณค่า IG ของแอตทริบิวต์ Windy
132
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Windy = FALSE
Windy = TRUE
p( ) = 6/8 = 0.75
p( ) = 2/8 = 0.25
p( ) = 3/6 = 0.50
p( ) = 3/6 = 0.50
attribute IG
Outlook 0.26
Temperature 0.03
Humidity 0.15
Windy 0.05

Decision Tree
• เลือกแอตทริบิวต์ Outlook  
เป็นโหนด root
133
Humidity = high
Humidity = normal
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 0/3 = 0.0
p( ) = 3/3 = 1.0
p( ) = 2/2 = 1.0
p( ) = 0/2 = 0.0

Decision Tree
• เลือกแอตทริบิวต์ Outlook  
เป็นโหนด root
134
Humidity = high
Humidity = normal
p( ) = 9/14 = 0.64
p( ) = 5/14 = 0.36
Outlook = sunny
Outlook = overcast
p( ) = 2/5 = 0.4
p( ) = 3/5 = 0.6
p( ) = 0/3 = 0.0
p( ) = 3/3 = 1.0
p( ) = 2/2 = 1.0
p( ) = 0/2 = 0.0
Windy = TRUE
Windy = FALSE
p( ) = 0/2 = 0.0
p( ) = 2/2 = 1.0
p( ) = 3/3 = 1.0
p( ) = 0/3 = 0.0

Decision Tree
• การใช้โมเดล predict ข้อมูลใหม่
135
Outlook
Humidity
= sunny = rainy
No
Yes Windy
= overcast
Yes No Yes
= high = normal = TRUE = FALSE
ID Outlook Temperature Humidity Windy
1 sunny hot high FALSE
ข้อมูลที่ใช้ทดสอบ

Decision Tree
• การใช้โมเดล predict ข้อมูลใหม่
136
Outlook
Humidity
= sunny = rainy
No
Yes Windy
= overcast
Yes No Yes
= high = normal = TRUE = FALSE
ID Outlook Temperature Humidity Windy
1 sunny hot high FALSE

Decision Tree
• ข้อมูลเป็นตัวเลข
• เรียงลำดับข้อมูลที่เป็นตัวเลขจากน้อยไปมาก
• แบ่งข้อมูลออกเป็น 2 ส่วนโดยการหาจุดกึ่งกลางระหว่างค่าตัวเลข 2 ค่า
• คำนวณค่า Information Gain จากข้อมูล 2 ส่วนที่แบ่งได้
• เลือกจุดกึ่งกลางที่ให้ค่า Information Gain สูงที่สุดมาใช้งานต่อ
137

Decision Tree
• เมื่อใช้ Humidity = 67.5 เป็นตัวแบ่ง ได้ค่า IG = 0.11
138
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ค่าเฉลี่ย = 67.5
ID Humidity Play
7 < 67.5 no
6 > 67.5 no
9 > 67.5 yes
11 > 67.5 yes
13 > 67.5 yes
3 > 67.5 no
5 > 67.5 yes
10 > 67.5 no
14 > 67.5 yes
1 > 67.5 yes
2 > 67.5 yes
12 > 67.5 yes
8 > 67.5 yes
4 > 67.5 no

Decision Tree
139
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 72.5 no
6 < 72.5 no
9 < 72.5 yes
11 < 72.5 yes
13 > 72.5 yes
3 > 72.5 no
5 > 72.5 yes
10 > 72.5 no
14 > 72.5 yes
1 > 72.5 yes
2 > 72.5 yes
12 > 72.5 yes
8 > 72.5 yes
4 > 72.5 no

Decision Tree
140
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 76.5 no
6 < 76.5 no
9 < 76.5 yes
11 < 76.5 yes
13 < 76.5 yes
3 > 76.5 no
5 > 76.5 yes
10 > 76.5 no
14 > 76.5 yes
1 > 76.5 yes
2 > 76.5 yes
12 > 76.5 yes
8 > 76.5 yes
4 > 76.5 no

Decision Tree
141
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 79.0 no
6 < 79.0 no
9 < 79.0 yes
11 < 79.0 yes
13 < 79.0 yes
3 < 79.0 no
5 > 79.0 yes
10 > 79.0 no
14 > 79.0 yes
1 > 79.0 yes
2 > 79.0 yes
12 > 79.0 yes
8 > 79.0 yes
4 > 79.0 no

Decision Tree
142
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 82.5 no
6 < 82.5 no
9 < 82.5 yes
11 < 82.5 yes
13 < 82.5 yes
3 < 82.5 no
5 < 82.5 yes
10 < 82.5 no
14 < 82.5 yes
1 > 82.5 yes
2 > 82.5 yes
12 > 82.5 yes
8 > 82.5 yes
4 > 82.5 no

Decision Tree
143
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 87.5 no
6 < 87.5 no
9 < 87.5 yes
11 < 87.5 yes
13 < 87.5 yes
3 < 87.5 no
5 < 87.5 yes
10 < 87.5 no
14 < 87.5 yes
1 < 87.5 yes
2 > 87.5 yes
12 > 87.5 yes
8 > 87.5 yes
4 > 87.5 no

Decision Tree
144
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 92.5 no
6 < 92.5 no
9 < 92.5 yes
11 < 92.5 yes
13 < 92.5 yes
3 < 92.5 no
5 < 92.5 yes
10 < 92.5 no
14 < 92.5 yes
1 < 92.5 yes
2 < 92.5 yes
12 < 92.5 yes
8 > 92.5 yes
4 > 92.5 no

Decision Tree
145
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
ID Humidity Play
7 < 95.5 no
6 < 95.5 no
9 < 95.5 yes
11 < 95.5 yes
13 < 95.5 yes
3 < 95.5 no
5 < 95.5 yes
10 < 95.5 no
14 < 95.5 yes
1 < 95.5 yes
2 < 95.5 yes
12 < 95.5 yes
8 > 95.5 yes
4 > 95.5 no

Decision Tree
146
ID Humidity Play
7 65.0 no
6 70.0 no
9 70.0 yes
11 70.0 yes
13 75.0 yes
3 78.0 no
5 80.0 yes
10 80.0 no
14 80.0 yes
1 85.0 yes
2 90.0 yes
12 90.0 yes
8 95.0 yes
4 96.0 no
จุดตัด IG
67.5 0.11
72.5 0.25
76.5 0.03
79.0 0.05
82.5 0.05
87.5 0.02
92.5 0.01
95.5 0.01
ตารางจุดตัดและค่า Information Gain (IG)
ให้ค่า IG มากที่สุด

Probability
• ความน่าจะเป็น (probability)
• โอกาสที่เกิดเหตุการณ์จากเหตุการณ์ทั้งหมด ใช้สัญลักษณ์ P() หรือ Pr()
• โยนเหรียญบาท (มีหัวและก้อย)
• โอกาสได้หัว มีค่าความน่าจะเป็น 1/2 = 0.5
• โอกาสได้ก้อย มีค่าความน่าจะเป็น 1/2 = 0.5
• ความน่าจะเป็นของการพบ spam email
• มี email ทั้งหมด 100 ฉบับ
• มี spam email ทั้งหมด 20 ฉบับ
• มี normal email ทั้งหมด 80 ฉบับ
• โอกาสที่ emai จะเป็น spam มีความน่าจะเป็น 20/100 = 0.2 หรือ P(spam) = 0.2
• โอกาสที่ emai จะเป็น normal มีความน่าจะเป็น 80/100 = 0.8 หรือ P(normal) = 0.8
147
all email (100 ฉบับ)
spam 
(20 ฉบับ)
normal 
(80 ฉบับ)

Probability
• Joint Probability
• ความน่าจะเป็นที่ 2 เหตุการณ์เกิดร่วมกัน
• ความน่าจะเป็นที่มีคำว่า Free อยู่ใน spam email
• สัญลักษณ์ P(Free=Y ∩ spam)
148
all email (100 ฉบับ)
spam 
(20 ฉบับ)
normal 
(80 ฉบับ)
Free
ความน่าจะเป็นที่มีคำว่า
Free ใน normal email
ความน่าจะเป็นที่เป็น  
spam email
ความน่าจะเป็นที่มีคำว่า
Free ใน spam email

Naive Bayes
• สร้างโมเดลเพื่อทำนาย spam email
151
P(Type = normal) = 5/10 = 0.50
P(Type = spam) = 5/10 = 0.50
attribute Type = normal Type = spam
Free = Y 0/5 = 0.00 3/5 = 0.60
Free = N 5/5 = 1.00 2/5 = 0.40
Won = Y 0/5 = 0.00 3/5 = 0.60
Won = N 5/5 = 1.00 2/5 = 0.40
Cash = Y 0/5 = 0.00 2/5 = 0.40
Cash = N 5/5 = 1.00 3/5 = 0.60
3 N N N normal
4 N N N normal
7 N N N normal
9 N N N normal
10 N N N normal
1 Y Y Y spam
2 N Y Y spam
5 Y N N spam
6 Y N N spam
8 N Y N spam
โมเดล Naive Bayes
training data

Naive Bayes
• การใช้โมเดลเพื่อ predict ข้อมูลใหม่
152
P(Type = normal) = 5/10 = 0.50
P(Type = spam) = 5/10 = 0.50
attribute Type = normal Type = spam
Free = Y 0/5 = 0.00 3/5 = 0.60
Free = N 5/5 = 1.00 2/5 = 0.40
Won = Y 0/5 = 0.00 3/5 = 0.60
Won = N 5/5 = 1.00 2/5 = 0.40
Cash = Y 0/5 = 0.00 2/5 = 0.40
Cash = N 5/5 = 1.00 3/5 = 0.60
โมเดล Naive Bayes
ID Free Won Cash
1 Y Y Y
P(Type = normal|A) = P(Free = Y|Type = normal) x  
P(Won = Y|Type = normal) x  
P(Cash = Y|Type = normal) x 
P(Type = normal)
= 0.00 x 0.00 x 0.00 x 0.50
= 0.00
P(Type = spam|A) = P(Free = Y|Type = spam) x  
P(Won = Y|Type = spam) x  
P(Cash = Y|Type = spam) x 
P(Type = spam)
= 0.60 x 0.60 x 0.40 x 0.50
= 0.07
P(C|A) = P(A|C) x P(C)
ค่า prob มากสุด

Outline
153

Classiﬁcation: Balanced data
• ในการสร้างโมเดลจำเป็นต้องมี training
data เพื่อให้เรียนรู้
• แอตทริบิวต์ทั่วไป คือ แอตทริบิวต์หรือ
ตัวแปรที่ใช้ในการสร้างโมเดล
• แอตทริบิวต์ประเภทลาเบล คือ
แอตทริบิวต์ที่เป็นคำตอบที่เราสนใจในการ
สร้างโมเดล เช่น spam/normal, response/
no response
• ข้อมูล training data ควรจะมีข้อมูลแต่
ละลาเบล (label) เท่ากัน หรือ ใกล้เคียง
กัน (balanced data) เพื่อให้โมเดล
สามารถเรียนรู้ได้จากทุกลาเบล
154
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
attribute label
ข้อมูล training data ที่เป็น balanced data

Classiﬁcation: unbalanced data
• ส่วนใหญ่แล้วข้อมูลจริงๆ จะไม่มีจำนวนที่ใกล้เคียงกัน
• ข้อมูลการฉ้อโกง (fraud) มี 2 ลาเบล คือ
• ลาเบล normal จะมีจำนวนเยอะมากๆ เช่น 90%
• ลาเบล fraud จะมีจำนวนน้อยมาก เช่น 10%
• ข้อมูลการตอบรับโปรโมชัน (response) มี 2 ลาเบล คือ
• ลาเบล no จะมีจำนวนเยอะมากๆ เช่น 95%
• ลาเบล yes จะมีจำนวนน้อยมาก เช่น 5%
• ข้อมูลการยกเลิกการใช้บริการ (churn) มี 2 ลาเบล คือ
• ลาเบล no churn จะมีจำนวนเยอะมากๆ เช่น 90%
• ลาเบล churn จะมีจำนวนน้อยมาก เช่น 10%
155

• ข้อมูลที่แต่ละลาเบลมีจำนวนแตกต่างกันมากเราจะเรียกว่าเป็น 
“imbalanced data” หรือ “unbalanced data” เราจะเรียกแต่ละลาเบลว่า
• majority class ข้อมูลที่มีจำนวนมากกว่า
• minority class ข้อมูลที่มีจำนวนน้อยกว่า
• ข้อมูลลาเบลที่มีจำนวนน้อย (minority class) เหล่านี้ส่วนใหญ่เป็นข้อมูลสำคัญ เช่น
• ข้อมูลการใช้งานที่เป็นการโกง (fraud)
• ข้อมูลลูกค้าที่ตอบรับโปรโมชัน (response = yes)
• ข้อมูลลูกค้าที่จะยกเลิกการใช้บริการ (churn = yes)
• ปัญหาคือ บางโมเดลจะเลือกทำให้ค่าความถูกต้องสูงที่สุด ดังนั้นจะตอบเป็น majority
class ไปเสียหมด เช่น ข้อมูลการใช้งานที่ไม่ได้โกง (normal) มีอยู่ 90% (majority
class) ดังนั้นถ้าโมเดลเลือกตอบว่าเป็น normal ทั้งหมดจะมีความถูกต้องสูงถึง 90%
156

• Example
157
training data
หมายเหตุ
ข้อมูลที่เป็น normal (85%)
ข้อมูลที่เป็น fraud (15%)
สร้าง  
1
2
3 4
testing data
accuracy = 85%

• Accuracy = 85%
• Recall (normal) = 100%
• Recall (fraud) = 0%
• Precision (normal) = 85%
• Precision (fraud) = 0%
Performance of unbalanced data
158
ID Type Predicted
1 fraud normal
2 fraud normal
3 fraud normal
4 normal normal
5 normal normal
6 normal normal
7 normal normal
8 normal normal
9 normal normal
10 normal normal
11 normal normal
12 normal normal
13 normal normal
14 normal normal
15 normal normal
16 normal normal
17 normal normal
18 normal normal
19 normal normal
20 normal normal
pred.true. true normal true fraud
pred. normal 17 3
pred. fraud 0 0
ID Type Predicted
1 fraud normal
2 fraud normal
3 fraud normal
4 normal normal
5 normal normal
6 normal normal
7 normal normal
8 normal normal
9 normal normal
10 normal normal
11 normal normal
12 normal normal
13 normal normal
14 normal normal
15 normal normal
16 normal normal
17 normal normal
18 normal normal
19 normal normal
20 normal normal

• การแก้ไขปัญหาของ imbalanced data
• sampling approach
• under-sampling
• สุ่มตัวอย่าง (sample) ข้อมูลที่เป็น majority class ให้มีจำนวนน้อยลง
• over-sampling
• สร้างข้อมูลตัวอย่างที่เป็น minority class ให้มีจำนวนเพิ่มขึ้น
• cost-sensitive approach
• กำหนดค่าน้ำหนัก (weight) ให้แต่ละลาเบลไม่เท่ากัน
• minority class จะมีค่าน้ำหนักมาก
• majority class จะมีค่าน้ำหนักน้อยกว่า
159

Sampling approach
160
under-sampling over-sampling
unbalanced data

Outline
161

Attribute (Feature) Selection
• ประสิทธิภาพของ Classiﬁcation ขึ้นอยู่กับ แอตทริบิวต์ หรือ feature 
ที่นำมาใช้
• attribute selection เป็นวิธีการคัดเลือกแอตทริบิวต์ (หรือ feature)  
ที่สำคัญในการสร้างโมเดล
• เลือกแอตทริบิวต์ที่มีความสัมพันธ์ (correlation) กับแอตทริบิวต์ลาเบล (label) มาก
• เลือกแอตทริบิวต์ที่มีความสัมพันธ์กันระหว่างแอตทริบิวต์น้อย
• การทำ attribute selection เหมาะกับ
• ช้อมูลที่มีจำนวนแอตทริบิวต์เป็นจำนวนเยอะ เช่น text mining
• ใช้เวลาในการสร้างโมเดลนาน
162

• การคัดเลือกแอตทริบิวต์จะทำให้
• โมเดลมีประสิทธิภาพมากขึ้น เนื่องจากบางแอตทริบิวต์ที่ไม่มีความสำคัญ
(irrelevant) ได้ถูกลบทิ้งไป
• ทำให้การทำงานไวขึ้นเนื่องจากมีแอตทริบิวต์ที่น้อยลง
163

แบ่งได้เป็น 2 แบบ
164
Information Gain Chi-square
Forward 
Selection
Attribute Selection
Wrapper Approach
Backward 
Elimination
Filter Approach

• แบ่งได้เป็น 2 แบบ
• Filter approach เป็นการคำนวณค่าน้ำหนัก (หรือค่าความสัมพันธ์) ของแต่ละ
แอตทริบิวต์และเลือกเฉพาะแอตทริบิวต์ที่สำคัญเก็บไว้
• Wrapper approach เป็นการคำนวณค่าน้ำหนักโดยใช้โมเดล classiﬁcation เป็นตัว
วัดประสิทธิภาพของแอตทริบิวต์
165
ID Free Won Cash Call Service Type
1 Y Y Y Y Y spam
2 N Y Y Y N spam
compute weight
ID Free Won Type
1 Y Y spam
2 N Y spam
แอตทริบิวต์ทั้งหมดใน training data
แอตทริบิวต์หลังจากการเลือก 
(selection) แล้ว
ID Free Won Cash Call Service Type
1 Y Y Y Y Y spam
2 N Y Y Y N spam
ID Free Won Type
1 Y Y spam
2 N Y spam
แอตทริบิวต์ทั้งหมดใน training data
แอตทริบิวต์หลังจากการเลือก 
(selection) แล้ว
classiﬁcation
model
Attribute Selection: Filter Approach
Attribute Selection: Wrapper Approach

Wrapper Approach
• เป็นวิธีการเลือกแอตทริบิวต์ใส่เข้าไปหรือถอดออกมาเพื่อสร้างโมเดล
และเลือก set ของแอตทริบิวต์ทีดีไว้ใช้
• ใช้แอตทริบิวต์ Free อย่างเดียว
166
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
ID Free Type
1 Y spam
2 N spam
3 N normal
4 N normal
5 Y spam
6 Y spam
7 N normal
8 N spam
9 N normal
10 N normal

Wrapper Approach
• ใช้แอตทริบิวต์ Won อย่างเดียว
167
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
ID Won Type
1 Y spam
2 Y spam
3 N normal
4 N normal
5 N spam
6 N spam
7 N normal
8 Y spam
9 N normal
10 N normal

Wrapper Approach
• ใช้แอตทริบิวต์ Cash อย่างเดียว
168
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
ID Cash Type
1 Y spam
2 Y spam
3 N normal
4 N normal
5 N spam
6 N spam
7 N normal
8 N spam
9 N normal
10 N normal

Wrapper Approach
• ใช้แอตทริบิวต์ Free และ Won
169
1 Y Y Y spam
2 N Y Y spam
3 N N N normal
4 N N N normal
5 Y N N spam
6 Y N N spam
7 N N N normal
8 N Y N spam
9 N N N normal
10 N N N normal
ID Free Won Type
1 Y Y spam
2 N Y spam
3 N N normal
4 N N normal
5 Y N spam
6 Y N spam
7 N N normal
8 N Y spam
9 N N normal
10 N N normal

Introduction to Predictive Analytics with case studies

Introduction to Predictive Analytics with case studies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Introduction to Predictive Analytics with case studies

Similar to Introduction to Predictive Analytics with case studies (20)

Introduction to Predictive Analytics with case studies