The document discusses the different types of lists that can be used in web pages, including unordered, ordered, and nested lists. Unordered lists use bullet points, while ordered lists number each item. The <UL>, <OL>, and <LI> tags are used to define these lists. Nested lists allow sub-lists to be indented within the main list to show hierarchical relationships between the items.
Improving Operations Efficiency- Issue 1James Thomas
This document summarizes initiatives to improve operations efficiency by preventing duplicate orders and shipping wrong items. Over the past year, there have been 16 duplicate orders and 238 wrong items shipped. The root causes were identified as issues with order entry, order changing, order picking, and pack confirmation processes. Ten areas for improvement were identified, including verifying customer information during order entry, flagging duplicate PO numbers, training on order changes, utilizing scanning technology, and enforcing quality checks during order picking and packing. The overall solution is to fully implement a warehouse management system to help automate processes and reduce human errors.
Fill rate is a measure of shipping performance calculated as a percentage of the total order. It compares the amount shipped on the initial shipment versus the total amount ordered. There are different types of fill rates such as line count, SKU, case, and value fill rates, which calculate this using order lines, SKUs, cases, or value, respectively. Factors like expedited items should be excluded from the calculation to avoid underestimating the true fill rate.
El documento habla sobre SlideShare y su propósito y uso. Explica que SlideShare sirve para publicar, compartir e intercambiar presentaciones de manera pública o privada, y que se puede utilizar de forma interna o externa compartiendo contenido de manera pública o privada.
Pipeline 1 (Lecture in KMD 2015 Fall: Internet Technology Slot)Kazunori Uhyo Sugiura
Pipeline Course Slide Material.
for basic lesson on Computer technology and an Operating System. Studet will be installing Debian distribution Linux
using VirtualBox to study basic elements in Internet Technology using Linux.
Austin Stoll discusses moving into his new home at the University of Northern Iowa and the challenges of meeting new people and getting involved on campus. He was nervous about leaving his friends and family in his small hometown but found that meeting his roommate helped him realize he had skills for making new friends. While getting involved is a work in progress, attending interesting events has been easier than expected. Stoll compares his transition to a memoir where the author also felt fear when watching loved ones drive away after her move to prison.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document discusses Prolog programming. It covers data structures in Prolog like lists and terms, programming techniques like guess-and-verify queries and open lists, and control in Prolog through goal ordering and rule selection. Lists can represent data structures and terms through partial specification with variables. Programming with open lists and difference lists allows modification of data.
The document discusses the different types of lists that can be used in web pages, including unordered, ordered, and nested lists. Unordered lists use bullet points, while ordered lists number each item. The <UL>, <OL>, and <LI> tags are used to define these lists. Nested lists allow sub-lists to be indented within the main list to show hierarchical relationships between the items.
Improving Operations Efficiency- Issue 1James Thomas
This document summarizes initiatives to improve operations efficiency by preventing duplicate orders and shipping wrong items. Over the past year, there have been 16 duplicate orders and 238 wrong items shipped. The root causes were identified as issues with order entry, order changing, order picking, and pack confirmation processes. Ten areas for improvement were identified, including verifying customer information during order entry, flagging duplicate PO numbers, training on order changes, utilizing scanning technology, and enforcing quality checks during order picking and packing. The overall solution is to fully implement a warehouse management system to help automate processes and reduce human errors.
Fill rate is a measure of shipping performance calculated as a percentage of the total order. It compares the amount shipped on the initial shipment versus the total amount ordered. There are different types of fill rates such as line count, SKU, case, and value fill rates, which calculate this using order lines, SKUs, cases, or value, respectively. Factors like expedited items should be excluded from the calculation to avoid underestimating the true fill rate.
El documento habla sobre SlideShare y su propósito y uso. Explica que SlideShare sirve para publicar, compartir e intercambiar presentaciones de manera pública o privada, y que se puede utilizar de forma interna o externa compartiendo contenido de manera pública o privada.
Pipeline 1 (Lecture in KMD 2015 Fall: Internet Technology Slot)Kazunori Uhyo Sugiura
Pipeline Course Slide Material.
for basic lesson on Computer technology and an Operating System. Studet will be installing Debian distribution Linux
using VirtualBox to study basic elements in Internet Technology using Linux.
Austin Stoll discusses moving into his new home at the University of Northern Iowa and the challenges of meeting new people and getting involved on campus. He was nervous about leaving his friends and family in his small hometown but found that meeting his roommate helped him realize he had skills for making new friends. While getting involved is a work in progress, attending interesting events has been easier than expected. Stoll compares his transition to a memoir where the author also felt fear when watching loved ones drive away after her move to prison.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document discusses Prolog programming. It covers data structures in Prolog like lists and terms, programming techniques like guess-and-verify queries and open lists, and control in Prolog through goal ordering and rule selection. Lists can represent data structures and terms through partial specification with variables. Programming with open lists and difference lists allows modification of data.
This document discusses three types of hardware multithreading: coarse-grained, fine-grained, and simultaneous multithreading (SMT). Coarse-grained multithreading allows another thread to run during long stalls of the first thread. Fine-grained multithreading interleaves instructions from multiple threads in a round-robin fashion to hide stalls. SMT issues instructions from multiple threads in the same cycle by using register renaming and dynamic scheduling to maximize utilization.
The document discusses the Lisp programming language. It notes that Allegro Common Lisp will be used and lists textbooks for learning Lisp. It provides 10 points on Lisp, including that it is interactive, dynamic, uses symbols and lists as basic data types, prefix notation for operators, and classifies different data types. Evaluation follows simple rules and programs can be treated as both instructions and data.
Simultaneous multithreading (SMT) allows multiple independent threads to issue and execute instructions simultaneously each clock cycle by sharing the functional units of a superscalar processor. This improves performance over conventional multithreading approaches like coarse-grained and fine-grained multithreading. SMT provides good performance across a wide range of workloads by utilizing instruction issue slots and execution resources that would otherwise go unused when a single thread is limited by dependencies or cache misses. Implementing SMT requires minimal additional hardware like multiple program counters and per-thread scheduling structures.
The document discusses non-uniform cache architectures (NUCA), cache coherence, and different implementations of directories in multicore systems. It describes NUCA designs that map data to banks based on distance from the controller to exploit non-uniform access times. Cache coherence is maintained using directory-based protocols that track copies of cache blocks. Directories can be implemented off-chip in DRAM or on-chip using duplicate tag stores or distributing the directory among cache banks.
The document provides an overview of business analytics (BA) including its history, types, examples, challenges, and relationship to data mining. BA involves exploring past business performance data to gain insights and guide planning. It can focus on the whole business or segments. Types of BA include reporting/descriptive analytics using tools like affinity grouping and clustering, as well as predictive analytics using modeling. Challenges include acquiring high quality data and reacting to data quickly. Data mining is important for BA as it helps handle large datasets and specific problems in conducting analytics.
This document discusses decision trees and how they are constructed. It begins by explaining that decision trees use supervised learning to generate classification rules by splitting a training dataset based on attribute values. It then walks through an example of constructing a decision tree for predicting voter support based on attributes like age, income, education level, etc. The document discusses that decision trees are constructed recursively by choosing the attribute that creates the "purest" splits at each node, often using an information gain heuristic that favors splits lowering entropy.
The document discusses data mining and knowledge discovery from large datasets. It begins by defining the terms data, information, knowledge, and wisdom. It then explains that the growth of data from various sources has created a need for data mining to extract useful knowledge from large datasets. Data mining involves automated analysis techniques from fields like machine learning, statistics, and database management to discover patterns and relationships in data. The knowledge discovery process involves data preparation, data mining, and evaluation of the extracted patterns. The document provides examples of data mining applications in business, science, fraud detection, and web mining.
The document discusses memory hierarchy and caching techniques. It begins by explaining the need for a memory hierarchy due to differing access times of memory technologies like SRAM, DRAM, and disk. It then covers concepts like cache hits, misses, block size, direct mapping, set associativity, compulsory misses, capacity misses, and conflict misses. It also discusses techniques for improving cache performance like multi-level caches, write buffers, increasing associativity, and interleaving memory banks.
This document discusses how Analysis Services caching works and provides strategies for warming the Storage Engine cache and Formula Engine cache. It explains that the Storage Engine handles data retrieval from disk while the Formula Engine determines which data is needed for queries. Caching can improve performance but requires understanding when Analysis Services is unable to cache data. The document recommends using the CREATE CACHE statement and running regular queries to pre-populate the caches with commonly used data. Memory usage must be monitored when warming caches to avoid exceeding limits. Automating cache warming after processing is suggested to not interfere with user queries.
The document proposes optimizing DRAM caches for latency rather than hit rate. It summarizes previous work on DRAM caches like Loh-Hill Cache that treated DRAM cache similarly to SRAM cache. This led to high latency and low bandwidth utilization.
The document introduces the Alloy Cache design which avoids tag serialization and keeps tags and data in the same DRAM row for lower latency. It also proposes a simple Memory Access Predictor to use either serial or parallel access models depending on the prediction to reduce latency and bandwidth usage. Simulation results show the Alloy Cache with predictor outperforms previous designs like SRAM-Tags.
The document discusses abstract data types (ADTs), specifically queues. It defines a queue as a linear collection where elements are added to one end and removed from the other end, following a first-in, first-out (FIFO) approach. The key queue operations are enqueue, which adds an element, and dequeue, which removes the element that has been in the queue longest. Queues can be implemented using arrays or linked lists. Array implementations use head and tail pointers to track the start and end of the queue.
The document provides information on three programming languages: COBOL, LISP, and Python. COBOL was released in 1959 and was used for 80% of business transactions due to its reliability. LISP was the second high-level language created in 1958 and introduced innovations like garbage collection and recursion using linked lists. Python was developed in the 1990s and prioritizes readability through features like whitespace and a simple grammar.
This document discusses abstract data types (ADTs) and their implementation in various programming languages. It covers the key concepts of ADTs including data abstraction, encapsulation, information hiding, and defining the public interface separately from the private implementation. It provides examples of ADTs implemented using modules in Modula-2, packages in Ada, classes in C++, generics in Java and C#, and classes in Ruby. Parameterized and encapsulation constructs are also discussed as techniques for implementing and organizing ADTs.
Optimizing shared caches in chip multiprocessorsFraboni Ec
Chip multiprocessors, which place multiple processors on a single chip, have become common in modern processors. There are different approaches to managing caches in chip multiprocessors, including private caches for each processor or shared caches. The optimal approach balances factors like interconnect traffic, duplication of data, load balancing, and cache hit rates.
This document discusses the key concepts of object-oriented programming including abstraction, encapsulation, classes and objects. It defines abstraction as focusing on the essential characteristics of an object and hiding unnecessary details. Encapsulation hides the internal representation of an object within its class. A class defines both the data and behaviors of an object through its public interface and private implementation. Objects are instantiations of classes that come to life through constructors and die through destructors while maintaining data integrity.
The document discusses abstraction, which is a fundamental concept of object-oriented design. Abstraction involves focusing on an object's essential characteristics and behavior while hiding implementation details. There are different types of abstractions from most useful to least useful. Effective abstractions model real-world entities and provide well-defined interfaces through contracts, preconditions, and postconditions. Both static and dynamic properties of objects must be considered.
Object-oriented analysis and design (OOAD) emphasizes investigating requirements rather than solutions, and conceptual solutions that fulfill requirements rather than implementations. OOAD focuses on identifying domain concepts and defining software objects and how they collaborate. The document then discusses OO concepts like encapsulation, abstraction, inheritance, and polymorphism and how classes and objects are used in object-oriented programming. It provides an overview of the course structure and evaluation criteria.
Abstract classes and interfaces allow for abstraction and polymorphism in object-oriented design. Abstract classes can contain both abstract and concrete methods, while interfaces only contain abstract methods. Abstract classes are used to provide a common definition for subclasses through inheritance, while interfaces define a contract for implementing classes to follow. Both increase complexity, so their use should provide clear benefits to functionality.
This document discusses various programming paradigms and concurrency concepts in Java. It covers single process and multi-process programming, as well as multi-core and multi-threaded programming. Key concepts discussed include processes, threads, synchronization, deadlocks, and high-level concurrency objects like locks, executors, and concurrent collections. The document provides examples of implementing and managing threads, as well as communicating between threads using techniques like interrupts, joins, and guarded blocks.
This document discusses inheritance in object-oriented programming. It explains that inheritance allows a subclass to inherit attributes and behaviors from a superclass, extending the superclass. This allows for code reuse and the establishment of class hierarchies. The document provides an example of a BankAccount superclass and SavingsAccount subclass, demonstrating how the subclass inherits methods like deposit() and withdraw() from the superclass while adding its own method, addInterest(). It also discusses polymorphism and access control as related concepts.
This document discusses three types of hardware multithreading: coarse-grained, fine-grained, and simultaneous multithreading (SMT). Coarse-grained multithreading allows another thread to run during long stalls of the first thread. Fine-grained multithreading interleaves instructions from multiple threads in a round-robin fashion to hide stalls. SMT issues instructions from multiple threads in the same cycle by using register renaming and dynamic scheduling to maximize utilization.
The document discusses the Lisp programming language. It notes that Allegro Common Lisp will be used and lists textbooks for learning Lisp. It provides 10 points on Lisp, including that it is interactive, dynamic, uses symbols and lists as basic data types, prefix notation for operators, and classifies different data types. Evaluation follows simple rules and programs can be treated as both instructions and data.
Simultaneous multithreading (SMT) allows multiple independent threads to issue and execute instructions simultaneously each clock cycle by sharing the functional units of a superscalar processor. This improves performance over conventional multithreading approaches like coarse-grained and fine-grained multithreading. SMT provides good performance across a wide range of workloads by utilizing instruction issue slots and execution resources that would otherwise go unused when a single thread is limited by dependencies or cache misses. Implementing SMT requires minimal additional hardware like multiple program counters and per-thread scheduling structures.
The document discusses non-uniform cache architectures (NUCA), cache coherence, and different implementations of directories in multicore systems. It describes NUCA designs that map data to banks based on distance from the controller to exploit non-uniform access times. Cache coherence is maintained using directory-based protocols that track copies of cache blocks. Directories can be implemented off-chip in DRAM or on-chip using duplicate tag stores or distributing the directory among cache banks.
The document provides an overview of business analytics (BA) including its history, types, examples, challenges, and relationship to data mining. BA involves exploring past business performance data to gain insights and guide planning. It can focus on the whole business or segments. Types of BA include reporting/descriptive analytics using tools like affinity grouping and clustering, as well as predictive analytics using modeling. Challenges include acquiring high quality data and reacting to data quickly. Data mining is important for BA as it helps handle large datasets and specific problems in conducting analytics.
This document discusses decision trees and how they are constructed. It begins by explaining that decision trees use supervised learning to generate classification rules by splitting a training dataset based on attribute values. It then walks through an example of constructing a decision tree for predicting voter support based on attributes like age, income, education level, etc. The document discusses that decision trees are constructed recursively by choosing the attribute that creates the "purest" splits at each node, often using an information gain heuristic that favors splits lowering entropy.
The document discusses data mining and knowledge discovery from large datasets. It begins by defining the terms data, information, knowledge, and wisdom. It then explains that the growth of data from various sources has created a need for data mining to extract useful knowledge from large datasets. Data mining involves automated analysis techniques from fields like machine learning, statistics, and database management to discover patterns and relationships in data. The knowledge discovery process involves data preparation, data mining, and evaluation of the extracted patterns. The document provides examples of data mining applications in business, science, fraud detection, and web mining.
The document discusses memory hierarchy and caching techniques. It begins by explaining the need for a memory hierarchy due to differing access times of memory technologies like SRAM, DRAM, and disk. It then covers concepts like cache hits, misses, block size, direct mapping, set associativity, compulsory misses, capacity misses, and conflict misses. It also discusses techniques for improving cache performance like multi-level caches, write buffers, increasing associativity, and interleaving memory banks.
This document discusses how Analysis Services caching works and provides strategies for warming the Storage Engine cache and Formula Engine cache. It explains that the Storage Engine handles data retrieval from disk while the Formula Engine determines which data is needed for queries. Caching can improve performance but requires understanding when Analysis Services is unable to cache data. The document recommends using the CREATE CACHE statement and running regular queries to pre-populate the caches with commonly used data. Memory usage must be monitored when warming caches to avoid exceeding limits. Automating cache warming after processing is suggested to not interfere with user queries.
The document proposes optimizing DRAM caches for latency rather than hit rate. It summarizes previous work on DRAM caches like Loh-Hill Cache that treated DRAM cache similarly to SRAM cache. This led to high latency and low bandwidth utilization.
The document introduces the Alloy Cache design which avoids tag serialization and keeps tags and data in the same DRAM row for lower latency. It also proposes a simple Memory Access Predictor to use either serial or parallel access models depending on the prediction to reduce latency and bandwidth usage. Simulation results show the Alloy Cache with predictor outperforms previous designs like SRAM-Tags.
The document discusses abstract data types (ADTs), specifically queues. It defines a queue as a linear collection where elements are added to one end and removed from the other end, following a first-in, first-out (FIFO) approach. The key queue operations are enqueue, which adds an element, and dequeue, which removes the element that has been in the queue longest. Queues can be implemented using arrays or linked lists. Array implementations use head and tail pointers to track the start and end of the queue.
The document provides information on three programming languages: COBOL, LISP, and Python. COBOL was released in 1959 and was used for 80% of business transactions due to its reliability. LISP was the second high-level language created in 1958 and introduced innovations like garbage collection and recursion using linked lists. Python was developed in the 1990s and prioritizes readability through features like whitespace and a simple grammar.
This document discusses abstract data types (ADTs) and their implementation in various programming languages. It covers the key concepts of ADTs including data abstraction, encapsulation, information hiding, and defining the public interface separately from the private implementation. It provides examples of ADTs implemented using modules in Modula-2, packages in Ada, classes in C++, generics in Java and C#, and classes in Ruby. Parameterized and encapsulation constructs are also discussed as techniques for implementing and organizing ADTs.
Optimizing shared caches in chip multiprocessorsFraboni Ec
Chip multiprocessors, which place multiple processors on a single chip, have become common in modern processors. There are different approaches to managing caches in chip multiprocessors, including private caches for each processor or shared caches. The optimal approach balances factors like interconnect traffic, duplication of data, load balancing, and cache hit rates.
This document discusses the key concepts of object-oriented programming including abstraction, encapsulation, classes and objects. It defines abstraction as focusing on the essential characteristics of an object and hiding unnecessary details. Encapsulation hides the internal representation of an object within its class. A class defines both the data and behaviors of an object through its public interface and private implementation. Objects are instantiations of classes that come to life through constructors and die through destructors while maintaining data integrity.
The document discusses abstraction, which is a fundamental concept of object-oriented design. Abstraction involves focusing on an object's essential characteristics and behavior while hiding implementation details. There are different types of abstractions from most useful to least useful. Effective abstractions model real-world entities and provide well-defined interfaces through contracts, preconditions, and postconditions. Both static and dynamic properties of objects must be considered.
Object-oriented analysis and design (OOAD) emphasizes investigating requirements rather than solutions, and conceptual solutions that fulfill requirements rather than implementations. OOAD focuses on identifying domain concepts and defining software objects and how they collaborate. The document then discusses OO concepts like encapsulation, abstraction, inheritance, and polymorphism and how classes and objects are used in object-oriented programming. It provides an overview of the course structure and evaluation criteria.
Abstract classes and interfaces allow for abstraction and polymorphism in object-oriented design. Abstract classes can contain both abstract and concrete methods, while interfaces only contain abstract methods. Abstract classes are used to provide a common definition for subclasses through inheritance, while interfaces define a contract for implementing classes to follow. Both increase complexity, so their use should provide clear benefits to functionality.
This document discusses various programming paradigms and concurrency concepts in Java. It covers single process and multi-process programming, as well as multi-core and multi-threaded programming. Key concepts discussed include processes, threads, synchronization, deadlocks, and high-level concurrency objects like locks, executors, and concurrent collections. The document provides examples of implementing and managing threads, as well as communicating between threads using techniques like interrupts, joins, and guarded blocks.
This document discusses inheritance in object-oriented programming. It explains that inheritance allows a subclass to inherit attributes and behaviors from a superclass, extending the superclass. This allows for code reuse and the establishment of class hierarchies. The document provides an example of a BankAccount superclass and SavingsAccount subclass, demonstrating how the subclass inherits methods like deposit() and withdraw() from the superclass while adding its own method, addInterest(). It also discusses polymorphism and access control as related concepts.
2. 1. Giới thiệu
2. Cây quyết định
3. Từ CSDL đến cây định danh
4. Từ cây đến luật
5. Nhận xét
6. Tổng kết
7. Tài liệu tham khảo
12/9/2009 2
3. Machine Learning
Supervised Learning Un-supervised Learning
-Học 1 hàm từ dữ liệu
huấn luyện
-Dữ liệu huấn luyện là các
cặp (input,output)
-Bài toán: phân lớp (Hàm
học được áp dụng vào dữ
liệu chưa biết để xác định
output cho dữ liệu đó)
-Học để xác định tập dữ
liệu được phân bố như thế
nào
-Dữ liệu học không có nhãn
(output)
- Bài toán: gom nhóm
12/9/2009 3
4. Machine Learning
Supervised Learning Un-supervised Learning
Giải quyết bài toán phân
lớp:
-Cây quyết định
-Luật
-Naïve Bayes
-Mạng neural
-SVM
-…
-Học để xác định tập dữ
liệu được phân bố như thế
nào
-Dữ liệu học không có nhãn
(output)
- Bài toán: gom nhóm
12/9/2009 4
5. Là cấu trúc phân cấp các nút + nhánh
Mỗi nút được liên kết với 1 tập các câu trả
lời có thể
Mỗi nút không phải là lá được gắn với 1
test chia tập các câu trả lời thành các tập
con ứng với kết quả test
Mỗi nhánh là kết quả test từ tập các câu
trả lời xuống tập con
12/9/2009 5
7. Dữ liệu
quan sát
Cây quyết
định
Luật quyết
định
Dự đoán trên dữ
liệu chưa biết
12/9/2009 7
8. Có nhiều thuật toán xây dựng cây quyết
định:
12/9/2009 8
Hunt’sAlgorithm
CART
ID3, C4.5
Cây định danh
9. •Cơ sở dữ liệu
•Luật Occam
•Xây dựng cây định danh
•Độ hỗn loạn trung bình
•Thủ tục đâm chồi (Sprouter)
12/9/2009 9
10. Tập các mẫu quan sát được trong thực tế
Mỗi mẫu có 1 thuộc tính cuối cùng là
thuộc tính dùng để phân lớp (dự
đoán/định danh đối tượng)
Sử dụng để huấn luyện cây quyết định
12/9/2009 10
11. Name Hair
color
Height Weight Lotion Result
Sarah blonde average light no sunburned
Dana blonde tall average yes none
Alex brown short average yes none
Annie blonde short average no sunburned
Emily red short average no sunburned
Pete brown average heavy no none
John brown average heavy no none
Katie blonde short light yes none
12/9/2009 11
12. 3 màu tóc x 3 chiều cao x 3 cân nặng x 2 sử
dụng lotion = 54 tổ hợp có thể
12/9/2009 12
Xác suất 1 mẫu mới trùng với mẫu quan sát:
8/54 = 15%
Thực tế: số lượng thuộc tính và giá trị thuộc
tính rất lớn không thể tìm bằng cách so
khớp từ bảng
Hướng giải quyết:
Đề ra 1 thủ tục phân lớp chính xác từng mẫu
Thủ tục đúng trên số lượng đủ lớn mẫu có khả
năng đúng trên các mẫu mà nhãn lớp là chưa biết
13. Hair color
Lotion used
Emily
Alex
Pete
John
Sarah
Annie
Dana
Katie
blonde
red brown
No Yes
12/9/2009 13
14. Hair color
Weight
Alex
Annie
Blonde Red Brown
Height
Short Average
Tall
Weight
Dana
Pete
Sarah
Hair color
Blonde Red Brown
Katie
12/9/2009 14
Emily John
Average
Heavy
Light
Average
Heavy
Light
15. Thế giới vốn dĩ là đơn giản [*].
Cây định danh nhỏ nhất mà nhất quán với
các mẫu cây có khả năng nhất trong
việc phân lớp chính xác các đối tượng
chưa biết
Cây định danh (1) tốt hơn (2)
[*] William Ockham (1285–1349): "Entities should not be multiplied unnecessarily."
Làm sao xây dựng cây
12/9/2009 15
định danh nhỏ nhất?
16. Chỉ có thể xây dựng cây “nhỏ”, mặc dù
không đảm bảo là “nhỏ nhất”
Thủ tục: chia để trị
Tìm test cho nút gốc: Chia CSDL thành các tập
con với càng nhiều mẫu thuộc cùng 1 lớp càng tốt
Với mỗi tập chứa nhiều hơn 1 lớp, chọn 1 test
khác để chia tập không đồng nhất thành các tập
con đồng nhất
Điều kiện dừng:
Mỗi nút lá chỉ gồm các mẫu đồng nhất
Không còn thuộc tính nào có thể phân chia nữa
=> Thủ tục này cực tiểu hóa độ hỗn loạn dữ
liệu
12/9/2009 16
17. 12/9/2009 17
Hair color
Lotion used
Emily
Alex
Pete
John
red brown
No
Sarah
Annie
Emily
Pete
John
Yes
Dana
Alex
Katie
blonde
Sarah
Dana
Annie
Katie
Height
Dana
Pete
average
Sarah
Emily
John
tallshort
Alex
Annie
Katie
Weight
Emily
Pete
John
Average
Dana
Alex
Annie
HeavyLight
Sarah
Katie
18. Height
short average tall
Alex Sarah Dana
Annie Emily Pete
Katie John
Weight Lotion used
No Yes
Light Average Heavy Sarah
Annie Dana
Sarah Dana Emily
Emily Alex
Katie Alex Pete
Pete Katie
Annie John
John
Hair color
Emily
Alex
Pete
John
red
12/9/2009 18
brownblonde
Sarah
Dana
Annie
Katie
20. Height
short average tall
Annie Sarah Dana
Katie
Weight
Light Average Heavy
Sarah Dana
Katie Annie
Lotion used
No
Sarah
Annie
Yes
Dana
Katie
Hair color
Emily
Alex
Pete
John
red
12/9/2009 20
brownblonde
Sarah
Dana
Annie
Katie
21. c
12/9/2009 21
b
nbcnbc
nt nb nb
nb
log2Average disorder
Trong đó:
• nb là số mẫu của nhánh b
• nt là tổng số mẫu của tất cả các nhánh
• nbc là tổng số mẫu của nhánh b thuộc lớp c
22. Mô tả sự “hỗn loạn” trong 1 tập dữ liệu:
ĐHL = 0 nếu tập dữ liệu là đồng nhất (chỉ gồm 1
lớp)
ĐHL = 1 nếu tập dữ liệu chứa tất cả các lớp
c nb nb
nbc nbc
log2Disorder
Disorder
1.0
0.5
0
0 0.5 1.0
Fraction in one class
0.5
0
0 0.5 1.0
Fraction in one class
Disorder
1.0
12/9/2009 22
23. Hair color
Emily
Alex
Pete
John
red brownblonde
Sarah
Dana
Annie
Katie
0 0.5
3
8
0
4 2
log
8 4
Average disorder
2
4
Disorderin the branch b set
12/9/2009 23
b nt
2 2
log
2 1
4 2
4 8
nb
Test Disorder
Hair 0.5
28. Lặp đến khi: mỗi nút lá chỉ gồm các mẫu
đồng nhất hoặc không còn thuộc tính nào
có thể phân chia nữa
12/9/2009 28
Chọn 1 nút lá có tập các mẫu không đồng
nhất
Thay thế nút lá đó bằng 1 nút test mà nó sẽ
chia tập mẫu không đồng nhất thành các tập
không đồng nhất ở mức độ tối thiểu dựa vào
độ đo tính hỗn loạn
29. •Rút luật từ cây
•Loại bỏ các tiền đề luật không cần
thiết
•Loại bỏ luật thừa
•Thủ tục tỉa cành (Pruner)
12/9/2009 29
30. Theo dấu mỗi đường dẫn từ gốc lá
Lấy các phép thử làm tiền đề
Các nút lá làm kết luận
1. Nếu tóc vàng, có dùng kem thì không sao
2. Nếu tóc vàng, không dùng kem thì cháy
nắng
3. Nếu tóc đỏ thì cháy nắng
4. Nếu tóc nâu thì không sao
12/9/2009 30
31. Ai có dùng kem cũng không bị cháy nắng
1. Nếu tóc vàng, có dùng kem thì không sao
2. Nếu tóc vàng, không dùng kem thì cháy
nắng
3. Nếu tóc đỏ thì cháy nắng
4. Nếu tóc nâu thì không sao
12/9/2009 31
32. Có 2 luật cho kết luận cháy nắng, 2 luật
cho kết luận không sao
Có thể bỏ bớt 2 luật và thêm 1 luật mặc
định (được áp dụng khi không có luật nào
thỏa)
1. Nếu tóc vàng, không dùng kem thì cháy
nắng
2. Nếu tóc đỏ thì cháy nắng
3. Nếu không có luật nào thỏa thì không sao
12/9/2009 32
33. Tạo 1 luật cho mỗi đường đi từ gốc đến lá
trong cây quyết định
Đơn giản hóa mỗi luật bằng cách loại bỏ
những tiền đề không ảnh hưởng KL mà
cây có được
Thay thế những luật có chung KL bằng 1
luật mặc định mà luật này sẽ được kích
hoạt khi không có luật nào khác được kích
hoạt
12/9/2009 33
34. Ưu điểm của cây quyết định:
12/9/2009 34
Dễ xây dựng
Phân lớp mẫu mới nhanh
Dễ dàng diễn giải cho các cây kích thước nhỏ
Khuyết điểm:
Không thể học tăng cường
35. Ưu điểm của cây quyết định:
12/9/2009 35
Dễ xây dựng
Phân lớp mẫu mới nhanh
Dễ dàng diễn giải cho các cây kích thước nhỏ
Vấn đề gặp phải: quá khớp (Overfitting)
với dữ liệu huấn luyện do:
Dữ liệu nhiễu
Dữ liệu thiếu
=> Giảm độ chính xác khi phân lớp mẫu mới
36. Giải quyết overfitting:
12/9/2009 36
Loại bỏ trước: Dừng thêm nhánh cây
ngay khi nó tạo ra độ đo dưới 1 ngưỡng
nào đó
▪ Làm sao chọn ngưỡng??
Loại bỏ sau: Loại bớt nhánh từ cây hoàn
chỉnh (từ dưới lên)
▪ Sử dụng dữ liệu độc lập để kiểm tra và loại
bớt (đánh giá bằng percentage split hoặc
cross validation)
37. Theo luật Occam: thế giới đơn giản. Do vậy,
cách giải thích đơn giản nhất bao phủ được
toàn bộ dữ liệu là cách hiệu quả nhất
1 cách để định danh/phân lớp dữ liệu là sử
dụng cây quyết định với đường đi đến mỗi nút
là 1 phép thử
Sử dụng các độ đo để xác định nên dùng phép
thử trên thuộc tính nào của tập dữ liệu
Sau khi xây được cây, rút thành luật bằng cách
duyệt các đường đi từ gốc đến lá
Đơn giản tập luật bằng cách loại bớt các tiền đề
thừa và các luật thừa
12/9/2009 37
38. 1. Patrick H.
12/9/2009 38
W., “Artificial Intelligence”, Third
edition,1992
2. Russell S., “Artificial – A model approach”,
Second edition, 2003
3. Han J., “Data mining, Concepts and
Technique”, Second edition, 2006
4. Quinlan, J. R. 1986. Induction of Decision
Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106
5. Leech W. J., “A Rule Based Process Control
of the
Exhibit,
Method with Feedback,” Proceedings
International Conference and
Instrument Society of America, 1986