This document provides an overview of machine learning, including definitions, common applications, and examples of companies using machine learning. It discusses how BuildFax, a company that provides building permit data and services to industries like insurance, used Amazon Machine Learning to build more accurate predictive models for roof age and job cost estimates. By leveraging Amazon ML, BuildFax was able to build models much faster and provide more precise, property-specific predictions to customers through APIs.
What is Feature Engineering?
Feature engineering is the process of creating or selecting relevant
features from raw data to improve the performance of machine
learning models.
Feature engineering is the process of transforming raw data into
features that are suitable for machine learning models. In other
words, it is the process of selecting, extracting, and transforming the
most relevant features from the available data to build more accurate
and efficient machine learning models.
In the context of machine learning, features are individual measurable
properties or characteristics of the data that are used as inputs for the
learning algorithms. The goal of feature engineering is to transform the
raw data into a suitable format that captures the underlying patterns
and relationships in the data, thereby enabling the machine learning
model to make accurate predictions or classifications
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
This document outlines an agenda for a data science training presentation. The agenda includes sections on why data science, what data science is, who a data scientist is, what they do, how to solve problems in data science, data science tools, and a demo. Key points are that data science uses tools, algorithms and machine learning to discover patterns in raw data and gain insights. It involves tasks like processing, cleaning, mining and modeling data, as well as communicating results. The problem solving process involves discovery, preparation, planning, building, operationalizing and communicating models.
Machine learning is a branch of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" with data, without being explicitly programmed. The goal of machine learning is to build programs that can teach themselves to grow and change when exposed to new data. There are supervised, unsupervised, and reinforcement learning techniques used in machine learning applications across many fields including computer vision, speech recognition, robotics, healthcare, and finance.
This document provides an overview of machine learning, including definitions, common applications, and examples of companies using machine learning. It discusses how BuildFax, a company that provides building permit data and services to industries like insurance, used Amazon Machine Learning to build more accurate predictive models for roof age and job cost estimates. By leveraging Amazon ML, BuildFax was able to build models much faster and provide more precise, property-specific predictions to customers through APIs.
What is Feature Engineering?
Feature engineering is the process of creating or selecting relevant
features from raw data to improve the performance of machine
learning models.
Feature engineering is the process of transforming raw data into
features that are suitable for machine learning models. In other
words, it is the process of selecting, extracting, and transforming the
most relevant features from the available data to build more accurate
and efficient machine learning models.
In the context of machine learning, features are individual measurable
properties or characteristics of the data that are used as inputs for the
learning algorithms. The goal of feature engineering is to transform the
raw data into a suitable format that captures the underlying patterns
and relationships in the data, thereby enabling the machine learning
model to make accurate predictions or classifications
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
This document outlines an agenda for a data science training presentation. The agenda includes sections on why data science, what data science is, who a data scientist is, what they do, how to solve problems in data science, data science tools, and a demo. Key points are that data science uses tools, algorithms and machine learning to discover patterns in raw data and gain insights. It involves tasks like processing, cleaning, mining and modeling data, as well as communicating results. The problem solving process involves discovery, preparation, planning, building, operationalizing and communicating models.
Machine learning is a branch of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" with data, without being explicitly programmed. The goal of machine learning is to build programs that can teach themselves to grow and change when exposed to new data. There are supervised, unsupervised, and reinforcement learning techniques used in machine learning applications across many fields including computer vision, speech recognition, robotics, healthcare, and finance.
Unlocking the Power of Generative AI An Executive's Guide.pdfPremNaraindas1
Generative AI is here, and it can revolutionize your business. With its powerful capabilities, this technology can help companies create more efficient processes, unlock new insights from data, and drive innovation. But how do you make the most of these opportunities?
This guide will provide you with the information and resources needed to understand the ins and outs of Generative AI, so you can make informed decisions and capitalize on the potential. It covers important topics such as strategies for leveraging large language models, optimizing MLOps processes, and best practices for building with Generative AI.
Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!
This Edureka Data Science Training will help you understand what is Data Science and you will learn about different Data Science components and concepts. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1. What is Data Science?
2. Job Roles in Data Science
3. Components of Data Science
4. Concepts of Statistics
5. Power of Data Visualization
6. Introduction to Machine Learning using R
7. Supervised & Unsupervised Learning
8. Classification, Clustering & Recommenders
9. Text Mining & Time Series
10. Deep Learning
To take a structured training on Data Science, you can check complete details of our Data Science Certification Training course here: https://goo.gl/OCfxP2
Artificial intelligence in practice- part-1GMR Group
Summary is made in 5 parts-
This is Part -1
Cyber-solutions to real-world business problems Artificial Intelligence in Practice is a fascinating look into how companies use AI and machine learning to solve problems. Presenting 50 case studies of actual situations, this book demonstrates practical applications to issues faced by businesses around the globe.
• The rapidly evolving field of artificial intelligence has expanded beyond research labs and computer science departments and made its way into the mainstream business environment.
• Artificial intelligence and machine learning are cited as the most important modern business trends to drive success.
• It is used in areas ranging from banking and finance to social media and marketing.
• This technology continues to provide innovative solutions to businesses of all sizes, sectors and industries.
• This engaging and topical book explores a wide range of cases illustrating how businesses use AI to boost performance, drive efficiency, analyse market preferences and many others.
• This detailed examination provides an overview of each company, describes the specific problem and explains how AI facilitates resolution.
• Each case study provides a comprehensive overview, including some technical details as well as key learning summaries:
o Understand how specific business problems are addressed by innovative machine learning methods Explore how current artificial intelligence applications improve performance and increase efficiency in various situations
o Expand your knowledge of recent AI advancements in technology
o Gain insight on the future of AI and its increasing role in business and industry
o Artificial Intelligence in Practice: How 50 Successful Companies Used Artificial Intelligence to Solve Problems is an insightful and informative exploration of the trans-formative power of technology in 21st century commerce
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
Machine learning is a branch of artificial intelligence. In which computers study algorithms. If I say in simple terms, machine learning is a computer algorithm study method that allows computer programs to learn from their experience. Now the question arises what is the algorithm.
https://www.viewofpeoples.xyz/2020/08/What-is-machine-learning.html
The world today is evolving and so are the needs and requirements of people. Furthermore, we are witnessing a fourth industrial revolution of data.
Machine Learning has revolutionized industries like medicine, healthcare, manufacturing, banking, and several other industries. Therefore, Machine Learning has become an essential part of modern industry.
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...Edureka!
Machine Learning Training with Python: https://www.edureka.co/python )
This Edureka Machine Learning tutorial (Machine Learning Tutorial with Python Blog: https://goo.gl/fe7ykh ) on "AI vs Machine Learning vs Deep Learning" talks about the differences and relationship between AL, Machine Learning and Deep Learning. Below are the topics covered in this tutorial:
1. AI vs Machine Learning vs Deep Learning
2. What is Artificial Intelligence?
3. Example of Artificial Intelligence
4. What is Machine Learning?
5. Example of Machine Learning
6. What is Deep Learning?
7. Example of Deep Learning
8. Machine Learning vs Deep Learning
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
The document describes an upcoming two-day workshop on how artificial intelligence is transforming business, hosted by the Confederation of Indian Industry and facilitated by DataMites. The workshop will take place on February 20-21, 2019 in Chennai, India and will be presented by AI expert Ashok Kumar Adinarayanan. The document provides an agenda and overview of topics that will be covered, including the evolution and history of AI, machine learning applications, case studies of AI transforming industries like banking and healthcare.
The State of Artificial Intelligence in 2018: A Good Old Fashioned ReportNathan Benaich
This document provides a summary of the state of artificial intelligence (AI) research and developments over the past year. It covers key areas like research breakthroughs, talent, industries utilizing AI, and public policy issues related to AI. The document is produced by two authors in East London as a way to capture the progress of AI and spark discussion about its implications. It includes sections on research breakthroughs in areas like transfer learning, advances in hardware that have enabled progress, and the use of video datasets to help machines understand scenes and actions to gain a level of common sense.
Artificial Intelligence for Product Managers by former Yahoo! PMProduct School
Jobs requiring artificial intelligence skills in the US has grown 450% in the last five years. Corporations are seeking relentlessly for product leaders who can utilize AI technologies on their products and services to improve the company’s bottom line or top line. It's called the Fourth Industrial Revolution, and it is happening right here, right now.
However, as a Product Manager, how do you gain the necessary knowledge to analyze, understand, plan, and design products based on artificial intelligence technologies? Since you cannot get a college degree in AI Product Management, how do you adapt to this rapid change? In this talk, Adnan helped to answer these questions.
The AI Index Report 2023 provides the following key highlights from its research and development chapter:
1. The US and China have the most cross-country AI research collaborations, though the rate of growth has slowed in recent years.
2. Global AI research output has more than doubled since 2010, led by areas like machine learning, computer vision and pattern recognition.
3. China now leads in total AI research publications, while the US still leads in conference and repository citations but these leads are decreasing.
4. Industry now produces far more significant AI models than academia, as building state-of-the-art systems requires greater resources that industry can provide.
5. Large language models
Machine learning(ML) is the scientific study of algorithms and statistical models that computer systems used to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “Training Data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, Machine learning is the study of computer systems that learn from data and experience. It is applied in an incredibly wide variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential customer of machine learning.
A Framework for Navigating Generative Artificial Intelligence for EnterpriseRocketSource
Generative AI offers both opportunities and risks for enterprises. While it could drive significant ROI through personalized experiences, thought leadership, and faster processes, there are also concerns about job losses, overreliance on automation without oversight, and inaccurate information. Effective adoption of generative AI requires experience management strategies like understanding emotional and logical customer triggers, aligning products and services to experience channels, and building a business model around a compelling brand story. A people-first approach is important to maximize benefits and mitigate risks.
Impact of Artificial Intelligence in IT IndustryAnand SFJ
https://sfjbstraining.com/product/artificial-intelligence-course
Artificial Intelligence transforms traditional computer methods but also has an impact on various industries. Software which makes Artificial Intelligence relatively more important in this sector.
An expert in prompt engineering provides guidelines on designing effective prompts for natural language models. The document discusses prompt engineering principles, what makes a good prompt, and various prompt frameworks including priming, focused prompts, and practical everyday prompts. Effective prompts are clear, concise, unambiguous, and provide the necessary context and task to generate a desired response from a model. Iteration and adapting the prompt based on the response is important.
Machine learning involves using algorithms and large datasets to allow systems to learn from data and improve their performance. There are several types of machine learning including supervised learning for classification and prediction tasks using labeled examples, unsupervised learning like clustering to find hidden patterns in unlabeled data, and reinforcement learning where an agent learns from delayed rewards. Applications of machine learning span many domains like retail for customer segmentation, finance for credit scoring, medicine for diagnosis, and web mining for search engines. The field is growing rapidly due to increased data and computing power enabling complex models to be learned from data rather than being explicitly programmed.
Active learning is a machine learning technique where the learner is able to interactively query the oracle (e.g. a human) to obtain labels for new data points in an effort to learn more accurately from fewer labeled examples. The learner selects the most informative samples to be labeled by the oracle, such as samples closest to the decision boundary or where models disagree most. This allows the learner to minimize the number of labeled samples needed, thus reducing the cost of training an accurate model. Suggested improvements include querying batches of samples instead of single samples and accounting for varying labeling costs.
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
The document provides an introduction to supervised machine learning and pattern classification. It begins with an overview of the speaker's background and research interests. Key concepts covered include definitions of machine learning, examples of machine learning applications, and the differences between supervised, unsupervised, and reinforcement learning. The rest of the document outlines the typical workflow for a supervised learning problem, including data collection and preprocessing, model training and evaluation, and model selection. Common classification algorithms like decision trees, naive Bayes, and support vector machines are briefly explained. The presentation concludes with discussions around choosing the right algorithm and avoiding overfitting.
This document provides an overview of machine learning through a case study submitted by computer science students. It discusses the history and evolution of machine learning from its early development in the 1940s-50s to major advances in the 21st century. The document also defines key machine learning terms, describes the typical machine learning process and steps involved, and lists different types of machine learning problems and algorithms. It aims to give readers a comprehensive introduction to the field of machine learning.
GPT-4 is the newest version of OpenAI's language model that can understand and generate natural language. It shows improvements over GPT-3.5 in its ability to take visual inputs, be steered more precisely by the user, refuse unsafe requests, and score higher on factual benchmarks. Potential applications of GPT-4 include customer service, translation, content creation, and research. However, its adoption may displace some jobs and raises ethical issues that need addressing through education, job retraining, and responsible development of the technology.
This document provides an overview of data mining concepts and techniques. It summarizes a book on data mining that covers topics such as data preprocessing, data warehousing, online analytical processing (OLAP), and data cube computation. The book is in its third edition and is intended to help readers understand fundamental data mining concepts and how to apply different techniques.
This document provides an overview of data mining concepts and techniques. It summarizes a book on data mining that covers topics such as data preprocessing, data warehousing, online analytical processing (OLAP), and data cube computation. The book is in its third edition and is intended to help readers understand fundamental data mining concepts and how to apply different techniques.
Unlocking the Power of Generative AI An Executive's Guide.pdfPremNaraindas1
Generative AI is here, and it can revolutionize your business. With its powerful capabilities, this technology can help companies create more efficient processes, unlock new insights from data, and drive innovation. But how do you make the most of these opportunities?
This guide will provide you with the information and resources needed to understand the ins and outs of Generative AI, so you can make informed decisions and capitalize on the potential. It covers important topics such as strategies for leveraging large language models, optimizing MLOps processes, and best practices for building with Generative AI.
Data Science Training | Data Science Tutorial | Data Science Certification | ...Edureka!
This Edureka Data Science Training will help you understand what is Data Science and you will learn about different Data Science components and concepts. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1. What is Data Science?
2. Job Roles in Data Science
3. Components of Data Science
4. Concepts of Statistics
5. Power of Data Visualization
6. Introduction to Machine Learning using R
7. Supervised & Unsupervised Learning
8. Classification, Clustering & Recommenders
9. Text Mining & Time Series
10. Deep Learning
To take a structured training on Data Science, you can check complete details of our Data Science Certification Training course here: https://goo.gl/OCfxP2
Artificial intelligence in practice- part-1GMR Group
Summary is made in 5 parts-
This is Part -1
Cyber-solutions to real-world business problems Artificial Intelligence in Practice is a fascinating look into how companies use AI and machine learning to solve problems. Presenting 50 case studies of actual situations, this book demonstrates practical applications to issues faced by businesses around the globe.
• The rapidly evolving field of artificial intelligence has expanded beyond research labs and computer science departments and made its way into the mainstream business environment.
• Artificial intelligence and machine learning are cited as the most important modern business trends to drive success.
• It is used in areas ranging from banking and finance to social media and marketing.
• This technology continues to provide innovative solutions to businesses of all sizes, sectors and industries.
• This engaging and topical book explores a wide range of cases illustrating how businesses use AI to boost performance, drive efficiency, analyse market preferences and many others.
• This detailed examination provides an overview of each company, describes the specific problem and explains how AI facilitates resolution.
• Each case study provides a comprehensive overview, including some technical details as well as key learning summaries:
o Understand how specific business problems are addressed by innovative machine learning methods Explore how current artificial intelligence applications improve performance and increase efficiency in various situations
o Expand your knowledge of recent AI advancements in technology
o Gain insight on the future of AI and its increasing role in business and industry
o Artificial Intelligence in Practice: How 50 Successful Companies Used Artificial Intelligence to Solve Problems is an insightful and informative exploration of the trans-formative power of technology in 21st century commerce
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
Machine learning is a branch of artificial intelligence. In which computers study algorithms. If I say in simple terms, machine learning is a computer algorithm study method that allows computer programs to learn from their experience. Now the question arises what is the algorithm.
https://www.viewofpeoples.xyz/2020/08/What-is-machine-learning.html
The world today is evolving and so are the needs and requirements of people. Furthermore, we are witnessing a fourth industrial revolution of data.
Machine Learning has revolutionized industries like medicine, healthcare, manufacturing, banking, and several other industries. Therefore, Machine Learning has become an essential part of modern industry.
AI vs Machine Learning vs Deep Learning | Machine Learning Training with Pyth...Edureka!
Machine Learning Training with Python: https://www.edureka.co/python )
This Edureka Machine Learning tutorial (Machine Learning Tutorial with Python Blog: https://goo.gl/fe7ykh ) on "AI vs Machine Learning vs Deep Learning" talks about the differences and relationship between AL, Machine Learning and Deep Learning. Below are the topics covered in this tutorial:
1. AI vs Machine Learning vs Deep Learning
2. What is Artificial Intelligence?
3. Example of Artificial Intelligence
4. What is Machine Learning?
5. Example of Machine Learning
6. What is Deep Learning?
7. Example of Deep Learning
8. Machine Learning vs Deep Learning
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
The document describes an upcoming two-day workshop on how artificial intelligence is transforming business, hosted by the Confederation of Indian Industry and facilitated by DataMites. The workshop will take place on February 20-21, 2019 in Chennai, India and will be presented by AI expert Ashok Kumar Adinarayanan. The document provides an agenda and overview of topics that will be covered, including the evolution and history of AI, machine learning applications, case studies of AI transforming industries like banking and healthcare.
The State of Artificial Intelligence in 2018: A Good Old Fashioned ReportNathan Benaich
This document provides a summary of the state of artificial intelligence (AI) research and developments over the past year. It covers key areas like research breakthroughs, talent, industries utilizing AI, and public policy issues related to AI. The document is produced by two authors in East London as a way to capture the progress of AI and spark discussion about its implications. It includes sections on research breakthroughs in areas like transfer learning, advances in hardware that have enabled progress, and the use of video datasets to help machines understand scenes and actions to gain a level of common sense.
Artificial Intelligence for Product Managers by former Yahoo! PMProduct School
Jobs requiring artificial intelligence skills in the US has grown 450% in the last five years. Corporations are seeking relentlessly for product leaders who can utilize AI technologies on their products and services to improve the company’s bottom line or top line. It's called the Fourth Industrial Revolution, and it is happening right here, right now.
However, as a Product Manager, how do you gain the necessary knowledge to analyze, understand, plan, and design products based on artificial intelligence technologies? Since you cannot get a college degree in AI Product Management, how do you adapt to this rapid change? In this talk, Adnan helped to answer these questions.
The AI Index Report 2023 provides the following key highlights from its research and development chapter:
1. The US and China have the most cross-country AI research collaborations, though the rate of growth has slowed in recent years.
2. Global AI research output has more than doubled since 2010, led by areas like machine learning, computer vision and pattern recognition.
3. China now leads in total AI research publications, while the US still leads in conference and repository citations but these leads are decreasing.
4. Industry now produces far more significant AI models than academia, as building state-of-the-art systems requires greater resources that industry can provide.
5. Large language models
Machine learning(ML) is the scientific study of algorithms and statistical models that computer systems used to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “Training Data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, Machine learning is the study of computer systems that learn from data and experience. It is applied in an incredibly wide variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential customer of machine learning.
A Framework for Navigating Generative Artificial Intelligence for EnterpriseRocketSource
Generative AI offers both opportunities and risks for enterprises. While it could drive significant ROI through personalized experiences, thought leadership, and faster processes, there are also concerns about job losses, overreliance on automation without oversight, and inaccurate information. Effective adoption of generative AI requires experience management strategies like understanding emotional and logical customer triggers, aligning products and services to experience channels, and building a business model around a compelling brand story. A people-first approach is important to maximize benefits and mitigate risks.
Impact of Artificial Intelligence in IT IndustryAnand SFJ
https://sfjbstraining.com/product/artificial-intelligence-course
Artificial Intelligence transforms traditional computer methods but also has an impact on various industries. Software which makes Artificial Intelligence relatively more important in this sector.
An expert in prompt engineering provides guidelines on designing effective prompts for natural language models. The document discusses prompt engineering principles, what makes a good prompt, and various prompt frameworks including priming, focused prompts, and practical everyday prompts. Effective prompts are clear, concise, unambiguous, and provide the necessary context and task to generate a desired response from a model. Iteration and adapting the prompt based on the response is important.
Machine learning involves using algorithms and large datasets to allow systems to learn from data and improve their performance. There are several types of machine learning including supervised learning for classification and prediction tasks using labeled examples, unsupervised learning like clustering to find hidden patterns in unlabeled data, and reinforcement learning where an agent learns from delayed rewards. Applications of machine learning span many domains like retail for customer segmentation, finance for credit scoring, medicine for diagnosis, and web mining for search engines. The field is growing rapidly due to increased data and computing power enabling complex models to be learned from data rather than being explicitly programmed.
Active learning is a machine learning technique where the learner is able to interactively query the oracle (e.g. a human) to obtain labels for new data points in an effort to learn more accurately from fewer labeled examples. The learner selects the most informative samples to be labeled by the oracle, such as samples closest to the decision boundary or where models disagree most. This allows the learner to minimize the number of labeled samples needed, thus reducing the cost of training an accurate model. Suggested improvements include querying batches of samples instead of single samples and accounting for varying labeling costs.
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
The document provides an introduction to supervised machine learning and pattern classification. It begins with an overview of the speaker's background and research interests. Key concepts covered include definitions of machine learning, examples of machine learning applications, and the differences between supervised, unsupervised, and reinforcement learning. The rest of the document outlines the typical workflow for a supervised learning problem, including data collection and preprocessing, model training and evaluation, and model selection. Common classification algorithms like decision trees, naive Bayes, and support vector machines are briefly explained. The presentation concludes with discussions around choosing the right algorithm and avoiding overfitting.
This document provides an overview of machine learning through a case study submitted by computer science students. It discusses the history and evolution of machine learning from its early development in the 1940s-50s to major advances in the 21st century. The document also defines key machine learning terms, describes the typical machine learning process and steps involved, and lists different types of machine learning problems and algorithms. It aims to give readers a comprehensive introduction to the field of machine learning.
GPT-4 is the newest version of OpenAI's language model that can understand and generate natural language. It shows improvements over GPT-3.5 in its ability to take visual inputs, be steered more precisely by the user, refuse unsafe requests, and score higher on factual benchmarks. Potential applications of GPT-4 include customer service, translation, content creation, and research. However, its adoption may displace some jobs and raises ethical issues that need addressing through education, job retraining, and responsible development of the technology.
This document provides an overview of data mining concepts and techniques. It summarizes a book on data mining that covers topics such as data preprocessing, data warehousing, online analytical processing (OLAP), and data cube computation. The book is in its third edition and is intended to help readers understand fundamental data mining concepts and how to apply different techniques.
This document provides an overview of data mining concepts and techniques. It summarizes a book on data mining that covers topics such as data preprocessing, data warehousing, online analytical processing (OLAP), and data cube computation. The book is in its third edition and is intended to help readers understand fundamental data mining concepts and how to apply different techniques.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data. It defines data mining as the extraction of interesting patterns from large datasets. The document outlines the key steps in the knowledge discovery process and how data mining fits within business intelligence applications. It also describes different types of data that can be mined and popular data mining algorithms.
The document summarizes the discussions and outcomes of a Dagstuhl Perspectives Workshop on applying tensor computing methods to problems in the Internet of Things (IoT). At the workshop, researchers from both industry and academia presented on challenges involving analyzing large, multi-dimensional streaming data from IoT devices and cyber-physical systems. Tensors provide a natural way to represent such data and can enable more efficient information extraction than alternative methods. However, further work is needed to develop benchmark challenges, datasets, and frameworks to make tensor methods more accessible and applicable to industrial IoT problems. The group discussed forming a knowledge hub and collaborating on data challenges to help establish tensor computing as a solution for machine learning on cyber-physical systems.
This document outlines the course structure and content for a Data Science course. The 5 modules cover: 1) introductions to data science concepts and statistical inference using R; 2) exploratory data analysis and machine learning algorithms; 3) feature generation/selection and additional machine learning algorithms; 4) recommendation systems and dimensionality reduction; 5) mining social network graphs and data visualization. The course aims to teach students to define data science fundamentals, demonstrate the data science process, explain necessary machine learning algorithms, illustrate data analysis techniques, and follow ethics in data visualization.
This document provides information about the course "CS690L: Semantic Web and Knowledge Discovery" taught in winter 2003. The course introduces concepts of the semantic web and knowledge discovery, including ontology, XML, RDF, OWL and machine learning algorithms. Students will complete hands-on exercises, presentations, reading assignments and a research paper. Assessment will be based on individual work including papers and presentations, as well as group projects. The goal is for students to gain an understanding of semantic web paradigms and knowledge discovery techniques and tools.
The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.
The document provides the syllabus for the 3rd year 1st semester courses for the Computer Science and Engineering department at Jawaharlal Nehru Technological University Kakinada, including courses on Data Warehousing and Data Mining, Computer Networks, Compiler Design, Artificial Intelligence, and Professional Electives, along with information on class schedules, course objectives, and reading materials.
This document outlines the course units for several management information systems and computer science courses. The management information systems course covers topics such as the foundation of information systems, management information systems, concepts of planning and control, and managing information technology. The object oriented systems course covers object modeling, dynamic modeling, functional modeling, Java programming, and software development using Java. The e-commerce course covers topics such as introduction to e-commerce, network infrastructure, web security, encryption, and electronic payments. The computer networks course covers networking concepts, medium access control, network layer, transport layer, and application layer protocols. The compiler design course covers compiler structure, lexical analysis, syntactic analysis, intermediate code generation, run-time memory management, code optimization
"From Big Data to Smart data"
Jie (Jack) Yang, Associate Research Fellow, SMART Infrastructure Facility, presented a summary of his research as part of the SMART Seminar Series on 28 April 2016.
For more information, visit the event page at: http://smart.uow.edu.au/events/UOW212890.html.
A Guide to Wireless Communication, Addison Wesley, 1995.
3. J.F.Kurose and K.W.Ross: Computer Networking: A Top Down Approach Featuring
the Internet, Addison Wesley, 2002.
4. C.Siva Ram Murthy and B.S.Manoj: Adhoc Wireless Networks: Architectures and
Protocols, Prentice Hall, 2004.
5. William Stallings: Wireless Communications and Networks, Prentice Hall, 2002.
Image Processing
Introduction to Digital Image Processing, Image acquisition and digitization, Image
enhancement, Image restoration, Image compression, Image segmentation, Image
representation and description, Object recognition
A number of recent milestones in AI have rekindled the faith that human-grade computer intelligence can fuel the next technological revolution. In parallel and almost independently, the job role of Data Scientist rose to one of the hottest tickets in the technology sector. Despite the obvious overlap in the domains of Data Science and Artificial Intelligence, the two approaches are sufficiently distinct that choosing the wrong one might trigger a product to fail or a hiring process to go wrong. This presentation will offer some clarity and best practices with regards to understanding what data analysis requirements you really have, as what opposed to what you think you have.
This document outlines several courses that are part of a degree program. It provides details for 14 different courses, including information on course credits, instructor, department, topics covered, schedule, and required literature. The courses cover a range of technical subjects within computer science and informatics, such as ad hoc networks, data mining, software engineering, cryptography, and web development. Students would take a combination of obligatory and optional courses over the course of their degree program to earn a total of 120 credits.
A modified k means algorithm for big data clusteringSK Ahammad Fahad
Amount of data is getting bigger in every moment and this data comes from everywhere; social media, sensors, search engines, GPS signals, transaction records, satellites, financial markets, ecommerce sites etc. This large volume of data may be semi-structured, unstructured or even structured. So it is important to derive meaningful information from this huge data set. Clustering is the process to categorize data such that data are grouped in the same cluster when they are similar according to specific metrics. In this paper, we are working on k-mean clustering technique to cluster big data. Several methods have been proposed for improving the performance of the k-means clustering algorithm. We propose a method for making the algorithm less time consuming, more effective and efficient for better clustering with reduced complexity. According to our observation, quality of the resulting clusters heavily depends on the selection of initial centroid and changes in data clusters in the subsequence iterations. As we know, after a certain number of iterations, a small part of the data points change their clusters. Therefore, our proposed method first finds the initial centroid and puts an interval between those data elements which will not change their cluster and those which may change their cluster in the subsequence iterations. So that it will reduce the workload significantly in case of very large data sets. We evaluate our method with different sets of data and compare with others methods as well.
The document contains summaries of 5 courses:
1. E-Commerce discusses key concepts like business frameworks, history of e-commerce, and business models. It also covers network infrastructure and security topics.
2. Information Systems discusses foundations of information systems including problem solving and project management. It also covers managing technology, applications, and advanced concepts.
3. Web Technology covers HTML, XML, scripting, Java technologies, databases, and frameworks for web development.
4. Design and Analysis of Algorithms includes sorting, advanced data structures, techniques like dynamic programming, and graph algorithms.
5. Industrial Economics and Management covers basic economic concepts, money and banking, and principles of management including human behavior.
1. Developing a unifying theory of data mining that connects different tasks and approaches could help advance the field by providing a theoretical framework.
2. Scaling data mining methods to handle high dimensional and streaming data at massive scales is challenging due to limitations in current approaches for problems like concept drift.
3. Efficiently mining sequential, time series, and noisy time series data remains an important open problem, particularly for applications like financial and seismic predictions.
This document provides an overview of a course on Predictive Modeling using IBM SPSS Statistics. The course is divided into 5 units that cover topics such as reading, organizing, and transforming data in SPSS; conducting descriptive and inferential statistics; creating graphical displays; and performing statistical analyses like t-tests, ANOVA, correlation, regression, and predictive analysis. Students will learn how to import, manage, and analyze data in SPSS through illustrative problems and projects involving both parametric and non-parametric statistical tests. The goal is for students to gain experience in using SPSS to conduct statistical analyses and predictive modeling on data.
This document introduces data mining concepts and techniques. It defines data mining as the process of discovering interesting patterns from large amounts of data. The document outlines several data mining functionalities including classification, clustering, association rule mining, and outlier detection. It also discusses popular data mining algorithms, major issues in data mining, and provides a brief history of the data mining field and community.
Similar to Data_Mining: Practical Machine Learning Tools and Techniques 2ndEd.pdf (20)
The Design and Analysis of Algorithms.pdfSaqib Raza
Here are the key points made in the preface:
- Algorithms play a central role in both computer science and practice. There are two main approaches to presenting algorithms in textbooks - by problem type or by design technique.
- The author believes organizing by design technique is more appropriate for an introductory algorithms course. This is because it provides tools for designing new algorithms, classifies known algorithms by underlying idea, and the techniques have general problem-solving utility.
- However, the traditional classification of design techniques has shortcomings from theoretical and educational perspectives. Specifically, it fails to classify many important algorithms.
- This limitation has forced other textbooks to depart from the design technique organization and include chapters by problem type instead.
An Introduction to the Analysis of Algorithms (2nd_Edition_Robert_Sedgewick,_...Saqib Raza
The preface honors the late co-author Philippe Flajolet and dedicates the second edition to his memory. It recounts Sedgewick's eulogy for Flajolet, praising his brilliance, creativity, generosity and the impact he had on many lives through his collaborations. The second edition aims to teach future generations and continue building upon Flajolet's mathematical legacy.
Social Impacts of Artificial intelligenceSaqib Raza
This lecture gives detail introduction, applications about AI. This lecture gives details about the social perspective and realities in the field of AI.
This document contains complete course outline of Professional Practices. Most of the topics are for computer science students. This document covers course of 32 lectures 1.5 hours each for professional practice course also known as Professional Ethics.
This lecture includes detail about ethical hacking profession, there jobs description, responsibilities duties and skills required to excel in their field.
This lecture includes introduction to computers security and privacy. This lecture include basic concepts of terminologies and technologies involve in current securities and privacy needs.
Software Engineering Code Of Ethics And Professional PracticeSaqib Raza
This document outlines the Software Engineering Code of Ethics and Professional Practice established jointly by the IEEE Computer Society and the Association for Computing Machinery. The code consists of 8 principles related to a software engineer's responsibilities to the public, clients/employers, products, professional judgment, management, profession, colleagues, and self-development. It provides guidance on ethical issues like ensuring software quality and safety, avoiding conflicts of interest, crediting colleagues' work, and participating in lifelong learning to improve skills. The goal is to establish standards of conduct for software engineers to make the profession beneficial and respected.
This lecture include detail about engineering especially software engineering profession. include common and mostly used schema to develop organisational structure.
This lecture include introduction to software contracts. Before starting development companies prepare agreement document to deal with conflicts afterwards.
This document discusses business ethics and provides an overview of key concepts. It defines business ethics as focusing on right and wrong behavior in the business world. While businesses have responsibilities to shareholders and profits, they also have responsibilities to the public and ethical principles. The document outlines several theories of ethical conduct, including deontology (focusing on duties), utilitarianism (focusing on consequences), and the rights model (focusing on human rights impacts). It provides examples of how to apply these models to analyze ethical dilemmas in business. Finally, it discusses ethics for employees and codes of business ethics.
This document discusses different types of ethics including personal ethics, social ethics, religious ethics, business ethics, and professional ethics. It provides examples and definitions for each type. Personal ethics refers to one's own moral guide while social ethics governs how members of a society deal with issues like fairness and justice. Religious ethics is often derived from religious teachings. Business ethics examines right and wrong in business contexts. Professional ethics establishes codes of conduct for computing professionals.
This document provides an overview of computer ethics and professional practices. It begins by defining key terms like ethics and morals. It then discusses the background of ethics according to philosophers like Socrates. The document outlines some historical milestones in computer ethics and issues that arose with early computer technologies. It provides examples of topics in computer ethics like privacy, intellectual property, and computer security. The document concludes by presenting the "Ten Commandments" of computer ethics.
This document discusses key concepts for managing software projects and engineering teams. It covers the four elements of projects (people, product, process, project), stakeholders, team structures, communication methods, defining the product scope, decomposing problems, selecting a development process, and practices for successful projects. The overall focus is on planning teams, work, and coordination to deliver quality software on time and on budget.
The document discusses software re-engineering which involves reorganizing and modifying existing software systems to improve maintainability. It describes the re-engineering process which includes activities like source code translation, reverse engineering, improving program structure and modularity, and re-engineering data structures. The objectives of re-engineering are to explain why it is often more cost-effective than new development and to describe the various activities involved in the re-engineering process.
This lecture is about the detail definition of software quality and quality assurance. Provide details about software tesing and its types. Clear the basic concepts of software quality and software testing.
This lecture provide a detail concepts of user interface development design and evaluation. This lecture have complete guideline toward UI development. The interesting thing about this lecture is Software User Interface Design trends.
This lecture helps to understand basics software design and especially Architecture Design and its importance. This lecture also describes the goals and importance of architecture design.
This lecture provide a review of requirement engineering process. The slides have been prepared after reading Ian Summerville and Roger Pressman work. This lecture is helpful to understand user, and user requirements.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
3. The Morgan Kaufmann Series in Data Management Systems
Series Editor: Jim Gray, Microsoft Research
Data Mining: Practical Machine Learning
Tools and Techniques, Second Edition
Ian H. Witten and Eibe Frank
Fuzzy Modeling and Genetic Algorithms for
Data Mining and Exploration
Earl Cox
Data Modeling Essentials, Third Edition
Graeme C. Simsion and Graham C. Witt
Location-Based Services
Jochen Schiller and Agnès Voisard
Database Modeling with Microsoft® Visio for
Enterprise Architects
Terry Halpin, Ken Evans, Patrick Hallock,
and Bill Maclean
Designing Data-Intensive Web Applications
Stefano Ceri, Piero Fraternali, Aldo Bongio,
Marco Brambilla, Sara Comai, and
Maristella Matera
Mining the Web: Discovering Knowledge
from Hypertext Data
Soumen Chakrabarti
Advanced SQL: 1999—Understanding
Object-Relational and Other Advanced
Features
Jim Melton
Database Tuning: Principles, Experiments,
and Troubleshooting Techniques
Dennis Shasha and Philippe Bonnet
SQL: 1999—Understanding Relational
Language Components
Jim Melton and Alan R. Simon
Information Visualization in Data Mining
and Knowledge Discovery
Edited by Usama Fayyad, Georges G.
Grinstein, and Andreas Wierse
Transactional Information Systems: Theory,
Algorithms, and the Practice of Concurrency
Control and Recovery
Gerhard Weikum and Gottfried Vossen
Spatial Databases: With Application to GIS
Philippe Rigaux, Michel Scholl, and Agnès
Voisard
Information Modeling and Relational
Databases: From Conceptual Analysis to
Logical Design
Terry Halpin
Component Database Systems
Edited by Klaus R. Dittrich and Andreas
Geppert
Managing Reference Data in Enterprise
Databases: Binding Corporate Data to the
Wider World
Malcolm Chisholm
Data Mining: Concepts and Techniques
Jiawei Han and Micheline Kamber
Understanding SQL and Java Together: A
Guide to SQLJ, JDBC, and Related
Technologies
Jim Melton and Andrew Eisenberg
Database: Principles, Programming, and
Performance, Second Edition
Patrick O’Neil and Elizabeth O’Neil
The Object Data Standard: ODMG 3.0
Edited by R. G. G. Cattell, Douglas K.
Barry, Mark Berler, Jeff Eastman, David
Jordan, Craig Russell, Olaf Schadow,
Torsten Stanienda, and Fernando Velez
Data on the Web: From Relations to
Semistructured Data and XML
Serge Abiteboul, Peter Buneman, and Dan
Suciu
Data Mining: Practical Machine Learning
Tools and Techniques with Java
Implementations
Ian H. Witten and Eibe Frank
Joe Celko’s SQL for Smarties: Advanced SQL
Programming, Second Edition
Joe Celko
Joe Celko’s Data and Databases: Concepts in
Practice
Joe Celko
Developing Time-Oriented Database
Applications in SQL
Richard T. Snodgrass
Web Farming for the Data Warehouse
Richard D. Hackathorn
Database Modeling & Design, Third Edition
Toby J. Teorey
Management of Heterogeneous and
Autonomous Database Systems
Edited by Ahmed Elmagarmid, Marek
Rusinkiewicz, and Amit Sheth
Object-Relational DBMSs: Tracking the Next
Great Wave, Second Edition
Michael Stonebraker and Paul Brown, with
Dorothy Moore
A Complete Guide to DB2 Universal
Database
Don Chamberlin
Universal Database Management: A Guide
to Object/Relational Technology
Cynthia Maro Saracco
Readings in Database Systems, Third Edition
Edited by Michael Stonebraker and Joseph
M. Hellerstein
Understanding SQL’s Stored Procedures: A
Complete Guide to SQL/PSM
Jim Melton
Principles of Multimedia Database Systems
V. S. Subrahmanian
Principles of Database Query Processing for
Advanced Applications
Clement T. Yu and Weiyi Meng
Advanced Database Systems
Carlo Zaniolo, Stefano Ceri, Christos
Faloutsos, Richard T. Snodgrass, V. S.
Subrahmanian, and Roberto Zicari
Principles of Transaction Processing for the
Systems Professional
Philip A. Bernstein and Eric Newcomer
Using the New DB2: IBM’s Object-Relational
Database System
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
Active Database Systems: Triggers and Rules
For Advanced Database Processing
Edited by Jennifer Widom and Stefano Ceri
Migrating Legacy Systems: Gateways,
Interfaces & the Incremental Approach
Michael L. Brodie and Michael Stonebraker
Atomic Transactions
Nancy Lynch, Michael Merritt, William
Weihl, and Alan Fekete
Query Processing For Advanced Database
Systems
Edited by Johann Christoph Freytag, David
Maier, and Gottfried Vossen
Transaction Processing: Concepts and
Techniques
Jim Gray and Andreas Reuter
Building an Object-Oriented Database
System: The Story of O2
Edited by François Bancilhon, Claude
Delobel, and Paris Kanellakis
Database Transaction Models For Advanced
Applications
Edited by Ahmed K. Elmagarmid
A Guide to Developing Client/Server SQL
Applications
Setrag Khoshafian, Arvola Chan, Anna
Wong, and Harry K. T. Wong
The Benchmark Handbook For Database
and Transaction Processing Systems, Second
Edition
Edited by Jim Gray
Camelot and Avalon: A Distributed
Transaction Facility
Edited by Jeffrey L. Eppinger, Lily B.
Mummert, and Alfred Z. Spector
Readings in Object-Oriented Database
Systems
Edited by Stanley B. Zdonik and David
Maier
4. Data Mining
Practical Machine Learning Tools and Techniques,
Second Edition
Ian H. Witten
Department of Computer Science
University of Waikato
Eibe Frank
Department of Computer Science
University of Waikato
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
MORGAN KAUFMANN PUBLISHERS IS AN IMPRINT OF ELSEVIER
6. Foreword
Jim Gray, Series Editor
Microsoft Research
Technology now allows us to capture and store vast quantities of data. Finding
patterns, trends, and anomalies in these datasets, and summarizing them
with simple quantitative models, is one of the grand challenges of the infor-
mation age—turning data into information and turning information into
knowledge.
There has been stunning progress in data mining and machine learning. The
synthesis of statistics, machine learning, information theory, and computing has
created a solid science, with a firm mathematical base, and with very powerful
tools. Witten and Frank present much of this progress in this book and in the
companion implementation of the key algorithms. As such, this is a milestone
in the synthesis of data mining, data analysis, information theory, and machine
learning. If you have not been following this field for the last decade, this is a
great way to catch up on this exciting progress. If you have, then Witten and
Frank’s presentation and the companion open-source workbench, called Weka,
will be a useful addition to your toolkit.
They present the basic theory of automatically extracting models from data,
and then validating those models. The book does an excellent job of explaining
the various models (decision trees, association rules, linear models, clustering,
Bayes nets, neural nets) and how to apply them in practice. With this basis, they
then walk through the steps and pitfalls of various approaches. They describe
how to safely scrub datasets, how to build models, and how to evaluate a model’s
predictive quality. Most of the book is tutorial, but Part II broadly describes how
commercial systems work and gives a tour of the publicly available data mining
workbench that the authors provide through a website. This Weka workbench
has a graphical user interface that leads you through data mining tasks and has
excellent data visualization tools that help understand the models. It is a great
companion to the text and a useful and popular tool in its own right.
v
7. This book presents this new discipline in a very accessible form: as a text
both to train the next generation of practitioners and researchers and to inform
lifelong learners like myself. Witten and Frank have a passion for simple and
elegant solutions. They approach each topic with this mindset, grounding all
concepts in concrete examples, and urging the reader to consider the simple
techniques first, and then progress to the more sophisticated ones if the simple
ones prove inadequate.
If you are interested in databases, and have not been following the machine
learning field, this book is a great way to catch up on this exciting progress. If
you have data that you want to analyze and understand, this book and the asso-
ciated Weka toolkit are an excellent way to start.
vi FOREWORD
8. Contents
Foreword v
Preface xxiii
Updated and revised content xxvii
Acknowledgments xxix
Part I Machine learning tools and techniques 1
1 What’s it all about? 3
1.1 Data mining and machine learning 4
Describing structural patterns 6
Machine learning 7
Data mining 9
1.2 Simple examples: The weather problem and others 9
The weather problem 10
Contact lenses: An idealized problem 13
Irises: A classic numeric dataset 15
CPU performance: Introducing numeric prediction 16
Labor negotiations: A more realistic example 17
Soybean classification: A classic machine learning success 18
1.3 Fielded applications 22
Decisions involving judgment 22
Screening images 23
Load forecasting 24
Diagnosis 25
Marketing and sales 26
Other applications 28
vii
9. 1.4 Machine learning and statistics 29
1.5 Generalization as search 30
Enumerating the concept space 31
Bias 32
1.6 Data mining and ethics 35
1.7 Further reading 37
2 Input: Concepts, instances, and attributes 41
2.1 What’s a concept? 42
2.2 What’s in an example? 45
2.3 What’s in an attribute? 49
2.4 Preparing the input 52
Gathering the data together 52
ARFF format 53
Sparse data 55
Attribute types 56
Missing values 58
Inaccurate values 59
Getting to know your data 60
2.5 Further reading 60
3 Output: Knowledge representation 61
3.1 Decision tables 62
3.2 Decision trees 62
3.3 Classification rules 65
3.4 Association rules 69
3.5 Rules with exceptions 70
3.6 Rules involving relations 73
3.7 Trees for numeric prediction 76
3.8 Instance-based representation 76
3.9 Clusters 81
3.10 Further reading 82
viii CONTENTS
10. 4 Algorithms: The basic methods 83
4.1 Inferring rudimentary rules 84
Missing values and numeric attributes 86
Discussion 88
4.2 Statistical modeling 88
Missing values and numeric attributes 92
Bayesian models for document classification 94
Discussion 96
4.3 Divide-and-conquer: Constructing decision trees 97
Calculating information 100
Highly branching attributes 102
Discussion 105
4.4 Covering algorithms: Constructing rules 105
Rules versus trees 107
A simple covering algorithm 107
Rules versus decision lists 111
4.5 Mining association rules 112
Item sets 113
Association rules 113
Generating rules efficiently 117
Discussion 118
4.6 Linear models 119
Numeric prediction: Linear regression 119
Linear classification: Logistic regression 121
Linear classification using the perceptron 124
Linear classification using Winnow 126
4.7 Instance-based learning 128
The distance function 128
Finding nearest neighbors efficiently 129
Discussion 135
4.8 Clustering 136
Iterative distance-based clustering 137
Faster distance calculations 138
Discussion 139
4.9 Further reading 139
CONTENTS ix
11. 5 Credibility: Evaluating what’s been learned 143
5.1 Training and testing 144
5.2 Predicting performance 146
5.3 Cross-validation 149
5.4 Other estimates 151
Leave-one-out 151
The bootstrap 152
5.5 Comparing data mining methods 153
5.6 Predicting probabilities 157
Quadratic loss function 158
Informational loss function 159
Discussion 160
5.7 Counting the cost 161
Cost-sensitive classification 164
Cost-sensitive learning 165
Lift charts 166
ROC curves 168
Recall–precision curves 171
Discussion 172
Cost curves 173
5.8 Evaluating numeric prediction 176
5.9 The minimum description length principle 179
5.10 Applying the MDL principle to clustering 183
5.11 Further reading 184
6 Implementations: Real machine learning schemes 187
6.1 Decision trees 189
Numeric attributes 189
Missing values 191
Pruning 192
Estimating error rates 193
Complexity of decision tree induction 196
From trees to rules 198
C4.5: Choices and options 198
Discussion 199
6.2 Classification rules 200
Criteria for choosing tests 200
Missing values, numeric attributes 201
x CONTENTS
12. Generating good rules 202
Using global optimization 205
Obtaining rules from partial decision trees 207
Rules with exceptions 210
Discussion 213
6.3 Extending linear models 214
The maximum margin hyperplane 215
Nonlinear class boundaries 217
Support vector regression 219
The kernel perceptron 222
Multilayer perceptrons 223
Discussion 235
6.4 Instance-based learning 235
Reducing the number of exemplars 236
Pruning noisy exemplars 236
Weighting attributes 237
Generalizing exemplars 238
Distance functions for generalized exemplars 239
Generalized distance functions 241
Discussion 242
6.5 Numeric prediction 243
Model trees 244
Building the tree 245
Pruning the tree 245
Nominal attributes 246
Missing values 246
Pseudocode for model tree induction 247
Rules from model trees 250
Locally weighted linear regression 251
Discussion 253
6.6 Clustering 254
Choosing the number of clusters 254
Incremental clustering 255
Category utility 260
Probability-based clustering 262
The EM algorithm 265
Extending the mixture model 266
Bayesian clustering 268
Discussion 270
6.7 Bayesian networks 271
Making predictions 272
Learning Bayesian networks 276
CONTENTS xi
13. Specific algorithms 278
Data structures for fast learning 280
Discussion 283
7 Transformations: Engineering the input and output 285
7.1 Attribute selection 288
Scheme-independent selection 290
Searching the attribute space 292
Scheme-specific selection 294
7.2 Discretizing numeric attributes 296
Unsupervised discretization 297
Entropy-based discretization 298
Other discretization methods 302
Entropy-based versus error-based discretization 302
Converting discrete to numeric attributes 304
7.3 Some useful transformations 305
Principal components analysis 306
Random projections 309
Text to attribute vectors 309
Time series 311
7.4 Automatic data cleansing 312
Improving decision trees 312
Robust regression 313
Detecting anomalies 314
7.5 Combining multiple models 315
Bagging 316
Bagging with costs 319
Randomization 320
Boosting 321
Additive regression 325
Additive logistic regression 327
Option trees 328
Logistic model trees 331
Stacking 332
Error-correcting output codes 334
7.6 Using unlabeled data 337
Clustering for classification 337
Co-training 339
EM and co-training 340
7.7 Further reading 341
xii CONTENTS
14. 8 Moving on: Extensions and applications 345
8.1 Learning from massive datasets 346
8.2 Incorporating domain knowledge 349
8.3 Text and Web mining 351
8.4 Adversarial situations 356
8.5 Ubiquitous data mining 358
8.6 Further reading 361
Part II The Weka machine learning workbench 363
9 Introduction to Weka 365
9.1 What’s in Weka? 366
9.2 How do you use it? 367
9.3 What else can you do? 368
9.4 How do you get it? 368
10 The Explorer 369
10.1 Getting started 369
Preparing the data 370
Loading the data into the Explorer 370
Building a decision tree 373
Examining the output 373
Doing it again 377
Working with models 377
When things go wrong 378
10.2 Exploring the Explorer 380
Loading and filtering files 380
Training and testing learning schemes 384
Do it yourself: The User Classifier 388
Using a metalearner 389
Clustering and association rules 391
Attribute selection 392
Visualization 393
10.3 Filtering algorithms 393
Unsupervised attribute filters 395
Unsupervised instance filters 400
Supervised filters 401
CONTENTS xiii
15. 10.4 Learning algorithms 403
Bayesian classifiers 403
Trees 406
Rules 408
Functions 409
Lazy classifiers 413
Miscellaneous classifiers 414
10.5 Metalearning algorithms 414
Bagging and randomization 414
Boosting 416
Combining classifiers 417
Cost-sensitive learning 417
Optimizing performance 417
Retargeting classifiers for different tasks 418
10.6 Clustering algorithms 418
10.7 Association-rule learners 419
10.8 Attribute selection 420
Attribute subset evaluators 422
Single-attribute evaluators 422
Search methods 423
11 The Knowledge Flow interface 427
11.1 Getting started 427
11.2 The Knowledge Flow components 430
11.3 Configuring and connecting the components 431
11.4 Incremental learning 433
12 The Experimenter 437
12.1 Getting started 438
Running an experiment 439
Analyzing the results 440
12.2 Simple setup 441
12.3 Advanced setup 442
12.4 The Analyze panel 443
12.5 Distributing processing over several machines 445
xiv CONTENTS
16. 13 The command-line interface 449
13.1 Getting started 449
13.2 The structure of Weka 450
Classes, instances, and packages 450
The weka.core package 451
The weka.classifiers package 453
Other packages 455
Javadoc indices 456
13.3 Command-line options 456
Generic options 456
Scheme-specific options 458
14 Embedded machine learning 461
14.1 A simple data mining application 461
14.2 Going through the code 462
main() 462
MessageClassifier() 462
updateData() 468
classifyMessage() 468
15 Writing new learning schemes 471
15.1 An example classifier 471
buildClassifier() 472
makeTree() 472
computeInfoGain() 480
classifyInstance() 480
main() 481
15.2 Conventions for implementing classifiers 483
References 485
Index 505
About the authors 525
CONTENTS xv
17.
18. List of Figures
Figure 1.1 Rules for the contact lens data. 13
Figure 1.2 Decision tree for the contact lens data. 14
Figure 1.3 Decision trees for the labor negotiations data. 19
Figure 2.1 A family tree and two ways of expressing the sister-of
relation. 46
Figure 2.2 ARFF file for the weather data. 54
Figure 3.1 Constructing a decision tree interactively: (a) creating a
rectangular test involving petallength and petalwidth and (b)
the resulting (unfinished) decision tree. 64
Figure 3.2 Decision tree for a simple disjunction. 66
Figure 3.3 The exclusive-or problem. 67
Figure 3.4 Decision tree with a replicated subtree. 68
Figure 3.5 Rules for the Iris data. 72
Figure 3.6 The shapes problem. 73
Figure 3.7 Models for the CPU performance data: (a) linear regression,
(b) regression tree, and (c) model tree. 77
Figure 3.8 Different ways of partitioning the instance space. 79
Figure 3.9 Different ways of representing clusters. 81
Figure 4.1 Pseudocode for 1R. 85
Figure 4.2 Tree stumps for the weather data. 98
Figure 4.3 Expanded tree stumps for the weather data. 100
Figure 4.4 Decision tree for the weather data. 101
Figure 4.5 Tree stump for the ID code attribute. 103
Figure 4.6 Covering algorithm: (a) covering the instances and (b) the
decision tree for the same problem. 106
Figure 4.7 The instance space during operation of a covering
algorithm. 108
Figure 4.8 Pseudocode for a basic rule learner. 111
Figure 4.9 Logistic regression: (a) the logit transform and (b) an example
logistic regression function. 122
xvii
19. Figure 4.10 The perceptron: (a) learning rule and (b) representation as
a neural network. 125
Figure 4.11 The Winnow algorithm: (a) the unbalanced version and (b)
the balanced version. 127
Figure 4.12 A kD-tree for four training instances: (a) the tree and (b)
instances and splits. 130
Figure 4.13 Using a kD-tree to find the nearest neighbor of the
star. 131
Figure 4.14 Ball tree for 16 training instances: (a) instances and balls and
(b) the tree. 134
Figure 4.15 Ruling out an entire ball (gray) based on a target point (star)
and its current nearest neighbor. 135
Figure 4.16 A ball tree: (a) two cluster centers and their dividing line and
(b) the corresponding tree. 140
Figure 5.1 A hypothetical lift chart. 168
Figure 5.2 A sample ROC curve. 169
Figure 5.3 ROC curves for two learning methods. 170
Figure 5.4 Effects of varying the probability threshold: (a) the error curve
and (b) the cost curve. 174
Figure 6.1 Example of subtree raising, where node C is “raised” to
subsume node B. 194
Figure 6.2 Pruning the labor negotiations decision tree. 196
Figure 6.3 Algorithm for forming rules by incremental reduced-error
pruning. 205
Figure 6.4 RIPPER: (a) algorithm for rule learning and (b) meaning of
symbols. 206
Figure 6.5 Algorithm for expanding examples into a partial
tree. 208
Figure 6.6 Example of building a partial tree. 209
Figure 6.7 Rules with exceptions for the iris data. 211
Figure 6.8 A maximum margin hyperplane. 216
Figure 6.9 Support vector regression: (a) e = 1, (b) e = 2, and (c)
e = 0.5. 221
Figure 6.10 Example datasets and corresponding perceptrons. 225
Figure 6.11 Step versus sigmoid: (a) step function and (b) sigmoid
function. 228
Figure 6.12 Gradient descent using the error function x2
+ 1. 229
Figure 6.13 Multilayer perceptron with a hidden layer. 231
Figure 6.14 A boundary between two rectangular classes. 240
Figure 6.15 Pseudocode for model tree induction. 248
Figure 6.16 Model tree for a dataset with nominal attributes. 250
Figure 6.17 Clustering the weather data. 256
xviii LIST OF FIGURES
20. Figure 6.18 Hierarchical clusterings of the iris data. 259
Figure 6.19 A two-class mixture model. 264
Figure 6.20 A simple Bayesian network for the weather data. 273
Figure 6.21 Another Bayesian network for the weather data. 274
Figure 6.22 The weather data: (a) reduced version and (b) corresponding
AD tree. 281
Figure 7.1 Attribute space for the weather dataset. 293
Figure 7.2 Discretizing the temperature attribute using the entropy
method. 299
Figure 7.3 The result of discretizing the temperature attribute. 300
Figure 7.4 Class distribution for a two-class, two-attribute
problem. 303
Figure 7.5 Principal components transform of a dataset: (a) variance of
each component and (b) variance plot. 308
Figure 7.6 Number of international phone calls from Belgium,
1950–1973. 314
Figure 7.7 Algorithm for bagging. 319
Figure 7.8 Algorithm for boosting. 322
Figure 7.9 Algorithm for additive logistic regression. 327
Figure 7.10 Simple option tree for the weather data. 329
Figure 7.11 Alternating decision tree for the weather data. 330
Figure 10.1 The Explorer interface. 370
Figure 10.2 Weather data: (a) spreadsheet, (b) CSV format, and
(c) ARFF. 371
Figure 10.3 The Weka Explorer: (a) choosing the Explorer interface and
(b) reading in the weather data. 372
Figure 10.4 Using J4.8: (a) finding it in the classifiers list and (b) the
Classify tab. 374
Figure 10.5 Output from the J4.8 decision tree learner. 375
Figure 10.6 Visualizing the result of J4.8 on the iris dataset: (a) the tree
and (b) the classifier errors. 379
Figure 10.7 Generic object editor: (a) the editor, (b) more information
(click More), and (c) choosing a converter
(click Choose). 381
Figure 10.8 Choosing a filter: (a) the filters menu, (b) an object editor, and
(c) more information (click More). 383
Figure 10.9 The weather data with two attributes removed. 384
Figure 10.10 Processing the CPU performance data with M5¢. 385
Figure 10.11 Output from the M5¢ program for numeric
prediction. 386
Figure 10.12 Visualizing the errors: (a) from M5¢ and (b) from linear
regression. 388
LIST OF FIGURES xix
21. Figure 10.13 Working on the segmentation data with the User Classifier:
(a) the data visualizer and (b) the tree visualizer. 390
Figure 10.14 Configuring a metalearner for boosting decision
stumps. 391
Figure 10.15 Output from the Apriori program for association rules. 392
Figure 10.16 Visualizing the Iris dataset. 394
Figure 10.17 Using Weka’s metalearner for discretization: (a) configuring
FilteredClassifier, and (b) the menu of filters. 402
Figure 10.18 Visualizing a Bayesian network for the weather data (nominal
version): (a) default output, (b) a version with the
maximum number of parents set to 3 in the search
algorithm, and (c) probability distribution table for the
windy node in (b). 406
Figure 10.19 Changing the parameters for J4.8. 407
Figure 10.20 Using Weka’s neural-network graphical user
interface. 411
Figure 10.21 Attribute selection: specifying an evaluator and a search
method. 420
Figure 11.1 The Knowledge Flow interface. 428
Figure 11.2 Configuring a data source: (a) the right-click menu and
(b) the file browser obtained from the Configure menu
item. 429
Figure 11.3 Operations on the Knowledge Flow components. 432
Figure 11.4 A Knowledge Flow that operates incrementally: (a) the
configuration and (b) the strip chart output. 434
Figure 12.1 An experiment: (a) setting it up, (b) the results file, and
(c) a spreadsheet with the results. 438
Figure 12.2 Statistical test results for the experiment in
Figure 12.1. 440
Figure 12.3 Setting up an experiment in advanced mode. 442
Figure 12.4 Rows and columns of Figure 12.2: (a) row field, (b) column
field, (c) result of swapping the row and column selections,
and (d) substituting Run for Dataset as rows. 444
Figure 13.1 Using Javadoc: (a) the front page and (b) the weka.core
package. 452
Figure 13.2 DecisionStump: A class of the weka.classifiers.trees
package. 454
Figure 14.1 Source code for the message classifier. 463
Figure 15.1 Source code for the ID3 decision tree learner. 473
xx LIST OF FIGURES
22. List ofTables
Table 1.1 The contact lens data. 6
Table 1.2 The weather data. 11
Table 1.3 Weather data with some numeric attributes. 12
Table 1.4 The iris data. 15
Table 1.5 The CPU performance data. 16
Table 1.6 The labor negotiations data. 18
Table 1.7 The soybean data. 21
Table 2.1 Iris data as a clustering problem. 44
Table 2.2 Weather data with a numeric class. 44
Table 2.3 Family tree represented as a table. 47
Table 2.4 The sister-of relation represented in a table. 47
Table 2.5 Another relation represented as a table. 49
Table 3.1 A new iris flower. 70
Table 3.2 Training data for the shapes problem. 74
Table 4.1 Evaluating the attributes in the weather data. 85
Table 4.2 The weather data with counts and probabilities. 89
Table 4.3 A new day. 89
Table 4.4 The numeric weather data with summary statistics. 93
Table 4.5 Another new day. 94
Table 4.6 The weather data with identification codes. 103
Table 4.7 Gain ratio calculations for the tree stumps of Figure 4.2. 104
Table 4.8 Part of the contact lens data for which astigmatism = yes. 109
Table 4.9 Part of the contact lens data for which astigmatism = yes and
tear production rate = normal. 110
Table 4.10 Item sets for the weather data with coverage 2 or
greater. 114
Table 4.11 Association rules for the weather data. 116
Table 5.1 Confidence limits for the normal distribution. 148
xxi
23. Table 5.2 Confidence limits for Student’s distribution with 9 degrees
of freedom. 155
Table 5.3 Different outcomes of a two-class prediction. 162
Table 5.4 Different outcomes of a three-class prediction: (a) actual and
(b) expected. 163
Table 5.5 Default cost matrixes: (a) a two-class case and (b) a three-class
case. 164
Table 5.6 Data for a lift chart. 167
Table 5.7 Different measures used to evaluate the false positive versus the
false negative tradeoff. 172
Table 5.8 Performance measures for numeric prediction. 178
Table 5.9 Performance measures for four numeric prediction
models. 179
Table 6.1 Linear models in the model tree. 250
Table 7.1 Transforming a multiclass problem into a two-class one:
(a) standard method and (b) error-correcting code. 335
Table 10.1 Unsupervised attribute filters. 396
Table 10.2 Unsupervised instance filters. 400
Table 10.3 Supervised attribute filters. 402
Table 10.4 Supervised instance filters. 402
Table 10.5 Classifier algorithms in Weka. 404
Table 10.6 Metalearning algorithms in Weka. 415
Table 10.7 Clustering algorithms. 419
Table 10.8 Association-rule learners. 419
Table 10.9 Attribute evaluation methods for attribute selection. 421
Table 10.10 Search methods for attribute selection. 421
Table 11.1 Visualization and evaluation components. 430
Table 13.1 Generic options for learning schemes in Weka. 457
Table 13.2 Scheme-specific options for the J4.8 decision tree
learner. 458
Table 15.1 Simple learning schemes in Weka. 472
xxii LIST OF TABLES
24. Preface
The convergence of computing and communication has produced a society that
feeds on information. Yet most of the information is in its raw form: data. If
data is characterized as recorded facts, then information is the set of patterns,
or expectations, that underlie the data. There is a huge amount of information
locked up in databases—information that is potentially important but has not
yet been discovered or articulated. Our mission is to bring it forth.
Data mining is the extraction of implicit, previously unknown, and poten-
tially useful information from data. The idea is to build computer programs that
sift through databases automatically, seeking regularities or patterns. Strong pat-
terns, if found, will likely generalize to make accurate predictions on future data.
Of course, there will be problems. Many patterns will be banal and uninterest-
ing. Others will be spurious, contingent on accidental coincidences in the par-
ticular dataset used. In addition real data is imperfect: Some parts will be
garbled, and some will be missing. Anything discovered will be inexact: There
will be exceptions to every rule and cases not covered by any rule. Algorithms
need to be robust enough to cope with imperfect data and to extract regulari-
ties that are inexact but useful.
Machine learning provides the technical basis of data mining. It is used to
extract information from the raw data in databases—information that is
expressed in a comprehensible form and can be used for a variety of purposes.
The process is one of abstraction: taking the data, warts and all, and inferring
whatever structure underlies it. This book is about the tools and techniques of
machine learning used in practical data mining for finding, and describing,
structural patterns in data.
As with any burgeoning new technology that enjoys intense commercial
attention, the use of data mining is surrounded by a great deal of hype in the
technical—and sometimes the popular—press. Exaggerated reports appear of
the secrets that can be uncovered by setting learning algorithms loose on oceans
of data. But there is no magic in machine learning, no hidden power, no
xxiii
25. alchemy. Instead, there is an identifiable body of simple and practical techniques
that can often extract useful information from raw data. This book describes
these techniques and shows how they work.
We interpret machine learning as the acquisition of structural descriptions
from examples. The kind of descriptions found can be used for prediction,
explanation, and understanding. Some data mining applications focus on pre-
diction: forecasting what will happen in new situations from data that describe
what happened in the past, often by guessing the classification of new examples.
But we are equally—perhaps more—interested in applications in which the
result of “learning” is an actual description of a structure that can be used to
classify examples. This structural description supports explanation, under-
standing, and prediction. In our experience, insights gained by the applications’
users are of most interest in the majority of practical data mining applications;
indeed, this is one of machine learning’s major advantages over classical statis-
tical modeling.
The book explains a variety of machine learning methods. Some are peda-
gogically motivated: simple schemes designed to explain clearly how the basic
ideas work. Others are practical: real systems used in applications today. Many
are contemporary and have been developed only in the last few years.
A comprehensive software resource, written in the Java language, has been
created to illustrate the ideas in the book. Called the Waikato Environment for
Knowledge Analysis, or Weka1
for short, it is available as source code on the
World Wide Web at http://www.cs.waikato.ac.nz/ml/weka. It is a full, industrial-
strength implementation of essentially all the techniques covered in this book.
It includes illustrative code and working implementations of machine learning
methods. It offers clean, spare implementations of the simplest techniques,
designed to aid understanding of the mechanisms involved. It also provides a
workbench that includes full, working, state-of-the-art implementations of
many popular learning schemes that can be used for practical data mining or
for research. Finally, it contains a framework, in the form of a Java class library,
that supports applications that use embedded machine learning and even the
implementation of new learning schemes.
The objective of this book is to introduce the tools and techniques for
machine learning that are used in data mining. After reading it, you will under-
stand what these techniques are and appreciate their strengths and applicabil-
ity. If you wish to experiment with your own data, you will be able to do this
easily with the Weka software.
xxiv PREFACE
1
Found only on the islands of New Zealand, the weka (pronounced to rhyme with Mecca)
is a flightless bird with an inquisitive nature.
26. The book spans the gulf between the intensely practical approach taken by
trade books that provide case studies on data mining and the more theoretical,
principle-driven exposition found in current textbooks on machine learning.
(A brief description of these books appears in the Further reading section at the
end of Chapter 1.) This gulf is rather wide. To apply machine learning tech-
niques productively, you need to understand something about how they work;
this is not a technology that you can apply blindly and expect to get good results.
Different problems yield to different techniques, but it is rarely obvious which
techniques are suitable for a given situation: you need to know something about
the range of possible solutions. We cover an extremely wide range of techniques.
We can do this because, unlike many trade books, this volume does not promote
any particular commercial software or approach. We include a large number of
examples, but they use illustrative datasets that are small enough to allow you
to follow what is going on. Real datasets are far too large to show this (and in
any case are usually company confidential). Our datasets are chosen not to
illustrate actual large-scale practical problems but to help you understand what
the different techniques do, how they work, and what their range of application
is.
The book is aimed at the technically aware general reader interested in the
principles and ideas underlying the current practice of data mining. It will
also be of interest to information professionals who need to become acquainted
with this new technology and to all those who wish to gain a detailed technical
understanding of what machine learning involves. It is written for an eclectic
audience of information systems practitioners, programmers, consultants,
developers, information technology managers, specification writers, patent
examiners, and curious laypeople—as well as students and professors—who
need an easy-to-read book with lots of illustrations that describes what the
major machine learning techniques are, what they do, how they are used, and
how they work. It is practically oriented, with a strong “how to” flavor, and
includes algorithms, code, and implementations. All those involved in practical
data mining will benefit directly from the techniques described. The book is
aimed at people who want to cut through to the reality that underlies the hype
about machine learning and who seek a practical, nonacademic, unpretentious
approach. We have avoided requiring any specific theoretical or mathematical
knowledge except in some sections marked by a light gray bar in the margin.
These contain optional material, often for the more technical or theoretically
inclined reader, and may be skipped without loss of continuity.
The book is organized in layers that make the ideas accessible to readers who
are interested in grasping the basics and to those who would like more depth of
treatment, along with full details on the techniques covered.We believe that con-
sumers of machine learning need to have some idea of how the algorithms they
use work. It is often observed that data models are only as good as the person
PREFACE xxv
27. who interprets them, and that person needs to know something about how the
models are produced to appreciate the strengths, and limitations, of the tech-
nology. However, it is not necessary for all data model users to have a deep
understanding of the finer details of the algorithms.
We address this situation by describing machine learning methods at succes-
sive levels of detail. You will learn the basic ideas, the topmost level, by reading
the first three chapters. Chapter 1 describes, through examples, what machine
learning is and where it can be used; it also provides actual practical applica-
tions. Chapters 2 and 3 cover the kinds of input and output—or knowledge
representation—involved. Different kinds of output dictate different styles
of algorithm, and at the next level Chapter 4 describes the basic methods of
machine learning, simplified to make them easy to comprehend. Here the prin-
ciples involved are conveyed in a variety of algorithms without getting into
intricate details or tricky implementation issues. To make progress in the appli-
cation of machine learning techniques to particular data mining problems, it is
essential to be able to measure how well you are doing. Chapter 5, which can be
read out of sequence, equips you to evaluate the results obtained from machine
learning, addressing the sometimes complex issues involved in performance
evaluation.
At the lowest and most detailed level, Chapter 6 exposes in naked detail the
nitty-gritty issues of implementing a spectrum of machine learning algorithms,
including the complexities necessary for them to work well in practice.Although
many readers may want to ignore this detailed information, it is at this level that
the full, working, tested implementations of machine learning schemes in Weka
are written. Chapter 7 describes practical topics involved with engineering the
input to machine learning—for example, selecting and discretizing attributes—
and covers several more advanced techniques for refining and combining the
output from different learning techniques. The final chapter of Part I looks to
the future.
The book describes most methods used in practical machine learning.
However, it does not cover reinforcement learning, because it is rarely applied
in practical data mining; genetic algorithm approaches, because these are just
an optimization technique; or relational learning and inductive logic program-
ming, because they are rarely used in mainstream data mining applications.
The data mining system that illustrates the ideas in the book is described in
Part II to clearly separate conceptual material from the practical aspects of how
to use it. You can skip to Part II directly from Chapter 4 if you are in a hurry to
analyze your data and don’t want to be bothered with the technical details.
Java has been chosen for the implementations of machine learning tech-
niques that accompany this book because, as an object-oriented programming
language, it allows a uniform interface to learning schemes and methods for pre-
and postprocessing. We have chosen Java instead of C++, Smalltalk, or other
xxvi PREFACE
28. object-oriented languages because programs written in Java can be run on
almost any computer without having to be recompiled, having to undergo com-
plicated installation procedures, or—worst of all—having to change the code.
A Java program is compiled into byte-code that can be executed on any com-
puter equipped with an appropriate interpreter. This interpreter is called the
Java virtual machine. Java virtual machines—and, for that matter, Java compil-
ers—are freely available for all important platforms.
Like all widely used programming languages, Java has received its share of
criticism. Although this is not the place to elaborate on such issues, in several
cases the critics are clearly right. However, of all currently available program-
ming languages that are widely supported, standardized, and extensively docu-
mented, Java seems to be the best choice for the purpose of this book. Its main
disadvantage is speed of execution—or lack of it. Executing a Java program is
several times slower than running a corresponding program written in C lan-
guage because the virtual machine has to translate the byte-code into machine
code before it can be executed. In our experience the difference is a factor of
three to five if the virtual machine uses a just-in-time compiler. Instead of trans-
lating each byte-code individually, a just-in-time compiler translates whole
chunks of byte-code into machine code, thereby achieving significant speedup.
However, if this is still to slow for your application, there are compilers that
translate Java programs directly into machine code, bypassing the byte-code
step. This code cannot be executed on other platforms, thereby sacrificing one
of Java’s most important advantages.
Updated and revised content
We finished writing the first edition of this book in 1999 and now, in April 2005,
are just polishing this second edition. The areas of data mining and machine
learning have matured in the intervening years. Although the core of material
in this edition remains the same, we have made the most of our opportunity to
update it to reflect the changes that have taken place over 5 years. There have
been errors to fix, errors that we had accumulated in our publicly available errata
file. Surprisingly few were found, and we hope there are even fewer in this
second edition. (The errata for the second edition may be found through the
book’s home page at http://www.cs.waikato.ac.nz/ml/weka/book.html.) We have
thoroughly edited the material and brought it up to date, and we practically
doubled the number of references. The most enjoyable part has been adding
new material. Here are the highlights.
Bowing to popular demand, we have added comprehensive information on
neural networks: the perceptron and closely related Winnow algorithm in
Section 4.6 and the multilayer perceptron and backpropagation algorithm
PREFACE xxvii
29. in Section 6.3. We have included more recent material on implementing
nonlinear decision boundaries using both the kernel perceptron and radial basis
function networks. There is a new section on Bayesian networks, again in
response to readers’ requests, with a description of how to learn classifiers based
on these networks and how to implement them efficiently using all-dimensions
trees.
The Weka machine learning workbench that accompanies the book, a widely
used and popular feature of the first edition, has acquired a radical new look in
the form of an interactive interface—or rather, three separate interactive inter-
faces—that make it far easier to use. The primary one is the Explorer, which
gives access to all of Weka’s facilities using menu selection and form filling. The
others are the Knowledge Flow interface, which allows you to design configu-
rations for streamed data processing, and the Experimenter, with which you set
up automated experiments that run selected machine learning algorithms with
different parameter settings on a corpus of datasets, collect performance statis-
tics, and perform significance tests on the results. These interfaces lower the bar
for becoming a practicing data miner, and we include a full description of how
to use them. However, the book continues to stand alone, independent of Weka,
and to underline this we have moved all material on the workbench into a sep-
arate Part II at the end of the book.
In addition to becoming far easier to use, Weka has grown over the last 5
years and matured enormously in its data mining capabilities. It now includes
an unparalleled range of machine learning algorithms and related techniques.
The growth has been partly stimulated by recent developments in the field and
partly led by Weka users and driven by demand. This puts us in a position in
which we know a great deal about what actual users of data mining want, and
we have capitalized on this experience when deciding what to include in this
new edition.
The earlier chapters, containing more general and foundational material,
have suffered relatively little change. We have added more examples of fielded
applications to Chapter 1, a new subsection on sparse data and a little on string
attributes and date attributes to Chapter 2, and a description of interactive deci-
sion tree construction, a useful and revealing technique to help you grapple with
your data using manually built decision trees, to Chapter 3.
In addition to introducing linear decision boundaries for classification, the
infrastructure for neural networks, Chapter 4 includes new material on multi-
nomial Bayes models for document classification and on logistic regression. The
last 5 years have seen great interest in data mining for text, and this is reflected
in our introduction to string attributes in Chapter 2, multinomial Bayes for doc-
ument classification in Chapter 4, and text transformations in Chapter 7.
Chapter 4 includes a great deal of new material on efficient data structures for
searching the instance space: kD-trees and the recently invented ball trees. These
xxviii PREFACE
30. are used to find nearest neighbors efficiently and to accelerate distance-based
clustering.
Chapter 5 describes the principles of statistical evaluation of machine learn-
ing, which have not changed. The main addition, apart from a note on the Kappa
statistic for measuring the success of a predictor, is a more detailed treatment
of cost-sensitive learning. We describe how to use a classifier, built without
taking costs into consideration, to make predictions that are sensitive to cost;
alternatively, we explain how to take costs into account during the training
process to build a cost-sensitive model. We also cover the popular new tech-
nique of cost curves.
There are several additions to Chapter 6, apart from the previously men-
tioned material on neural networks and Bayesian network classifiers. More
details—gory details—are given of the heuristics used in the successful RIPPER
rule learner. We describe how to use model trees to generate rules for numeric
prediction. We show how to apply locally weighted regression to classification
problems. Finally, we describe the X-means clustering algorithm, which is a big
improvement on traditional k-means.
Chapter 7 on engineering the input and output has changed most, because
this is where recent developments in practical machine learning have been con-
centrated. We describe new attribute selection schemes such as race search and
the use of support vector machines and new methods for combining models
such as additive regression, additive logistic regression, logistic model trees, and
option trees. We give a full account of LogitBoost (which was mentioned in the
first edition but not described). There is a new section on useful transforma-
tions, including principal components analysis and transformations for text
mining and time series. We also cover recent developments in using unlabeled
data to improve classification, including the co-training and co-EM methods.
The final chapter of Part I on new directions and different perspectives has
been reworked to keep up with the times and now includes contemporary chal-
lenges such as adversarial learning and ubiquitous data mining.
Acknowledgments
Writing the acknowledgments is always the nicest part! A lot of people have
helped us, and we relish this opportunity to thank them. This book has arisen
out of the machine learning research project in the Computer Science Depart-
ment at the University of Waikato, New Zealand. We have received generous
encouragement and assistance from the academic staff members on that project:
John Cleary, Sally Jo Cunningham, Matt Humphrey, Lyn Hunt, Bob McQueen,
Lloyd Smith, and Tony Smith. Special thanks go to Mark Hall, Bernhard
Pfahringer, and above all Geoff Holmes, the project leader and source of inspi-
PREFACE xxix
31. ration. All who have worked on the machine learning project here have con-
tributed to our thinking: we would particularly like to mention Steve Garner,
Stuart Inglis, and Craig Nevill-Manning for helping us to get the project off the
ground in the beginning when success was less certain and things were more
difficult.
The Weka system that illustrates the ideas in this book forms a crucial com-
ponent of it. It was conceived by the authors and designed and implemented by
Eibe Frank, along with Len Trigg and Mark Hall. Many people in the machine
learning laboratory at Waikato made significant contributions. Since the first
edition of the book the Weka team has expanded considerably: so many people
have contributed that it is impossible to acknowledge everyone properly. We are
grateful to Remco Bouckaert for his implementation of Bayesian networks, Dale
Fletcher for many database-related aspects, Ashraf Kibriya and Richard Kirkby
for contributions far too numerous to list, Niels Landwehr for logistic model
trees, Abdelaziz Mahoui for the implementation of K*, Stefan Mutter for asso-
ciation rule mining, Gabi Schmidberger and Malcolm Ware for numerous mis-
cellaneous contributions, Tony Voyle for least-median-of-squares regression,
Yong Wang for Pace regression and the implementation of M5¢, and Xin Xu for
JRip, logistic regression, and many other contributions. Our sincere thanks go
to all these people for their dedicated work and to the many contributors to
Weka from outside our group at Waikato.
Tucked away as we are in a remote (but very pretty) corner of the Southern
Hemisphere, we greatly appreciate the visitors to our department who play
a crucial role in acting as sounding boards and helping us to develop our
thinking. We would like to mention in particular Rob Holte, Carl Gutwin, and
Russell Beale, each of whom visited us for several months; David Aha, who
although he only came for a few days did so at an early and fragile stage of the
project and performed a great service by his enthusiasm and encouragement;
and Kai Ming Ting, who worked with us for 2 years on many of the topics
described in Chapter 7 and helped to bring us into the mainstream of machine
learning.
Students at Waikato have played a significant role in the development of the
project. Jamie Littin worked on ripple-down rules and relational learning. Brent
Martin explored instance-based learning and nested instance-based representa-
tions. Murray Fife slaved over relational learning, and Nadeeka Madapathage
investigated the use of functional languages for expressing machine learning
algorithms. Other graduate students have influenced us in numerous ways, par-
ticularly Gordon Paynter, YingYing Wen, and Zane Bray, who have worked with
us on text mining. Colleagues Steve Jones and Malika Mahoui have also made
far-reaching contributions to these and other machine learning projects. More
recently we have learned much from our many visiting students from Freiburg,
including Peter Reutemann and Nils Weidmann.
xxx PREFACE
32. Ian Witten would like to acknowledge the formative role of his former stu-
dents at Calgary, particularly Brent Krawchuk, Dave Maulsby, Thong Phan, and
Tanja Mitrovic, all of whom helped him develop his early ideas in machine
learning, as did faculty members Bruce MacDonald, Brian Gaines, and David
Hill at Calgary and John Andreae at the University of Canterbury.
Eibe Frank is indebted to his former supervisor at the University of
Karlsruhe, Klaus-Peter Huber (now with SAS Institute), who infected him with
the fascination of machines that learn. On his travels Eibe has benefited from
interactions with Peter Turney, Joel Martin, and Berry de Bruijn in Canada and
with Luc de Raedt, Christoph Helma, Kristian Kersting, Stefan Kramer, Ulrich
Rückert, and Ashwin Srinivasan in Germany.
Diane Cerra and Asma Stephan of Morgan Kaufmann have worked hard to
shape this book, and Lisa Royse, our production editor, has made the process
go smoothly. Bronwyn Webster has provided excellent support at the Waikato
end.
We gratefully acknowledge the unsung efforts of the anonymous reviewers,
one of whom in particular made a great number of pertinent and constructive
comments that helped us to improve this book significantly. In addition, we
would like to thank the librarians of the Repository of Machine Learning Data-
bases at the University of California, Irvine, whose carefully collected datasets
have been invaluable in our research.
Our research has been funded by the New Zealand Foundation for Research,
Science and Technology and the Royal Society of New Zealand Marsden Fund.
The Department of Computer Science at the University of Waikato has gener-
ously supported us in all sorts of ways, and we owe a particular debt of
gratitude to Mark Apperley for his enlightened leadership and warm encour-
agement. Part of the first edition was written while both authors were visiting
the University of Calgary, Canada, and the support of the Computer Science
department there is gratefully acknowledged—as well as the positive and helpful
attitude of the long-suffering students in the machine learning course on whom
we experimented.
In producing the second edition Ian was generously supported by Canada’s
Informatics Circle of Research Excellence and by the University of Lethbridge
in southern Alberta, which gave him what all authors yearn for—a quiet space
in pleasant and convivial surroundings in which to work.
Last, and most of all, we are grateful to our families and partners. Pam, Anna,
and Nikki were all too well aware of the implications of having an author in the
house (“not again!”) but let Ian go ahead and write the book anyway. Julie was
always supportive, even when Eibe had to burn the midnight oil in the machine
learning lab, and Immo and Ollig provided exciting diversions. Between us we
hail from Canada, England, Germany, Ireland, and Samoa: New Zealand has
brought us together and provided an ideal, even idyllic, place to do this work.
PREFACE xxxi
36. Human in vitro fertilization involves collecting several eggs from a woman’s
ovaries, which, after fertilization with partner or donor sperm, produce several
embryos. Some of these are selected and transferred to the woman’s uterus. The
problem is to select the “best” embryos to use—the ones that are most likely to
survive. Selection is based on around 60 recorded features of the embryos—
characterizing their morphology, oocyte, follicle, and the sperm sample. The
number of features is sufficiently large that it is difficult for an embryologist to
assess them all simultaneously and correlate historical data with the crucial
outcome of whether that embryo did or did not result in a live child. In a
research project in England, machine learning is being investigated as a tech-
nique for making the selection, using as training data historical records of
embryos and their outcome.
Every year, dairy farmers in New Zealand have to make a tough business deci-
sion: which cows to retain in their herd and which to sell off to an abattoir. Typi-
cally, one-fifth of the cows in a dairy herd are culled each year near the end of
the milking season as feed reserves dwindle. Each cow’s breeding and milk pro-
c h a p t e r 1
What’s ItAllAbout?
3
37. duction history influences this decision. Other factors include age (a cow is
nearing the end of its productive life at 8 years), health problems, history of dif-
ficult calving, undesirable temperament traits (kicking or jumping fences), and
not being in calf for the following season. About 700 attributes for each of
several million cows have been recorded over the years. Machine learning is
being investigated as a way of ascertaining what factors are taken into account
by successful farmers—not to automate the decision but to propagate their skills
and experience to others.
Life and death. From Europe to the antipodes. Family and business. Machine
learning is a burgeoning new technology for mining knowledge from data, a
technology that a lot of people are starting to take seriously.
1.1 Data mining and machine learning
We are overwhelmed with data. The amount of data in the world, in our lives,
seems to go on and on increasing—and there’s no end in sight. Omnipresent
personal computers make it too easy to save things that previously we would
have trashed. Inexpensive multigigabyte disks make it too easy to postpone deci-
sions about what to do with all this stuff—we simply buy another disk and keep
it all. Ubiquitous electronics record our decisions, our choices in the super-
market, our financial habits, our comings and goings.We swipe our way through
the world, every swipe a record in a database. The World Wide Web overwhelms
us with information; meanwhile, every choice we make is recorded.And all these
are just personal choices: they have countless counterparts in the world of com-
merce and industry. We would all testify to the growing gap between the gener-
ation of data and our understanding of it. As the volume of data increases,
inexorably, the proportion of it that people understand decreases, alarmingly.
Lying hidden in all this data is information, potentially useful information, that
is rarely made explicit or taken advantage of.
This book is about looking for patterns in data. There is nothing new about
this. People have been seeking patterns in data since human life began. Hunters
seek patterns in animal migration behavior, farmers seek patterns in crop
growth, politicians seek patterns in voter opinion, and lovers seek patterns in
their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data,
to discover the patterns that govern how the physical world works and encap-
sulate them in theories that can be used for predicting what will happen in new
situations. The entrepreneur’s job is to identify opportunities, that is, patterns
in behavior that can be turned into a profitable business, and exploit them.
In data mining, the data is stored electronically and the search is automated—
or at least augmented—by computer. Even this is not particularly new. Econo-
mists, statisticians, forecasters, and communication engineers have long worked
4 CHAPTER 1 | WHAT’S IT ALL ABOUT?
38. with the idea that patterns in data can be sought automatically, identified,
validated, and used for prediction. What is new is the staggering increase in
opportunities for finding patterns in data. The unbridled growth of databases
in recent years, databases on such everyday activities as customer choices, brings
data mining to the forefront of new business technologies. It has been estimated
that the amount of data stored in the world’s databases doubles every 20
months, and although it would surely be difficult to justify this figure in any
quantitative sense, we can all relate to the pace of growth qualitatively. As the
flood of data swells and machines that can undertake the searching become
commonplace, the opportunities for data mining increase. As the world grows
in complexity, overwhelming us with the data it generates, data mining becomes
our only hope for elucidating the patterns that underlie it. Intelligently analyzed
data is a valuable resource. It can lead to new insights and, in commercial set-
tings, to competitive advantages.
Data mining is about solving problems by analyzing data already present in
databases. Suppose, to take a well-worn example, the problem is fickle customer
loyalty in a highly competitive marketplace. A database of customer choices,
along with customer profiles, holds the key to this problem. Patterns of
behavior of former customers can be analyzed to identify distinguishing charac-
teristics of those likely to switch products and those likely to remain loyal. Once
such characteristics are found, they can be put to work to identify present cus-
tomers who are likely to jump ship. This group can be targeted for special treat-
ment, treatment too costly to apply to the customer base as a whole. More
positively, the same techniques can be used to identify customers who might be
attracted to another service the enterprise provides, one they are not presently
enjoying, to target them for special offers that promote this service. In today’s
highly competitive, customer-centered, service-oriented economy, data is the
raw material that fuels business growth—if only it can be mined.
Data mining is defined as the process of discovering patterns in data. The
process must be automatic or (more usually) semiautomatic. The patterns
discovered must be meaningful in that they lead to some advantage, usually
an economic advantage. The data is invariably present in substantial
quantities.
How are the patterns expressed? Useful patterns allow us to make nontrivial
predictions on new data. There are two extremes for the expression of a pattern:
as a black box whose innards are effectively incomprehensible and as a trans-
parent box whose construction reveals the structure of the pattern. Both, we are
assuming, make good predictions. The difference is whether or not the patterns
that are mined are represented in terms of a structure that can be examined,
reasoned about, and used to inform future decisions. Such patterns we call struc-
tural because they capture the decision structure in an explicit way. In other
words, they help to explain something about the data.
1.1 DATA MINING AND MACHINE LEARNING 5
39. Now, finally, we can say what this book is about. It is about techniques for
finding and describing structural patterns in data. Most of the techniques that
we cover have developed within a field known as machine learning. But first let
us look at what structural patterns are.
Describing structural patterns
What is meant by structural patterns? How do you describe them? And what
form does the input take? We will answer these questions by way of illustration
rather than by attempting formal, and ultimately sterile, definitions. There will
be plenty of examples later in this chapter, but let’s examine one right now to
get a feeling for what we’re talking about.
Look at the contact lens data in Table 1.1. This gives the conditions under
which an optician might want to prescribe soft contact lenses, hard contact
lenses, or no contact lenses at all; we will say more about what the individual
6 CHAPTER 1 | WHAT’S IT ALL ABOUT?
Table 1.1 The contact lens data.
Spectacle Tear production Recommended
Age prescription Astigmatism rate lenses
young myope no reduced none
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermetrope no reduced none
young hypermetrope no normal soft
young hypermetrope yes reduced none
young hypermetrope yes normal hard
pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
pre-presbyopic hypermetrope yes reduced none
pre-presbyopic hypermetrope yes normal none
presbyopic myope no reduced none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermetrope no reduced none
presbyopic hypermetrope no normal soft
presbyopic hypermetrope yes reduced none
presbyopic hypermetrope yes normal none
40. features mean later. Each line of the table is one of the examples. Part of a struc-
tural description of this information might be as follows:
If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Structural descriptions need not necessarily be couched as rules such as these.
Decision trees, which specify the sequences of decisions that need to be made
and the resulting recommendation, are another popular means of expression.
This example is a very simplistic one. First, all combinations of possible
values are represented in the table. There are 24 rows, representing three possi-
ble values of age and two values each for spectacle prescription, astigmatism,
and tear production rate (3 ¥ 2 ¥ 2 ¥ 2 = 24). The rules do not really general-
ize from the data; they merely summarize it. In most learning situations, the set
of examples given as input is far from complete, and part of the job is to gen-
eralize to other, new examples. You can imagine omitting some of the rows in
the table for which tear production rate is reduced and still coming up with the
rule
If tear production rate = reduced then recommendation = none
which would generalize to the missing rows and fill them in correctly. Second,
values are specified for all the features in all the examples. Real-life datasets
invariably contain examples in which the values of some features, for some
reason or other, are unknown—for example, measurements were not taken or
were lost. Third, the preceding rules classify the examples correctly, whereas
often, because of errors or noise in the data, misclassifications occur even on the
data that is used to train the classifier.
Machine learning
Now that we have some idea about the inputs and outputs, let’s turn to machine
learning. What is learning, anyway? What is machine learning? These are philo-
sophic questions, and we will not be much concerned with philosophy in this
book; our emphasis is firmly on the practical. However, it is worth spending a
few moments at the outset on fundamental issues, just to see how tricky they
are, before rolling up our sleeves and looking at machine learning in practice.
Our dictionary defines “to learn” as follows:
To get knowledge of by study, experience, or being taught;
To become aware by information or from observation;
To commit to memory;
To be informed of, ascertain;
To receive instruction.
1.1 DATA MINING AND MACHINE LEARNING 7
41. These meanings have some shortcomings when it comes to talking about com-
puters. For the first two, it is virtually impossible to test whether learning has
been achieved or not. How do you know whether a machine has got knowledge
of something? You probably can’t just ask it questions; even if you could, you
wouldn’t be testing its ability to learn but would be testing its ability to answer
questions. How do you know whether it has become aware of something? The
whole question of whether computers can be aware, or conscious, is a burning
philosophic issue. As for the last three meanings, although we can see what they
denote in human terms, merely “committing to memory” and “receiving
instruction” seem to fall far short of what we might mean by machine learning.
They are too passive, and we know that computers find these tasks trivial.
Instead, we are interested in improvements in performance, or at least in the
potential for performance, in new situations. You can “commit something to
memory” or “be informed of something” by rote learning without being able to
apply the new knowledge to new situations. You can receive instruction without
benefiting from it at all.
Earlier we defined data mining operationally as the process of discovering
patterns, automatically or semiautomatically, in large quantities of data—and
the patterns must be useful. An operational definition can be formulated in the
same way for learning:
Things learn when they change their behavior in a way that makes them
perform better in the future.
This ties learning to performance rather than knowledge. You can test learning
by observing the behavior and comparing it with past behavior. This is a much
more objective kind of definition and appears to be far more satisfactory.
But there’s still a problem. Learning is a rather slippery concept. Lots of things
change their behavior in ways that make them perform better in the future, yet
we wouldn’t want to say that they have actually learned. A good example is a
comfortable slipper. Has it learned the shape of your foot? It has certainly
changed its behavior to make it perform better as a slipper! Yet we would hardly
want to call this learning. In everyday language, we often use the word “train-
ing” to denote a mindless kind of learning. We train animals and even plants,
although it would be stretching the word a bit to talk of training objects such
as slippers that are not in any sense alive. But learning is different. Learning
implies thinking. Learning implies purpose. Something that learns has to do so
intentionally. That is why we wouldn’t say that a vine has learned to grow round
a trellis in a vineyard—we’d say it has been trained. Learning without purpose
is merely training. Or, more to the point, in learning the purpose is the learner’s,
whereas in training it is the teacher’s.
Thus on closer examination the second definition of learning, in operational,
performance-oriented terms,has its own problems when it comes to talking about
8 CHAPTER 1 | WHAT’S IT ALL ABOUT?
42. computers. To decide whether something has actually learned, you need to see
whether it intended to or whether there was any purpose involved. That makes
the concept moot when applied to machines because whether artifacts can behave
purposefully is unclear. Philosophic discussions of what is really meant by “learn-
ing,” like discussions of what is really meant by “intention” or “purpose,” are
fraught with difficulty. Even courts of law find intention hard to grapple with.
Data mining
Fortunately, the kind of learning techniques explained in this book do not
present these conceptual problems—they are called machine learning without
really presupposing any particular philosophic stance about what learning actu-
ally is. Data mining is a practical topic and involves learning in a practical, not
a theoretical, sense. We are interested in techniques for finding and describing
structural patterns in data as a tool for helping to explain that data and make
predictions from it. The data will take the form of a set of examples—examples
of customers who have switched loyalties, for instance, or situations in which
certain kinds of contact lenses can be prescribed. The output takes the form of
predictions about new examples—a prediction of whether a particular customer
will switch or a prediction of what kind of lens will be prescribed under given
circumstances. But because this book is about finding and describing patterns
in data, the output may also include an actual description of a structure that
can be used to classify unknown examples to explain the decision. As well as
performance, it is helpful to supply an explicit representation of the knowledge
that is acquired. In essence, this reflects both definitions of learning considered
previously: the acquisition of knowledge and the ability to use it.
Many learning techniques look for structural descriptions of what is learned,
descriptions that can become fairly complex and are typically expressed as sets
of rules such as the ones described previously or the decision trees described
later in this chapter. Because they can be understood by people, these descrip-
tions serve to explain what has been learned and explain the basis for new pre-
dictions. Experience shows that in many applications of machine learning to
data mining, the explicit knowledge structures that are acquired, the structural
descriptions, are at least as important, and often very much more important,
than the ability to perform well on new examples. People frequently use data
mining to gain knowledge, not just predictions. Gaining knowledge from data
certainly sounds like a good idea if you can do it. To find out how, read on!
1.2 Simple examples: The weather problem and others
We use a lot of examples in this book, which seems particularly appropriate con-
sidering that the book is all about learning from examples! There are several
1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 9
43. standard datasets that we will come back to repeatedly. Different datasets tend
to expose new issues and challenges, and it is interesting and instructive to have
in mind a variety of problems when considering learning methods. In fact, the
need to work with different datasets is so important that a corpus containing
around 100 example problems has been gathered together so that different algo-
rithms can be tested and compared on the same set of problems.
The illustrations in this section are all unrealistically simple. Serious appli-
cation of data mining involves thousands, hundreds of thousands, or even mil-
lions of individual cases. But when explaining what algorithms do and how they
work, we need simple examples that capture the essence of the problem but are
small enough to be comprehensible in every detail. We will be working with the
illustrations in this section throughout the book, and they are intended to be
“academic” in the sense that they will help us to understand what is going on.
Some actual fielded applications of learning techniques are discussed in Section
1.3, and many more are covered in the books mentioned in the Further reading
section at the end of the chapter.
Another problem with actual real-life datasets is that they are often propri-
etary. No one is going to share their customer and product choice database with
you so that you can understand the details of their data mining application and
how it works. Corporate data is a valuable asset, one whose value has increased
enormously with the development of data mining techniques such as those
described in this book. Yet we are concerned here with understanding how the
methods used for data mining work and understanding the details of these
methods so that we can trace their operation on actual data. That is why our
illustrations are simple ones. But they are not simplistic: they exhibit the fea-
tures of real datasets.
The weather problem
The weather problem is a tiny dataset that we will use repeatedly to illustrate
machine learning methods. Entirely fictitious, it supposedly concerns the con-
ditions that are suitable for playing some unspecified game. In general, instances
in a dataset are characterized by the values of features, or attributes, that measure
different aspects of the instance. In this case there are four attributes: outlook,
temperature, humidity, and windy. The outcome is whether to play or not.
In its simplest form, shown in Table 1.2, all four attributes have values that
are symbolic categories rather than numbers. Outlook can be sunny, overcast, or
rainy; temperature can be hot, mild, or cool; humidity can be high or normal;
and windy can be true or false. This creates 36 possible combinations (3 ¥ 3 ¥
2 ¥ 2 = 36), of which 14 are present in the set of input examples.
A set of rules learned from this information—not necessarily a very good
one—might look as follows:
10 CHAPTER 1 | WHAT’S IT ALL ABOUT?
44. If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
These rules are meant to be interpreted in order: the first one, then if it doesn’t
apply the second, and so on. A set of rules that are intended to be interpreted
in sequence is called a decision list. Interpreted as a decision list, the rules
correctly classify all of the examples in the table, whereas taken individually, out
of context, some of the rules are incorrect. For example, the rule if humidity =
normal then play = yes gets one of the examples wrong (check which one).
The meaning of a set of rules depends on how it is interpreted—not
surprisingly!
In the slightly more complex form shown in Table 1.3, two of the attributes—
temperature and humidity—have numeric values. This means that any learn-
ing method must create inequalities involving these attributes rather than
simple equality tests, as in the former case. This is called a numeric-attribute
problem—in this case, a mixed-attribute problem because not all attributes are
numeric.
Now the first rule given earlier might take the following form:
If outlook = sunny and humidity > 83 then play = no
A slightly more complex process is required to come up with rules that involve
numeric tests.
1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 11
Table 1.2 The weather data.
Outlook Temperature Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
45. The rules we have seen so far are classification rules: they predict the classifi-
cation of the example in terms of whether to play or not. It is equally possible
to disregard the classification and just look for any rules that strongly associate
different attribute values. These are called association rules. Many association
rules can be derived from the weather data in Table 1.2. Some good ones are as
follows:
If temperature = cool then humidity = normal
If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunny
and humidity = high.
All these rules are 100% correct on the given data; they make no false predic-
tions. The first two apply to four examples in the dataset, the third to three
examples, and the fourth to two examples. There are many other rules: in fact,
nearly 60 association rules can be found that apply to two or more examples of
the weather data and are completely correct on this data. If you look for rules
that are less than 100% correct, then you will find many more. There are so
many because unlike classification rules, association rules can “predict” any of
the attributes, not just a specified class, and can even predict more than one
thing. For example, the fourth rule predicts both that outlook will be sunny and
that humidity will be high.
12 CHAPTER 1 | WHAT’S IT ALL ABOUT?
Table 1.3 Weather data with some numeric attributes.
Outlook Temperature Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
46. Contact lenses: An idealized problem
The contact lens data introduced earlier tells you the kind of contact lens to pre-
scribe, given certain information about a patient. Note that this example is
intended for illustration only: it grossly oversimplifies the problem and should
certainly not be used for diagnostic purposes!
The first column of Table 1.1 gives the age of the patient. In case you’re won-
dering, presbyopia is a form of longsightedness that accompanies the onset of
middle age. The second gives the spectacle prescription: myope means short-
sighted and hypermetrope means longsighted. The third shows whether the
patient is astigmatic, and the fourth relates to the rate of tear production, which
is important in this context because tears lubricate contact lenses. The final
column shows which kind of lenses to prescribe: hard, soft, or none. All possi-
ble combinations of the attribute values are represented in the table.
A sample set of rules learned from this information is shown in Figure 1.1.
This is a rather large set of rules, but they do correctly classify all the examples.
These rules are complete and deterministic: they give a unique prescription for
every conceivable example. Generally, this is not the case. Sometimes there are
situations in which no rule applies; other times more than one rule may apply,
resulting in conflicting recommendations. Sometimes probabilities or weights
1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 13
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no and
tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and
tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = pre-presbyopic and
spectacle prescription = hypermetrope and astigmatic = yes
then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
Figure 1.1 Rules for the contact lens data.
47. may be associated with the rules themselves to indicate that some are more
important, or more reliable, than others.
You might be wondering whether there is a smaller rule set that performs as
well. If so, would you be better off using the smaller rule set and, if so, why?
These are exactly the kinds of questions that will occupy us in this book. Because
the examples form a complete set for the problem space, the rules do no more
than summarize all the information that is given, expressing it in a different and
more concise way. Even though it involves no generalization, this is often a very
useful thing to do! People frequently use machine learning techniques to gain
insight into the structure of their data rather than to make predictions for new
cases. In fact, a prominent and successful line of research in machine learning
began as an attempt to compress a huge database of possible chess endgames
and their outcomes into a data structure of reasonable size. The data structure
chosen for this enterprise was not a set of rules but a decision tree.
Figure 1.2 shows a structural description for the contact lens data in the form
of a decision tree, which for many purposes is a more concise and perspicuous
representation of the rules and has the advantage that it can be visualized more
easily. (However, this decision tree—in contrast to the rule set given in Figure
1.1—classifies two examples incorrectly.) The tree calls first for a test on tear
production rate, and the first two branches correspond to the two possible out-
comes. If tear production rate is reduced (the left branch), the outcome is none.
If it is normal (the right branch), a second test is made, this time on astigma-
tism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached
14 CHAPTER 1 | WHAT’S IT ALL ABOUT?
normal
tear production rate
reduced
hypermetrope
myope
none astigmatism
soft
hard none
spectacle prescription
yes
no
Figure 1.2 Decision tree for the
contact lens data.
48. that dictates the contact lens recommendation for that case. The question of
what is the most natural and easily understood format for the output from a
machine learning scheme is one that we will return to in Chapter 3.
Irises: A classic numeric dataset
The iris dataset, which dates back to seminal work by the eminent statistician
R.A. Fisher in the mid-1930s and is arguably the most famous dataset used in
data mining, contains 50 examples each of three types of plant: Iris setosa, Iris
versicolor, and Iris virginica. It is excerpted in Table 1.4. There are four attrib-
utes: sepal length, sepal width, petal length, and petal width (all measured in cen-
timeters). Unlike previous datasets, all attributes have values that are numeric.
The following set of rules might be learned from this dataset:
If petal length < 2.45 then Iris setosa
If sepal width < 2.10 then Iris versicolor
If sepal width < 2.45 and petal length < 4.55 then Iris versicolor
If sepal width < 2.95 and petal width < 1.35 then Iris versicolor
If petal length ≥ 2.45 and petal length < 4.45 then Iris versicolor
If sepal length ≥ 5.85 and petal length < 4.75 then Iris versicolor
1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 15
Table 1.4 The iris data.
Sepal Sepal width Petal length Petal width
length (cm) (cm) (cm) (cm) Type
1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa
3 4.7 3.2 1.3 0.2 Iris setosa
4 4.6 3.1 1.5 0.2 Iris setosa
5 5.0 3.6 1.4 0.2 Iris setosa
. . .
51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor
53 6.9 3.1 4.9 1.5 Iris versicolor
54 5.5 2.3 4.0 1.3 Iris versicolor
55 6.5 2.8 4.6 1.5 Iris versicolor
. . .
101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica
103 7.1 3.0 5.9 2.1 Iris virginica
104 6.3 2.9 5.6 1.8 Iris virginica
105 6.5 3.0 5.8 2.2 Iris virginica
. . .
49. If sepal width < 2.55 and petal length < 4.95 and
petal width < 1.55 then Iris versicolor
If petal length ≥ 2.45 and petal length < 4.95 and
petal width < 1.55 then Iris versicolor
If sepal length ≥ 6.55 and petal length < 5.05 then Iris versicolor
If sepal width < 2.75 and petal width < 1.65 and
sepal length < 6.05 then Iris versicolor
If sepal length ≥ 5.85 and sepal length < 5.95 and
petal length < 4.85 then Iris versicolor
If petal length ≥ 5.15 then Iris virginica
If petal width ≥ 1.85 then Iris virginica
If petal width ≥ 1.75 and sepal width < 3.05 then Iris virginica
If petal length ≥ 4.95 and petal width < 1.55 then Iris virginica
These rules are very cumbersome, and we will see in Chapter 3 how more
compact rules can be expressed that convey the same information.
CPU performance: Introducing numeric prediction
Although the iris dataset involves numeric attributes, the outcome—the type of
iris—is a category, not a numeric value. Table 1.5 shows some data for which
the outcome and the attributes are numeric. It concerns the relative perform-
ance of computer processing power on the basis of a number of relevant
attributes; each row represents 1 of 209 different computer configurations.
The classic way of dealing with continuous prediction is to write the outcome
as a linear sum of the attribute values with appropriate weights, for example:
16 CHAPTER 1 | WHAT’S IT ALL ABOUT?
Table 1.5 The CPU performance data.
Main
Cycle
memory (KB)
Cache
Channels
time (ns) Min. Max. (KB) Min. Max. Performance
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269
3 29 8000 32000 32 8 32 220
4 29 8000 32000 32 8 32 172
5 29 8000 16000 32 8 16 132
. . .
207 125 2000 8000 0 2 14 52
208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45
50. PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX.
(The abbreviated variable names are given in the second row of the table.) This
is called a regression equation, and the process of determining the weights is
called regression, a well-known procedure in statistics that we will review in
Chapter 4. However, the basic regression method is incapable of discovering
nonlinear relationships (although variants do exist—indeed, one will be
described in Section 6.3), and in Chapter 3 we will examine different represen-
tations that can be used for predicting numeric quantities.
In the iris and central processing unit (CPU) performance data, all the
attributes have numeric values. Practical situations frequently present a mixture
of numeric and nonnumeric attributes.
Labor negotiations: A more realistic example
The labor negotiations dataset in Table 1.6 summarizes the outcome of Cana-
dian contract negotiations in 1987 and 1988. It includes all collective agreements
reached in the business and personal services sector for organizations with at
least 500 members (teachers, nurses, university staff, police, etc.). Each case con-
cerns one contract, and the outcome is whether the contract is deemed accept-
able or unacceptable. The acceptable contracts are ones in which agreements
were accepted by both labor and management. The unacceptable ones are either
known offers that fell through because one party would not accept them or
acceptable contracts that had been significantly perturbed to the extent that, in
the view of experts, they would not have been accepted.
There are 40 examples in the dataset (plus another 17 which are normally
reserved for test purposes). Unlike the other tables here, Table 1.6 presents the
examples as columns rather than as rows; otherwise, it would have to be
stretched over several pages. Many of the values are unknown or missing, as
indicated by question marks.
This is a much more realistic dataset than the others we have seen. It con-
tains many missing values, and it seems unlikely that an exact classification can
be obtained.
Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a)
is simple and approximate: it doesn’t represent the data exactly. For example, it
will predict bad for some contracts that are actually marked good. But it does
make intuitive sense: a contract is bad (for the employee!) if the wage increase
in the first year is too small (less than 2.5%). If the first-year wage increase is
larger than this, it is good if there are lots of statutory holidays (more than 10
days). Even if there are fewer statutory holidays, it is good if the first-year wage
increase is large enough (more than 4%).
1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 17
51. Figure 1.3(b) is a more complex decision tree that represents the same
dataset. In fact, this is a more accurate representation of the actual dataset that
was used to create the tree. But it is not necessarily a more accurate representa-
tion of the underlying concept of good versus bad contracts. Look down the left
branch. It doesn’t seem to make sense intuitively that, if the working hours
exceed 36, a contract is bad if there is no health-plan contribution or a full
health-plan contribution but is good if there is a half health-plan contribution.
It is certainly reasonable that the health-plan contribution plays a role in the
decision but not if half is good and both full and none are bad. It seems likely
that this is an artifact of the particular values used to create the decision tree
rather than a genuine feature of the good versus bad distinction.
The tree in Figure 1.3(b) is more accurate on the data that was used to train
the classifier but will probably perform less well on an independent set of test
data. It is “overfitted” to the training data—it follows it too slavishly. The tree
in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of
pruning, which we will learn more about in Chapter 6.
Soybean classification: A classic machine learning success
An often-quoted early success story in the application of machine learning to
practical problems is the identification of rules for diagnosing soybean diseases.
The data is taken from questionnaires describing plant diseases. There are about
18 CHAPTER 1 | WHAT’S IT ALL ABOUT?
Table 1.6 The labor negotiations data.
Attribute Type 1 2 3 . . . 40
duration years 1 2 3 2
wage increase 1st year percentage 2% 4% 4.3% 4.5
wage increase 2nd year percentage ? 5% 4.4% 4.0
wage increase 3rd year percentage ? ? ? ?
cost of living adjustment {none, tcf, tc} none tcf ? none
working hours per week hours 28 35 38 40
pension {none, ret-allw, empl-cntr} none ? ? ?
standby pay percentage ? 13% ? ?
shift-work supplement percentage ? 5% 4% 4
education allowance {yes, no} yes ? ? ?
statutory holidays days 11 15 12 12
vacation {below-avg, avg, gen} avg gen gen avg
long-term disability assistance {yes, no} no ? ? yes
dental plan contribution {none, half, full} none ? full full
bereavement assistance {yes, no} no ? ? yes
health plan contribution {none, half, full} none ? full half
acceptability of contract {good, bad} bad good good good
52. 1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 19
wage
increase
first
year
bad
≤
2.5
statutory
holidays
>
2.5
good
>
10
wage
increase
first
year
≤
10
bad
≤
4
good
>
4
(a)
≤
2.5
statutory
holidays
>
2.5
bad
≤
36
health
plan
contribution
>
36
good
>
10
wage
increase
first
year
wage
increase
first
year
working
hours
per
week
≤
10
bad
none
good
half
bad
full
bad
≤
4
good
>
4
(b)
Figure
1.3
Decision
trees
for
the
labor
negotiations
data.
53. 680 examples, each representing a diseased plant. Plants were measured on 35
attributes, each one having a small set of possible values. Examples are labeled
with the diagnosis of an expert in plant biology: there are 19 disease categories
altogether—horrible-sounding diseases such as diaporthe stem canker, rhizoc-
tonia root rot, and bacterial blight, to mention just a few.
Table 1.7 gives the attributes, the number of different values that each can
have, and a sample record for one particular plant. The attributes are placed into
different categories just to make them easier to read.
Here are two example rules, learned from this data:
If [leaf condition is normal and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown]
then
diagnosis is rhizoctonia root rot
If [leaf malformation is absent and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown]
then
diagnosis is rhizoctonia root rot
These rules nicely illustrate the potential role of prior knowledge—often called
domain knowledge—in machine learning, because the only difference between
the two descriptions is leaf condition is normal versus leaf malformation is
absent. Now, in this domain, if the leaf condition is normal then leaf malfor-
mation is necessarily absent, so one of these conditions happens to be a special
case of the other. Thus if the first rule is true, the second is necessarily true as
well. The only time the second rule comes into play is when leaf malformation
is absent but leaf condition is not normal, that is, when something other than
malformation is wrong with the leaf. This is certainly not apparent from a casual
reading of the rules.
Research on this problem in the late 1970s found that these diagnostic rules
could be generated by a machine learning algorithm, along with rules for every
other disease category, from about 300 training examples. These training
examples were carefully selected from the corpus of cases as being quite differ-
ent from one another—“far apart” in the example space. At the same time, the
plant pathologist who had produced the diagnoses was interviewed, and his
expertise was translated into diagnostic rules. Surprisingly, the computer-
generated rules outperformed the expert-derived rules on the remaining test
examples. They gave the correct disease top ranking 97.5% of the time com-
pared with only 72% for the expert-derived rules. Furthermore, not only did
20 CHAPTER 1 | WHAT’S IT ALL ABOUT?
54. 1.2 SIMPLE EXAMPLES: THE WEATHER PROBLEM AND OTHERS 21
Table 1.7 The soybean data.
Number
Attribute of values Sample value
Environment time of occurrence 7 July
precipitation 3 above normal
temperature 3 normal
cropping history 4 same as last year
hail damage 2 yes
damaged area 4 scattered
severity 3 severe
plant height 2 normal
plant growth 2 abnormal
seed treatment 3 fungicide
germination 3 less than 80%
Seed condition 2 normal
mold growth 2 absent
discoloration 2 absent
size 2 normal
shriveling 2 absent
Fruit condition of fruit pods 3 normal
fruit spots 5 —
Leaf condition 2 abnormal
leaf spot size 3 —
yellow leaf spot halo 3 absent
leaf spot margins 3 —
shredding 2 absent
leaf malformation 2 absent
leaf mildew growth 3 absent
Stem condition 2 abnormal
stem lodging 2 yes
stem cankers 4 above soil line
canker lesion color 3 —
fruiting bodies on stems 2 present
external decay of stem 3 firm and dry
mycelium on stem 2 absent
internal discoloration 3 none
sclerotia 2 absent
Root condition 3 normal
Diagnosis diaporthe stem
19 canker