Introduction to Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Data Mining

  1. 1. AlphaMiner OPEN SOURCE DATA MINING PLATFORM Whitepaper Introduction to Data Mining © Copyright E-Business Technology Institute The University of Hong Kong 2005 All rights reserved
  2. 2. Introduction to Data Mining Accessing data stored in the data source such as a Preface database or data warehouse Cleansing data to remove errors and treat missing values Business Intelligence (BI) refers to the process of Transform data into the right formats for model building transforming company business data into information and business knowledge that support business decision Building models using data mining algorithms making. Many companies consider taking business Assessing models to choose the best one intelligence as a necessary means to improve Using models to score data to generate the results competitive capability, enhance customer relationships, such as selecting the right customers as marketing increase market share and reduce business costs. With targets timely and accurate business information discovered from company business operation A given business problem is solved data, BI enables company managers by a data mining process that to make effective business decisions consists of a series of data mining resulting in better business operations mentioned above. outcomes. Solving different business problems require different data mining The core technologies for enabling processes. Therefore, data mining BI include data warehousing, OLAP processes represent knowledge on and data mining. Data warehousing how to solve business problems technology is used to integrate business data from using data mining technology. A data mining tool separate operational systems for analysis. Without good contains data operations that enable users to create use of integrated data, a data warehouse itself cannot data mining processes. generate business benefits. OLAP and data mining use data to generate information and business knowledge for decision making. What is Data Mining? OLAP tools are used by business managers and analysts to generate various business reports, such as Data mining refers to the process of discovering sales reports on different product categories, inventory interesting business information and knowledge from reports or reports comparing sales during different time large databases. Such information and knowledge come periods. OLAP tools feature functions that enable users in different forms, such as, a set of rules, which can to create ad hoc reports and, as such, are more flexible show among other things, associations of products than the traditional MIS systems. frequently bought at the same time by customers at a supermarket; classification model that replicates the Data mining tools are used to build predictive models classification scheme hidden in a classified dataset. It that can predict likely business events, such as, which could be shown as a cluster of data, which represents customers are likely to defect and which products are significant difference likely to be sold more easily. Data mining is a useful from the rest of the technology in marketing, market intelligence, customer data. It could also be relationship management, customer retention, credit risk shown as a graph, management, business fraud detection and many other which shows a strong business applications. correlation between selected fields in data. Building a data mining model from historical business data involves a number of operations, including: A data mining process consists of the entire set of steps, including problem definition, data selection, data preprocessing, model building, model validation and Copyright E-Business Technology Institute 2005. All rights reserved 1
  3. 3. Introduction to Data Mining verification, and model deployment. The actual data especially for beginners. mining part involves running an algorithm to build a model. Data sources have to be identified in the planning stage of a data mining project and the administrators and Data Mining Project the users of the data sources consulted in order to understand the data in the A data mining project involves a variety of activities and data sources. Data for solving requires the following components to accomplish. a particular data mining problem may also come from Defining Business Problem multiple sources. Data selection, exploration and A data mining initiative begins with the definition of a preprocessing are important steps in the data mining business problem. A well-defined business problem sets process and cannot be overestimated and, in fact, up a clear goal, which helps in the selection of data consume the most time and resources in a data mining sources and data mining techniques to be used. A clear project. goal also helps identify persons who better understand the business and data and can contribute to the project. Utilizing Data Mining Tool Data mining tools are crucial to conduct a data mining A business problem is often project smoothly and get the best out of data. A generic defined after a series of data mining tool with a wide range of functionality will discussions among different benefit large organizations with many data mining tasks business units. People requiring different functions to accomplish. However, involved should come from the even though functionality is an important element to areas of business business users, it is more critical for them to have a tool management, data analysis that is easy-to-use. AlphaMiner is such a tool. and data management. In a typical scenario, a general problem is initially asked by business managers and, Forming Data Mining Team after a series of discussions, narrowed down to a Data mining relies on teamwork, requiring people to play. specific problem. For example, a direct marketing The people play active roles in conducting a data mining manager may ask: “Can we use data mining to increase project. A competent data miner should have a good our customer response rate in a direct mail marketing understanding of the business problem, the data, and campaign?” This is a general problem with a goal of data mining principles, and should be capable of using building a data mining model to select customers for the data mining tools to perform the various steps of the promoting a particular product through mail. But a data mining process. business problem has to be specific so that a clear goal can be set and easily achieved within a short time Adopting Data Mining Methodology period. Data methodology is an important element to the success of a data mining project. A sound methodology Gathering Data Sources can guide the data mining work smoothly and quickly Reliable and accessible data sources make up the toward the best outcome. A bad methodology, however, precondition for a successful data mining project. Lack of results in unsatisfactory outcome and frequent repetition reliability and accessibility of data sources can delay of the same steps. data mining initiatives and even lead to their termination or failure without any concrete results. Unfortunately, There is no unified methodology, which can govern all such problems happen all the time in enterprises, data mining practices. Data mining practitioners need to Copyright E-Business Technology Institute 2005. All rights reserved 2
  4. 4. Introduction to Data Mining gradually develop their own methodologies from their Bulk of Data Unexplored data mining experience. The volume of data has always been a big issue for data analysis in business. Almost every big company or government organization maintains one or more large Why Data Mining? databases. These large databases are bulging at a tremendous speed each day. The traditional analysis techniques cannot cope with the volumes of data. Data mining technology is needed in business for Consequently, large part of data in the company’s numerous reasons, including the following primary ones. databases is left unexplored. Clearly, data is a valuable Competitive Business Environment asset to the business but it From the 1990s, the business environment has become cannot generate profit itself. ever more competitive with the easy access to However, information and information and increasing globalization of the world knowledge being dug out economy. There is almost no isolated market, which from the data can be allows a company to operate without competition. Indeed, transformed to money. Business data can form golden some previously highly regulated industries, such as mines which nuggets hide in, and data mining provides health insurance and telecommunications, have become powerful machinery to mine the nuggets. hot battlefields for competition in recent years. In such a competitive environment, many companies operate with Better Data Accessibility a marginal profit and some are continuously struggling Business data in large companies are usually stored in with their competitors for different databases and information systems distributed survival. in different divisions and departments, which can be located in different cities. Such a distributed information To survive and prosper, a environment used to be a big barrier for data access to business has to maintain a different sources within an organization. However, the sound business practice, situation has changed in the late ‘90s due to the reduce operating costs and advances of information technology. Under the new find new business opportunities. Besides options such intranet and internet infrastructure and with the as organization downsizing and reengineering, client/server technology, business analysts can query investment on new technology can increase the any remote databases in the network and transfer a big competitiveness of a company. For many industries, volume of data quickly between two systems which are especially in the service sector, the fastest growing located in different places. The centralized data areas for such investment include business intelligence warehouses provide integrated data sources for systems. comprehensive data analysis and mining. The new security measures can guarantee the safe access to the Data mining, together with data confidential business data. Using a user-friendly query warehousing and OLAP, forms the tool, business analysts can easily extract and manipulate core technology in business data in relational database systems without the need to intelligence. Data mining write comprehensive SQL statements. technology, although still in its infancy, has demonstrated a great Proliferation of Data Mining Tools potential in many business In the past few years, data mining tool vendors have put applications. It can help a a lot of effort on improving the usability of their data company to improve its marketing performance, detect mining products. The improvement includes graphical business fraud and maintain good management of user interfaces, consolidation of functionality, integration customer relations. of various data mining algorithms, easy access to Copyright E-Business Technology Institute 2005. All rights reserved 3
  5. 5. Introduction to Data Mining different data sources, and representation of mining Here are the reasons why data mining should be results through visualization. The fruit of the effort has integrated into the decision support environment. made data mining techniques no longer only accessible to academics, Ph.D. students and skillful data mining The data warehouse provides the abundance, consultants, but a useful weapon for business analysts. integrated and comparatively clean and reliable data source for data mining After a short period of training, most business analysts Many data mining results cannot be produced with are able to use one of the leading data mining tools to other tools conduct their own data mining projects. They no longer Data mining can deal with large data need to spend a lot of time struggling with the technical Data mining models are often deployed in the difficulties of using the tools. Instead, they focus on environment solving their problems. Strategic Decision Support Modern organizations consider that building enterprise-wise decision support environment is a strategic investment for the near future. In an enterprise-wise decision support environment, the data warehouse plays the central role to providing the integrated and time-dependent data source for making decisions. End-users, such as branch managers, business analysts, and marketing and sales staff, use various front-end tools to analyze data in and extract information from the data warehouse. Widely used tools include ad hoc querying tools, OLAP tools and data mining tools. Top level managers can also use EIS systems to obtain the update reports they need for their various meetings. Copyright E-Business Technology Institute 2005. All rights reserved 4