74 - Understanding Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

74 - Understanding Data Mining

  1. 1. Management:Overview Understanding Data Mining Data mining has become one of the latest trends in using data. Rod Newing explains that it is a complex process which has been around for a long time. O rganisations world-wide are ac- cumulating vast quantities of electronic data as databases become ever more pervasive. The recent The name is derived from the pro- cess of sifting large amounts of ore to discover nuggets of gold, just as the software is able to sift large volumes of automated most of the work involved. Data mining differs from statistical analysis in that the latter is used to verify existing knowledge in order to prove a trend to implement a data warehouse data to find nuggets of information known relationship. Most data mining architecture is increasing the quality and which yield gold in the form of compe- involves carrying out several different accessibility of data. This is all being titive advantage. The extracted operations using more than one tech- done at great cost, but the information information can be used to do one or nology, so it should be thought of as an is only valuable if used effectively. more of the following: operation, rather than a product. Users have been using query tools, Data mining can be carried out on OLAP servers, Business Intelligence q Provide an understanding of data any data file, from a spreadsheet to a tools, Enterprise Information Systems relationships to end users. data warehouse. Transaction process- and a wide range of other packaged q Form a prediction or classification ing systems can be mined, and the software to examine their data. model. exercise can be used to generate However, these tools either work q Allow prediction of future trends benefits which can help to justify the with summarised data or answer users based on past experience. considerable investment required to im- specific questions. The more numerate q Identify relationships between da- plement a data warehouse architecture. analysts have recognised that there are tabase records. Figure 1 outlines the major mile- hidden patterns, relationships and rules q Provide a summary of the database stones in the evolution of Data Mining. in their data which cannot be found by being mined. using these traditional methods. Objectives The answer is to use specialist soft- With a query, the user knows what ware which harnesses advanced is in the database and know what infor- Data mining can achieve a number mathematics to examine large volumes mation to ask for, so they must know of different objectives, using one or of detailed data. This specialist group of what patterns exist. With data mining, more different technologies. software has become known as "data the software establishes the patterns mining" or "knowledge discovery". Data and relationships. It is possible to carry Prediction And Classification mining is defined as the process of ex- out data mining operations using a This approach uses the historical tracting valid, previously unknown and query tool, but the process is extremely data in the database to predict future ultimately comprehensible informa- complex and would be prohibitively behaviour. It creates a generalised de- tion from large databases and using it manually intensive. Data mining soft- scription which characterises the to make critical business decisions. ware uses algorithms which have contents of the database by generating Time Evolutionary step Business question Enabling Technologies Characteristics 1960s Data collection "What was my total revenue Computers, tapes, disks. Retrospective static data in each of the last five years?" delivery. 1980s Data access "What were unit sales in Relational databases, SQL, Retrospective dynamic New England in March?" ODBC. data delivery at record level. 1990s Data warehousing "What were unit sales in On-Line Analytical Retrospective dynamic and decision support New England in March?" Processing, data warehouses. data delivery at Drill down to Boston. multiple levels. Now Data mining "What is likely to happen to Advanced algorithms, Prospective proactive Boston unit sales next multi-processor computers, information delivery. month? Why?" massive databases. Figure 1 - Milestones in the evolution of Data Mining. Issue 74 Page 13 PC Network Advisor File: M0481.1
  2. 2. Management:Overview an understandable model. It enables Data Transformation It may be necessary to refine the data, the model to be applied to new data Once it has been selected, the data repeating the sequence again. This pro- sets in order to predict the behaviour may need to be transformed. For in- cess is often referred to as "data hidden in that data. For example, a stance, neural networks require refining". predictive model of existing customers nominal values to be converted to can be applied to potential customers numeric ones. Alternatively, derived Techniques in order to identify those most likely to attributes may need to be created by purchase a particular product or ser- applying mathematical or logical oper- There are a number of techniques vice. It has traditionally used statistical ators, such as a ratio or logarithmic for carrying out the data mining exer- techniques, but lots of automatic model value. cise. development techniques are being de- veloped, often based on supervised Applying Algorithms Supervised Induction induction. One or more data mining tech- Supervised induction automatically niques are carried out to try to extract creates a classification model from a set Analysing Links of records, known as a "training set", the required information or meet the Data mining can be used to establish required objective. Some of the algo- which may be the whole database or a relationships between the records in rithms used are described in Figure 2. sample of data from it. The induced the database which would otherwise model consists of generalised patterns be impossible to find because they can- Results Interpretation which can be used to classify new rec- not be predicted and so cannot be The result of applying data mining ords. It can use neural networks or found other than by accident. It is a algorithms will be tables of values or decision trees, but the latter do not relatively recent technique, which has relationships. The user will have to work well with noisy data. become well known through shopping look for interesting groupings of data It produces high quality models, basket analysis, which indicates popu- and establish if there is any business even when data in the training set is lar combinations purchased by retail value in them. They need to be ana- poor or incomplete. The result is more customers. lysed using a data visualisation (see accurate than that obtained using stat- Figure 3) or decision support tool. Vis- istical methods, because it checks for Segmenting Databases local patterns, whereas the latter work ualisation helps the user to understand This is a form of sophisticated query the data and identify patterns. across the entire database. The models to identify common groups of records If the objective is to produce a are easy for the user to understand. An within a database. It may be a tech- model, it must be validated and tested. example would be a credit card ana- nique in its own right or may be used lysis to discover the attributes of a good to prepare data for further processing. Detecting Deviations Neural Networks Software which learns from training to identify patterns and construct a This identifies unusual values model. This model is then applied to larger data sets to predict its structures. which do not conform to the expected It can also identify changes, which then become a notifiable event. pattern. It is often a source of new knowledge since the results defy Decision Trees known logic. It is also used in fraud Decision trees are tree-shaped structures which represent sets of decisions. detection, where unusual values may They generate rules for classifying the data set, using algorithms such as ID3, represent an unauthorised transaction. Classification and Regression Trees ("CART") and Chi Square Automatic Interaction Detection ("CHAID"). The Process Clustering Methods There are four basic steps which In this method, artificial intelligence search techniques are used to identify need to be carried out in order to com- subsets in a cluster. It uses software such as AQ11, UNIMEM and COBWEB. plete a data mining exercise. Rule Induction Data Selection Rule induction involves the extraction of "if ... then ...." rules from data based The objective determines the type of on statistical significance. Examples are IBM’s RMINI, and FOIL, which are information and the way it is or- in the public domain. ganised. Only part of the data available from the source data file will be Genetic Algorithms needed, so the relevant data must be This is an optimisation technique which uses processes such as genetic identified. Noise and missing values combination, mutation and natural selection in a design based on the con- may need to be addressed. It may also cepts of evolution. be preferable to sample the data re- quired and mine the sample. Figure 2 - Data Mining Technologies. File: M0481.2 PC Network Advisor Issue 74 Page 14
  3. 3. Management:Overview Data Mining credit risk in order to predict credit example of association discovery is related transactions. It is used for tar- worthiness of applicants. market basket analysis. geting direct mail. Association Discovery Sequence Discovery Clustering This is a technique which identifies This is similar to association dis- This technique is used to segment a the affinities which exist among rec- covery, but works over time. It is database into subsets of mutually ex- ords. The output might find that 67% frequently directed towards individ- clusive groups. The members of each of records containing A, B and C, also ual customers as a means of identifying group should be as close to each other contain Y and Z. The percentage is their preferences. It detects buying pat- as possible and as far apart from other known as the "confidence factor". An terns which occur in a sequence of groups as possible. The members of each cluster should possess properties which are interesting to the user. Data Data visualisation provides the user with visual summaries of the results of visualisation techniques are then used the data mining algorithms. This helps them to understand the results of the to examine each cluster to establish data mining algorithms by communicating relationships in a way that rows which are useful or interesting. and columns cannot. It is interactive, allowing the user to filter or change the It is less precise than other tech- information displayed. The user can also change the presentation method niques because of redundant or used, such as from a histogram to a scatter chart. irrelevant data. The solution is for the Visualisation allows users to browse the data looking for unusual features. user to direct the software to ignore It is good at identifying small meaningful sub-sets of data which defy subsets of attributes, assign weightings conventional wisdom. These "outliers" are anomalies which may be errors, to them or apply filters to the informa- or genuine and valuable exceptions to established wisdom. tion. The importance of the attributes A wide range of advanced chart types can be used: themselves can be established using statistical methods. q Geographical maps, combined with histograms, colour coding, pie charts Clustering can also be used to pro- etc. vide data for other techniques, such as q Tree maps showing the hierarchy of a classified database. supervised induction. Clusters can be q Rule visualisation. created using statistics, neural net- q Trends. works or unsupervised induction. q Scatter graphs. However, using statistical methods q Heat maps. makes it difficult to assign new records to existing clusters, because of the dif- These chart types are very advanced when compared with traditional ficulty of measuring and handling its graphing tools and need powerful workstations. For instance, a five dimen- deviation from those clusters. sional chart can be created by representing clusters on a three dimensional scatter chart as a sphere. The size and colour of the sphere represent the fourth and fifth dimensions. Applications The time dimension can be incorporated by "playing" the chart like a video. The importance of data mining has The user can watch the movements in a multi-dimensional chart as it changes been recognised by information intens- with the elapsed time. ive industries which have large databases of customer transactions, Figure 3 - Data visualisation. such as banking, health care, insur- Supplier Product Contact Details Angoss Knowledge Seeker http://www.angoss.com Attar XpertRule http://www.attar.com Brann Software Viper http://www.brannsoftware.co.uk DataMind Corporation Mine Your Own Business http://www.datamindcorp.com EDS Dbintellect http://www.dbintellect.com IBM Intelligent Miner http://www.software.ibm.com Intelligent Decision Server Integral Solutions Clementine http://www.isl.co.uk Right Information Systems 4Thought http://www.4thought.com The SAS Institute Neural Network Application, http://www.sas.com Insight, Spectraview, GIS Silicon Graphics MineSet http://www.sgi.com SPSS SPSS CHAID, Neural Connection, http://www.spss.com Professional Statistics etc Figure 4 - The Main Data Mining Products. Issue 74 Page 15 PC Network Advisor File: M0481.3
  4. 4. Management:Overview Supplier Product Tool Contact Details Cognos PowerPlay 4Thought, Knowledge Seeker http://www.cognos.com Comshare Commander Decision Own http://www.comshare.com NCR Knowledge Discovery Clementine http://www.ncr.com Workbench Holistic Systems Holos Own http://www.holossys.com Oracle Express Partners’ http://www.oracle.com Pilot Software Pilot Discovery Server Own, based on the Thinking Machine http://www.pilotsw.com Planning Sciences Gentia Own, plus Intelligent Miner http://www.gentium.com Red Brick Systems Red Brick Data Mine Mine Your Own Business http://www.redbrick.com Figure 5 - Products incorporating data mining. ance, marketing, retail and telecom- found up to a twenty-fold decrease on present the data in an easy to under- munications. costs over conventional approaches. stand manner so that users can assess One of the most well-known data The data mining operation can also be its significance to the business. It may mining applications is market/shop- taken a step further by identifying clus- incorporate its own visualisation tools ping basket analysis. This involves ters of the most profitable likely or work with third-party packages. running an association discovery oper- customers, which may be different to The software must incorporate fil- ation over Electronic Point Of Sale those most likely to respond. ters to remove "noise", which is (EPOS) data. It analyses the combina- Identifying exceptions can be just as incorrect information or spurious rela- tions of products purchased by important as finding hidden patterns. tionships. For instance, the software individual buyers to find depend- In fraud detection, credit card transac- shouldn’t waste the user’s time by re- encies. Until the recent arrival of tions are often analysed by a neural porting that 99.9% of married people loyalty cards, it has been the only way network to identify unusual transac- have a spouse of the opposite gender! the supermarkets and high street stores tions which may indicate that the card Software for data mining is avail- has to understand who their customers is not being used by its holder, even able either direct from the authors or are and how they behave. before the loss is reported. through decision support vendors who Other common applications are for It is important to understand that a have embedded it into their own appli- promotion effectiveness, customer vul- particular data mining exercise may cations. IBM and the other vendors nerability analysis, cross-selling, use more than one stage and use sev- have open Application Programming portfolio creation and fraud detection. eral algorithms by passing the results Interfaces so that application builders It is also used in healthcare, where it from one analysis to another. For in- can add value to their decision support can find relationships between patient stance, the user might produce software by driving a data mining en- histories, illnesses and surgical oper- associations using a decision tree and gine from their own tools. ations. It is also used in manufacturing then pass the result to a neural network processes to monitor quality and spot to identify changes over time. Mining machine wear. elements can be combined in an infinite In marketing, if an organisation variety of ways. wants to cross-sell one product to an- other, it cannot target all customers, Software because the volume may be too large. PCNA Therefore it is necessary to mine the For most organisations, the soft- database of existing customers to ware needs to be scalable from a identify patterns which describe the stand-alone PC to a parallel-processing characteristics of purchasers of the pro- server. This allows data mining oper- duct. These patterns can then be ations to be carried out on desktop applied to the database of customers databases, relational or multi-dimen- who have not purchased the product to sional data marts, transaction segment and predict those who are processing systems or enterprise data more likely to purchase the product. warehouses. These are then targeted in a very spe- Because of the different techniques The Author cific marketing campaign. and technologies, the software needs to Rod Newing MBA FCA FInstD is Data mining is often used to predict integrate various different algorithms a specialist writer on Executive and identify people most likely to re- into one product. Most vendors use Computing. He can be contacted spond to direct mail. This reduces the several different ones and are writing via email as rnewing@cix.compu- cost of mailing without affecting the further modules to expand the scope of link.co.uk. response rate. Organisations have their products. The software must File: M0481.4 PC Network Advisor Issue 74 Page 16
  5. 5. New Reviews from Tech Support Alert Anti-Trojan Software Reviews A detailed review of six of the best anti trojan software programs. Two products were impressive with a clear gap between these and other contenders in their ability to detect and remove dangerous modern trojans. Inkjet Printer Cartridge Suppliers Everyone gets inundated by hundreds of ads for inkjet printer cartridges, all claiming to be the cheapest or best. But which vendor do you believe? Our editors decided to put them to the test by anonymously buying printer cartridges and testing them in our office inkjet printers. Many suppliers disappointed but we came up with several web sites that offer good quality cheap inkjet cartridges with impressive customer service. Windows Backup Software In this review we looked at 18 different backup software products for home or SOHO use. In the end we could only recommend six though only two were good enough to get our “Editor’s Choice” award The 46 Best Freeware Programs There are many free utilities that perform as well or better than expensive commercial products. Our Editor Ian Richards picks out his selection of the very best freeware programs and he comes up with some real gems. Tech Support Alert http://www.techsupportalert.com