Mis2013 chapter 12 business intelligence and knowledge management


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mis2013 chapter 12 business intelligence and knowledge management

  1. 1. David Kroenke Business Intelligence and Knowledge Management Chapter 9 © 2007 Prentice Hall, Inc. 1
  2. 2.  Understand the need for business intelligence systems.  Know the characteristics of reporting systems.  Know the purpose and role of data warehouses and data marts.  Understand fundamental data-mining techniques.  Know the purpose, features, and functions of knowledge management systems. © 2007 Prentice Hall, Inc. 2
  3. 3.  According to a study done at the University of California at Berkeley, a total of 403 petabytes of new data were created.  403 petabytes is roughly the amount of all printed material ever written. ◦ The printed collection of the Library of Congress is .01 petabytes. ◦ 400 petabytes equals 40,000 copies of the print collection of the Library of Congress. © 2007 Prentice Hall, Inc. 3
  4. 4.  The generation of all these data has much to do with Moore’s Law.  The capacity of storage devices increases as their costs decrease.  Today, storage capacity is nearly unlimited.  We are drowning in data and starving for information. © 2007 Prentice Hall, Inc. 4
  5. 5. © 2007 Prentice Hall, Inc. 5 Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
  6. 6. © 2007 Prentice Hall, Inc. 6 Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
  7. 7.  Tools for searching business data in an attempt to find patterns is called business intelligence (BI) tools.  Reporting tools are programs that read data from a variety of sources, process that data, produce formatted reports, and deliver those reports to the users who need them. © 2007 Prentice Hall, Inc. 7
  8. 8.  The processing of data is simple: ◦ Data are sorted and grouped. ◦ Simple totals and averages are calculated.  Reporting tools are used primarily for assessment ◦ They are used to address questions like:  What has happened in the past?  What is the current situation?  How does the current situation compare to the past? © 2007 Prentice Hall, Inc. 8
  9. 9.  Data-mining tools process data using statistical techniques, many of which are sophisticated and mathematically complex.  Data mining involves searching for patterns and relationships among data.  In most cases, data-mining tools are used to make predictions.  For example, we can use one form of analysis to compute the probability that a customer will default on a loan. © 2007 Prentice Hall, Inc. 9
  10. 10.  Another way to distinguish the differences of reporting tools and data-mining tools is : ◦ Reporting tools use simple operations like sorting, grouping, and summing. ◦ Data-mining tools use sophisticated techniques. © 2007 Prentice Hall, Inc. 10
  11. 11.  An information system is a collection of hardware, software, data, procedures, and people.  The purpose of a business intelligence (BI) system is to provide the right information, to the right user, at the right time.  BI systems help users accomplish their goals and objectives by producing insights that lead to actions. © 2007 Prentice Hall, Inc. 11
  12. 12.  A reporting tool can generate a report that shows a customer has canceled an important order.  A reporting system, however, alerts that customer’s salesperson with this unwanted news, and does so in time for the salesperson to try to alter the customer’s decision.  A data-mining tool can create an equation that computes the probability that a customer will default on a loan. © 2007 Prentice Hall, Inc. 12
  13. 13.  A data-mining system uses that equation to enable banking personnel to assess new loan applications. © 2007 Prentice Hall, Inc. 13
  14. 14.  The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper user on a timely basis.  Reporting systems generate information from data as a result of four operations: ◦ Filtering data ◦ Sorting data ◦ Grouping data ◦ Making simple calculations on the data © 2007 Prentice Hall, Inc. 14
  15. 15. © 2007 Prentice Hall, Inc. 15
  16. 16. © 2007 Prentice Hall, Inc. 16
  17. 17.  A reporting system maintains a database of reporting metadata.  The metadata describes the reports, users, groups, roles, events, and other entities involved in the reporting activity.  The reporting system uses the metadata to prepare and deliver reports to the proper users on a timely basis. © 2007 Prentice Hall, Inc. 17
  18. 18. © 2007 Prentice Hall, Inc. 18
  19. 19. © 2007 Prentice Hall, Inc. 19
  20. 20.  In terms of a report type, reports can be static or dynamic.  Static reports are prepared once from the underlying data, and they do not change. ◦ Example, a report of past year’s sales  Dynamic reports: the reporting system reads the most current data and generates the report using that fresh data. ◦ Examples are: a report on sales today and a report on current stock prices © 2007 Prentice Hall, Inc. 20
  21. 21.  Query reports are prepared in response to data entered by users.  Online analytical processing (OLAP) reports allow the user to dynamically change the report grouping structures. © 2007 Prentice Hall, Inc. 21
  22. 22.  Reports are delivered via many different report media or channels.  Some reports are printed on paper, and others are created in a format like PDF whereby they can be printed or viewed electronically.  Other reports are delivered to computer screens.  Companies sometimes place reports on internal corporate Web sites for employees to access. © 2007 Prentice Hall, Inc. 22
  23. 23.  Another report medium is a digital dashboard, which is an electronic display customized for a particular user. ◦ Vendors like Yahoo! and MSN provide common examples. ◦ Users of these services can define content they want- say, a local weather forecast, a list of stock prices, or a list of news sources. ◦ The vendor constructs the display customized for each user. © 2007 Prentice Hall, Inc. 23
  24. 24.  Other dashboards are particular to an organization. ◦ The organization might have a dashboard that shows up-to-the-minute production and sales activities.  Alerts are another form of report. ◦ Users can declare that they wish to receive notifications of events, say, via email or on their cell phones.  Reports can be published via a Web service. ◦ The Web service produces the report in response to requests from the service-consuming application. © 2007 Prentice Hall, Inc. 24
  25. 25. © 2007 Prentice Hall, Inc. 25
  26. 26.  The report mode can be either push report or pull report.  Organizations send a push report to users according to a preset schedule. ◦ Users receive the report without any activity on their part.  Users must request a pull report. ◦ To obtain a pull report, a user goes to a Web portal or digital dashboard and clicks a link or button to cause the reporting system to produce and deliver the report. © 2007 Prentice Hall, Inc. 26
  27. 27.  Three functions of reporting systems are: ◦ Authoring ◦ Management ◦ Delivery  Report authoring involves connecting to data sources, creating the reporting structure, and formatting the report. © 2007 Prentice Hall, Inc. 27
  28. 28. © 2007 Prentice Hall, Inc. 28 Source: Microsoft product screen shot reprinted with permission from Microsoft Corporation.
  29. 29. © 2007 Prentice Hall, Inc. 29 Source: Microsoft product screen shot reprinted with permission from Microsoft Corporation.
  30. 30.  The purpose of report management is to define who receives what reports, when, and by what means.  Most report-management systems allow the report administrator to define user accounts and user groups and to assign particular users to particular groups.  Reports that have been created using the report-authoring system are assigned groups and users. © 2007 Prentice Hall, Inc. 30
  31. 31.  Assigning reports to groups saves the administrator work. ◦ When a report is created, changed, or removed, the administrator need only change the report assignments to the group. ◦ All of the users in the group will inherit the changes.  Metadata also indicates what channel is to be used and whether the report is to be pushed or pulled. ◦ If the report is to be pushed, the administrator declares whether the report is to be generated on a regular schedule or as an alert. © 2007 Prentice Hall, Inc. 31
  32. 32.  The report-delivery function of a reporting system pushes reports or allows them to be pulled according to report-management metadata.  Reports can be delivered via an email server, Web site, XML Web services, or by other program-specific means.  The report-delivery system uses the operating system and other program security components to ensure that only authorized users receive authorized reports. © 2007 Prentice Hall, Inc. 32
  33. 33.  The report-delivery system also ensures that push reports are produced at appropriate times.  For query reports, the report-delivery system serves as an intermediary between the user and the report generator. ◦ It receives user query data, such as item numbers in an inventory query, passes the query data to the report generator, receives the resulting report, and delivers the report to the user. © 2007 Prentice Hall, Inc. 33
  34. 34.  RFM analysis is a way of analyzing and ranking customers according to their purchasing patterns.  It is a simple technique that considers how recently (R) a customer has ordered, how frequently (F) a customer orders, and how much money (M) the customer spends per order.  To produce an RFM score, the program first sorts customer purchase records by the date of their most recent (R) purchase. © 2007 Prentice Hall, Inc. 34
  35. 35.  In a common form of this analysis, the program then divides the customers into five groups and gives customers in each group a score of 1 to 5. ◦ The top 20% of the customers having the most recent orders are given an R score 1 (highest).  The program then re-sorts the customers on the basis of how frequently they order. ◦ The top 20% of the customers who order most frequently are given a F score of 1 (highest).  Finally the program sorts the customers again according to the amount spent on their orders. ◦ The 20% who have ordered the most expensive items are given an M score of 1 (highest).© 2007 Prentice Hall, Inc. 35
  36. 36.  A reporting system can generate the RFM data and deliver it in many ways: ◦ A report of RFM scores for all customers can be pushed to the vice president of sales. ◦ Reports with scores for particular regions can be pushed to regional sales managers. ◦ Reports of scores for particular accounts can be pushed to the account salespeople. ◦ All of this reporting can be automated. © 2007 Prentice Hall, Inc. 36
  37. 37. © 2007 Prentice Hall, Inc. 37
  38. 38.  Online analytical processing (OLAP) provides the ability to sum, count, average, and perform other simple arithmetic operations on groups of data.  The remarkable characteristics of OLAP reports is that they are dynamic.  The viewer of the report can change the report’s format, hence, the term online. © 2007 Prentice Hall, Inc. 38
  39. 39.  An OLAP report has measures and dimensions.  A measure is the data item of interest. ◦ It is the item that is to be summed or averaged or otherwise processed in the OLAP report.  A dimension is a characteristic of a measure. ◦ Purchase data, customer type, customer location, and sales region are all examples of dimension. © 2007 Prentice Hall, Inc. 39
  40. 40.  With an OLAP report, it is possible to drill down into the data. ◦ This term means to further divide the data into more detail.  Special-purpose products called OLAP servers have been developed to perform OLAP analysis.  An OLAP server reads data from an operational database, performs preliminary calculations, and stores the results of those operations in an OLAP database. © 2007 Prentice Hall, Inc. 40
  41. 41. © 2007 Prentice Hall, Inc. 41
  42. 42. © 2007 Prentice Hall, Inc. 42
  43. 43. © 2007 Prentice Hall, Inc. 43
  44. 44. © 2007 Prentice Hall, Inc. 44
  45. 45.  Basic reports and simple OLAP analyses can be made directly from operational data.  For the most part, such reports display the current state of the business; and if there are a few missing values or small inconsistencies with the data, no one is too concerned.  Operational data are unsuited to more sophisticated analyses, particularly, data- mining analyses that require high-quality input for accurate and useful results. © 2007 Prentice Hall, Inc. 45
  46. 46.  Many organizations choose to extract operational data into facilities called data warehouses and data marts, both of which are facilities that prepare, store, and manage data specifically for data mining and other analyses.  Programs read operational data and extract, clean, and prepare that data for BI processing.  The prepared data are stored in a data- warehouse database using data-warehouse DBMS, which can be different from the organization’s operational DBMS. © 2007 Prentice Hall, Inc. 46
  47. 47.  Data warehouses include data that are purchased from outside sources.  Metadata concerning the data, its source, its format, its assumptions and constraints, and other facts about the data is kept in a data- warehouse metadata database.  The data-warehouse DBMS extracts and provides data to business intelligence tools such as data-mining programs. © 2007 Prentice Hall, Inc. 47
  48. 48. © 2007 Prentice Hall, Inc. 48
  49. 49. © 2007 Prentice Hall, Inc. 49
  50. 50.  Most operational and purchased data have problems that inhibit their usefulness for business intelligence.  Problematic data are termed dirty data. ◦ Examples are values of B for customer gender and of 213 for customer age.  Purchased data often contain missing elements. ◦ Most data vendors state the percentage of missing values for each attribute in the data they sell. ◦ An organization buys such data because for some uses, some data is better than no data at all. © 2007 Prentice Hall, Inc. 50
  51. 51.  Inconsistent data are particularly common for data that have been gathered over time. ◦ When an area code changes, for example, the phone number for a given customer before the change will not match the customer’s number after the change.  Some data inconsistencies occur from the nature of the business activity.  Nonintegrated data can cause problems when data comes from different management information systems. © 2007 Prentice Hall, Inc. 51
  52. 52.  Data can be too fine or too coarse. ◦ It is possible to capture the customers clicking behavior in what is termed clickstream data that includes everything a customer does at a Web site.  If data is in the wrong format, that condition is sometimes expressed by saying the data have the wrong granularity.  Because of a phenomenon called the curse of dimensionally, the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor. © 2007 Prentice Hall, Inc. 52
  53. 53. © 2007 Prentice Hall, Inc. 53
  54. 54.  The data warehouse takes data from the data manufacturers (operational systems and purchased data), cleans and processes the data, and locates the data on the shelves, so to speak, of the data warehouse.  A data mart is a data collection, smaller than the data warehouse, that addresses a particular component or functional area of the business. © 2007 Prentice Hall, Inc. 54
  55. 55.  The data warehouse is like the distributor in the supply chain and the data mart is like the retail store in the supply chain.  Users in the data mart obtain data that pertain to a particular business function from the data warehouse.  It is expensive to create, staff, and operate data warehouses and data marts. © 2007 Prentice Hall, Inc. 55
  56. 56. © 2007 Prentice Hall, Inc. 56
  57. 57.  Data mining is the application of statistical techniques to find patterns and relationships among data and to classify and predict.  Data mining represents a convergence of disciplines.  Data-mining techniques emerged from statistics and mathematics and from artificial intelligence and machine-learning fields in computer science. © 2007 Prentice Hall, Inc. 57
  58. 58. © 2007 Prentice Hall, Inc. 58
  59. 59.  With unsupervised data mining, analysts do not create a model or hypothesis before running the analysis.  Instead, they apply the data-mining technique to the data and observe the results.  Analysts create hypotheses after the analysis to explain the patterns found. © 2007 Prentice Hall, Inc. 59
  60. 60.  One common unsupervised technique is cluster analysis. ◦ A common use for cluster analysis is to find groups of similar customers from customer order and demographic data. © 2007 Prentice Hall, Inc. 60
  61. 61.  With supervised data mining, data miners develop a model prior to the analysis and apply statistical techniques to data to estimate parameters of the model.  One such analysis, which measures the impact of a set of variables on another variable, is called a regression analysis.  Neural networks are another popular supervised data-mining technique used to predict values and make classifications such as “good prospect” or “poor prospect” customers. © 2007 Prentice Hall, Inc. 61
  62. 62.  A market-basket analysis is a data-mining technique for determining sales patterns.  A market-basket analysis shows the products that customers tend to buy together.  In market-basket terminology, support is the probability that two items will be purchased together.  You can expect market-basket analysis to become a standard CRM analysis during your career. © 2007 Prentice Hall, Inc. 62
  63. 63. © 2007 Prentice Hall, Inc. 63
  64. 64.  A decision tree is a hierarchical arrangement of criteria that predict a classification or a value.  Decision tree analyses are an unsupervised data-mining technique.  The analyst sets up the computer program and provides the data to analyze, and the decision tree program produces the tree. © 2007 Prentice Hall, Inc. 64
  65. 65. © 2007 Prentice Hall, Inc. 65
  66. 66.  A common business application of decision trees is to classify loans by likelihood of default.  Organizations analyze data from past loans to produce a decision tree that can be converted to loan-decision rules. ◦ A financial institution could use such a tree to assess the default risk on a new loan. © 2007 Prentice Hall, Inc. 66
  67. 67. © 2007 Prentice Hall, Inc. 67 Source: Used with permission of Insightful Corporation. Copyright © 1999-2005 Insightful Corporation. All Rights Reserved.
  68. 68.  Knowledge management systems concern the sharing of knowledge that is already known to exist, either in libraries of documents, in the heads of employees, or in other known sources.  Knowledge management (KM) is the process of creating value from intellectual capital and sharing that knowledge with employees, managers, suppliers, customers, and others who need that capital. © 2007 Prentice Hall, Inc. 68
  69. 69.  Knowledge management is a process that is supported by the five components of an information system. ◦ Its emphasis is on people, their knowledge, and effective means for sharing that knowledge with others.  The benefits of KM concern the application of knowledge to enable employees and others to leverage organizational knowledge to work smarter.  KM preserves organizational memory by capturing and storing the lessons learned and best practices of key employees. © 2007 Prentice Hall, Inc. 69
  70. 70.  Content management systems are information systems that track organizational documents, Web pages, graphics, and related materials.  Such systems differ from operational document systems in that they do not directly support business operations.  KM content management systems are concerned with the creation, management, and delivery of documents that exist for the purpose of imparting knowledge. © 2007 Prentice Hall, Inc. 70
  71. 71.  Typical users of content management systems are companies that sell complicated products and want to share their knowledge of those products with employees and customers.  The basic functions of content management systems are the same as for report management systems: author, manage, and deliver.  The only requirement that content managers place on document authoring is that the document has been created in a standardized© 2007 Prentice Hall, Inc. 71
  72. 72.  Content management functions are, however, exceedingly complicated.  Most content databases are huge; some have thousands of individual documents, pages, and graphics. © 2007 Prentice Hall, Inc. 72
  73. 73.  Documents may refer to one another or multiple documents may refer to the same product or procedure. ◦ When one of them changes, others must change as well. ◦ Some content management systems keep semantic linkages among documents so that content dependencies can be known and used to maintain document consistency. © 2007 Prentice Hall, Inc. 73
  74. 74.  Document contents are perishable. ◦ Documents become obsolete and need to be altered, removed, or replaced.  Multinational companies have to ensure document language translations. © 2007 Prentice Hall, Inc. 74
  75. 75. © 2007 Prentice Hall, Inc. 75 Source: microsoft.com/backstage/inside.htm (accessed February 2004). © 2003 Microsoft Corporation. All rights reserved.
  76. 76. © 2007 Prentice Hall, Inc. 76 Source: Used with permission of Tom Rizzo of Microsoft Corporation.
  77. 77. © 2007 Prentice Hall, Inc. 77 Source: Used with permission of Tom Rizzo of Microsoft Corporation.
  78. 78.  Almost all users of content management systems pull the contents.  Users cannot pull content if they do not know it exists. ◦ The content must be arranged and indexed, and a facility for searching the content devised.  Documents that reside behind a corporate firewall, however, are not publicly accessible and will not be reachable by Google or other search engines. ◦ Organizations must index their own proprietary documents and provide their own search capability for them. © 2007 Prentice Hall, Inc. 78
  79. 79.  Web browsers and other programs can readily format content expressed in HTML, PDF, or another standard format.  XML documents often contain their own formatting rules that browsers can interpret. ◦ The content management system will have to determine an appropriate format for content expressed in other ways. © 2007 Prentice Hall, Inc. 79
  80. 80.  Nothing is more frustrating for a manager to contemplate than the situation in which one employee struggles with a problem that another employee knows how to solve easily.  KM systems are concerned with the sharing not only of content, but also with the sharing of knowledge among humans. ◦ How can one person share her knowledge with another? ◦ How can one person learn of another person’s great idea? © 2007 Prentice Hall, Inc. 80
  81. 81.  Three forms of technology are used for knowledge- sharing among humans: ◦ Portals, discussion groups, and email ◦ Collaborations systems ◦ Expert systems Portals ◦ Employees can share ideas by posting knowledge on a Web portal whereby managers and employees can pull the knowledge from the portal. © 2007 Prentice Hall, Inc. 81
  82. 82. © 2007 Prentice Hall, Inc. 82
  83. 83. Discussion Groups ◦ Discussion groups allow employees or customers to post questions and queries seeking solutions to problems they have. ◦ Oracle, IBM, PeopleSoft, and other vendors support product discussion groups where users can post questions and where employees, vendors, and other users can answer them. ◦ Later, the organization can edit and summarize the questions from such discussion groups into frequently asked questions (FAQs). © 2007 Prentice Hall, Inc. 83
  84. 84. Discussion groups (continued) ◦ Basic email can also be used for knowledge-sharing, especially if email lists have been constructed with KM in mind. ◦ Two human factors inhibit knowledge-sharing.  Employees can be reluctant to exhibit their ignorance.  Competition exists between employees. ◦ A KM application may be ill-suited to a competitive group.  The company may be able to restructure rewards and incentives to foster sharing of ideas among employees. © 2007 Prentice Hall, Inc. 84
  85. 85. Collaboration Systems ◦ Collaboration systems are information systems that enable people to work together more effectively. ◦ The Internet can be used as a broadcast medium for speeches, panel discussion, and other types of meetings. ◦ Web broadcasts, because they are digital, can be readily saved and replayed at the viewer’s convenience. ◦ Web broadcasts can also be made interactive by combining them with discussion group bulletin boards that are live during the broadcast. ◦ Video conferencing is another popular form of IT- supported meetings.  Video-conferencing equipment is expensive and normally is located in selected sites in the organization. © 2007 Prentice Hall, Inc. 85
  86. 86. Collaboration Systems (continued) ◦ Net meetings are a means by which individuals can participate in remote meetings without leaving their desk.  With a speaker and a Web camera, virtual meetings can be conducted among employees who sit in their own offices. © 2007 Prentice Hall, Inc. 86
  87. 87. © 2007 Prentice Hall, Inc. 87
  88. 88. Expert Systems ◦ Expert systems are created by interviewing experts in a given business domain and codifying the rules stated by those experts. ◦ Many expert systems were created in the late 1980s and 1990s, and some of them have been successful. ◦ Expert systems suffer from three major disadvantages.  They are difficult and expensive to develop.  They are difficult to maintain.  They were unable to live up to the high expectations set by their name. © 2007 Prentice Hall, Inc. 88
  89. 89.  Enormous amounts of data are generated each year.  Business intelligence (BI) tools search these increasing amounts of data for useful information.  Reporting tools tend to be used for assessment, process data using simple calculations such as sums and averages. © 2007 Prentice Hall, Inc. 89
  90. 90.  Data-mining tools, tend to be used for prediction, process data using sophisticated statistical and mathematical techniques.  Reporting systems create meaningful information from disparate data sources and deliver that information to the proper user on a timely basis.  RFM and OLAP are two examples of report applications. © 2007 Prentice Hall, Inc. 90
  91. 91.  Data warehouses and data marts are facilities that prepare, store, and manage data for data mining and other analyses.  Data Market-basket analysis determines groups of products that customers tend to purchase together.  Decision trees are used to construct “If…Then…” rules for predicting classifications. © 2007 Prentice Hall, Inc. 91
  92. 92.  Knowledge management is the process of creating value from intellectual capital and sharing that knowledge with employees, managers, suppliers, customers, and others who need that capital.  Human knowledge-sharing systems use portals, bulletin boards, and email to facilitate knowledge interchange.  Collaboration systems include net conferencing, video conferencing, and expert systems. © 2007 Prentice Hall, Inc. 92
  93. 93. Business intelligence (BI) systems Business intelligence (BI) tools Clickstream data Cluster analysis Collaboration systems Confidence Content management systems Curse of dimensionality Data mart Data mining Data-mining tools Data warehouse © 2007 Prentice Hall, Inc. 93 Decision trees Digital dashboard Dimension Dirty data Discussion groups Drill down Dynamic report Exabyte Expert Systems Frequently asked questions (FAQs)
  94. 94. © 2007 Prentice Hall, Inc. 94 Granularity If…then…rules Knowledge management (KM) Lift Market-basket analysis Measure Neural networks OLAP cube OLAP server Online analytical processing (OLAP) Petabyte Portals Pull report Push report Query report Regression analysis Report media Report mode Report type Reporting systems Reporting tools RFM analysis Semantic security
  95. 95. © 2007 Prentice Hall, Inc. 95 Static report Supervised data mining Support Unsupervised data mining
  96. 96. © 2007 Prentice Hall, Inc. 96 Security is a very difficult problem, and it gets worse every year. Physical security is hard enough: How do we know that the person (or program) that signs on as Megan Cho is really Megan Cho?  We use passwords, but files of passwords can be stolen. Suppose Megan works in the HR department, so she has access to personal and private data of other employees.
  97. 97. © 2007 Prentice Hall, Inc. 97 We need to design the reporting system so that Megan can access all of the data she needs to do her job, and no more. A reporting server is an obvious and juicy target for any would-be intruder.  Someone can break in and change access permissions.  Or, a hacker could pose as someone else to obtain reports.
  98. 98. © 2007 Prentice Hall, Inc. 98 Semantic security concerns the unintended release of a combination of reports or documents that are independently not protected. Megan was given just two reports to do her job  Yet she combined the information in those reports with publicly available information and is able to deduce salaries, for at least some employees.  These salaries are much more than she is supposed to know.  This is a semantic security problem.
  99. 99. © 2007 Prentice Hall, Inc. 99 The product managers wanted the data miners to analyze customer clicks on a Web page to determine customer preferences for particular product lines.  The products were competing with one another for resources.  “Sampling?” asked the product managers in a chorus  “Sampling? No way. We want all the data. This is important, and we don’t want a guess.”
  100. 100. © 2007 Prentice Hall, Inc. 10 0 There’s nothing wrong with sampling  Properly done, the results from a sample are just as accurate as results from the complete data set.  Studies done from samples are also cheaper and faster.  Sampling is a great way to save time and money. In truth, skill is required to develop a good sample.  The product managers should have listened to the data miners’ sampling plan and ensured that the sample would be appropriate, given the goals of the study.  Understanding this concept will save you and your organization substantial money!
  101. 101. © 2007 Prentice Hall, Inc. 10 1 Classification is a useful human skill. Sorting and classifying are necessary, important, and essential activities.  But those activities can also be dangerous Serious ethical issues arise when we classify people.  What makes someone a good or bad “prospect”?  If we’re talking about classifying customers in order to prioritize our sales calls, then the ethical issue may not be too serious.  What about classifying applicants for college?
  102. 102. © 2007 Prentice Hall, Inc. 10 2 I’m not really a contrarian about data mining.  I believe in it.  But data mining in the real world is a lot different from the way it’s described in textbooks  One problem is that data are always dirty, with missing values, values way out of the range of possibility, and time values that make no sense. “Another problem is that you know the least when you start the study”.  So you work for a few months and learn that if you had another variable, say the customer’s zip code, or age, or something else, you could do a much better analysis.
  103. 103. © 2007 Prentice Hall, Inc. 10 3 Overfitting is another problem, a huge one.  With neural networks, you can create a model of any level of complexity you want, except that none of those equations will predict new cases with any accuracy at all.  When using neural nets, you have to be very careful not to overfit the data. Another problem is seasonality:  Say all your training data are from the summer-will your model be valid for the winter?
  104. 104. © 2007 Prentice Hall, Inc. 10 4 “When you start a data-mining project, you never know how it will turn out”;  Some were bad and a wasted of time.  Some were good and found to have interesting and important patterns and information and created very accurate predictive models. It’s not easy, though, you have to be very careful and lucky.
  105. 105. © 2007 Prentice Hall, Inc. 10 5 Computer simulation of World War III project at Pentagon 1971-1973 Analysis process  Run the simulation and obtain a set of results.  The military analysts and weapons experts would examine the results, and if results weren’t quite what was expected or wanted, the analysts would ask to change some of the inputs or a portion of the model.  Over time, an accumulated set of results was approved.  The accumulated results were presented to the four-star generals and other senior Pentagon managers.  Sometimes these senior people would see problems in the analyses, and gave instruct ions to discard some of the results.
  106. 106. © 2007 Prentice Hall, Inc. 10 6 Observation  I do not believe that anyone thought they were deceiving anyone else.  The top managers didn’t realize that the results they saw left out a substantial portion of the unfavorable simulations.  They never knew about the other results.  The analysts who were filtering the outcomes by throwing out the numbers didn’t like being dishonest  They simply thought that those results were wrong or unrealistic.  I do not think they realized they were using the computer to promulgate their prior ideas about military needs.
  107. 107. © 2007 Prentice Hall, Inc. 10 7 Questions to think about  Why perform the analysis?  What are you going to do with the results?  What is it that you want to know or to decide? Answer the questions above before you begin the analysis.  Then, pay attention to the results.  Don’t argue with the data.  If the results don’t conform to your expectations, think long and hard about changing the model, adjusting the data, or modifying the answers.