Your SlideShare is downloading. ×
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Information
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Information

975

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
975
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Business Intelligence : Competitive Advantage Data Warehousing and Data mining R.K.Gupta* .“Knowledge [no more Information] is not only power, but also has significant competitive advantage” The globalization of business, the liberalization of the economy and the rapid strides in technology make strategic and operational plan (micro level) virtually outdated, by the time they are generated and ready for implementation. The changing dynamic & economic scenario with on-going process of liberalization, globalization and privatization is posing new challenges before the planner and decision-makers with multiple level of complexity. The planning has been transformed into dynamic on-going process - thus needs "Business Intelligence Solution". In the present economic scenario, for large number of organizations, making unbiased and faster decisions, can make the difference between surviving and thriving, more so in an increasingly competitive market. Buried in the huge databases assembled by large organizations is the information useful for generating new facts and relationships that can provide significant competitive advantage. Organizations have lately realized that just processing transactions and/or information’s faster and more efficiently, no longer provides them with a competitive advantage vis-à-vis their competitors for achieving business excellence. Information technology (IT) tools that are oriented towards knowledge processing can provide the edge that organizations need to survive and thrive in the current era of fierce competition. Enterprises are no longer satisfied with business information system(s); they require business intelligence system(s). The increasing competitive pressures and the desire to leverage information technology techniques have led many organizations to explore the benefits of new emerging technology – viz. "Data Warehousing and Data Mining". What is needed today is not just the latest and updated to the nano-second information, but the cross-functional information that can help decisions making activity as "on-line" process. The data mining technology is like extracting gold, parallel to gold extraction technology. Data mining is based on filtration and assaying of a mountain of data "ore", in order to get data "nuggets" and is designed to help corporate organization(s) to discover hidden patterns and to delve deeper to establish hidden connections in organization's data – patterns that can help planner & decision makers to understand the behavior of key users, detect likely trends-growth pattern, predict change(s) in the financial sector etc. Thus, managing the business effectively and gaining competitive edge. * Senior Technical Director & Head, Analytics and Modelling Division, National Informatics Centre, Planning Commission (GOI), CGO Complex, Lodhi Road, New Delhi-110003. Tel. No.: (O) 011-4362530 E-Mail : rkg@hub.nic.in OR gupta@amdiv.delhi.nic.in
  • 2. Evolution of Information Technology Tools: The initial information technology tools that were utilized in the managerial world were for data collection and storage (Transaction Processing System) – a tool for systemizing the collection and storage of data needed for the management of the transactions, in which the enterprise entered into with its trading partner(s). This was followed by the realization that there was an enormous amount of data which is created by the organization (either internally from the interaction of the different functional areas or with the interaction with business partners) thus, a system which manages these data in an integrated manner was needed. To address these needs the Management Information System (MIS) concept was created, this system (theoretically) allowed the management to query the transactional database for summary as well as detailed report on any matter of interest on a routine or on exception basis. Unfortunately, the MIS concept could not be implemented at the organizational level as an integrated system, but was implemented within the functionally distinct areas like finance, purchase, audit, human resource management department etc. The organization wide MIS has only recently been accepted and is being implemented in many organizations in a new incarnation – the Enterprise (wide) Resource Planning (ERP) systems. Together with the MIS concept there was a need felt for the development of systems that supported the management in decision making by providing the analytical and/or heuristic model(s) for decision making – the Decision Support Systems (DSS). The evolution of the information systems characterize the evolution of systems from data maintenance systems, to systems that transform the data into "information" for use in the decision making process. Organizations had realized that information was a competitive tool that allowed them to perform better in the dynamic environment. These systems supported the information acquisition from the database of transactional data. The managerial knowledge acquisition function is/was not directly supported by these systems (Fig. 1). The evolution of new patterns in the changing scenario could not be provided by these systems directly, the planner was supposed to do this from experience – for example the shift in consumers buying behavior is/was not directly reported, the manager is/was supposed to analyze the sales figure over time by him (her) self to identify the reasons. Processing Processing Data Information Knowledge Transactions Management Information Data Mining Tools & Processing Systems On-Line Analytical Systems Processing Tools Fig. 1: The Transformation of Data into Knowledge and associated tools. 2
  • 3. To answer questions that requires analysis of historical data, a repository of the relevant data is needed. With advances in the On-Line Transaction Processing systems (OLTP), the organizations have gathered data/information for decades, viz. customer names, addresses, credit lines, purchasing preferences, product sales history, pricing elasticity, seasonality – the data exists somewhere – in the departmental database or in the organization database. The problem is not in the data, but rather data access, data consistency, data timeliness, data accuracy and data granularity (the level of detail) and in turn the tools needed to access this data, and the types of queries that can be handled by these tools? On-line Analytical Processing (OLAP) tools are decision support tools that can access data in the operational database across multiple-dimensions and apply statistical analysis tools (like cluster analysis, factor analysis etc.) on them. But to use, the user must have a domain knowledge and define of the objective(s) for analyses? While, data mining tools allow the user to run more general queries, they allow the discovery of previously unknown or obscure patterns and relationships in a very large database, with the aim of arriving at comprehensible and meaningful results from extensive analysis of information. Though data mining tools can also run on the database management system (DBMS), supporting the transaction/operational data processing (which may have a centralized structure on one end of the possible configuration to a totally distributed system running disparate DBMS’s), they impose a large overhead on these DBMS’s (OLAP tools also impose similar overheads), therefore to support the data mining operation huge repository of data known as "data-warehouses" are used. Data warehousing provides an architecture, which provides access to the data to different data mining technology in a structured manner. Data Warehouse: The data warehouse makes an attempt to figure out "what we need", before we know we need it? Data warehousing is the process of integrating enterprise-wide corporate data into a single repository, from which end-users can easily run complex queries, generate multi- dimensional reports, and perform analyses. In general a database is not a data warehouse unless it has the following two features: • It collects information from a number of different disparate sources and is the place where this disparity is reconciled, and • It allows several different applications to make use of the same information. Table 1. Similarities and Differences between OLTP and Data Warehouse Systems OLTP DATA WAREHOUSE Purpose Run day-to-day operation Information retrieval and analysis Structure RDBMS RDBMS/Multi-dimensional DBMS Data Model Normalized Denormalized and/or Multi-dimensional Access SQL SQL plus data analysis extensions Type of data Data that runs the business Data to analyze the business Condition of data Changing, incomplete Historical, descriptive 3
  • 4. Together, with the data collected from various sources (Fig. 2) the data warehouse stores a kind of data called the metadata, which is the "data about data". Metadata may represent the information about when the data was created, from which system it came, and what different tools have accessed it to move in from where it was originally, to where it is now. It’s all the things that surround the actual content of the data to give a person understanding of how it was created and how it is maintained. Data warehouse also differs from conventional databases in that they are denormalized; i.e. the same data may appear several times in different places. Denormalization allows for combination of data into larger tables (structures in which RDBMS’s hold the data ) and reduce the number of input/output that have to be made, thereby speeding system operation. W.H. Inmon (1993) in his landmark work building the Data Warehouse, offers the following definition of a data warehouse: “A data ware directory is to help locate the contents of the data warehouse. • A guide to the mapping of data as the data is transformed from the operational environment to the data warehouse environment. • A guide to the algorithms used for summarization between the current data and the summarized data, etc. In a typical implementation, the data warehouse application is coupled to the warehouse via the metadata (that is, "data about data"), allowing changes to the data warehouse to be immediately reflected in the enduser data-access application. For example, if a corporation restructures to eliminate a layer of management, as soon as the data corresponding to the new organizational hierarchy is added to the warehouse, the application should "reconfigure" itself using metadata to reflect the new hierarchy. Data warehouse also differ from conventional databases in that they are denormalized, i.e. the same data may appear several times in different use is a subject - oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.” Subject Oriented means that the data warehouse focuses on the high-level entities of the business; in the case of marketing, subjects such as consumers, their income, their addresses, sales figures etc. This is contrast to operational systems, which deal with processes such as bill’s of payment etc. Integrated means that the data is stored in a consistent format (i.e., naming conventions, domain constraints, physical attributes, and measurements). For example, production systems may have several unique coding schemes for parts. In the data warehouse there is only one coding scheme. Time variant means that the data associates with a point in time (i.e. fiscal year, pay period etc.) Lastly, non-volatile means that the data does not change once it gets into the data warehouse. 4
  • 5. Legacy Database Extract Metadata Operational Database Transform Maintain Data Warehouse External Data Source Fig 2. : Data Warehouse Architecture. While designing and building a data warehouse several important points must be kept in mind, a few of these points being: • To support accelerated decision making, right information at the right time should be available and easily accessible. • The effort needed to create the infrastructure to support the data warehouse should not be underestimated. • The requirement definition (to build a data warehouse) is more difficult because a data warehouse requires developing a system to support undefined requests. • The data warehouse is not an operational system that the people have to use to do their jobs. It has value, however, only if used. Need for a Data Warehouse: The data warehouse (DW) concept sprang from the growing competitive need to quickly analyse information. Existing operational systems cannot meet this need because, • They lack on-line historical data. • The data required for analysis resides in different operational systems. • The query performance is extremely poor, which in turn impacts performance of operational systems. • The operational DBMS designs are inadequate for decision support. As a result, information stored in operational systems is inaccessible to planner & decision-makers. A data warehouse eliminates these problems by storing current and historical data From disparate operational systems that decision-makers need in a single consolidated system. This makes data readily accessible to all in the organisation who needs it without interrupting on-line operational workloads. Type: Data Warehouses: There are two major approaches that differ greatly in scale and complexity. They are the data mart and the data warehouse. A "data mart" is a department or functional oriented data warehouse. It is a scaled down version of a data warehouse that focuses on the local needs of a specific department like 5
  • 6. finance or purchase. A data mart contains a subset of the data that would be in an organisation's data warehouse since it is department oriented. An organization may have many data marts, each focused on a subset of a distinct organization activity (distinct functional domains : like finance, purchase, planning, human resource, etc.). Table 2. Differences between Data Marts and Data Warehouses Attribute Data Mart Data Warehouse • Effort Scope Department Enterprise Time to build ~Months ~Years Cost to build Few Lakh(s)of Rs. Crore(s) of Rs. Complexity to build Low to Medium High • Data Requirements for sharing Shared (within business are) Common (across enterprise) sources Few operational and external Multiple operational and external system systems Size Megabyte to low gigabyte Gigabytes to terabytes Time Horizon Near-current and historical Historical data data Amount of data Low to Medium High transformation Frequency of updates Daily, weekly Weekly, monthly • Technology Hardware Intel-based (or compatible) Minicomputers and mainframe computers and minicomputers computers Operating System NT UNIX,MVS, and others Database Workgroup database servers Large (enterprise) database servers • Usage Number of concurrent users Tens Hundreds Types of users Business area analysts and Enterprise analysts and key managers senior executives Business focus Optimising activities within the Cross-functional optimisation and department/business area decision making A data warehouse, is an orderly and accessible repository of known facts or things from many subject areas, used for decision making? In contrast to the data mart approach, the data warehouse is generally organization-wide in scope. Its goal is to provide a single, integrated view of the enterprises’ data, spanning all the enterprises’ activities. The data warehouse consolidates the various departmental perspectives into a single enterprise perspective. The data mart being more focused than the data warehouse, the complexities 6
  • 7. involved in its creation and maintenance are less as compared to the data warehouse. The major differences between data marts and data warehouse, Table 2. Data – Warehouse Functions: OPERATIONAL AND EXTERNAL DATA Access Transform Distribute Store Find Display & Analyze Operational • Cleanse • Stage • Relationa • Inform • Query and • Reconcil • Join l Data ation and External e Multipl • Specializ Catalog Reportin Data • Enhance e ed Cache ue g • Summari Source • Multiple • Busine • Multi- ze s Platform ss dimensio • Aggregat • Populat s and Views nal e e on Hardwar • Models Analysis deman e • Data FLOW Fig.3 Data Warehouse Functions / Components. Fig. 3, represents the flow of data from the original source (of the data) to the user, and includes management and implementation capabilities. For example, there are access mechanisms required to retrieve data from heterogeneous operational databases. That data is then transformed and delivered to the “data warehouse store” based on a selected model (or mapping definition). The metadata defines this model and definition of the transformation of the original data. The data transformation and movement processes are executed whenever an update to the warehouse data is desired. And, the data warehouse management software has the capability to manage and automate the processes required to execute these functions. Data Warehouse Architecture: Each implementation of a data warehouse is different in its detailed design (a schematic high-level of the architecture and its components is given in Fig. 4), but all are characterised by a handful of the following key components: • A data model to define the warehouse contents. This model is different for every implementation but the utility and success of the data warehouse depends to a large extent on how well this data model reflects the type of processing that will done on the data stored and on how well the warehouse reflects the business process. The data warehouse (is a subject oriented database as detailed above) modelling must take into consideration the following issues: a) what business process is being modelled, b) what are the measures or the facts (information) to be stored, c) at what level of detail (granularity) is “active” analysis conducted, d) what do the measures have in common (the “dimensions”), e) what are the 7
  • 8. dimensions’ attribute and f) are the attributes stable or variable over time and is their “cardinality” bounded or unbounded. • A carefully designed warehouse database, whether hierarchical, relational, or multidimensional. While choosing a DBMS it must be kept in view that the database management system should be powerful enough to handle huge amount of data running up to terabytes. Well known relational DBMS vendors are DB2, Informix, Oracle, Sybase, etc; and multidimensional DBMS’s are offered by Kenan Systems Corp. (Acumate ES), Dimensional Insight (CrossTarget), Informix (MetaCube) etc. • Query and Legacy Database reporting • Multi- Metadata dimensional Extract analysis Operational Database Transform Data tools Maintain Warehouse • Other OLAP tools External Data Source • Data mining tools Fig 4. Schematic view of the Data Warehouse Architecture. • A front end for Decision Support System (DSS) for reporting and for structured and unstructured analysis. The data warehouse allows the storage of data in a format that facilitates its access, but if the tools for deriving information and/or knowledge and presenting them in a format that is useful for decision making are not provided the whole rationale for the existence of the warehouse disappears. Various technologies for extracting new insight from the data warehouse have come up which we classify loosely as "Data Mining Techniques". Data Mining: The process of extracting new information from the repository of historical data (data warehouse) using advanced statistical and artificial intelligence techniques is known as data mining. Data mining requires prospecting – the exploration that constantly guides mining operations. To get the best out of the data mining process, home-grown data is not sufficient; one may also have to add outsourced data – overlay on it, viz., demographics, geographical information, weather and climate patterns, economic and social indicators etc. 8
  • 9. On-Line Analytical Processing (OLAP), though strictly speaking not a data mining technique, is an efficient architecture for performing complex from a business perspective while hiding the complexity of underlying data structures. OLAP is also called multiple dimensional analysis, as it typically involves analysis of trends and comparisons across business dimensions such as product, sales region, or distribution channel, via analytical operations such as data consolidation, drill-down, and slicing and dicing, OLAP tools allow the user to analyse complex data relationships quickly and easily using historical, projected, and derived data to provide detailed reports. OLAP relies on the user to provide the path or the route through the data for the analysis, but databases in the data warehouses are often so large and complex that they cannot be analysed adequately with repetitive queries and reports. In such situations, data mining tools can be used to automate the decision-support process and find facts hidden in databases (Fig. 5). Using a combination of machine learning and database technology, data mining tools find patterns in data and infer rules about the patterns – essentially, finding answers to questions, users do not know to ask? Techniques such as multidimensional analysis are then employed to evaluate the implications of the inferred rules, and the information is presented in a suitable form with graphics, reports, text, and hypertext. Data Business Presentation Decision(s) Data Mining Knowledge Data Analysis Information Data Warehouse Data Operational Database Management Systems Fig 5. The Information Value Chain and Information Value Pyramid On-Line Analytical Processing (OLAP): On-line analytical processing is the next logical step beyond query and reporting. OLAP software tools deliver the technological means for complex business analysis by enabling end-users to analyse data in a multidimensional environment. With OLAP tools, users can analyse and navigate through data to discover trends, spot exceptions, and get underlying details to better understand the on-going process. One example of an OLAP tool is the Pluto OLAP tool bundled with the SQL Server 7.xx by Microsoft. A user’s view of the enterprise is typically multidimensional in nature. Fertiliser consumption, for example, can be viewed in three dimensions, type (N,P,and K) time and 9
  • 10. region. Thus, this requires that OLAP tools to be effective must allow multidimensional "visualisation" and analysis of data. Analysis requirements span a spectrum from statistics to simulation. The two form of analysis most relevant to in this context is commonly known as “slice and dice” and “drill- down”. “Slice” means the facility to view data along any dimensions, that is, for example if the Fertilizer consumption data is available across three dimensions of type, region and time, the user can fix any one of the dimensions, say time as 1998 and view the data distributed across type and region. This allows the user to view the data in a more specific context. Dicing is the facility to rotate the data about any particular dimension. “Drill – down” is the technique by which a data, which is presented in a summarised form, is expanded to show more detail. This allows the user to navigate through or “drill” through information to get more detail. During data analysis, a user can spot exceptions. Using OLAP data navigation, the user can drill-down through levels of data to get more details to help answer “why” questions about the exceptions. For e.g. why the consumption of phosphate fertiliser in June 1998 was low in UP state, as compared to Jan, 1998. Although OLAP users do not formulate questions in advance. OLAP still requires the user to select paths through the data, which limits the findings to the areas pursued. The results of OLAP analysis help organisations answer specific questions like, “Who are my best customers for this product in this region?” But it cannot answer general questions like “How can I segment my customer base for better targeting of products manufactured by us?” Data mining helps the user to ask more general questions, one that does not automatically limit the results. For example, “What sort of customers should we be targeting?”, “Which airline passengers flew to Bombay last month and might be invited to respond to special pricing on tickets for this month?” or “Which customers bought computers but no printers last month, so that we can entice them with a discount on a new printer?” Data Mining: Data base mining or Data mining (DM) (formally termed Knowledge Discovery in Databases – KDD) is a process that aims to use existing data to invent new facts and to uncover new relationships previously unknown even to experts thoroughly familiar with the data. It is like extracting precious metal (say gold etc.) and/or gems, hence the term “mining”, It is based on filtration and assaying of mountain of data “ore” in order to get “nuggets” of knowledge. The data mining process is diagrammatically exemplified in Fig. 6. 10
  • 11. Transformed Data Data Sources 1 Extracted Assimilated Information Information 2 Data Selected Warehouse Data N Select Transform Mine Assimilate Fig. 6: The Data Mining Process. Humans are especially adept at both of these tasks (filtration and assaying of data), but the brain makes such advances slowly and sporadically. Computer databases pose additional, unique problems: • Database structures are highly complex. They contain numerous tables connected through abstract linkages that mind finds difficult to trace. This has lead to the phenomenon of very efficient data gathering mechanisms leading to the drowning of the decision-making capabilities of the management in the sea of information. • Digitised databases are hidden from sight so the details in the records are unseen and unanalysed. • The size and nature of the databases make it impossible for the mind to detect hidden patterns and ill-formed relationships. Data mining include the identification of relationships that would have gone undetected without the application of specialised approaches. For example, one application determined that certain bank customers with occasional overdrafts and characteristic deposit histories were especially good candidates for home equity loan advertising. Another, a fraud detection system identified a fraudulent mortgage unit that changed names frequently and defrauded many different banks, duplicating in minutes the findings of a team of investigators who worked with the same data for two years. Data mining helps the user to discover the right questions to ask based on the patterns found in the data. It is a strategic tool that can uncover patterns already owned by the enterprise, allowing the enterprise to build a more effective customer relationship, which is 11
  • 12. recognised now to be a very powerful competitive tool. It automates the process of knowledge discovery; helping to pinpoint particular areas of interest, predict outcomes, and extend the capabilities offered by other business intelligence tools like OLAP tools. Although predefined and ad-hoc access tools provide top-down, query-driven data analysis, data mining provides bottom-up, discovery-driven data analysis (also known as “knowledge discovery”). The predefined and ad-hoc access tools allow users repeatedly to test their theories or hypotheses by exploring the data. In contrast, data mining identifies facts of conclusions based on shifting through the data to discover patterns or anomalies. Data mining tools typically access more granular data than ad-hoc query tools. Data mining complements predefined and ad-hoc access tools by enabling users to discover new relationships in the data that they may have overlooked, such as that helps explain consumer behaviour. Unlike, analytical tools, it can be automated to run continuously in the background, saving significant time. Data mining does not necessarily require a data warehouse to be effective, but the presence of a data warehouse makes the data mining operation easier. The data mining project requires a well formulated strategy to design the warehouse tools to mine the data warehouse and then the follow up of actions on the basis of the mined knowledge (Fig 7). Transformed Data • Defin • Take e the Extracted Acti Probl Information on em • Mea • Scop sure e the Data Selected Resu Proje Ware- Data lts ct -House • Asse • Identi ss fy Assimilated Per Data Information man Sourc ent es Select Transform Mine Assimilate Fig. 7: The Data Mining Project. Data mining derives business intelligence from the data warehouse by using advanced analytic techniques such as neural networks, logic (heuristics, inductive reasoning, and fuzzy logic), tree-based models, and advanced statistical techniques (cluster analysis, discriminant analysis, logistic regression, or survival testing, hypotheses testing, perceptual mapping and conjoint analysis; these are appropriate for answering “why” and “how” questions”). 12
  • 13. Neural networks have been used to forecast electronic network and component failure, identify loan applicants who are likely to default, carry out image recognition, and perceive stock and bond market fluctuations. Surprisingly accurate predictions and identifications have been made by neural networks in areas in which human experts have had difficulty defining and programming traditional systems to these tasks. Logical inference theory has been used to locate relationships and examples of relationships that may be suspected but unverified. For example, suppose one had a database of persons who contracted a variety of diseases and another database that contained genealogical relationships among people whose disease histories are found in the other database. Some pattern-matching rule-based algorithms are capable of determining which diseases might be genetic because they exhibit characteristics found predominantly in males or commonly found both in parents and in children. In business, these pattern-matching systems can be used to identify airline passengers who travel on particular routes or potentially fraudulent credit requests that differ from an individual’s normal buying behaviour etc. Four major operations for data mining include: predictive modelling, database segmentation, link analysis, and deviation detection. • Predictive Modelling – A form of inductive reasoning that uses neural networks and inductive reasoning algorithms (rule-based models) to create software models that can be used to predict future situations, such as which customers are likely to leave for the competition. • Database Segmentation – Partitions the data into clusters using statistical cluster analysis techniques. • Link Analysis – Identifies connections between records, based on association discovery and sequential patterns. • Deviation Detection – Detection and explanation of why records cannot be put into specific segments? Text Mining: Organisations generate, collect and have large volumes of data, which they use in day to day operations. These data are mostly in the form of numeric and text. There are large number of tools products available to analyse and generate valuable information from the stored numeric data. The text based databases are in various forms, like: a) Electronic mails from customers, containing feedback about products and services. b) Internet documents such as memos and presentations which embody corporate expertise. c) Technical reports which describe new technology - Patent Information Systems. d) News works carrying information about the business environment and the activities of competitors. 13
  • 14. Many of the organisations are unable to capitalise fully on the value of this data because information implicit in the data is not easy to discuss. The need for tools to deal with such databases is already large. This implies an opportunity to make more effective use of repositories of business communications, and other unstructured data, by using computer analysis. Since the database contain only implicit and not explicit information, it needs a specialised software tool. Text mining is a new and emerging technology that promises to discover hidden patterns and extract valuable information from the text data. Features: Unlike most of the search engines, text mining, in addition to searching have build in intelligence and also trains the system, parallel to natural network methodology. Typically it broadly performs the following: . Feature extractions - finding the key single or multi-word concept in a document or a collection of document. . Clustering - discovering pre-document themes in a document collection. The added feature of clustering is that it can learn itself and act intelligently. The clustering algorithm can take a query and determine which areas and data sources need to be searched and eliminate the others, drastically reducing information query and retrieval times. Functions: Much of the benefits of text mining lie in the combinations of its various functions. Some of the applications functions include: . Language Identification - to identify/discover the language in which a document is written. . Feature Extraction - To recognise significant vocabulary items in a document, like names, abbreviations, dates and currency etc.; Clustering; Categorisation - To assign documents to pre existing categories; Visualisation - To present the information in a way which is easy to understand. Both Data mining and Text mining are well established and widely used tools, as regard the GIS/Video documents mining the technology is still evolving, and it may take some more time before it stabilises. 14
  • 15. Data Warehouses - the `Ideal' Solution v/s Data Marts: ADVANTAGES OF DATAWAREHOUSING: ADVANTAGES OF DATA MARTS:  Centralized storage of information reducing  Quicker return on investment redundancy  Less costly (time, money, personnel)  Lesser number of variables to control and hence  Ensures data integrity lower risk.  Less chance of scope escalation  Common understanding of data across the  Possible to implement even when common enterprise understanding of business across departments does not exist.  Effort of data extraction and loading is done  Lesser data and conflicts, and simpler models. once  Less complex in terms of the types of users and hence designing of summary tables.  More efficient use of hardware and networking  More focussed on specific business problems. resources. DISADVANTAGES OF DATAWAREHOUSING: DISADVANTAGES OF DATA MARTS:  Long term, longer time for ROI  Duplication of effort in data extraction and  High number of users has to agree to spend cleaning. time and money for the process to succeed.  One person can stop the process.  Duplication of data at times with integrity  Differences among persons handling functional problems. areas of LoBs, can bring the project to a halt.  Common metadata cannot be achieved when  Different understanding of data between common understanding of business does not departments. exist.  More types of users, means complex summaries  Possibility of different standards. and applications need to be built.  Scope escalation, as requirement change through the length of the project.  Higher risk as number of variable increases. Reasons for Failure: A data warehouse can only be successful if carried out meticulously. Some of the main reasons to be kept in mind, before going for Data warehouse: • Cost overruns, caused by wrong estimation of hardware/network resources. • Time overruns, and changing priorities. • Scope escalation • Lack of focus on main issue(s) • Non-co-operation by one or more user groups. Myths about Data warehouses: Myth: Data warehouse is a repository of all historical data 15
  • 16. Reality: Data warehouse is historical data required for decision support, arranged by subject area. Myth: Data warehouse is always a very large database. Reality: Data warehouse could be small. It depends on the kind of business, and the amount of information required for solving business problems. Myth: Complexity of data warehouse comes from the size of data and number of users. Reality: Complexity comes from the multiplicity of data sources, and the different types of users/subject areas. Myth: Data marts are smaller data warehouses Reality: Data marts are focused subject specific data warehouses. Actual database size does not determine where it’s a warehouse or a mart. Myth: Data marts have fewer users Reality: Data marts have fewer types of users. If a solution has to be used by 100 sales manager it's still a data mart. If it is used by 5 sales managers, 5 production managers and the CEO, it turns into a warehouse. Data Mining Tools: Numerous tools specifically designed for data mining are available. The tools differ substantially in the types of problems they are designed to address and in the ways in which they work. Table 3: Data Mining Tools Product Company URL Intelligent Miner IBM www.ibm.com Data Mart Suite/Express Oracle www.oracle.com Data Mine Red bricks Systems www.redbrick.com Discovery Server Pilot Software www.pilotsw.com Enterprise Miner SAS Institute www.sas.com Express IRI Express www.express.com Business Miner Business Objects www.businessobject.com Meta Cube/Informix/Red Bricks Informix www.informix.com MineSet Silicon Graphics www.sgi.com Scenario Cognos Corporation www.cognos.com Seagate Holos Seagate Software www.seagate.com SPSS -Clementine SPSS Inc. www.spss.com XpertRule Attar Software www.attar.com Tools that use advanced statistical techniques, neural networks or genetic algorithms are also available (Table 3 provides information about some other data mining tools available). 16
  • 17. Data Mining Applications Typical data mining applications could include: • Consumer segmentation on similar buying behaviour. • Profiling customers for individual relationship management. • Increasing response rate from mailshots. • Identifying the most profitable customers and the underlying reasons. • Understanding why customers are leaving for the competitors – it can provide information like: customer dissatisfaction peaks during holidays, when the company’s on-line staff runs a skeletal service; using such information the management can plan future strategies or tactics. • Uncovering factors affecting purchasing patterns, payments, and response rates. • Detection of fraudulent credit card transaction or insurance claims. • Preparing for utility demand (telecom sector, transport, energy and water). • Anticipating a customer’s future actions based on current histories and characteristics. • Other applications areas can be mass customisation, cross-selling, demand forecasting, inventory control, machine (part) maintenance, risk analysis, multi- product campaigning, etc. Data mining, enables organisations to take full advantage of the investments they have made and currently building data stores. The decision-maker can tap it with the unique opportunities that data mining offers, thus, large corporate house(s) are capitalising on their databases and becoming sole proprietor of competitive advantage. Due to the strategic nature of applications and use, organisations normally do not discuss the very use of data mining and data warehousing technology with outsiders. Even then, the growing data about reported instances of data mining being utilised (effectively) is growing at a very fast pace. More specifically, the following are the reported applications (or planned use) of data mining technique in the business environment (both Indian and foreign): • A medical supplier company increased its return on advertisement by targeting doctors who were most likely to make a second purchase. • A collection agency improved its ability to determine which delinquent accounts were most likely to be collectable. • A bank initiated an automobile loan campaign by predicting which customers are likely to buy a new car? • A researcher discovered the conditions under which it was most likely that companies would take corporate write-downs. • A health insurance company discovered that understaffed medical units were sending patients for test as a means of warehousing them until staff could deal with them. • A telecom company found out how the cricket telecasts affect the telecommunication services. • A leading bank discovered how punctuality in the bank’s teller counter determine daily cash outflows. 17
  • 18. • An insurance company searched its data for patterns indicating fraud(s). • A telephone company began predicting which of its new customers were likely to turn over (to the competition) in a short period of time, limited its advertising to them, and increased its customer retention ratio by evaluating their payment patterns. • A life insurance company discovered the pattern that lead to early cancellation of insurance policies. • A cosmetic company found that the sales of moisturising lotion to be below average in northern India last year (1998), as humidity levels were above normal that year. • Production manager comes to know that the mean time between failure of industrial refrigeration systems to be sold, have a direct correlation with the age of the customer. • The BPL group is deploying statistical analysis tools to extract patterns from its sale and component-inventory database. • Hutchison Max and Modi-Telstra are both constructing integrated data warehouses, so as to plumb them for insights into customer behaviour. • The Reliance Industries is setting up an enterprise wide data warehouse and data mining system. • Citibank (India) is using data mining to manage customer relationships. • Godrej-GE Appliances has built a data warehouse to use data mining to understand its distribution chain better. • The National Stock Exchange (NSE) is putting up a data warehouse as a prelude to using data mining tools to manage its clearing house operations, followed by capital market operations, and then derivatives trading. • The State Bank of India plans to use data mining as a means for better customer account management. • The MCI Communications (a $20 billion organisation) discovered complex patterns in the usage of telecommunication services by different set of customers, and is using the finding to revise its tariffs in a way that benefits users and optimises its revenues. Applications in the Government: "Data warehouse and Data mining" is a perfect means of preparing the government to face the challenges of the next millennium. Government departments are in the process of a paradigm shift - a transformation in how to better govern at centre, state and district level. Officials are often forced to do more in less time, complete with the private sector, operate with tighter budget and smaller staffs, and provide better service to the people. As a result, they are being forced to evaluate their core strengths and weaknesses and find new strategy of doing development activities. Information technology has served a vital role in the drive to meet these new challenges. It is no surprise that data warehousing - one of the hottest developments in the IT industry is quickly becoming an integral part of IT strategy. Warehousing and mining could be the most significant advancement in the government computing in the near future. The proven benefits of data warehousing for commercial organisations are clear. Recent surveys indicate that a large percentage of Fortune 2000 corporations are either planning or have already built large scale data warehouse initiatives, as a means to increase 18
  • 19. sales, reduce costs and maximise profits. These initiatives will enable sophisticated decision support systems to deliver necessary information throughout organisations and beyond. But the question remains: How will public organisations benefit from jumping on the data warehousing bandwagon? The answer lies in the realisation that information is still through government's largest asset and undertapped for its full potentials. Both private and public organisations face the reality that resources are limited. Capital assets are scarce and will only continue to be so, and rightsizing and downsizing results in limited human resources. As these resources decline and organisations continue to amass large amounts of data - information that often holds the key to more efficient organisational operation. However, government organisations are realising that the means to access this information is still have some issues unresolved. The information exists, but creating smooth enterprise-wide access to these data stores is another matter. To understand in-depth relevance of the "Data warehouse and Data mining" in the government sector, one must understand the major difference between the objective of a government /public sector undertaking (enterprise) and that of a private sector enterprise. A government/Public Sector enterprise objective is not maximisation of profit solely, but also economic development of the nation (as a long-term goal) and the welfare of the society; where as a private sector enterprise is oriented towards the sole objective of maximisation of profit. But even, if the objective of these two exclusive categories of enterprises are entirely different, they share some features: • To generate & process the latest, timely and update information to create an information/ knowledge base. • Allocation of limited resources (of the nation and/or enterprise) to meet the above objective. Typically, environment in the government is such that all development sectors, (have direct or indirect impact on each other and are inter-linked) for example, health has implications for productivity. Investment in education eventually leads to higher standards of nutrition and family planning. Investment in the fertiliser sector increases agriculture productivity. However, the resources are limited, thus, it may result in lower productivity in other sectors. One needs to study and describe these links to achieve the common objective of national development. Moreover, to evaluate any scenario in advance, for planning and decision making, typically one need to develop a data warehouse corresponding to economic, production, national accounts, demography, agriculture, energy, health, education, nutrition, environment etc. In short, one needs to develop local and global data warehouse depending upon the needs, to strengthen the decision making. Broadly, it provides capability of moving from "Planning in isolation to planning following Integrated Approach". Though, DW & DM is being used effectively by large sales, services and marketing organisation for activities such as data base marketing; segmentation and consumer management. There are a large number of applications in the government both at centre, state as well as attached organisations. Some of the major application areas includes: development of local and global Data warehouse from the following data bases/data marts depending upon the key objective(s). 19
  • 20. A. Data Warehouse and "Data mining" • Ministry of Agriculture : Production; Consumption; Agricultural Marketing; Fertiliser Consumption; Seeds; Prices (wholesales & retail); Technology; Agricultural census; Marketing region(s); Live stock; Crops; Agricultural credit; Plant Protection; Watershed; Area under Productions yields; Land use statistics; Finance & Budget etc. • Ministry of Petroleum & Natural Gas: Marking; Finance; Personnel; Pricing; Import; Crude & Product Production; Sale - from Oil Corporations (IOC, BPCL, HPCL, & other); Marketing Division of Ministry of P&NG. • Department of Tourism: Foreign Tourist Arrival System (FTAS); Customer preference/behaviour Data base; Tourism & Product Development Information; Foreign Exchange earning; Employment Opportunities; Manpower & Training; Marketing Research; Publicity; Hotel Classification System; Travel & Tour Operators data base. • Ministry of Rural Development : Below Poverty line; DRDA; Drinking Water; Rural Population - census; Rural Development scheme of the state & Central govt.; per capital income - Rural, Urban; • Ministry of Health & Family Welfare: Health & Family Welfare MIS; Community needs Assessment Approach: Immunisation (mother & child health); External Aid monitoring; National Programme for Control of Blindness (NPCB); National Leproscapy Eradication (NLEP); National Malaria Eradication (NMEP); National Aids Control; Indian System of Medicine & Homeopathy (plants, herbs, medicines etc.); Drug policy; Health law; Morbadity & Motility pattern; Medical Record System (Hospital); Stores; Medical & Para medical manpower; NGO data base; Emergency Medical Relief; Health Education; census etc. • Ministry of Energy : CEA; MOC; MOP&NG; DC&PC; Power Plants; Non-conventional energy; • Planning Commission: State Plans (All sectors); Labour; Health; Education; Trade; Industry; Annual Budget; Five Year Plans; State Plan Project; Rural Development; Energy including Non- conventional. • Department of Programme Implementation: Central Sector Projects costing Rs.20 crores & above; 20
  • 21. • Ministry of Commerce: Import & Export (Trade); E-Commerce; Exports & Imports data bank (8 digit HSCODE); Foreign Trade of India (Principal Commodities and counties); Trade Policy; Balance of Payment; World Price monitoring system; Provisional Estimates of Import & Export. • Deptt. of Revenue : Customs & Central Excise; Income Tax; Commercial Taxes; • Deptt. of Economic Affairs : External Assisted Projects in the various central govt. ministries/deptts.; Budget Expenditure; Annual Economic Survey(s). • Ministry of Welfare: Welfare Schemes data bases; Programmes for Weaker sections of society; NGO - for Welfare Projects/Schemes; • Ministry of Shipping & Transport: Shipping Information System; Shipping Tonnage Information System; Chartering Information System; Transport Statistics; • Audit & Accounts: Govt. Accounts Data base; • Ministry of Railways; Deptt. of Coal; Department of Posts; Department of Telecommunication; Ministry of Labour; Ministry of Civil Supplies; Ministry of Education; • Public Utilities Departments: CPWD; State Vidyut Board; State Development Authority; MTNL; Police Department; Sales Tax; Accident Tribunals; Hospitals; B. Data Warehouse and "Text Mining" • Human Rights Commission • MRTPC • Supreme Court & High Court Judgement Cases • Parliamentary question - answers and Debate Information • Patent Information • Public Grievances • Department of Welfare 21
  • 22. • Land Records • Public Interface Departments: Passport; Licensing Authority; Ration Card In general, for the government departments, the data base generated, updated and maintained by the corresponding public sector undertaking or attached organisation, will act as a Data Mart (or a Local warehouse), to the global warehouse to be generated and maintained in the corresponding central govt. ministry/deptt. Similar is true for the state govt. deptts. Typically, for any state planning deptt. or even for planning commission, the global data warehouse in the central ministry will act as a local data warehouse and the data warehouse to be generated, maintained at the planning commission will act more as a "Super data warehouse". Conclusion: Organisations are today suffering from a malaise of data overflow. The developments in the transaction processing technology has given rise to a situation where the amount and rate of data capture is very high, but the processing of this data into information that can be utilised for decision making, is not developing at the same pace. Data warehousing and data mining (both data & text) provide a technology that enables the decision-maker in the corporate sector/govt. To process this huge amount of data in a reasonable amount of time, to extract intelligence/knowledge in a near real time. A data warehouse takes the organisations operational data, historical data and external data (Fig. 2); consolidates it into a separately designed database (which can either be relational or multi-dimensional in nature); manages it into a format that is optimised for end users to access and analyse. When a data warehouse has been constructed, it provides a complete picture of the enterprise (similarly, a data mart provides a full representation of the business area it is designed to serve), may be for the first time. It provides an unparalleled opportunity to the management to learn about their customers. The data warehouse technology together with online transaction processing and data mining, allows the management to provide better customer service, create greater customer loyalty and activity, focus customer acquisition and retention of the most profitable customer, increase revenue, reduce operating cost; provides tools that facilitate sounder decision making; improves worker/management knowledge and productivity; spares the operational database from ad- hoc queries with the resulting performance degradation and clears the legacy database system, while moving the corporate system architecture forward. All this has become possible due to development on two fronts: a) on the hardware front by the emergence of faster processors (which also can work in parallel configurations) having greater computational power as compared to processors even a year ago, and reduced data storage costs and larger and faster secondary storage devices that further decrease processing time and provide online data in amounts that were impossible earlier and b) emergence of new software technologies from artificial intelligence and innovative constructs about how to carry out intelligent (and optimised) data mining. With the incorporation of new data delivery and presentation techniques, like hypertext mark up language (HTML), Open Database Connectivity (ODBC) etc. the database mining (Data & Text) operation has gained wide spread recognition as a viable tool for business intelligence gathering. Advances in the document mining technology (database 22
  • 23. mining of free form text/data, in contrast to the “classical” approach to data mining of fixed length records) are making the data mining technology more powerful. Last but not the least, the Internet has emerged as the largest data warehouse of unstructured and free form data. The new technologies are geared towards mining this great data warehouse. **** 23

×