Data Mining Jargon


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining Jargon

  1. 1. Data Mining Jargon Bob Muenchen The Statistical Consulting Center Data mining is the automated search for useful patterns in data. It uses tools from many different disciplines, each of which uses its own technical jargon. This document defines the jargon that is most widely used. A similar document, which translates neural networking jargon into statistical terms, can be found at . If you need assistance, call the Helpdesk at 974-9900, send email to, or stop by the SCC walk-in support area at 200 Stokely Management Center. All UT students, faculty, and staff researchers can get up to 10 hours of free assistance for their statistical computing needs each semester. See for details. We also offer training each semester. See for details. Analytics – the tools of data mining. The major categories of analytics are cluster analysis, decision trees, neural networks, statistical models and association analysis. Analytics that deal with the future are called Predictive Analytics. Since accurate information about the future is so valuable, some view predictive analytics as the core mission of data mining. Artificial Intelligence – the field of science that studies how to make computers “intelligent”. It consists mainly of the fields of Machine Learning (neural networks and decision trees) and expert systems. Artificial Neural Network (ANN) – see Neural Network. Association Analysis – a data mining tool that discovers combinations of options and their frequency of co-occurrence. For example, 80% of people who buy paint also buy brushes. Essentially a method of rule induction in which all variables are viewed as targets. When applied to products purchased, it is called market basket analysis. Back Office – the part of the company that customers don’t see, which is run using data stored in an Enterprise Resource Planning system and a Supply Chain Management system. Black Box – a term used to describe a model which, although it may work well, is too complex for people to understand. Usually expressible as a long series of incomprehensible equations as in neural networking models. Business Intelligence – making better decisions through the use of objective analysis. The four main BI tools are Report & Query, Online Analytical Processing (OLAP), Visualization and Data Mining. Champion Model – the model that best solves the data mining problem. 9/2/2003
  2. 2. Classify – developing a model to place records into known categories, e.g. defaulted on loan or not. Class Discovery – a term used in biological data mining to refer to unsupervised training or cluster analysis. Class Prediction – a term used in biological data mining to refer to supervised training with a categorical target variable. Cluster Analysis –developing a model that discovers categories of similar records. Usually performed as a prelude to other analyses. Also called unsupervised training. Concatenate – combining datasets or marts so that their columns are aligned and new rows are added. Alignment of the columns is done by using the same column headings, or by a column-by-column manual matching, such as ID in one table might be called SSN in another. CRM – see Customer Relationship Management. Curse of Dimensionality –refers to the fact that the more variables you study, the larger your dataset needs to be to have a chance at modeling the larger space. The relationship between variables and observations is exponential. This means that to model 10 variables at once, 100 observations may be barely sufficient, but to model 10 times as many variables would require 100 times as many observations, and so on. Removing irrelevant or redundant variables are the two easiest ways to fight the “curse”. Customer Relationship Management (CRM) – is the process of studying and interacting with customers to maximize profits. Luckily, ensuring customer satisfaction is a key way to maximize profitability, but cutting service to unprofitable customers is also involved. This is such a popular use of analysis in business that companies such as SPSS that once said their business was statistical analysis now say it is CRM. CRM is one of the three main areas to which data mining is applied: supply chain management (SCM), enterprise resource planning (ERP) and customer relationship management (CRM). Data Access – consists of reading the data for analysis. This may include inputting the data from a flat file, translating a copy of some data from a database or warehouse so that the data mining software can analyze it, or defining a method to read the data directly using a method such as the Open Database Connectivity standard. Data Conversion – performing a one-time translation of data from its original format (perhaps stored in a database) into the format used by a data mining package. Example conversion tools are DBMScopy, StatTransfer, Data Junction. See also Data Access and Extract, Transform and Load. Data Cube – A data structure of aggregated values summarized for a combination of pre-selected categorical variables e.g. number of items sold and their total cost for each time period, region and product. This structure is required for high-speed analysis of the summaries that is done in Online Analytical Processing (OLAP). Also called a Multidimensional Database or MDDB. 9/2/2003
  3. 3. Data Management – all of the tasks required to manage data such as correcting data entry errors, estimating values of missing data, subsetting or combining sets of data. Data Mart – A small data warehouse that is focused on a single area such as a research project or a single department such as sales or accounting. Ideally, all marts in an organization should compatible, but they often differ in structure and file format Data Model – Has several very different meanings. To a data miner, it can mean two different things. It can refer to the structure of how a database administrator chooses to store the data in a database or data warehouse, how the tables relate to each other. It can also refer to the way a given database program requires storing data, for example in relational or hierarchical form. If a data analyst uses this term, it refers to the rules or formulas that describe relationships among the variables (see Modeling). Data Quality (DQ) – addresses the issues of getting the right measures, ensuring the measures are timely and accurate, that editing is done with controls to prevent errors, that manipulations such as formulas are accurate and documented, that the data is accurately described. Also known as Information Quality or IQ. Data Table – A collection of data measurements organized into rectangular columns called fields and rows. Columns contain a single measure, such as blood pressure, for all sampling units. Also called variables, vectors or attributes. Rows contain the measures for a given sampling unit such as all medical information for a person. Also called observations, cases, records or instances. Data Visualization – see Visualization. Data Warehouse – A static copy of a database that has been optimized for analysis or “denormalized.” For example, the address of a customer stored for each purchase he makes may waste computer space, but it makes it very easy to find the mean sales for any given geographic region, without knowing the location of the address table. Doing analysis on a data warehouse also prevents analyses from interfering with ongoing data collection. Database – A collection of data organized for efficient use in a continuously updated situation, such as frequent sales, reservations. Far and away the most common type of database is a Relational database in which a collection of data tables are stored and related or linked by common key. A key is a column or collection of columns that uniquely identify a row such as Social Security Number. A database is optimized for online transaction processing (e.g. selling products, entering patient information), not for analysis. The optimized state of the database is called “normalized” or third-normal form. Briefly, it removes redundant information such as the full customer address of every sale, storing it in a separate table. Database Administrator (DBA) – the person responsible for organizing the data for an organization. Tasks include changing structure of databases to optimize speed, and of data warehouses to optimize data mining efficiency. In any large organization, this is the person you will need to work with to gain access to the data. He or she will be using many of the terms in this handout when you meet! Decision List – see Decision Trees. 9/2/2003
  4. 4. Decision Support Software (DSS) – any software that uses analysis to improve decision- making. Also called Decision Support System. Decision Trees – a method of finding rules or (rule induction) that divide the data into subgroups that are as similar as possible with regard to a target variable. See the example below for a tree that predicts survival rates for heart attack victims in an emergency room setting (made up for simplicity’s sake). The whole training dataset of 100 patients is called the “root node.” It is divided logically into subgroups called “branches” that are further subdivided into other branches or finally, leaves. The process of continuing to subdivide the groups is called “recursive partitioning.” Decision trees are the most popular method of displaying rules. If this sequence of rules is written out in English (or a computer language) it is called a “decision list.” If the complete set of steps required reaching each decision are written out so that they no longer need to be read in sequence, they are called a “rule set.” If the decision tree predicts a categorical outcome such as purchase or not, it is called a “classification tree”. If it predicts a continuous variable such as dollar amount purchased, it is called a “regression tree.” The most popular decision tree models are called Chi-squared Automatic Interaction Detection (CHAID) Classification and Regression Trees (CART) and C4.5/C5. 100 Patients 40% Died 60% Lived Blood Pressure < 160 Blood Pressure > 160 30% Died 70% Died 70% Lived 30% Lived Cholesterol < 400 Cholesterol > 400 Cholesterol < 400 Cholesterol > 400 10% Died 20% Died 80% Died 90% Died 90% Lived 80% Lived 20% Lived 10% Lived Dependent Variable – see Target Variable Drill-Down – a request for more detailed information, usually by double-clicking on a number or a part of a graph. For example, a table may show average salaries of professors broken down by department and gender. Drilling down on a cell of that table might display that relationship for each campus. The opposite of roll-up. 9/2/2003
  5. 5. DSS – see Decision Support Software. Ensemble Model – a model that combines the results of several types of models. For example, a prediction could use the average estimation from a decision tree, a neural network and a statistical model. Enterprise Resource Planning (ERP) – software that stores the core operational data of a businesses operational data such as sales, receivables, payables, in a database. One of the three main areas to which data mining is applied: supply chain management (SCM), enterprise resource planning (ERP) and customer relationship management (CRM). ERP – see Enterprise Resource Planning. Estimate – develop a model to find an approximate value for a continuous variable, e.g. sales, blood pressure. ETL or ETML – stands for Extract, Transform, Move and Load, the steps required to gain access to data for analysis. Since the Move step is the easiest, the “M” is often left out. ETL is an important subset of Data Management. Executive Information System (EIS) – systems any decision maker can use with little training to do ad-hoc analyses, often using OLAP Expert Systems – a system that can solve a problem by incorporating the rules manually obtained from human experts. You describe the problem and let it choose how best to solve it. Examples in the area of analysis include SigmaStat, DecisionTime and the SPSS Statistics Coach. Decisions are rather flaky at the moment, but improving. Flat File - data stored in a standard format used to move data from one program to another. Windows and Macintosh call this a “Text Only” or “Text With Line Breaks” file. It may also be called an ASCII, EBCDIC (on large IBM computers) or UNICODE file. Front Office – the part of a company that customers interact with. Customer data is critical to business profitability, so it is frequently mined. Heuristic – see Modeling. IQ – see Data Quality. Imputation – the process of estimating the values of missing data prior to analysis. Independent Variable – see Input Variables. Information Quality – see Data Quality. Input Variables – the variables thought to be related to, predict or cause the target variable. In data mining, almost any variable that is not the target variable is a candidate for an input variable. 9/2/2003
  6. 6. Join – a database procedure that pools the information stored in different tables so that they can be better analyzed. Key Performance Indicator (KPI) – a very important variable. In business, it is a measure that is critically important to the overall functioning of the organization. KPI – See Key Performance Indicator. Lift – a measure used to compare different data mining models. Essentially it is a measure of how much better you are with the model than without. For example, if 2% of the customers you mail a catalog to would make a purchase but using the model to select catalog recipients 10% would make a purchase, then lift is 10/2 or 5. Machine Learning – models that enable the computer to improve its performance through experience, especially rule induction. The definition of learning is so loose that, although rarely mentioned in this context, statistical estimation could also be considered “learning”. Modeling is roughly synonymous with machine learning. Market Basket Analysis – see Association Analysis. Mart – see Data Mart. MDDB – See Data Cube. Measurment Scale – the level of detail in a variable. The measurement scale helps determine the role of the variable in an analysis. Types include: Single-valued variables or constants that result from selected subsets. Binary have only two values such as male/female, purchased/didn't purchase. Nominal contain category memberships such as political party. Also called categorical, class, group, symbolic or qualitative variables. Ordinal variables contain values that have order such as small, medium, large. Interval or continuous variables have meaningful intervals, such as a weight interval of 110 pounds to 120 being the same as 120 to 130. Interval-level variables are also called numeric or scale variables. Merge – combining datasets or marts so that their rows are aligned and new columns are added. Row alignment is often done using a key such as an ID number. Metadata – Data about the data. Examples are column names such as gender, height; column labels containing descriptors to embellish output. Entire questions from questionnaires are common labels. Formats describe what values mean, such as 0=Female, 1=Male. Also called “value labels” or codes. Missing value codes, if other than blank, e.g. 999; Scale of each column: nominal, ordinal, interval; Formulas or recoding steps that were followed; Documentation such as who, what, where, when why the data collected were collected; MOLAP – see Online Analytical Processing. Modeling – generally refers to the process of developing rules which can classify or predict with an estimated level of precision. The rules may be in the form of a series of logical statements or mathematical formula(s). 9/2/2003
  7. 7. Statistical models are equations that have been mathematically derived to provide the best or optimal description of relationships that involved straight lines, smooth curves, group membership or clusters of similar cases. The solution to these equations usually requires simplifying assumptions about the nature of the data that will not fit every dataset. Heuristic models use methods that have been empirically shown to work well, but which have not been shown to be best or optimal solution. Heuristic models usually make comparatively few assumptions about the nature of the data. Decision trees are an example of an analysis based on heuristics, while discriminant analysis is based on an optimal method (which assumes the data follows a multivariate normal distribution). Multidimensional Database – See Data Cube. Neural Networks – models that mimic the brain through systems of equations. They “learn” by being “trained” with a dataset. Unfortunately, what they learn is conveyed by a series of complex mathematical formulas. These formulas may work well but not explain much about the process they model. See Black Box. ODBC – see Open Database Connectivity. OLAP – see Online Analytical Processing. OLE DB – an open standard for gaining access to the data stored in a multidimensional database. Most OLAP products use OLE DB to access the data. OLTP – see Online Transaction Processing. Online Analytical Processing (OLAP) – software that quickly displays interactive tables or graphs of pre-selected variables such as sales aggregated by time, region, state, store and product line. A two-dimensional “slice” of the data might show mean sales broken down by region and product line. Clicking on region might drill down to further divide those numbers by state. Some statistics such as medians and percentiles cannot be used in OLAP due to the data structure OLAP requires (a multidimensional database). OLAP is often not considered data mining since it involves only simple tables and graphs that display what is happening rather than the analytics that can help determine why it is happening or what may happen in the future. However, it is very widely discussed in the data mining literature. OLAP usually displays only sums, counts and means. This is because means at any level of breakdown can be calculated from sums of sums and sums of counts. However, the median of an aggregate is not the aggregate of the medians, which is why medians and percentiles cannot be used in OLAP. This is a major limitation in the method. OLAP is occasionally referred to as MOLAP because it runs very quickly using a Multidimensional database. When OLAP is used with a standard relational database, it is called ROLAP. ROLAP is usually hundreds of times slower than MOLAP. 9/2/2003
  8. 8. Online Transaction Processing (OLTP) – involves the efficient execution of frequent database transactions used to collect data or run a business. These transactions are recorded in a database rather than a data warehouse and are not suitable for analysis until they have been transferred to a warehouse and restructured for efficient analysis. Open Database Connectivity (ODBC) – a widely used standard to extract data from a data warehouse to use for analysis. Over-fitted Model – a model which has become so complex that it applies only to the dataset upon which it was developed. Another term for this is “over-parameterized.” Predictive Analytics – see Analytics. Profit – the amount of profit made in a specific modeling situation, calculated by estimating the cost of each type of error: assuming it is right when it is wrong, and vice versa. It can be used in business for obvious reasons but also in other areas. For example in medicine, you could assign a cost of concluding a patient has a treatable disease when they do not (antibiotic treatment = $35) versus concluding they do not have it when they actually do (treated after complications set in=$2,300 hospital stay). The point of maximum profit would show you the best way to use the model. Qualitative Data – depending on the context this term may refer to text data such as email messages or to a categorical variables such as gender. Query – the process of asking a database questions. Often done in an ad-hoc, interactive way. Regression Analysis –a family of statistical models that include fitting straight lines, called linear regression, smoothly curving lines, called polynomial regression or more sharply curving lines, called nonlinear regression; or models that predict group membership, called logistic regression. The main type of data mining that these “generalized linear models” do not do is clustering. Relational Database – See Database. Report – A basic listing of database information, which may consist of individual values, sums, counts or means. Often done in a pre-planned, static way. Return on Investment (ROI) – the money you saved by doing data mining. It is above and beyond the usually high expenses of purchasing a data mining package, learning to use it and then using it to solve a problem. ROI – see Return on Investment. ROLAP – see Online Analytical Processing. Roll-Up – the process of aggregating numeric data. For example, a series of tables may show professor salaries broken down by department and gender at each campus. The campuses could be “rolled up” to create a single table of salaries broken down by department and gender for all campuses combined. The opposite of drill-down. Rule Induction – see Decision Trees. 9/2/2003
  9. 9. Rule Set – see Decision Trees. SCM – see Supply Chain Management. Scoring – applying a model to new data usually to predict values of continuous variables such as amount of purchase, or group memberships such as survive/die. Sequel – see Structured Query Language. SRM – stands for Supplier Relationship Management. See Supply Chain Management. Statistical Models – see Modeling. Structured Query Language (SQL) – the basic language used in almost all databases. It allows you to search a database basic information such as listing certain records or totaling sums or counts. It also lets you select subsets or samples, or to perform joins. The basic form is SELECT vars FROM tablename WHERE logical condition is true. Often pronounced “sequel”. Supervised Learning – the process of developing a model that has a target variable, such as sales or survival to “supervise” it. This is opposed to unsupervised learning, which is developing a model without a target, or finding clusters or groups of similar records in your data. When the target variable is categorical, biologists would call this Class Prediction and statisticians would call it Discriminant Analysis or Logistic Regression (two different methods of achieving a similar result). Supplier Relationship Management (SRM) – see supply chain management. Supply Chain Management (SCM) – is the process of studying and interacting with suppliers to maximize profits. Also called supplier relationship management (SRM). One of the three main areas to which data mining is applied: supply chain management (SCM), enterprise resource planning (ERP) and customer relationship management (CRM). Target Variable – the main variable of interest in a data mining project. A business example is the amount of each sale; a medical example is cure/no cure. Also called the predicted, supervisor or dependent variable. Test Data – used only once at the end of the data mining project to see if the best or champion model generalizes to completely new data. Text Data – written descriptions e.g. open ended survey questions, interviews, customer complaints; also called qualitative data. Text File – see flat file. Text Mining – the process of automatically finding the key concepts contained in text data. It may also find clusters of similar documents. The numeric output containing presence/absence of each concept and cluster membership is often passed on to a data mining step where it is combined with other numeric data for further analysis. 9/2/2003
  10. 10. Training Data – the data which is used to develop models that will, if done properly, work well on new sets of data. Unsupervised Training/Learning – the process of developing a model that does not have a target variable. This boils down to finding clusters of similar cases within the data. It would usually be followed by another analysis that does involve a target variable. Biologists would call this class discovery. Statisticians would call it cluster analysis. Validation Data – during data mining, each step of each model developed using the training data is tested using this data to discover the point at which the model becomes overly specific to that single set of data, an over-fitted model. Variable – see Data Table. Visualization – the use of dynamic, interactive graphical displays to search for useful patterns in data. Warehouse – see Data Warehouse. 9/2/2003