Data Mining Jargon
The Statistical Consulting Center
Data mining is the automated search for useful patterns in data. It uses tools from many
different disciplines, each of which uses its own technical jargon. This document defines the
jargon that is most widely used.
A similar document, which translates neural networking jargon into statistical terms, can be
found at ftp.sas.com/neural/jargon .
If you need assistance, call the Helpdesk at 974-9900, send email to firstname.lastname@example.org, or stop
by the SCC walk-in support area at 200 Stokely Management Center. All UT students, faculty,
and staff researchers can get up to 10 hours of free assistance for their statistical computing
needs each semester. See oit.utk.edu/scc for details. We also offer training each semester. See
web.utk.edu/~training for details.
Analytics – the tools of data mining. The major categories of analytics are cluster analysis,
decision trees, neural networks, statistical models and association analysis. Analytics that
deal with the future are called Predictive Analytics. Since accurate information about the
future is so valuable, some view predictive analytics as the core mission of data mining.
Artificial Intelligence – the field of science that studies how to make computers “intelligent”. It
consists mainly of the fields of Machine Learning (neural networks and decision trees) and
Artificial Neural Network (ANN) – see Neural Network.
Association Analysis – a data mining tool that discovers combinations of options and their
frequency of co-occurrence. For example, 80% of people who buy paint also buy brushes.
Essentially a method of rule induction in which all variables are viewed as targets. When
applied to products purchased, it is called market basket analysis.
Back Office – the part of the company that customers don’t see, which is run using data stored in
an Enterprise Resource Planning system and a Supply Chain Management system.
Black Box – a term used to describe a model which, although it may work well, is too complex
for people to understand. Usually expressible as a long series of incomprehensible equations
as in neural networking models.
Business Intelligence – making better decisions through the use of objective analysis. The four
main BI tools are Report & Query, Online Analytical Processing (OLAP), Visualization and
Champion Model – the model that best solves the data mining problem.
Classify – developing a model to place records into known categories, e.g. defaulted on
loan or not.
Class Discovery – a term used in biological data mining to refer to unsupervised training
or cluster analysis.
Class Prediction – a term used in biological data mining to refer to supervised training
with a categorical target variable.
Cluster Analysis –developing a model that discovers categories of similar records.
Usually performed as a prelude to other analyses. Also called unsupervised training.
Concatenate – combining datasets or marts so that their columns are aligned and new rows are
added. Alignment of the columns is done by using the same column headings, or by a
column-by-column manual matching, such as ID in one table might be called SSN in another.
CRM – see Customer Relationship Management.
Curse of Dimensionality –refers to the fact that the more variables you study, the larger
your dataset needs to be to have a chance at modeling the larger space. The
relationship between variables and observations is exponential. This means that to
model 10 variables at once, 100 observations may be barely sufficient, but to model
10 times as many variables would require 100 times as many observations, and so on.
Removing irrelevant or redundant variables are the two easiest ways to fight the
Customer Relationship Management (CRM) – is the process of studying and interacting with
customers to maximize profits. Luckily, ensuring customer satisfaction is a key way to
maximize profitability, but cutting service to unprofitable customers is also involved. This is
such a popular use of analysis in business that companies such as SPSS that once said their
business was statistical analysis now say it is CRM. CRM is one of the three main areas to
which data mining is applied: supply chain management (SCM), enterprise resource
planning (ERP) and customer relationship management (CRM).
Data Access – consists of reading the data for analysis. This may include inputting the data from
a flat file, translating a copy of some data from a database or warehouse so that the data
mining software can analyze it, or defining a method to read the data directly using a method
such as the Open Database Connectivity standard.
Data Conversion – performing a one-time translation of data from its original format (perhaps
stored in a database) into the format used by a data mining package. Example conversion
tools are DBMScopy, StatTransfer, Data Junction. See also Data Access and Extract,
Transform and Load.
Data Cube – A data structure of aggregated values summarized for a combination of pre-selected
categorical variables e.g. number of items sold and their total cost for each time period,
region and product. This structure is required for high-speed analysis of the summaries that is
done in Online Analytical Processing (OLAP). Also called a Multidimensional Database or
Data Management – all of the tasks required to manage data such as correcting data entry errors,
estimating values of missing data, subsetting or combining sets of data.
Data Mart – A small data warehouse that is focused on a single area such as a research project
or a single department such as sales or accounting. Ideally, all marts in an organization should
compatible, but they often differ in structure and file format
Data Model – Has several very different meanings. To a data miner, it can mean two different
things. It can refer to the structure of how a database administrator chooses to store the data
in a database or data warehouse, how the tables relate to each other. It can also refer to the
way a given database program requires storing data, for example in relational or hierarchical
form. If a data analyst uses this term, it refers to the rules or formulas that describe
relationships among the variables (see Modeling).
Data Quality (DQ) – addresses the issues of getting the right measures, ensuring the measures
are timely and accurate, that editing is done with controls to prevent errors, that
manipulations such as formulas are accurate and documented, that the data is accurately
described. Also known as Information Quality or IQ.
Data Table – A collection of data measurements organized into rectangular columns called fields
and rows. Columns contain a single measure, such as blood pressure, for all sampling units.
Also called variables, vectors or attributes. Rows contain the measures for a given sampling
unit such as all medical information for a person. Also called observations, cases, records or
Data Visualization – see Visualization.
Data Warehouse – A static copy of a database that has been optimized for analysis or
“denormalized.” For example, the address of a customer stored for each purchase he makes
may waste computer space, but it makes it very easy to find the mean sales for any given
geographic region, without knowing the location of the address table. Doing analysis on a
data warehouse also prevents analyses from interfering with ongoing data collection.
Database – A collection of data organized for efficient use in a continuously updated situation,
such as frequent sales, reservations. Far and away the most common type of database is a
Relational database in which a collection of data tables are stored and related or linked by
common key. A key is a column or collection of columns that uniquely identify a row such as
Social Security Number. A database is optimized for online transaction processing (e.g.
selling products, entering patient information), not for analysis. The optimized state of the
database is called “normalized” or third-normal form. Briefly, it removes redundant
information such as the full customer address of every sale, storing it in a separate table.
Database Administrator (DBA) – the person responsible for organizing the data for an
organization. Tasks include changing structure of databases to optimize speed, and of data
warehouses to optimize data mining efficiency. In any large organization, this is the person
you will need to work with to gain access to the data. He or she will be using many of the
terms in this handout when you meet!
Decision List – see Decision Trees.
Decision Support Software (DSS) – any software that uses analysis to improve decision-
making. Also called Decision Support System.
Decision Trees – a method of finding rules or (rule induction) that divide the data into
subgroups that are as similar as possible with regard to a target variable. See the
example below for a tree that predicts survival rates for heart attack victims in an
emergency room setting (made up for simplicity’s sake).
The whole training dataset of 100 patients is called the “root node.” It is divided
logically into subgroups called “branches” that are further subdivided into other
branches or finally, leaves. The process of continuing to subdivide the groups is
called “recursive partitioning.”
Decision trees are the most popular method of displaying rules. If this sequence of
rules is written out in English (or a computer language) it is called a “decision list.” If
the complete set of steps required reaching each decision are written out so that they
no longer need to be read in sequence, they are called a “rule set.”
If the decision tree predicts a categorical outcome such as purchase or not, it is called
a “classification tree”. If it predicts a continuous variable such as dollar amount
purchased, it is called a “regression tree.”
The most popular decision tree models are called Chi-squared Automatic Interaction
Detection (CHAID) Classification and Regression Trees (CART) and C4.5/C5.
Blood Pressure < 160 Blood Pressure > 160
30% Died 70% Died
70% Lived 30% Lived
Cholesterol < 400 Cholesterol > 400 Cholesterol < 400 Cholesterol > 400
10% Died 20% Died 80% Died 90% Died
90% Lived 80% Lived 20% Lived 10% Lived
Dependent Variable – see Target Variable
Drill-Down – a request for more detailed information, usually by double-clicking on a number or
a part of a graph. For example, a table may show average salaries of professors broken down
by department and gender. Drilling down on a cell of that table might display that
relationship for each campus. The opposite of roll-up.
DSS – see Decision Support Software.
Ensemble Model – a model that combines the results of several types of models. For example, a
prediction could use the average estimation from a decision tree, a neural network and a
Enterprise Resource Planning (ERP) – software that stores the core operational data of a
businesses operational data such as sales, receivables, payables, in a database. One of the
three main areas to which data mining is applied: supply chain management (SCM),
enterprise resource planning (ERP) and customer relationship management (CRM).
ERP – see Enterprise Resource Planning.
Estimate – develop a model to find an approximate value for a continuous variable, e.g.
sales, blood pressure.
ETL or ETML – stands for Extract, Transform, Move and Load, the steps required to gain access
to data for analysis. Since the Move step is the easiest, the “M” is often left out. ETL is an
important subset of Data Management.
Executive Information System (EIS) – systems any decision maker can use with little training to
do ad-hoc analyses, often using OLAP
Expert Systems – a system that can solve a problem by incorporating the rules manually
obtained from human experts. You describe the problem and let it choose how best to
solve it. Examples in the area of analysis include SigmaStat, DecisionTime and the
SPSS Statistics Coach. Decisions are rather flaky at the moment, but improving.
Flat File - data stored in a standard format used to move data from one program to
another. Windows and Macintosh call this a “Text Only” or “Text With Line Breaks”
file. It may also be called an ASCII, EBCDIC (on large IBM computers) or
Front Office – the part of a company that customers interact with. Customer data is critical to
business profitability, so it is frequently mined.
Heuristic – see Modeling.
IQ – see Data Quality.
Imputation – the process of estimating the values of missing data prior to analysis.
Independent Variable – see Input Variables.
Information Quality – see Data Quality.
Input Variables – the variables thought to be related to, predict or cause the target variable. In
data mining, almost any variable that is not the target variable is a candidate for an input
Join – a database procedure that pools the information stored in different tables so that they can
be better analyzed.
Key Performance Indicator (KPI) – a very important variable. In business, it is a measure that
is critically important to the overall functioning of the organization.
KPI – See Key Performance Indicator.
Lift – a measure used to compare different data mining models. Essentially it is a measure of how
much better you are with the model than without. For example, if 2% of the customers you
mail a catalog to would make a purchase but using the model to select catalog recipients 10%
would make a purchase, then lift is 10/2 or 5.
Machine Learning – models that enable the computer to improve its performance through
experience, especially rule induction. The definition of learning is so loose that, although
rarely mentioned in this context, statistical estimation could also be considered “learning”.
Modeling is roughly synonymous with machine learning.
Market Basket Analysis – see Association Analysis.
Mart – see Data Mart.
MDDB – See Data Cube.
Measurment Scale – the level of detail in a variable. The measurement scale helps determine the
role of the variable in an analysis. Types include:
Single-valued variables or constants that result from selected subsets.
Binary have only two values such as male/female, purchased/didn't purchase.
Nominal contain category memberships such as political party. Also called categorical, class,
group, symbolic or qualitative variables.
Ordinal variables contain values that have order such as small, medium, large.
Interval or continuous variables have meaningful intervals, such as a weight interval of 110
pounds to 120 being the same as 120 to 130. Interval-level variables are also called numeric
or scale variables.
Merge – combining datasets or marts so that their rows are aligned and new columns are added.
Row alignment is often done using a key such as an ID number.
Metadata – Data about the data. Examples are column names such as gender, height; column
labels containing descriptors to embellish output. Entire questions from questionnaires are
common labels. Formats describe what values mean, such as 0=Female, 1=Male. Also called
“value labels” or codes. Missing value codes, if other than blank, e.g. 999; Scale of each
column: nominal, ordinal, interval; Formulas or recoding steps that were followed;
Documentation such as who, what, where, when why the data collected were collected;
MOLAP – see Online Analytical Processing.
Modeling – generally refers to the process of developing rules which can classify or predict with
an estimated level of precision. The rules may be in the form of a series of logical statements
or mathematical formula(s).
Statistical models are equations that have been mathematically derived to provide the best or
optimal description of relationships that involved straight lines, smooth curves, group
membership or clusters of similar cases. The solution to these equations usually requires
simplifying assumptions about the nature of the data that will not fit every dataset.
Heuristic models use methods that have been empirically shown to work well, but which have
not been shown to be best or optimal solution. Heuristic models usually make comparatively
few assumptions about the nature of the data.
Decision trees are an example of an analysis based on heuristics, while discriminant analysis
is based on an optimal method (which assumes the data follows a multivariate normal
Multidimensional Database – See Data Cube.
Neural Networks – models that mimic the brain through systems of equations. They “learn” by
being “trained” with a dataset. Unfortunately, what they learn is conveyed by a series of
complex mathematical formulas. These formulas may work well but not explain much about
the process they model. See Black Box.
ODBC – see Open Database Connectivity.
OLAP – see Online Analytical Processing.
OLE DB – an open standard for gaining access to the data stored in a multidimensional database.
Most OLAP products use OLE DB to access the data.
OLTP – see Online Transaction Processing.
Online Analytical Processing (OLAP) – software that quickly displays interactive tables or
graphs of pre-selected variables such as sales aggregated by time, region, state, store and
product line. A two-dimensional “slice” of the data might show mean sales broken down by
region and product line. Clicking on region might drill down to further divide those numbers
by state. Some statistics such as medians and percentiles cannot be used in OLAP due to the
data structure OLAP requires (a multidimensional database).
OLAP is often not considered data mining since it involves only simple tables and graphs that
display what is happening rather than the analytics that can help determine why it is
happening or what may happen in the future. However, it is very widely discussed in the data
OLAP usually displays only sums, counts and means. This is because means at any level of
breakdown can be calculated from sums of sums and sums of counts. However, the median
of an aggregate is not the aggregate of the medians, which is why medians and
percentiles cannot be used in OLAP. This is a major limitation in the method.
OLAP is occasionally referred to as MOLAP because it runs very quickly using a
Multidimensional database. When OLAP is used with a standard relational database,
it is called ROLAP. ROLAP is usually hundreds of times slower than MOLAP.
Online Transaction Processing (OLTP) – involves the efficient execution of frequent
database transactions used to collect data or run a business. These transactions are
recorded in a database rather than a data warehouse and are not suitable for analysis
until they have been transferred to a warehouse and restructured for efficient analysis.
Open Database Connectivity (ODBC) – a widely used standard to extract data from a data
warehouse to use for analysis.
Over-fitted Model – a model which has become so complex that it applies only to the dataset
upon which it was developed. Another term for this is “over-parameterized.”
Predictive Analytics – see Analytics.
Profit – the amount of profit made in a specific modeling situation, calculated by estimating the
cost of each type of error: assuming it is right when it is wrong, and vice versa. It can be used
in business for obvious reasons but also in other areas. For example in medicine, you could
assign a cost of concluding a patient has a treatable disease when they do not (antibiotic
treatment = $35) versus concluding they do not have it when they actually do (treated after
complications set in=$2,300 hospital stay). The point of maximum profit would show you the
best way to use the model.
Qualitative Data – depending on the context this term may refer to text data such as email
messages or to a categorical variables such as gender.
Query – the process of asking a database questions. Often done in an ad-hoc, interactive way.
Regression Analysis –a family of statistical models that include fitting straight lines, called linear
regression, smoothly curving lines, called polynomial regression or more sharply curving
lines, called nonlinear regression; or models that predict group membership, called logistic
regression. The main type of data mining that these “generalized linear models” do not do is
Relational Database – See Database.
Report – A basic listing of database information, which may consist of individual values, sums,
counts or means. Often done in a pre-planned, static way.
Return on Investment (ROI) – the money you saved by doing data mining. It is above and
beyond the usually high expenses of purchasing a data mining package, learning to use it and
then using it to solve a problem.
ROI – see Return on Investment.
ROLAP – see Online Analytical Processing.
Roll-Up – the process of aggregating numeric data. For example, a series of tables may show
professor salaries broken down by department and gender at each campus. The campuses
could be “rolled up” to create a single table of salaries broken down by department and
gender for all campuses combined. The opposite of drill-down.
Rule Induction – see Decision Trees.
Rule Set – see Decision Trees.
SCM – see Supply Chain Management.
Scoring – applying a model to new data usually to predict values of continuous variables such as
amount of purchase, or group memberships such as survive/die.
Sequel – see Structured Query Language.
SRM – stands for Supplier Relationship Management. See Supply Chain Management.
Statistical Models – see Modeling.
Structured Query Language (SQL) – the basic language used in almost all databases. It allows
you to search a database basic information such as listing certain records or totaling sums or
counts. It also lets you select subsets or samples, or to perform joins. The basic form is
SELECT vars FROM tablename WHERE logical condition is true. Often pronounced
Supervised Learning – the process of developing a model that has a target variable, such as sales
or survival to “supervise” it. This is opposed to unsupervised learning, which is developing a
model without a target, or finding clusters or groups of similar records in your data. When the
target variable is categorical, biologists would call this Class Prediction and statisticians
would call it Discriminant Analysis or Logistic Regression (two different methods of
achieving a similar result).
Supplier Relationship Management (SRM) – see supply chain management.
Supply Chain Management (SCM) – is the process of studying and interacting with suppliers to
maximize profits. Also called supplier relationship management (SRM). One of the three
main areas to which data mining is applied: supply chain management (SCM), enterprise
resource planning (ERP) and customer relationship management (CRM).
Target Variable – the main variable of interest in a data mining project. A business example is
the amount of each sale; a medical example is cure/no cure. Also called the predicted,
supervisor or dependent variable.
Test Data – used only once at the end of the data mining project to see if the best or
champion model generalizes to completely new data.
Text Data – written descriptions e.g. open ended survey questions, interviews, customer
complaints; also called qualitative data.
Text File – see flat file.
Text Mining – the process of automatically finding the key concepts contained in text data. It
may also find clusters of similar documents. The numeric output containing presence/absence
of each concept and cluster membership is often passed on to a data mining step where it is
combined with other numeric data for further analysis.
Training Data – the data which is used to develop models that will, if done properly, work well
on new sets of data.
Unsupervised Training/Learning – the process of developing a model that does not have a
target variable. This boils down to finding clusters of similar cases within the data. It would
usually be followed by another analysis that does involve a target variable. Biologists would
call this class discovery. Statisticians would call it cluster analysis.
Validation Data – during data mining, each step of each model developed using the
training data is tested using this data to discover the point at which the model
becomes overly specific to that single set of data, an over-fitted model.
Variable – see Data Table.
Visualization – the use of dynamic, interactive graphical displays to search for useful
patterns in data.
Warehouse – see Data Warehouse.