Unit-1
INTRODUCTION TO
DATA MINING
By,
MRUTYUNJAYA S Y
Assistant Professor,
CSE dept,
CMREC, Hyderabad
HISTORY
• The history of Data Mining started very recently
as it is commonly considered with new
technology.
• However data is a discipline with a long history.
• It starts with the early Data Mining methods
Bayes’ Theorem (1700`s) and Regression
analysis (1800`s) which were mostly identifying
patterns in data.
.
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw
data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis.
– Data mining is a process used by companies to turn
raw data into useful information. By using software to look
for patterns in large batches of data, businesses can learn
more about their customers and develop more effective
marketing strategies as well as increase sales and
decrease costs.
????Questions????
• Where exactly the DATA can be stored?
• How and from where to extract the DATA?
• From where the DATA we can be MINE?
• Answer is DATA WAREHOUSE........
• Draws ideas from machine learning/AI,
pattern recognition, statistics, and
database systems
• Must address:
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
Origins of Data Mining
AI /
Machine Learning
Statistics
Data Mining
Database
systems
Database Processing vs. Data
Mining Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
 Output
– Precise
– Subset of database
 Output
– Fuzzy
– Not a subset of database
10
Query Examples
• Database
• Data Mining
– Find all customers who have purchased milk
– Find all items which are frequently purchased
with milk. (association rules)
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
Data Mining Models and Tasks
Data Mining: Classification Schemes
• Decisions in data mining
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
• Data mining tasks
– Descriptive data mining
– Predictive data mining
Decisions in Data Mining
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks
• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.
Common data mining tasks
– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
– Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
Classification: Application 3
• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
Classifying Galaxies
Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Data Mining: A KDD Process
– Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Data Selection
Data Preprocessing
Data Mining
Pattern Evaluation
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
November 13, 2022 Data Mining: Concepts and
Techniques
24
Architecture of a Typical Data
Mining System
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology
Statistics
Other
Disciplines
Information
Science
Machine
Learning
Visualization
Examples of Large Datasets
• Government: IRS, NGA, …
• Large corporations
– WALMART: 20M transactions per day
– MOBIL: 100 TB geological databases
– AT&T 300 M calls per day
– Credit card companies
• Scientific
– NASA, EOS project: 50 GB per hour
– Environmental datasets
DATA MINING APPLICATIONS
• Areas of Use (Huge usage in All Fields)
– Internet – Discover needs of customers
– Economics – Predict stock prices
– Science – Predict environmental change
– Medicine – Match patients with similar problems  cure
• Credit Card Company wants to discover information about
clients from databases. Want to find:
– Clients who respond to promotions in “Junk Mail”
– Clients that are likely to change to another competitor
November 13, 2022 Data Preprocessing 28
Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
November 13, 2022 Data Preprocessing 29
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
November 13, 2022 Data Preprocessing 30
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
November 13, 2022 Data Preprocessing 31
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
November 13, 2022 Data Preprocessing 32
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
November 13, 2022 Data Preprocessing 33
Forms of Data Preprocessing
November 13, 2022 Data Preprocessing 34
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
November 13, 2022 Data Preprocessing 35
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
 Missing data may need to be inferred.
November 13, 2022 Data Preprocessing 36
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
November 13, 2022 Data Preprocessing 37
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
November 13, 2022 Data Preprocessing 38
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
.
Thank You

Data Mining- Unit-I PPT (1).ppt

  • 1.
    Unit-1 INTRODUCTION TO DATA MINING By, MRUTYUNJAYAS Y Assistant Professor, CSE dept, CMREC, Hyderabad
  • 2.
    HISTORY • The historyof Data Mining started very recently as it is commonly considered with new technology. • However data is a discipline with a long history. • It starts with the early Data Mining methods Bayes’ Theorem (1700`s) and Regression analysis (1800`s) which were mostly identifying patterns in data.
  • 3.
  • 4.
    Why Mine Data?Commercial Viewpoint • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management)
  • 5.
    Why Mine Data?Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation
  • 6.
    What Is DataMining? • Data mining (knowledge discovery in databases): – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases – Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis. – Data mining is a process used by companies to turn raw data into useful information. By using software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective marketing strategies as well as increase sales and decrease costs.
  • 7.
    ????Questions???? • Where exactlythe DATA can be stored? • How and from where to extract the DATA? • From where the DATA we can be MINE? • Answer is DATA WAREHOUSE........
  • 8.
    • Draws ideasfrom machine learning/AI, pattern recognition, statistics, and database systems • Must address: – Enormity of data – High dimensionality of data – Heterogeneous, distributed nature of data Origins of Data Mining AI / Machine Learning Statistics Data Mining Database systems
  • 9.
    Database Processing vs.Data Mining Processing • Query – Well defined – SQL • Query – Poorly defined – No precise query language  Output – Precise – Subset of database  Output – Fuzzy – Not a subset of database
  • 10.
    10 Query Examples • Database •Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
  • 11.
  • 12.
    Data Mining: ClassificationSchemes • Decisions in data mining – Kinds of databases to be mined – Kinds of knowledge to be discovered – Kinds of techniques utilized – Kinds of applications adapted • Data mining tasks – Descriptive data mining – Predictive data mining
  • 13.
    Decisions in DataMining • Databases to be mined – Relational, transactional, object-oriented, object-relational, time-series, text, multi-media, heterogeneous, legacy, WWW, etc. • Knowledge to be mined – Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels • Techniques utilized – Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc. • Applications adapted – Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.
  • 14.
    Data Mining Tasks •Prediction Tasks – Use some variables to predict unknown or future values of other variables • Description Tasks – Find human-interpretable patterns that describe the data. Common data mining tasks – Classification [Predictive] – Clustering [Descriptive] – Association Rule Discovery [Descriptive] – Sequential Pattern Discovery [Descriptive] – Regression [Predictive] – Deviation Detection [Predictive]
  • 15.
    Classification: Definition • Givena collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. – Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 16.
    Classification: Application 1 •Direct Marketing – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. – Approach: • Use the data for a similar product introduced before. • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. • Collect various demographic, lifestyle, and company- interaction related information about all such customers. – Type of business, where they stay, how much they earn, etc. • Use this information as input attributes to learn a classifier model.
  • 17.
    Classification: Application 2 •Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: • Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc • Label past transactions as fraud or fair transactions. This forms the class attribute. • Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card transactions on an account.
  • 18.
    Classification: Application 3 •Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach: • Segment the image. • Measure image attributes (features) - 40 of them per object. • Model the class based on these features. • Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
  • 19.
    Classifying Galaxies Early Intermediate Late Data Size: •72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc.
  • 20.
    Data Mining: AKDD Process – Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Data Selection Data Preprocessing Data Mining Pattern Evaluation
  • 22.
    Steps of aKDD Process • Learning the application domain: – relevant prior knowledge and goals of application • Creating a target data set: data selection • Data cleaning and preprocessing: (may take 60% of effort!) • Data reduction and transformation: – Find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining – summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge
  • 23.
    Data Mining andBusiness Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • 24.
    November 13, 2022Data Mining: Concepts and Techniques 24 Architecture of a Typical Data Mining System Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  • 25.
    Data Mining: Confluenceof Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization
  • 26.
    Examples of LargeDatasets • Government: IRS, NGA, … • Large corporations – WALMART: 20M transactions per day – MOBIL: 100 TB geological databases – AT&T 300 M calls per day – Credit card companies • Scientific – NASA, EOS project: 50 GB per hour – Environmental datasets
  • 27.
    DATA MINING APPLICATIONS •Areas of Use (Huge usage in All Fields) – Internet – Discover needs of customers – Economics – Predict stock prices – Science – Predict environmental change – Medicine – Match patients with similar problems  cure • Credit Card Company wants to discover information about clients from databases. Want to find: – Clients who respond to promotions in “Junk Mail” – Clients that are likely to change to another competitor
  • 28.
    November 13, 2022Data Preprocessing 28 Data Preprocessing  Why preprocess the data?  Descriptive data summarization  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary
  • 29.
    November 13, 2022Data Preprocessing 29 Why Data Preprocessing?  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation=“ ”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 30.
    November 13, 2022Data Preprocessing 30 Why Is Data Dirty?  Incomplete data may come from  “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed.  Human/hardware/software problems  Noisy data (incorrect values) may come from  Faulty data collection instruments  Human or computer error at data entry  Errors in data transmission  Inconsistent data may come from  Different data sources  Functional dependency violation (e.g., modify some linked data)  Duplicate records also need data cleaning
  • 31.
    November 13, 2022Data Preprocessing 31 Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
  • 32.
    November 13, 2022Data Preprocessing 32 Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 33.
    November 13, 2022Data Preprocessing 33 Forms of Data Preprocessing
  • 34.
    November 13, 2022Data Preprocessing 34 Data Cleaning  Importance  “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball  “Data cleaning is the number one problem in data warehousing”—DCI survey  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 35.
    November 13, 2022Data Preprocessing 35 Missing Data  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred.
  • 36.
    November 13, 2022Data Preprocessing 36 Noisy Data  Noise: random error or variance in a measured variable  Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data
  • 37.
    November 13, 2022Data Preprocessing 37 How to Handle Noisy Data?  Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression  smooth by fitting the data into regression functions  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)
  • 38.
    November 13, 2022Data Preprocessing 38 Data Cleaning as a Process  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface  Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels)
  • 39.