Introduction to Data Mining

DM
Intro

Integrated
Knowledge
Solutions

iksinc@yahoo.com

iksinc.wordpress.com

What
is
Data?

©IKSINC
Data is a set of facts/observations/
measurements about objects/
events/processes of interest

What
is
Information?

©IKSINC
Information is processed data that
is useful in one way or the other, for
example for decision making,
communication etc.
While the data
is fixed,
information
from it can
differ based
on needs

What
is
Knowledge?

©IKSINC
Patterns of relationships in data and
information that exhibit a high
degree of certainty

What
Is
Data
Mining?

Data
mining
is
essen+ally
a
process
of
data-‐driven
extrac+on

of
not
so
obvious
but
useful
informa+on
from
large
databases.

The
en+re
process
is
interac+ve
and
itera+ve.

Data
mining
also
goes
under
various
other
names
such
as:

Knowledge
discovery
in
databases
(KDD),
knowledge

extrac+on,
data/paDern
analysis,
data
analy+cs,
business

intelligence,
etc.

©IKSINC

Why
Data
Mining?

•  The
Data
Glut

•  Data
rich
but
informa+on

poor
businesses

•  Es+mates
of
data
doubling

every
20
months

•  The
average
Fortune
500

company
manages
over
a

terabyte
of
data
everyday

•  Convergence
of

Technologies

•  Compe++ve
Edge

•  Mass
marke+ng
versus

targeted
marke+ng

©IKSINC

Changing
Business
Direction

©IKSINC

Typical
Business
Applications
of
Data

Mining

•  Market

Segmenta+on

•  Customer
Targe+ng

and
Reten+on

•  Product
Design
and

Placement

•  Credit
Card
Fraud

Detec+on

•  Web
Adver+sing

•  Recommenda+on

Systems

©IKSINC

Other
Applications
of
Data
Mining

•  Stock
Market
Trends

•  Text
and
Mul+media
Data

Mining

•  Sports
Scou+ng

•  Medical
Outcomes
Analysis

•  Scien+ﬁc
Data
Mining

©IKSINC

Data
ClassiDication

•  Structured
Data

•  Data
consis+ng
of
well-‐deﬁned
ﬁelds

of
numeric
or
alphanumeric
values

©IKSINC

Data
ClassiDication

•  Unstructured
Data

•  No
well
deﬁned
ﬁelds
of

informa+on

•  Requires
extensive

processing
to
extract

content
informa+on

•  Examples
include
blogs,

news
reports,
images,

videos,
tweets
etc.

•  Fastest
growing
data

segment

©IKSINC

Data
ClassiDication

•  Semi-‐Structured
Data

•  Data
with
par+al
structure
(medical
reports,

execu+ve
summaries,
interview
scripts,
web

documents
etc.)

©IKSINC

Core
Technologies
for
Data
Mining

Data Mining
Machine
Learning
Database
Technology
Statistics
Visualization
Information
Retrieval

©IKSINC
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

©IKSINC
Architecture of a Typical
Data Mining System
Data
Warehouse
Data cleaning & data integration Filtering
Databases
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base

Data
Mining
and
Data
Warehousing

©IKSINC
Customers
Etc…
Vendors Etc…
Orders
Data
Warehouse
Enterprise
“Database”
Transactions
Copied,
organized
summarized
Data Mining
Data Miners
A data warehouse is a
data repository set up
to support strategic
decision making.

Data
Mart

•  A
Data
Mart
is
a
smaller,
more
focused
Data
Warehouse
–
a

mini-‐warehouse.

•  A
Data
Mart
typically
reﬂects
the
business
rules
of
a
speciﬁc

business
unit
within
an
enterprise.

Accessing
Information
in
a
Data

Warehouse

•  Structured
Query
Language
(SQL)-‐based

Repor+ng
and
Query
Tools

•  Good
for
extrac+ng
shallow,
non-‐dimensional

informa+on.
For
example,
“Find
all
credit
card

customers
holding
a
balance
payment
of
$500

or
more.”

©IKSINC

Accessing
Information
in
a
Data

Warehouse

•  On-‐Line
Analy+cal
Processing
(OLAP)-‐based

Repor+ng
and
Query
Tools

•  Good
for
extrac+ng
shallow,
dimensional

informa+on.
For
example,
“Find
all
credit
card

customers
who
live
in
Midwest,
drive
a
luxury
car,

and
hold
a
balance
payment
of
$500
or
more.”

©IKSINC

On-‐Line
Analytical
Processing
(OLAP)

•  OLAP
tools
organize
data
in
a
dimensional

representa+on,
called

a
cube.
This
permits
data

to
be
viewed
from
any
user
speciﬁed
angle
to

allow
slicing-‐and-‐dicing
of
the
data.

•  SQL
can
easily
answer
‘who?’
and
‘what?’

ques+ons,
however,
ability
to
answer
‘what
if?’

and
‘why?’
type
ques+ons
dis+nguishes
OLAP

from
general-‐purpose
query
tools.

©IKSINC

©IKSINC
Time
Regions
Northeast
Midwest
South
West
08 09 10 11
Products
A
B
C
D

25
OLAP
Operations

Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down

Different
OLAP
Tools

•  Desktop
OLAP
(DOLAP)

•  Limited
capability
PC-‐based
tools

•  Rela+onal
OLAP
(ROLAP)

•  Server-‐based
tools
that
let
a
user
assemble
into
a
cube

a
subset
of
data
from
a
rela+onal
database

•  Mul+dimensional
OLAP
(MOLAP)

•  Server-‐based
tools
that
use
pre-‐computed
cubes
of

data
for
faster
response

©IKSINC

Accessing
Information
in
a
Data
Warehouse

•  Data
Mining
Tools

•  Good
for
extrac+ng
hidden
or
not
so
obvious

informa+on.
For
example,
“Find
all
credit
card

customers
who
are
likely
to
declare

bankruptcy.”

©IKSINC

DBMS,
OLAP,
and
Data
Mining

DBMS OLAP Data Mining
Task
Extraction of detailed
and summary data
Summaries, trends and
forecasts
Knowledge
discovery of
hidden patterns
and insights
Type of result Information Analysis
Insight and
Prediction
Method
Deduction (Ask the
question, verify
with data)
Multidimensional data
modeling,
Aggregation,
Statistics
Induction (Build the
model, apply it
to new data, get
the result)
Example question
Who purchased
mutual funds in
the last 3 years?
What is the average
income of mutual
fund buyers by
region by year?
Who will buy a
mutual fund in
the next 6
months and
why?

Data
Mining
and
SQL

•  SQL
is
good
for

queries
that
impose
a

constraint
on
data
to

extract
an
answer

•  SQL
extracts
shallow

knowledge
from
a

database

•  Use
SQL
when
we

know
exactly
what
we

are
looking
for

•  Data
mining
is
good

for
exploratory

queries

•  Data
mining
extracts

hidden
informa+on

from
a
database

•  Use
data
mining
when

we
are
in
a
ﬁshing

mode

©IKSINC

Data
Mining
and
Data
Warehousing:
A

Mutually
Reinforcing
Relationship

•  Data
mining
provides
a
good
ROI
for
data
warehousing

•  A
data
warehouse
or
data
mart
provides
clean,
well-‐
formaDed
historical
data
for
mining

©IKSINC

Data
Mining
Process

©IKSINC
Domain
Understanding
Data
Selection
Cleaning and
Preprocessing
Discovering
Patterns
Reporting Interpretation

Domain
Understanding
Stage

•  Learning
the
business
goals

•  Gathering
relevant
prior
knowledge

•  Best
executed
by
a
team
of
business
and

IT
persons

•  Good
understanding
of
the
domain
avoids

discovering
irrelevant
paDerns
or

minimizes
the
chances
for
garbage
in

garbage
out

©IKSINC

Example
Data
Sources

•  Point-‐of-‐sale
data

•  Credit
card
charge

records

•  Warranty
claims

•  Medical
insurance

claims

•  Direct
mail

response
data

•  Telephone
call

records

•  Web
ac+vity
data

•  Economic
data

•  U+lity
charges

•  Census
returns

•  Magazine

subscrip+on
records

©IKSINC

Data
Cleaning
and
Preprocessing
Stage

•  Data
comes
from
many
sources
-‐
internal
and
external

•  Data
comes
in
many
forms
and
formats

•  Hierarchical
databases,
flat
files,
COBOL
data
sets

•  Data
is
never
clean

•  Most
important
stage.
Typically
consumes
about
60-‐80%

of
the
total
data
mining
effort

©IKSINC

Business
Data
Corruption
Examples

•  Duplica+on
-‐
A
common
problem
with
direct

mailers
and
credit
card
companies

•  Missing
and
Confusing
Data
Fields

•  Outliers
-‐
Generally
present
due
to
incorrect

entry/coding
of
a
data
ﬁeld.

©IKSINC

Data
Preprocessing

This
step
is
also
known
as
data
transforma4on.
The
aim
here

is
to
map
data
ﬁelds
into
representa+ons
suitable
for
the
data

discovery
stage.

Examples:

Month/Date/Year

===>
Age
Groups

Customer
Address

===>
Geographic
Zone
Code

©IKSINC

Pattern
Discovery
Stage

•  Discovery
Model?

•  Discovery
Methodology?

©IKSINC

Discovery
Models

•  Associa+on
Model

•  Classiﬁca+on
Model

•  Clustering
Model

•  Regression
Model

•  Sequen+al
Model

•  Visual
Model

©IKSINC

Association
Model

90%
of
customers
who
subscribe
to
at
least

three
premium
channels
also
subscribe
to
pay-‐
per
view
events

©IKSINC
Also known as
Market Basket
Analysis

Association
Model:
Application
1

•  Marke+ng
and
Sales
Promo+on:

•  Let
the
rule
discovered
be

{Bagels,
…
}
-‐-‐>
{Potato
Chips}

•  Potato
Chips
as
consequent
=>
Can
be
used
to
determine

what
should
be
done
to
boost
its
sales.

•  Bagels
in
the
antecedent
=>
Can
be
used
to
see
which

products
would
be
aﬀected
if
the
store
discon+nues
selling

bagels.

•  Bagels
in
antecedent
and
Potato
chips
in
consequent
=>

Can
be
used
to
see
what
products
should
be
sold
with

Bagels
to
promote
sale
of
Potato
chips!

Association
Model:
Application
2

•  Supermarket
shelf
management.

•  Goal:
To
iden+fy
items
that
are
bought
together
by
sufficiently

many
customers.

•  Approach:
Process
the
point-‐of-‐sale
data
collected
with
barcode

scanners
to
find
dependencies
among
items.

•  A
classic
rule
-‐-‐

•  If
a
customer
buys
diaper
and
milk,
then
he
is
very
likely
to
buy
beer.

•  So,
don’t
be
surprised
if
you
find
six-‐packs
stacked
next
to
diapers!

ClassiDication
Model

•  If
Annual_Income
>40,000
AND
Home-‐Owner,
Then

Credit-‐Risk
è
Medium

•  If

>
25,

Then
Loan-‐Approval
è
Yes

©IKSINC
( _ )
( _ _ )
.
.
Annual Income
Avg Monthly CreditCardBalance Mortgage
1 2
15
+

ClassiDication:
Application
1

•  Direct
Marke+ng

•  Goal:
Reduce
cost
of
mailing
by
targe4ng
a
set
of

consumers
likely
to
buy
a
new
cell-‐phone
product.

•  Approach:

•  Use
the
data
for
a
similar
product
introduced
before.

•  We
know
which
customers
decided
to
buy
and
which
decided

otherwise.
This
{buy,
don’t
buy}
decision
forms
the
class
aQribute.

•  Collect
various
demographic,
lifestyle,
and
company-‐interac+on

related
informa+on
about
all
such
customers.

•  Type
of
business,
where
they
stay,
how
much
they
earn,
etc.

•  Use
this
informa+on
as
input
aDributes
to
learn
a
classiﬁer
model.

ClassiDication:
Application
2

•  Fraud
Detec+on

•  Goal:
Predict
fraudulent
cases
in
credit
card
transac+ons.

•  Approach:

•  Use
credit
card
transac+ons
and
the
informa+on
on
its
account-‐
holder
as
aDributes.

•  When
does
a
customer
buy,
what
does
he
buy,
how
oqen
he
pays
on

+me,
etc

•  Label
past
transac+ons
as
fraud
or
fair
transac+ons.
This
forms
the

class
aDribute.

•  Learn
a
model
for
the
class
of
the
transac+ons.

•  Use
this
model
to
detect
fraud
by
observing
credit
card

transac+ons
on
an
account.

ClassiDication:
Application
3

•  Customer
ADri+on/Churn:

•  Goal:
To
predict
whether
a
customer
is

likely
to
be
lost
to
a
compe+tor.

•  Approach:

•  Use
detailed
record
of
transac+ons
with

each
of
the
past
and
present
customers,
to

ﬁnd
aDributes.

•  How
oqen
the
customer
calls,
where
he
calls,

what
+me-‐of-‐the
day
he
calls
most,
his

ﬁnancial
status,
marital
status,
etc.

•  Label
the
customers
as
loyal
or
disloyal.

•  Find
a
model
for
loyalty.

Clustering
Model

Clustering
models
are
similar
to

classiﬁca+on
models
except
that
no
a-‐
priori
informa+on
is
available
for

classes.

©IKSINC

Clustering:
Application
1

•  Market
Segmenta+on:

•  Goal:
subdivide
a
market
into
dis+nct

subsets
of
customers
where
any
subset

may
conceivably
be
selected
as
a
market

target
to
be
reached
with
a
dis+nct

marke+ng
mix.

•  Approach:

•  Collect
diﬀerent
aDributes
of
customers

based
on
their
geographical
and
lifestyle

related
informa+on.

•  Find
clusters
of
similar
customers.

•  Measure
the
clustering
quality
by
observing

buying
paDerns
of
customers
in
same
cluster

vs.
those
from
diﬀerent
clusters.

Clustering:
Application
2

•  Document
Clustering:

•  Goal:
To
ﬁnd
groups
of
documents
that
are

similar
to
each
other
based
on
the
important

terms
appearing
in
them.

•  Approach:
To
iden+fy
frequently
occurring

terms
in
each
document.
Form
a
similarity

measure
based
on
the
frequencies
of
diﬀerent

terms.
Use
it
to
cluster.

•  Gain:
Informa+on
Retrieval
can
u+lize
the

clusters
to
relate
a
new
document
or
search

term
to
clustered
documents.

Regression
Model

Log(Peak_Load)
=
300
+
1.6(|Temp
-‐
65|)**2

+
2.4(Rel_Humidity
-‐
80)

Unlike
classiﬁca+on
models
that
produce
only

discrete
outcomes,
a
regression
model

generates
a
numerical
score
as
its
output.

©IKSINC

Sequential
Model

Similar
to
associa+on
models
except
that
sequences
of
events

are
considered.
For
example:

“80%
of
customers
who
buy
a
product
X
are
likely
to
buy

product
Y
in
next
six
months”

©IKSINC

Interpretation
Stage

•  Evaluate
the
quality
of
the

discovered
paDerns

•  Determine
the
value
of
the

discovery
to
the
business

©IKSINC

Value
of
Mined
Information

•  Percep+ve
gap

©IKSINC
The
Organization
Organization’s
View of the
Business
Data
Mined
View of
the
Business
Gap

Value
of
Mined
Information

•  Dollar
gap

©IKSINC
Gap
The
Organization
Organization’s
View of the
Business
Data
Mined
View of
the
Business
Action 1
Action 2

Reporting
Stage

•  Repor+ng
the
discovery
to
higher
management

•  Transforming
the
discovery
to
new
ac+ons
or

products

©IKSINC

Challenges
of
Data
Mining

•  Scalability

•  Dimensionality

•  Complex
and
Heterogeneous
Data

•  Data
Quality

•  Data
Ownership
and
Distribu+on

•  Privacy
Preserva+on

©IKSINC

Thank
you
for
Viewing
my
Presenta+on

For
ques+ons,
you
can
contact
me
at
iksinc@yahoo.com

Also
visit
my
blog
“From
Data
to
Decisions”
at

iksinc.wordpress.com

Introduction to Data Mining

More Related Content

What's hot

Similar to Introduction to Data Mining

More from Si Krishan

Recently uploaded

Introduction to Data Mining