Configuring Associations to Increase Trust in Product Purchase
Data Mining – analyse Bank Marketing Data Set
1. Data Mining – analyse
Bank Marketing Data Set
by WEKA.
EXPLORATORY PROJECT BY
MATEUSZ BRZOSKA
MIDDLESEX UNIVERSITY 2015
1
2. Abstract / Aims / Objectives
Aims
To study techniques and
methodologies in data mining
To analyse a data set of interest for
clustering, classification, learning
dependencies and prediction
To process the data and achieve the
final satisfactory result
Objectives
To study Knowledge Discovery in
Database (KDD)
To understand the need for analyses
of large, complex, information - rich
data sets
To provide essential information and
demonstrate relevant algorithms onto
techniques
2
3. Bank Marketing
Data Set
“The data is come from marketing
campaigns of a Portuguese banking
institution. The marketing campaigns
were based on phone calls. Often, more
than one contact to the same client was
required, in order to access if the product
(bank term deposit) would be ('yes') or
not ('no') subscribed.“
41188 instances / 11 inputs
3
predict if the client will subscribe (yes/no) a term deposit
4. Knowledge Discovery
in Databases
The KDD process consists of the
following steps (see the picture):
Selection of data which are relevant to
the analysis task
Preprocessing of these data, including
tasks like data cleaning and data
integration
Transformation of the data into forms
appropriate for mining
Application of Data Mining algorithms
for the extraction of patterns
Interpretation/evaluation of the
generated patterns so as to identify
those patterns that represent real
knowledge, based on some
interestingness measures.
4
5. Data Mining Overview
"sink" in the electronic data
data mining technology can extract knowledge
efficiently and rationally utilize the data collected in the knowledge
"a process of automatic discovery of non-trivial, previously unknown,
potentially useful rules, dependencies, patterns, similarities and trends in
large data repositories."
5
6. Data Mining Methods
Discovering
association rules
methods of discovering interesting
relationship or correlation
Classification
and prediction
includes methods for discovering
models (classifiers)
Grouping (cluster
analysis, clustering)
finding the classes of finite sets of
objects with similar characteristics
6
7. WEKA Software
automatically make predictions
help people make decisions faster and
more accurately
freely available for download
the most popular used data mining
systems
the tools can be used in many different
data mining task
discovering knowledge from Bank
Marketing Data Set through:
- classification
- clustering
- association rules
7
8. Visualization of Data Set and Examining Data
You can Visualize the attributes based on selected class.
8
9. Data Mining – Classification
(OneR, J48, Naive Bayes)
method of data analysis
assign an object (data) to one of the
predefined classes based on a set of
attributes that describe the object
the purpose of classification is the
prediction
the most popular classification
algorithms: Decision Trees (J48), Naive
Bayes, Bayesian Networks, OneR
9
10. Discovering potentially useful patterns
from a data set
- classification algorithms
OneR
OneR generate a one-level
decision tree. The rules are simple
to understand but also less
accurate.
Deposit = YES (AGE)
If 64.5 – 66.5
If 75.5 – 80.5
If more than 88.5
Deposit = NO (AGE)
If less than 64.5
If 66.5 – 75.5
If 80.5 – 88.5
J48
Divides the original data set
relative to each variable. Creates
many variants of the division.
Deposit = YES
Age > 60
Job = retired
Education = basic.4y
Marital = married
Loan = no
Housing = yes
Naïve Bayes
Assign a new case to one of the
classes.
10
Attribute NO YES
AGE 40 41
JOB Admin
MARITAL Married
EDUCATION University degree
DEFAULT No
HOUSING Yes
LOAN No
CONTRACT Cellular
MONTH May
DAY OF WEEK Monday Thursday
11. Data Mining – Clustering
(SimpleKMeans)
a process of grouping objects in a
class called clusters
definitions of the concept of the
cluster:
- a set of objects that are "similar“
- a set of objects such that the
distance between any two objects
belonging to the cluster that is less
than the distance between any
object
algorithm SimpleKMeans as an
example in WEKA
11
12. Discovering potentially useful patterns
from a data set
- clustering algorithm
12
Represent the group with the centroid for the documents that belong to this group.
Membership in the group is determined by finding the most similar group centroid for each
document.
SimpleKMeans
13. Data Mining - Association
(Rules Function|Apriori)
Association Rule is an unsupervised
data mining function
It finds rules associated with frequently
co-occurring items
It gives rules that explain how items or
events are associated with each other
Apriori algorithm to discover
co-occurring items.
13
14. Discovering potentially useful patterns
from a data set
- association algorithm
14
Apriori
Apriori finds rules with support greater than a specified minimum support and confidence greater
than a specified minimum confidence.
1. marital=married contact=telephone month=may 5454 ==> y=no 5283 conf:(0.97)
2. marital=married loan=no contact=telephone month=may 4511 ==> y=no 4367 conf:(0.97)
3. contact=telephone month=may 8251 ==> y=no 7979 conf:(0.97)
4. loan=no contact=telephone month=may 6819 ==> y=no 6593 conf:(0.97)
5. default=no contact=telephone month=may 5726 ==> y=no 5533 conf:(0.97)
6. default=no loan=no contact=telephone month=may 4749 ==> y=no 4587 conf:(0.97)
7. month=aug y=no 5523 ==> contact=cellular 5290 conf:(0.96)
8. month=aug 6178 ==> contact=cellular 5909 conf:(0.96)
9. loan=no month=aug y=no 4562 ==> contact=cellular 4362 conf:(0.96)
10. loan=no month=aug 5120 ==> contact=cellular 4890 conf:(0.96)
15. Conclusion
Analysis
shows information about techniques
and methodologies in data mining,
also Knowledge Discovery Database
analyses a big dataset
provides essential information and
demonstrate relevant algorithms onto
techniques
Results
knowledge which is potentially useful;
the computer search engines already
provide the best results in gaining of
specific goals;
WEKA helped to collect certain rules;
process the data and achieve the
final satisfactory result
15
16. Results
Will subscribe term deposit YES
AGE >65
JOB: services, blue-collar, technician, entrepreneur
MARITAL: married
EDUCATION: basic.9y, basic.6y, high.school
DEFAULT: unknown (has credit in default)
HOUSING: no (has housing loan)
LOAN: there is no big difference (has personal loan)
CONTACT: telephone
MONTH: may, jun, jul, agu, nov
DAY OF WEEK: mon, fri
Will subscribe term deposit NO
16
AGE <65
JOB: admin, student, unemployed, retired
MARITAL: single
EDUCATION: university degree, unknown
DEFAULT: no (has credit in default)
HOUSING: yes (has housing loan)
LOAN: there is no big difference (has personal loan)
CONTACT: cellular
MONTH: oct, sep, dec, mar, apr
DAY OF WEEK: tue, wed, thu
Who want that data?
marketing companies / banking institutions