Data Mining – analyse
Bank Marketing Data Set
by WEKA.
EXPLORATORY PROJECT BY
MATEUSZ BRZOSKA
MIDDLESEX UNIVERSITY 2015
1
Abstract / Aims / Objectives
Aims
 To study techniques and
methodologies in data mining
 To analyse a data set of interest for
clustering, classification, learning
dependencies and prediction
 To process the data and achieve the
final satisfactory result
Objectives
 To study Knowledge Discovery in
Database (KDD)
 To understand the need for analyses
of large, complex, information - rich
data sets
 To provide essential information and
demonstrate relevant algorithms onto
techniques
2
Bank Marketing
Data Set
“The data is come from marketing
campaigns of a Portuguese banking
institution. The marketing campaigns
were based on phone calls. Often, more
than one contact to the same client was
required, in order to access if the product
(bank term deposit) would be ('yes') or
not ('no') subscribed.“
41188 instances / 11 inputs
3
 predict if the client will subscribe (yes/no) a term deposit
Knowledge Discovery
in Databases
The KDD process consists of the
following steps (see the picture):
Selection of data which are relevant to
the analysis task
Preprocessing of these data, including
tasks like data cleaning and data
integration
Transformation of the data into forms
appropriate for mining
Application of Data Mining algorithms
for the extraction of patterns
Interpretation/evaluation of the
generated patterns so as to identify
those patterns that represent real
knowledge, based on some
interestingness measures.
4
Data Mining Overview
 "sink" in the electronic data
 data mining technology can extract knowledge
 efficiently and rationally utilize the data collected in the knowledge
 "a process of automatic discovery of non-trivial, previously unknown,
potentially useful rules, dependencies, patterns, similarities and trends in
large data repositories."
5
Data Mining Methods
Discovering
association rules
methods of discovering interesting
relationship or correlation
Classification
and prediction
includes methods for discovering
models (classifiers)
Grouping (cluster
analysis, clustering)
finding the classes of finite sets of
objects with similar characteristics
6
WEKA Software
 automatically make predictions
 help people make decisions faster and
more accurately
 freely available for download
 the most popular used data mining
systems
 the tools can be used in many different
data mining task
 discovering knowledge from Bank
Marketing Data Set through:
- classification
- clustering
- association rules
7
Visualization of Data Set and Examining Data
You can Visualize the attributes based on selected class.
8
Data Mining – Classification
(OneR, J48, Naive Bayes)
 method of data analysis
 assign an object (data) to one of the
predefined classes based on a set of
attributes that describe the object
 the purpose of classification is the
prediction
 the most popular classification
algorithms: Decision Trees (J48), Naive
Bayes, Bayesian Networks, OneR
9
Discovering potentially useful patterns
from a data set
- classification algorithms
OneR
OneR generate a one-level
decision tree. The rules are simple
to understand but also less
accurate.
Deposit = YES (AGE)
If 64.5 – 66.5
If 75.5 – 80.5
If more than 88.5
Deposit = NO (AGE)
If less than 64.5
If 66.5 – 75.5
If 80.5 – 88.5
J48
Divides the original data set
relative to each variable. Creates
many variants of the division.
Deposit = YES
Age > 60
Job = retired
Education = basic.4y
Marital = married
Loan = no
Housing = yes
Naïve Bayes
Assign a new case to one of the
classes.
10
Attribute NO YES
AGE 40 41
JOB Admin
MARITAL Married
EDUCATION University degree
DEFAULT No
HOUSING Yes
LOAN No
CONTRACT Cellular
MONTH May
DAY OF WEEK Monday Thursday
Data Mining – Clustering
(SimpleKMeans)
 a process of grouping objects in a
class called clusters
 definitions of the concept of the
cluster:
- a set of objects that are "similar“
- a set of objects such that the
distance between any two objects
belonging to the cluster that is less
than the distance between any
object
 algorithm SimpleKMeans as an
example in WEKA
11
Discovering potentially useful patterns
from a data set
- clustering algorithm
12
Represent the group with the centroid for the documents that belong to this group.
Membership in the group is determined by finding the most similar group centroid for each
document.
SimpleKMeans
Data Mining - Association
(Rules Function|Apriori)
 Association Rule is an unsupervised
data mining function
 It finds rules associated with frequently
co-occurring items
 It gives rules that explain how items or
events are associated with each other
 Apriori algorithm to discover
co-occurring items.
13
Discovering potentially useful patterns
from a data set
- association algorithm
14
Apriori
Apriori finds rules with support greater than a specified minimum support and confidence greater
than a specified minimum confidence.
1. marital=married contact=telephone month=may 5454 ==> y=no 5283 conf:(0.97)
2. marital=married loan=no contact=telephone month=may 4511 ==> y=no 4367 conf:(0.97)
3. contact=telephone month=may 8251 ==> y=no 7979 conf:(0.97)
4. loan=no contact=telephone month=may 6819 ==> y=no 6593 conf:(0.97)
5. default=no contact=telephone month=may 5726 ==> y=no 5533 conf:(0.97)
6. default=no loan=no contact=telephone month=may 4749 ==> y=no 4587 conf:(0.97)
7. month=aug y=no 5523 ==> contact=cellular 5290 conf:(0.96)
8. month=aug 6178 ==> contact=cellular 5909 conf:(0.96)
9. loan=no month=aug y=no 4562 ==> contact=cellular 4362 conf:(0.96)
10. loan=no month=aug 5120 ==> contact=cellular 4890 conf:(0.96)
Conclusion
Analysis
 shows information about techniques
and methodologies in data mining,
also Knowledge Discovery Database
 analyses a big dataset
 provides essential information and
demonstrate relevant algorithms onto
techniques
Results
 knowledge which is potentially useful;
 the computer search engines already
provide the best results in gaining of
specific goals;
 WEKA helped to collect certain rules;
 process the data and achieve the
final satisfactory result
15
Results
Will subscribe term deposit YES
AGE >65
JOB: services, blue-collar, technician, entrepreneur
MARITAL: married
EDUCATION: basic.9y, basic.6y, high.school
DEFAULT: unknown (has credit in default)
HOUSING: no (has housing loan)
LOAN: there is no big difference (has personal loan)
CONTACT: telephone
MONTH: may, jun, jul, agu, nov
DAY OF WEEK: mon, fri
Will subscribe term deposit NO
16
AGE <65
JOB: admin, student, unemployed, retired
MARITAL: single
EDUCATION: university degree, unknown
DEFAULT: no (has credit in default)
HOUSING: yes (has housing loan)
LOAN: there is no big difference (has personal loan)
CONTACT: cellular
MONTH: oct, sep, dec, mar, apr
DAY OF WEEK: tue, wed, thu
Who want that data?
marketing companies / banking institutions
Thank you for listening
17

Data Mining – analyse Bank Marketing Data Set

  • 1.
    Data Mining –analyse Bank Marketing Data Set by WEKA. EXPLORATORY PROJECT BY MATEUSZ BRZOSKA MIDDLESEX UNIVERSITY 2015 1
  • 2.
    Abstract / Aims/ Objectives Aims  To study techniques and methodologies in data mining  To analyse a data set of interest for clustering, classification, learning dependencies and prediction  To process the data and achieve the final satisfactory result Objectives  To study Knowledge Discovery in Database (KDD)  To understand the need for analyses of large, complex, information - rich data sets  To provide essential information and demonstrate relevant algorithms onto techniques 2
  • 3.
    Bank Marketing Data Set “Thedata is come from marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.“ 41188 instances / 11 inputs 3  predict if the client will subscribe (yes/no) a term deposit
  • 4.
    Knowledge Discovery in Databases TheKDD process consists of the following steps (see the picture): Selection of data which are relevant to the analysis task Preprocessing of these data, including tasks like data cleaning and data integration Transformation of the data into forms appropriate for mining Application of Data Mining algorithms for the extraction of patterns Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures. 4
  • 5.
    Data Mining Overview "sink" in the electronic data  data mining technology can extract knowledge  efficiently and rationally utilize the data collected in the knowledge  "a process of automatic discovery of non-trivial, previously unknown, potentially useful rules, dependencies, patterns, similarities and trends in large data repositories." 5
  • 6.
    Data Mining Methods Discovering associationrules methods of discovering interesting relationship or correlation Classification and prediction includes methods for discovering models (classifiers) Grouping (cluster analysis, clustering) finding the classes of finite sets of objects with similar characteristics 6
  • 7.
    WEKA Software  automaticallymake predictions  help people make decisions faster and more accurately  freely available for download  the most popular used data mining systems  the tools can be used in many different data mining task  discovering knowledge from Bank Marketing Data Set through: - classification - clustering - association rules 7
  • 8.
    Visualization of DataSet and Examining Data You can Visualize the attributes based on selected class. 8
  • 9.
    Data Mining –Classification (OneR, J48, Naive Bayes)  method of data analysis  assign an object (data) to one of the predefined classes based on a set of attributes that describe the object  the purpose of classification is the prediction  the most popular classification algorithms: Decision Trees (J48), Naive Bayes, Bayesian Networks, OneR 9
  • 10.
    Discovering potentially usefulpatterns from a data set - classification algorithms OneR OneR generate a one-level decision tree. The rules are simple to understand but also less accurate. Deposit = YES (AGE) If 64.5 – 66.5 If 75.5 – 80.5 If more than 88.5 Deposit = NO (AGE) If less than 64.5 If 66.5 – 75.5 If 80.5 – 88.5 J48 Divides the original data set relative to each variable. Creates many variants of the division. Deposit = YES Age > 60 Job = retired Education = basic.4y Marital = married Loan = no Housing = yes Naïve Bayes Assign a new case to one of the classes. 10 Attribute NO YES AGE 40 41 JOB Admin MARITAL Married EDUCATION University degree DEFAULT No HOUSING Yes LOAN No CONTRACT Cellular MONTH May DAY OF WEEK Monday Thursday
  • 11.
    Data Mining –Clustering (SimpleKMeans)  a process of grouping objects in a class called clusters  definitions of the concept of the cluster: - a set of objects that are "similar“ - a set of objects such that the distance between any two objects belonging to the cluster that is less than the distance between any object  algorithm SimpleKMeans as an example in WEKA 11
  • 12.
    Discovering potentially usefulpatterns from a data set - clustering algorithm 12 Represent the group with the centroid for the documents that belong to this group. Membership in the group is determined by finding the most similar group centroid for each document. SimpleKMeans
  • 13.
    Data Mining -Association (Rules Function|Apriori)  Association Rule is an unsupervised data mining function  It finds rules associated with frequently co-occurring items  It gives rules that explain how items or events are associated with each other  Apriori algorithm to discover co-occurring items. 13
  • 14.
    Discovering potentially usefulpatterns from a data set - association algorithm 14 Apriori Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence. 1. marital=married contact=telephone month=may 5454 ==> y=no 5283 conf:(0.97) 2. marital=married loan=no contact=telephone month=may 4511 ==> y=no 4367 conf:(0.97) 3. contact=telephone month=may 8251 ==> y=no 7979 conf:(0.97) 4. loan=no contact=telephone month=may 6819 ==> y=no 6593 conf:(0.97) 5. default=no contact=telephone month=may 5726 ==> y=no 5533 conf:(0.97) 6. default=no loan=no contact=telephone month=may 4749 ==> y=no 4587 conf:(0.97) 7. month=aug y=no 5523 ==> contact=cellular 5290 conf:(0.96) 8. month=aug 6178 ==> contact=cellular 5909 conf:(0.96) 9. loan=no month=aug y=no 4562 ==> contact=cellular 4362 conf:(0.96) 10. loan=no month=aug 5120 ==> contact=cellular 4890 conf:(0.96)
  • 15.
    Conclusion Analysis  shows informationabout techniques and methodologies in data mining, also Knowledge Discovery Database  analyses a big dataset  provides essential information and demonstrate relevant algorithms onto techniques Results  knowledge which is potentially useful;  the computer search engines already provide the best results in gaining of specific goals;  WEKA helped to collect certain rules;  process the data and achieve the final satisfactory result 15
  • 16.
    Results Will subscribe termdeposit YES AGE >65 JOB: services, blue-collar, technician, entrepreneur MARITAL: married EDUCATION: basic.9y, basic.6y, high.school DEFAULT: unknown (has credit in default) HOUSING: no (has housing loan) LOAN: there is no big difference (has personal loan) CONTACT: telephone MONTH: may, jun, jul, agu, nov DAY OF WEEK: mon, fri Will subscribe term deposit NO 16 AGE <65 JOB: admin, student, unemployed, retired MARITAL: single EDUCATION: university degree, unknown DEFAULT: no (has credit in default) HOUSING: yes (has housing loan) LOAN: there is no big difference (has personal loan) CONTACT: cellular MONTH: oct, sep, dec, mar, apr DAY OF WEEK: tue, wed, thu Who want that data? marketing companies / banking institutions
  • 17.
    Thank you forlistening 17