SlideShare a Scribd company logo
IT & BUSINESS INTELLIGENCE




     DATA MINING
            ON

         WEKA




       SATYAM KHATRI
          (10BM60081)
         MBA, VGSOM
       IIT KHARAGPUR
WEKA

WEKA is a collection of open source many data mining and machine learning algorithms. It was created
by researchers at the University of Waikato in New Zealand, it is a Java based, open source tool. WEKA
is used for pre-processing on data, Classification, clustering and association rule extraction
It’s main features are as follows


    49 data preprocessing tools
    76 classification/regression algorithms
    8 clustering algorithms
    15 attribute/subset evaluators + 10 search algorithms for feature selection.
    3 algorithms for finding association rules
    3 graphical user interfaces
        “The Explorer” (exploratory data analysis)
        “The Experimenter” (experimental environment)
        “The Knowledge Flow” (new process model inspired interface)




WEKA FUNCTIONS AND TOOLS
        Preprocessing Filters
        Attribute selection
        Classification/Regression
        Clustering
        Association discovery
        Visualization


DOWNLOAD INSTRUCTIONS


    Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/
    Choose a self-extracting executable (including Java VM)
    If you are interested in modifying/extending weka there is a developer version that includes the
    source code




WEKA DATA FORMATS
Data can be imported from a file in various format such as ARFF, CSV, C4.5. Data can also be read from
a URL or from an SQL database (using JDBC)
CLUSTERING
A cluster, by definition, is a group of similar objects. There could be clusters of people, brands or other
objects. If clusters are formed of customers similar to one another, then cluster analysis can help
marketers identify segments (clusters).If clusters of brands are formed, this can be used to gain insights
into brands that are perceived as similar to each other on a set of attributes. Cluster analysis is hence
used for customer segmentation. Cluster analysis is best performed when the variables are interval or
ratio-scaled


There are two major classes of cluster analysis techniques
        hierarchical
        non-hierarchical


HIERARCHICAL CLUSTERING
Some measure of distance is used to identify distances between all pairs of objects to be clustered. One
of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean
Distance. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters.
Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is
stage 1 of hierarchical clustering.


NON HIERARCHICAL CLUSTERING
They are also known as k-means clustering methods, we need to specify the number of clusters we want
the objects to be clustered into. This can be done if we have a hypothesis that the objects will group into a
certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the
approximate number of clusters, and then perform a k-means clustering


IMPLEMENTATION METHODS
        k - Means
        EM
        Cobweb
        X-means
        Farthest First
CLUSTERING ON WEKA


PROBLEM CASE
An Asset Management company (AMC) wants to launch a new Mutual Fund Scheme, AMC wants to
segment the target market, so that it can raise funds easily by different marketing strategies for different
segments of target market.
AMC segments the target market on the basis of following parameters
    1. Investor’s Age
    2. Marital status
    3. Investor’s Monthly income
    4. Region of Residence
    5. Investment in Derivatives
    6. Investment in Equities
    7. Investment in Fixed deposits
    8. Investment in Gold
    9. Existing number of Mutual fund schemes
    10. Existing loans
Data is collected from the public base on the above parameters and clustering function is performed on it



WEKA Explorer interface
Processing on parameter Investment in Gold




Processing on parameter Existing Number of Mutual fund schemes
Processing on parameter Existing Loans




Processing on parameter “Age of Investor “
Processing on parameter Investment in Fixed deposits




Processing on parameter Investor’s marital status
Processing on parameter “Investor’s region of residence”




Processing on parameter “Investor’s monthly income”
Processing on parameter Investment in derivatives




Visualization of the entire dataset
To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This
results in a drop down list of available clustering algorithms. In this case we select "Simple K Means".
Next, click on the text box to the right of the "Choose" button to get the pop-up window shown k-means
clustering is done by dividing the data into 4 cluster group.


The WEKA Simple K Means algorithm uses Euclidean distance measure to compute distances between
instances and clusters. In the pop-up window we enter 6 as the number of clusters (instead of the default
values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random
number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in
general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try
different values and evaluate the results


Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the
"Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the
result set in the "Result list" panel and view the results of clustering in a separate window.
CLUSTERING RESULTS
Clusters can be visualize as shown below




CLUSTER 1
It consist of people with average age of 44 yrs, mostly male, that stay in town, have average monthly
income of 30000, mostly single and invest in equities, fixed deposits, gold, do not invest in derivatives and
have existing loans.


CLUSTER 2
It consist of people with average age of 49 yrs, mostly male, that stay in town, have average monthly
income of 39000, mostly married and invest in equities, fixed deposits, gold, do not invest in derivatives
and have existing loans.


CLUSTER 3
It consist of people with average age of 39 yrs, mostly male, that stay in cities, have average monthly
income of 24000, mostly married and invest in gold, derivatives, do not invest in equities and fixed
deposits, and have existing loans.


CLUSTER 4
It consist of people with average age of 40 yrs, mostly female, that stay in cities, have average monthly
income of 25000, mostly married and invest in equities, fixed deposits, do not invest in derivatives, gold
and have existing loans.
CLASSIFICATION VIA DECISION TREES IN WEKA


PROBLEM CASE
A market research firm wants to model the investment decisions by people in various types of securities
on the basis of following parameters Investor’s Age, Marital status, Investor’s Monthly income, Region of
Residence, Investment in Derivatives, ,Investment in Equities, Investment in Fixed deposits, Investment in
Gold, Investment in Mutual funds, Existing loans. Based on this model, an investment decision by an
entity in a particular type of security can be predicted if other parameters about that entity are mentioned


Data is collected from the public on the above parameters and classification is done




Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, Note that J48
(implementation of C4.5 algorithm does not require discretization of numeric attributes, in contrast to the
ID3 algorithm from which C4.5 has evolved. Now, we can specify the various parameters. These can be
specified by clicking in the text box to the right of the "Choose" button, In this example we accept the
default values. The default version does perform some pruning (using the sub tree raising approach), but
does not perform error pruning.
Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach.
Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well
as evaluation statistics will appear in the eight panel when the model construction is completed We can
view this information in a separate window by right clicking the last result set (inside the "Result list" panel
on the left) and selecting "View in separate window" from the pop-up menu.
We can also use our model to classify the new instances. In the main panel, under "Test options" click the
"Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows
you to open the file containing test instances.
This, once again generates the models from our training data, but this time it applies the model to the new
unclassified instances in order to predict the value of an attribute. Note that the summary of the results in
the right panel does not show any statistics.


WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking
the last result set (as before) and selecting "Visualize tree" from the pop-up menu.




Note that by resizing the window and selecting various menu items from inside the tree view (using the
right mouse button), we can adjust the tree view to make it more readable.

More Related Content

What's hot

Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
Hein Min Htike
 
Preprocessing and Classification in WEKA Using Different Classifiers
Preprocessing and Classification in WEKA Using Different ClassifiersPreprocessing and Classification in WEKA Using Different Classifiers
Preprocessing and Classification in WEKA Using Different Classifiers
IJERA Editor
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekaPrashant Menon
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKAbutest
 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
Pamoda Vajiramali
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
IJERA Editor
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
ijsrd.com
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
Mohammed Aejazuddin
 
Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtree
ijcsa
 
Machine learning(UNIT 4)
Machine learning(UNIT 4)Machine learning(UNIT 4)
Machine learning(UNIT 4)
SURBHI SAROHA
 
Wek1
Wek1Wek1
Building_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKABuilding_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKASunil Kakade
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
IJCI JOURNAL
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKAbutest
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Abrar ali
 
weka data mining
weka data mining weka data mining
weka data mining
kalthoom almaqbali
 

What's hot (19)

Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Preprocessing and Classification in WEKA Using Different Classifiers
Preprocessing and Classification in WEKA Using Different ClassifiersPreprocessing and Classification in WEKA Using Different Classifiers
Preprocessing and Classification in WEKA Using Different Classifiers
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 
A simple introduction to weka
A simple introduction to wekaA simple introduction to weka
A simple introduction to weka
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtree
 
Machine learning(UNIT 4)
Machine learning(UNIT 4)Machine learning(UNIT 4)
Machine learning(UNIT 4)
 
Wek1
Wek1Wek1
Wek1
 
Building_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKABuilding_a_Readmission_Model_Using_WEKA
Building_a_Readmission_Model_Using_WEKA
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
weka data mining
weka data mining weka data mining
weka data mining
 

Viewers also liked

Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04IIUM
 
05 Conditional statements
05 Conditional statements05 Conditional statements
05 Conditional statements
maznabili
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
BarryK88
 
Tree pruning
Tree pruningTree pruning
Tree pruning
priya_kalia
 
01 10 speech channel assignment
01 10 speech channel assignment01 10 speech channel assignment
01 10 speech channel assignmentEricsson Saudi
 
С++ without new and delete
С++ without new and deleteС++ without new and delete
С++ without new and delete
Platonov Sergey
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
BarryK88
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentDarran Mottershead
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
Krish_ver2
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetMateusz Brzoska
 
v52.MaritessVelez.assignment
v52.MaritessVelez.assignmentv52.MaritessVelez.assignment
v52.MaritessVelez.assignmentMaritessVelez
 
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 editedปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
pannily
 
myHarapan profile 2016
myHarapan profile 2016myHarapan profile 2016
myHarapan profile 2016
Nini Daing
 
Ato Z World Travel Travel Industry
Ato Z World Travel  Travel IndustryAto Z World Travel  Travel Industry
Ato Z World Travel Travel Industry
delisaleighton
 
V52.maritess velez.assignment1
V52.maritess velez.assignment1V52.maritess velez.assignment1
V52.maritess velez.assignment1MaritessVelez
 

Viewers also liked (20)

Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
 
Csc1100 lecture04 ch04
Csc1100 lecture04 ch04Csc1100 lecture04 ch04
Csc1100 lecture04 ch04
 
05 Conditional statements
05 Conditional statements05 Conditional statements
05 Conditional statements
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Tree pruning
Tree pruningTree pruning
Tree pruning
 
01 10 speech channel assignment
01 10 speech channel assignment01 10 speech channel assignment
01 10 speech channel assignment
 
С++ without new and delete
С++ without new and deleteС++ without new and delete
С++ without new and delete
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Data Engineering - Data Mining Assignment
Data Engineering - Data Mining AssignmentData Engineering - Data Mining Assignment
Data Engineering - Data Mining Assignment
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
 
Ch06
Ch06Ch06
Ch06
 
Decision trees
Decision treesDecision trees
Decision trees
 
v52.MaritessVelez.assignment
v52.MaritessVelez.assignmentv52.MaritessVelez.assignment
v52.MaritessVelez.assignment
 
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 editedปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
ปลดปล่อยการเจิมสู่สถานการณ์ 5.9.10 edited
 
myHarapan profile 2016
myHarapan profile 2016myHarapan profile 2016
myHarapan profile 2016
 
Ato Z World Travel Travel Industry
Ato Z World Travel  Travel IndustryAto Z World Travel  Travel Industry
Ato Z World Travel Travel Industry
 
V52.maritess velez.assignment1
V52.maritess velez.assignment1V52.maritess velez.assignment1
V52.maritess velez.assignment1
 

Similar to DATA MINING on WEKA

Weka Term Paper_VGSoM_10BM60011
Weka Term Paper_VGSoM_10BM60011Weka Term Paper_VGSoM_10BM60011
Weka Term Paper_VGSoM_10BM60011
Amu Singh
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)
Siddharth Verma
 
ITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat AgarwalITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat Agarwal
Prabhat Agarwal
 
Data Mining GUI Tools with Demo
Data Mining GUI Tools with DemoData Mining GUI Tools with Demo
Data Mining GUI Tools with Demo
praeeth palliyaguru
 
Remedy Presentation
Remedy PresentationRemedy Presentation
Remedy Presentation
Charrisa Coulibaly
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionignacio_alberdi
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
sheetal sharma
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
Venkat Projects
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
Venkat Projects
 
What if analysis-goal_seek
What if analysis-goal_seekWhat if analysis-goal_seek
What if analysis-goal_seek
Ilgar Zarbaliyev
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
excel content
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
DataminingTools Inc
 
MetaStock - Trading with Purpose
MetaStock - Trading with PurposeMetaStock - Trading with Purpose
MetaStock - Trading with PurposeGreg Lewis
 
2580 microsoft excel2010_rtm_wsg_external
2580 microsoft excel2010_rtm_wsg_external2580 microsoft excel2010_rtm_wsg_external
2580 microsoft excel2010_rtm_wsg_externalBrittanyShatia
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
excel content
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
DataminingTools Inc
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
IJERA Editor
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
Wake Tech BAS
 

Similar to DATA MINING on WEKA (20)

Weka Term Paper_VGSoM_10BM60011
Weka Term Paper_VGSoM_10BM60011Weka Term Paper_VGSoM_10BM60011
Weka Term Paper_VGSoM_10BM60011
 
Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)Weka term paper(siddharth 10 bm60086)
Weka term paper(siddharth 10 bm60086)
 
ITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat AgarwalITB tutorial WEKA Prabhat Agarwal
ITB tutorial WEKA Prabhat Agarwal
 
Data Mining GUI Tools with Demo
Data Mining GUI Tools with DemoData Mining GUI Tools with Demo
Data Mining GUI Tools with Demo
 
Itb weka
Itb wekaItb weka
Itb weka
 
Remedy Presentation
Remedy PresentationRemedy Presentation
Remedy Presentation
 
Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
 
Stock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithmStock market trend prediction using k nearest neighbor(knn) algorithm
Stock market trend prediction using k nearest neighbor(knn) algorithm
 
What if analysis-goal_seek
What if analysis-goal_seekWhat if analysis-goal_seek
What if analysis-goal_seek
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
 
Excel Datamining Addin Beginner
Excel Datamining Addin BeginnerExcel Datamining Addin Beginner
Excel Datamining Addin Beginner
 
MetaStock - Trading with Purpose
MetaStock - Trading with PurposeMetaStock - Trading with Purpose
MetaStock - Trading with Purpose
 
2580 microsoft excel2010_rtm_wsg_external
2580 microsoft excel2010_rtm_wsg_external2580 microsoft excel2010_rtm_wsg_external
2580 microsoft excel2010_rtm_wsg_external
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 
Excel Datamining Addin Advanced
Excel Datamining Addin AdvancedExcel Datamining Addin Advanced
Excel Datamining Addin Advanced
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

DATA MINING on WEKA

  • 1. IT & BUSINESS INTELLIGENCE DATA MINING ON WEKA SATYAM KHATRI (10BM60081) MBA, VGSOM IIT KHARAGPUR
  • 2. WEKA WEKA is a collection of open source many data mining and machine learning algorithms. It was created by researchers at the University of Waikato in New Zealand, it is a Java based, open source tool. WEKA is used for pre-processing on data, Classification, clustering and association rule extraction It’s main features are as follows 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection. 3 algorithms for finding association rules 3 graphical user interfaces “The Explorer” (exploratory data analysis) “The Experimenter” (experimental environment) “The Knowledge Flow” (new process model inspired interface) WEKA FUNCTIONS AND TOOLS Preprocessing Filters Attribute selection Classification/Regression Clustering Association discovery Visualization DOWNLOAD INSTRUCTIONS Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/ Choose a self-extracting executable (including Java VM) If you are interested in modifying/extending weka there is a developer version that includes the source code WEKA DATA FORMATS Data can be imported from a file in various format such as ARFF, CSV, C4.5. Data can also be read from a URL or from an SQL database (using JDBC)
  • 3. CLUSTERING A cluster, by definition, is a group of similar objects. There could be clusters of people, brands or other objects. If clusters are formed of customers similar to one another, then cluster analysis can help marketers identify segments (clusters).If clusters of brands are formed, this can be used to gain insights into brands that are perceived as similar to each other on a set of attributes. Cluster analysis is hence used for customer segmentation. Cluster analysis is best performed when the variables are interval or ratio-scaled There are two major classes of cluster analysis techniques hierarchical non-hierarchical HIERARCHICAL CLUSTERING Some measure of distance is used to identify distances between all pairs of objects to be clustered. One of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean Distance. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters. Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is stage 1 of hierarchical clustering. NON HIERARCHICAL CLUSTERING They are also known as k-means clustering methods, we need to specify the number of clusters we want the objects to be clustered into. This can be done if we have a hypothesis that the objects will group into a certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the approximate number of clusters, and then perform a k-means clustering IMPLEMENTATION METHODS k - Means EM Cobweb X-means Farthest First
  • 4. CLUSTERING ON WEKA PROBLEM CASE An Asset Management company (AMC) wants to launch a new Mutual Fund Scheme, AMC wants to segment the target market, so that it can raise funds easily by different marketing strategies for different segments of target market. AMC segments the target market on the basis of following parameters 1. Investor’s Age 2. Marital status 3. Investor’s Monthly income 4. Region of Residence 5. Investment in Derivatives 6. Investment in Equities 7. Investment in Fixed deposits 8. Investment in Gold 9. Existing number of Mutual fund schemes 10. Existing loans Data is collected from the public base on the above parameters and clustering function is performed on it WEKA Explorer interface
  • 5. Processing on parameter Investment in Gold Processing on parameter Existing Number of Mutual fund schemes
  • 6. Processing on parameter Existing Loans Processing on parameter “Age of Investor “
  • 7. Processing on parameter Investment in Fixed deposits Processing on parameter Investor’s marital status
  • 8. Processing on parameter “Investor’s region of residence” Processing on parameter “Investor’s monthly income”
  • 9. Processing on parameter Investment in derivatives Visualization of the entire dataset
  • 10. To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case we select "Simple K Means". Next, click on the text box to the right of the "Choose" button to get the pop-up window shown k-means clustering is done by dividing the data into 4 cluster group. The WEKA Simple K Means algorithm uses Euclidean distance measure to compute distances between instances and clusters. In the pop-up window we enter 6 as the number of clusters (instead of the default values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try different values and evaluate the results Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window.
  • 12. Clusters can be visualize as shown below CLUSTER 1 It consist of people with average age of 44 yrs, mostly male, that stay in town, have average monthly income of 30000, mostly single and invest in equities, fixed deposits, gold, do not invest in derivatives and have existing loans. CLUSTER 2 It consist of people with average age of 49 yrs, mostly male, that stay in town, have average monthly income of 39000, mostly married and invest in equities, fixed deposits, gold, do not invest in derivatives and have existing loans. CLUSTER 3 It consist of people with average age of 39 yrs, mostly male, that stay in cities, have average monthly income of 24000, mostly married and invest in gold, derivatives, do not invest in equities and fixed deposits, and have existing loans. CLUSTER 4 It consist of people with average age of 40 yrs, mostly female, that stay in cities, have average monthly income of 25000, mostly married and invest in equities, fixed deposits, do not invest in derivatives, gold and have existing loans.
  • 13. CLASSIFICATION VIA DECISION TREES IN WEKA PROBLEM CASE A market research firm wants to model the investment decisions by people in various types of securities on the basis of following parameters Investor’s Age, Marital status, Investor’s Monthly income, Region of Residence, Investment in Derivatives, ,Investment in Equities, Investment in Fixed deposits, Investment in Gold, Investment in Mutual funds, Existing loans. Based on this model, an investment decision by an entity in a particular type of security can be predicted if other parameters about that entity are mentioned Data is collected from the public on the above parameters and classification is done Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, Note that J48 (implementation of C4.5 algorithm does not require discretization of numeric attributes, in contrast to the ID3 algorithm from which C4.5 has evolved. Now, we can specify the various parameters. These can be specified by clicking in the text box to the right of the "Choose" button, In this example we accept the default values. The default version does perform some pruning (using the sub tree raising approach), but does not perform error pruning.
  • 14.
  • 15. Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the eight panel when the model construction is completed We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.
  • 16. We can also use our model to classify the new instances. In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances.
  • 17. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in order to predict the value of an attribute. Note that the summary of the results in the right panel does not show any statistics. WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu. Note that by resizing the window and selecting various menu items from inside the tree view (using the right mouse button), we can adjust the tree view to make it more readable.