WEKA is an open source data mining and machine learning software written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. WEKA was developed at the University of Waikato and contains algorithms for classification like decision trees, clustering like k-means, and preprocessing tools. The document provides examples of using WEKA's clustering and decision tree classification algorithms on sample investment data to segment investors and predict investment choices.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
This presentation would give a simple introduction to perform some basic data analysis using WEKA. Since it contains an image guiding criteria it would be very easy for the beginners.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Survey on Various Classification Techniques in Data Miningijsrd.com
Dynamic Classification is an information mining (machine learning) strategy used to anticipate bunch participation for information cases. In this paper, we show the essential arrangement systems. A few significant sorts of arrangement technique including induction, Bayesian networks, k-nearest neighbor classifier, case-based reasoning, genetic algorithm and fuzzy logic techniques. The objective of this review is to give a complete audit of distinctive characterization procedures in information mining.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
What is the Covering (Rule-based) algorithm?
Classification Rules- Straightforward
1. If-Then rule
2. Generating rules from Decision Tree
Rule-based Algorithm
1. The 1R Algorithm / Learn One Rule
2. The PRISM Algorithm
3. Other Algorithm
Application of Covering algorithm
Discussion on e/m-learning application
Preprocessing and Classification in WEKA Using Different ClassifiersIJERA Editor
Data mining is a process of extracting information from a dataset and transform it into understandable structure
for further use, also it discovers patterns in large data sets [1]. Data mining has number of important techniques
such as preprocessing, classification. Classification is one such technique which is based on supervised learning.
It is a technique used for predicting group membership for the data instance. Here in this paper we use
preprocessing, classification on diabetes database. Here we apply classifiers on this database and compare the
result based on certain parameters using WEKA. 77.2 million people in India are suffering from pre diabetes.
ICMR estimates that around 65.1million are diabetes patients. Globally in year 2010, 227 to 285 million people
had diabetes, out of that 90% cases are related to type 2 ,this is equal to 3.3% of the population with equal rates
in both women and men in 2011 it resulted in 1.4 million deaths worldwide making it the leading cause of
death.
This presentation would give a simple introduction to perform some basic data analysis using WEKA. Since it contains an image guiding criteria it would be very easy for the beginners.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Survey on Various Classification Techniques in Data Miningijsrd.com
Dynamic Classification is an information mining (machine learning) strategy used to anticipate bunch participation for information cases. In this paper, we show the essential arrangement systems. A few significant sorts of arrangement technique including induction, Bayesian networks, k-nearest neighbor classifier, case-based reasoning, genetic algorithm and fuzzy logic techniques. The objective of this review is to give a complete audit of distinctive characterization procedures in information mining.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
What is the Covering (Rule-based) algorithm?
Classification Rules- Straightforward
1. If-Then rule
2. Generating rules from Decision Tree
Rule-based Algorithm
1. The 1R Algorithm / Learn One Rule
2. The PRISM Algorithm
3. Other Algorithm
Application of Covering algorithm
Discussion on e/m-learning application
Со времён С++98 стандартные контейнеры и идиома RAII позволяли избегать использования оператора delete,
что делало код более безопасным. С приходом С++11 и умных указателей отпала необходимость использовать оператор new, что позволило практически полностью переложить управление памятью на плечи компилятора. В
докладе объясняется идеология управления памятью и ресурсами в современном С++.
A 2016 fact sheet on the Youth Trust Foundation or myHarapan based in Malaysia, established to help youth fulfill their true potential by positively impacting the lives of others.
Comparing Colleges on basis of various attributes and doing regression using Weka Software
Demonstration of Clustering using Weka on various attributes on data set of places.
This video is present the Data mining GUI tools and concept.
WEKA is a collection of state-of-the-art machine learning algorithms and data preprocessing tools written in Java, developed at the University of Waikato, New Zealand.
KNIME stands for Konstanz Information Miner
It is an Open Source Data Analytics, Reporting and Integration platform
You can download the video at https://youtu.be/QC1UZW6e_Wc
IBM Watson Analytics sets powerful analytics capabilities free so practically anyone can use them. Automated data preparation, predictive analytics, reporting, dashboards, visualization and collaboration capabilities, enable you to take control of your own analysis. You can then take the appropriate action to address a problem or seize an opportunity, all without asking IT or a data expert for help.
What-If Analysis is the process of changing the values in cells to see how those changes will affect the outcome of formulas on the worksheet.
Four kinds of What-If Analysis tools come with Excel: Scenarios, Goal Seek, Solver, and Data Tables. Scenarios and Data Tables take sets of input values and determine possible results.
(https://support.office.com/en-us/article/introduction-to-what-if-analysis-22bffa5f-e891-4acc-bf7a-e4645c446fb4)
Scenarios: It is a collection of input values that are placed in formulas to get a result.
Goal Seek: It is used to find desired result by changing an input value.
Solver: It is used to find optimal solutions for all kind of decision issues.
Data Table: It is a range of cells where one column consists of a series of values, called input cells.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
DATA MINING on WEKA
1. IT & BUSINESS INTELLIGENCE
DATA MINING
ON
WEKA
SATYAM KHATRI
(10BM60081)
MBA, VGSOM
IIT KHARAGPUR
2. WEKA
WEKA is a collection of open source many data mining and machine learning algorithms. It was created
by researchers at the University of Waikato in New Zealand, it is a Java based, open source tool. WEKA
is used for pre-processing on data, Classification, clustering and association rule extraction
It’s main features are as follows
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators + 10 search algorithms for feature selection.
3 algorithms for finding association rules
3 graphical user interfaces
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental environment)
“The Knowledge Flow” (new process model inspired interface)
WEKA FUNCTIONS AND TOOLS
Preprocessing Filters
Attribute selection
Classification/Regression
Clustering
Association discovery
Visualization
DOWNLOAD INSTRUCTIONS
Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/
Choose a self-extracting executable (including Java VM)
If you are interested in modifying/extending weka there is a developer version that includes the
source code
WEKA DATA FORMATS
Data can be imported from a file in various format such as ARFF, CSV, C4.5. Data can also be read from
a URL or from an SQL database (using JDBC)
3. CLUSTERING
A cluster, by definition, is a group of similar objects. There could be clusters of people, brands or other
objects. If clusters are formed of customers similar to one another, then cluster analysis can help
marketers identify segments (clusters).If clusters of brands are formed, this can be used to gain insights
into brands that are perceived as similar to each other on a set of attributes. Cluster analysis is hence
used for customer segmentation. Cluster analysis is best performed when the variables are interval or
ratio-scaled
There are two major classes of cluster analysis techniques
hierarchical
non-hierarchical
HIERARCHICAL CLUSTERING
Some measure of distance is used to identify distances between all pairs of objects to be clustered. One
of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean
Distance. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters.
Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is
stage 1 of hierarchical clustering.
NON HIERARCHICAL CLUSTERING
They are also known as k-means clustering methods, we need to specify the number of clusters we want
the objects to be clustered into. This can be done if we have a hypothesis that the objects will group into a
certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the
approximate number of clusters, and then perform a k-means clustering
IMPLEMENTATION METHODS
k - Means
EM
Cobweb
X-means
Farthest First
4. CLUSTERING ON WEKA
PROBLEM CASE
An Asset Management company (AMC) wants to launch a new Mutual Fund Scheme, AMC wants to
segment the target market, so that it can raise funds easily by different marketing strategies for different
segments of target market.
AMC segments the target market on the basis of following parameters
1. Investor’s Age
2. Marital status
3. Investor’s Monthly income
4. Region of Residence
5. Investment in Derivatives
6. Investment in Equities
7. Investment in Fixed deposits
8. Investment in Gold
9. Existing number of Mutual fund schemes
10. Existing loans
Data is collected from the public base on the above parameters and clustering function is performed on it
WEKA Explorer interface
5. Processing on parameter Investment in Gold
Processing on parameter Existing Number of Mutual fund schemes
10. To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This
results in a drop down list of available clustering algorithms. In this case we select "Simple K Means".
Next, click on the text box to the right of the "Choose" button to get the pop-up window shown k-means
clustering is done by dividing the data into 4 cluster group.
The WEKA Simple K Means algorithm uses Euclidean distance measure to compute distances between
instances and clusters. In the pop-up window we enter 6 as the number of clusters (instead of the default
values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random
number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in
general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try
different values and evaluate the results
Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the
"Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the
result set in the "Result list" panel and view the results of clustering in a separate window.
12. Clusters can be visualize as shown below
CLUSTER 1
It consist of people with average age of 44 yrs, mostly male, that stay in town, have average monthly
income of 30000, mostly single and invest in equities, fixed deposits, gold, do not invest in derivatives and
have existing loans.
CLUSTER 2
It consist of people with average age of 49 yrs, mostly male, that stay in town, have average monthly
income of 39000, mostly married and invest in equities, fixed deposits, gold, do not invest in derivatives
and have existing loans.
CLUSTER 3
It consist of people with average age of 39 yrs, mostly male, that stay in cities, have average monthly
income of 24000, mostly married and invest in gold, derivatives, do not invest in equities and fixed
deposits, and have existing loans.
CLUSTER 4
It consist of people with average age of 40 yrs, mostly female, that stay in cities, have average monthly
income of 25000, mostly married and invest in equities, fixed deposits, do not invest in derivatives, gold
and have existing loans.
13. CLASSIFICATION VIA DECISION TREES IN WEKA
PROBLEM CASE
A market research firm wants to model the investment decisions by people in various types of securities
on the basis of following parameters Investor’s Age, Marital status, Investor’s Monthly income, Region of
Residence, Investment in Derivatives, ,Investment in Equities, Investment in Fixed deposits, Investment in
Gold, Investment in Mutual funds, Existing loans. Based on this model, an investment decision by an
entity in a particular type of security can be predicted if other parameters about that entity are mentioned
Data is collected from the public on the above parameters and classification is done
Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, Note that J48
(implementation of C4.5 algorithm does not require discretization of numeric attributes, in contrast to the
ID3 algorithm from which C4.5 has evolved. Now, we can specify the various parameters. These can be
specified by clicking in the text box to the right of the "Choose" button, In this example we accept the
default values. The default version does perform some pruning (using the sub tree raising approach), but
does not perform error pruning.
14.
15. Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach.
Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy
of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well
as evaluation statistics will appear in the eight panel when the model construction is completed We can
view this information in a separate window by right clicking the last result set (inside the "Result list" panel
on the left) and selecting "View in separate window" from the pop-up menu.
16. We can also use our model to classify the new instances. In the main panel, under "Test options" click the
"Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows
you to open the file containing test instances.
17. This, once again generates the models from our training data, but this time it applies the model to the new
unclassified instances in order to predict the value of an attribute. Note that the summary of the results in
the right panel does not show any statistics.
WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking
the last result set (as before) and selecting "Visualize tree" from the pop-up menu.
Note that by resizing the window and selecting various menu items from inside the tree view (using the
right mouse button), we can adjust the tree view to make it more readable.