This document provides an overview of using the WEKA data mining tool to perform two common techniques: clustering and linear regression. It first introduces WEKA and its interfaces. It then provides details on k-means clustering, including how to implement it in WEKA on a sample BMW customer dataset. This identifies five distinct customer clusters. The document also explains linear regression and uses a house pricing dataset in WEKA to build a regression model to predict house value based on features.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
It’s a data mining/machine learning tool developed by Department of
Computer Science, University of Waikato, New Zealand.
Weka is a collection of machine learning algorithms for data mining tasks.
Weka is open source software issued under the GNU General Public License
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
(Python Certification Training for Data Science: https://www.edureka.co/python)
This Edureka video on "Scikit-learn Tutorial" introduces you to machine learning in Python. It will also takes you through regression and clustering techniques along with a demo on SVM classification on the famous iris dataset. This video helps you to learn the below topics:
1. Machine learning Overview
2. Introduction to Scikit-learn
3. Installation of Scikit-learn
4. Regression and Classification
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button and click the bell icon.
Survey on Various Classification Techniques in Data Miningijsrd.com
Dynamic Classification is an information mining (machine learning) strategy used to anticipate bunch participation for information cases. In this paper, we show the essential arrangement systems. A few significant sorts of arrangement technique including induction, Bayesian networks, k-nearest neighbor classifier, case-based reasoning, genetic algorithm and fuzzy logic techniques. The objective of this review is to give a complete audit of distinctive characterization procedures in information mining.
It’s a data mining/machine learning tool developed by Department of
Computer Science, University of Waikato, New Zealand.
Weka is a collection of machine learning algorithms for data mining tasks.
Weka is open source software issued under the GNU General Public License
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
(Python Certification Training for Data Science: https://www.edureka.co/python)
This Edureka video on "Scikit-learn Tutorial" introduces you to machine learning in Python. It will also takes you through regression and clustering techniques along with a demo on SVM classification on the famous iris dataset. This video helps you to learn the below topics:
1. Machine learning Overview
2. Introduction to Scikit-learn
3. Installation of Scikit-learn
4. Regression and Classification
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button and click the bell icon.
Survey on Various Classification Techniques in Data Miningijsrd.com
Dynamic Classification is an information mining (machine learning) strategy used to anticipate bunch participation for information cases. In this paper, we show the essential arrangement systems. A few significant sorts of arrangement technique including induction, Bayesian networks, k-nearest neighbor classifier, case-based reasoning, genetic algorithm and fuzzy logic techniques. The objective of this review is to give a complete audit of distinctive characterization procedures in information mining.
Hierarchical clustering in Python and beyondFrank Kelly
Clustering of data is an increasingly important task for many data scientists. This talk will explore the challenge of hierarchical clustering of text data for summarisation purposes. We'll take a look at some great solutions now available to Python users including the relevant Scikit Learn libraries, via Elasticsearch (with the carrot2 plugin), and check out visualisations from both approaches.
https://www.youtube.com/watch?v=KFs9pBAetOo
Predictive Analytics: It's The Intervention That MattersHealth Catalyst
In this two-part webinar, get the detailed knowledge you need to make informed decisions about adopting predictive analytics in healthcare so you can separate today's hype from reality. In part 1, you'll learn key learnings from Dale Sanders including 1) our fixation on predictive analytics in readmissions, 2) the common trap of predictions without interventions, 3) the common misconceptions of correlations verses causation, 4) examples of predictions without algorithms, and 5) the importance of putting the basics first.
In part 2, you'll hear from industry expert David Crockett, PhD in a "graduate level" crash course cover key concepts such as machine learning, algorithms, feature selection, classification, tools and more.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...IJCI JOURNAL
This paper presents a newly-created Barracuda open-source framework which aims to parallelize Java divide and conquer applications. This framework exploits implicit for-loop parallelism in dividing and merging operations. So, this makes it a mixture of parallel for-loop and task parallelism. It targets sharedmemory multiprocessors and hybrid distributed shared-memory architectures. We highlight the effectiveness of the framework and focus on the performance gain and programming effort by using this framework. Barracuda aims at large public actors as well as various application domains. In terms of performance achievement, it is very close to Fork/Join framework while allowing end-users to only focus on refactoring code and experts to have the opportunity to improve it.
Kseniya Leshchenko: Shared development support service model as the way to ma...Lviv Startup Club
Kseniya Leshchenko: Shared development support service model as the way to make small projects with small budgets profitable for the company (UA)
Kyiv PMDay 2024 Summer
Website – www.pmday.org
Youtube – https://www.youtube.com/startuplviv
FB – https://www.facebook.com/pmdayconference
[Note: This is a partial preview. To download this presentation, visit:
https://www.oeconsulting.com.sg/training-presentations]
Sustainability has become an increasingly critical topic as the world recognizes the need to protect our planet and its resources for future generations. Sustainability means meeting our current needs without compromising the ability of future generations to meet theirs. It involves long-term planning and consideration of the consequences of our actions. The goal is to create strategies that ensure the long-term viability of People, Planet, and Profit.
Leading companies such as Nike, Toyota, and Siemens are prioritizing sustainable innovation in their business models, setting an example for others to follow. In this Sustainability training presentation, you will learn key concepts, principles, and practices of sustainability applicable across industries. This training aims to create awareness and educate employees, senior executives, consultants, and other key stakeholders, including investors, policymakers, and supply chain partners, on the importance and implementation of sustainability.
LEARNING OBJECTIVES
1. Develop a comprehensive understanding of the fundamental principles and concepts that form the foundation of sustainability within corporate environments.
2. Explore the sustainability implementation model, focusing on effective measures and reporting strategies to track and communicate sustainability efforts.
3. Identify and define best practices and critical success factors essential for achieving sustainability goals within organizations.
CONTENTS
1. Introduction and Key Concepts of Sustainability
2. Principles and Practices of Sustainability
3. Measures and Reporting in Sustainability
4. Sustainability Implementation & Best Practices
To download the complete presentation, visit: https://www.oeconsulting.com.sg/training-presentations
Discover the innovative and creative projects that highlight my journey throu...dylandmeas
Discover the innovative and creative projects that highlight my journey through Full Sail University. Below, you’ll find a collection of my work showcasing my skills and expertise in digital marketing, event planning, and media production.
3.0 Project 2_ Developing My Brand Identity Kit.pptxtanyjahb
A personal brand exploration presentation summarizes an individual's unique qualities and goals, covering strengths, values, passions, and target audience. It helps individuals understand what makes them stand out, their desired image, and how they aim to achieve it.
Digital Transformation and IT Strategy Toolkit and TemplatesAurelien Domont, MBA
This Digital Transformation and IT Strategy Toolkit was created by ex-McKinsey, Deloitte and BCG Management Consultants, after more than 5,000 hours of work. It is considered the world's best & most comprehensive Digital Transformation and IT Strategy Toolkit. It includes all the Frameworks, Best Practices & Templates required to successfully undertake the Digital Transformation of your organization and define a robust IT Strategy.
Editable Toolkit to help you reuse our content: 700 Powerpoint slides | 35 Excel sheets | 84 minutes of Video training
This PowerPoint presentation is only a small preview of our Toolkits. For more details, visit www.domontconsulting.com
Building Your Employer Brand with Social MediaLuanWise
Presented at The Global HR Summit, 6th June 2024
In this keynote, Luan Wise will provide invaluable insights to elevate your employer brand on social media platforms including LinkedIn, Facebook, Instagram, X (formerly Twitter) and TikTok. You'll learn how compelling content can authentically showcase your company culture, values, and employee experiences to support your talent acquisition and retention objectives. Additionally, you'll understand the power of employee advocacy to amplify reach and engagement – helping to position your organization as an employer of choice in today's competitive talent landscape.
Improving profitability for small businessBen Wann
In this comprehensive presentation, we will explore strategies and practical tips for enhancing profitability in small businesses. Tailored to meet the unique challenges faced by small enterprises, this session covers various aspects that directly impact the bottom line. Attendees will learn how to optimize operational efficiency, manage expenses, and increase revenue through innovative marketing and customer engagement techniques.
Affordable Stationery Printing Services in Jaipur | Navpack n PrintNavpack & Print
Looking for professional printing services in Jaipur? Navpack n Print offers high-quality and affordable stationery printing for all your business needs. Stand out with custom stationery designs and fast turnaround times. Contact us today for a quote!
An introduction to the cryptocurrency investment platform Binance Savings.Any kyc Account
Learn how to use Binance Savings to expand your bitcoin holdings. Discover how to maximize your earnings on one of the most reliable cryptocurrency exchange platforms, as well as how to earn interest on your cryptocurrency holdings and the various savings choices available.
LA HUG - Video Testimonials with Chynna Morgan - June 2024Lital Barkan
Have you ever heard that user-generated content or video testimonials can take your brand to the next level? We will explore how you can effectively use video testimonials to leverage and boost your sales, content strategy, and increase your CRM data.🤯
We will dig deeper into:
1. How to capture video testimonials that convert from your audience 🎥
2. How to leverage your testimonials to boost your sales 💲
3. How you can capture more CRM data to understand your audience better through video testimonials. 📊
In the Adani-Hindenburg case, what is SEBI investigating.pptxAdani case
Adani SEBI investigation revealed that the latter had sought information from five foreign jurisdictions concerning the holdings of the firm’s foreign portfolio investors (FPIs) in relation to the alleged violations of the MPS Regulations. Nevertheless, the economic interest of the twelve FPIs based in tax haven jurisdictions still needs to be determined. The Adani Group firms classed these FPIs as public shareholders. According to Hindenburg, FPIs were used to get around regulatory standards.
The world of search engine optimization (SEO) is buzzing with discussions after Google confirmed that around 2,500 leaked internal documents related to its Search feature are indeed authentic. The revelation has sparked significant concerns within the SEO community. The leaked documents were initially reported by SEO experts Rand Fishkin and Mike King, igniting widespread analysis and discourse. For More Info:- https://news.arihantwebtech.com/search-disrupted-googles-leaked-documents-rock-the-seo-world/
"𝑩𝑬𝑮𝑼𝑵 𝑾𝑰𝑻𝑯 𝑻𝑱 𝑰𝑺 𝑯𝑨𝑳𝑭 𝑫𝑶𝑵𝑬"
𝐓𝐉 𝐂𝐨𝐦𝐬 (𝐓𝐉 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬) is a professional event agency that includes experts in the event-organizing market in Vietnam, Korea, and ASEAN countries. We provide unlimited types of events from Music concerts, Fan meetings, and Culture festivals to Corporate events, Internal company events, Golf tournaments, MICE events, and Exhibitions.
𝐓𝐉 𝐂𝐨𝐦𝐬 provides unlimited package services including such as Event organizing, Event planning, Event production, Manpower, PR marketing, Design 2D/3D, VIP protocols, Interpreter agency, etc.
Sports events - Golf competitions/billiards competitions/company sports events: dynamic and challenging
⭐ 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐝 𝐩𝐫𝐨𝐣𝐞𝐜𝐭𝐬:
➢ 2024 BAEKHYUN [Lonsdaleite] IN HO CHI MINH
➢ SUPER JUNIOR-L.S.S. THE SHOW : Th3ee Guys in HO CHI MINH
➢FreenBecky 1st Fan Meeting in Vietnam
➢CHILDREN ART EXHIBITION 2024: BEYOND BARRIERS
➢ WOW K-Music Festival 2023
➢ Winner [CROSS] Tour in HCM
➢ Super Show 9 in HCM with Super Junior
➢ HCMC - Gyeongsangbuk-do Culture and Tourism Festival
➢ Korean Vietnam Partnership - Fair with LG
➢ Korean President visits Samsung Electronics R&D Center
➢ Vietnam Food Expo with Lotte Wellfood
"𝐄𝐯𝐞𝐫𝐲 𝐞𝐯𝐞𝐧𝐭 𝐢𝐬 𝐚 𝐬𝐭𝐨𝐫𝐲, 𝐚 𝐬𝐩𝐞𝐜𝐢𝐚𝐥 𝐣𝐨𝐮𝐫𝐧𝐞𝐲. 𝐖𝐞 𝐚𝐥𝐰𝐚𝐲𝐬 𝐛𝐞𝐥𝐢𝐞𝐯𝐞 𝐭𝐡𝐚𝐭 𝐬𝐡𝐨𝐫𝐭𝐥𝐲 𝐲𝐨𝐮 𝐰𝐢𝐥𝐥 𝐛𝐞 𝐚 𝐩𝐚𝐫𝐭 𝐨𝐟 𝐨𝐮𝐫 𝐬𝐭𝐨𝐫𝐢𝐞𝐬."
1. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
IT for Business Intelligence (BM61080)
Data Mining Techniques using WEKA
Sagar (10BM60075)
This term paper contains a brief introduction to WEKA – a powerful data mining tool along with a
guide to two data mining techniques - Clustering (k-means) and Linear Regression, using WEKA tool.
Page 1
Vinod Gupta School of Management
2. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Data Mining Techniques using WEKA:
WEKA: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software
written in Java, developed at the University of Waikato, New Zealand. The Weka workbench contains a collection
of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user
interfaces for easy access to this functionality. Weka supports several standard data mining tasks, more
specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's
main user interface is the explorer, but essentially the same functionality can be accessed through the component-
based knowledge Flow interface and from the command line. There is also the experimenter, which allows the
systematic comparison of the predictive performance of Weka's machine learning algorithms on a collection of
datasets.
Interfaces –
Command Line Interface (CLI)
Graphical User Interface (GUI)
The WEKA GUI Chooser –
Fig. 1
The buttons can be used to start the following applications –
Explorer – this is the environment for exploring data with WEKA and gives access to all the facilities using
menu selection and form filling.
Experimenter – Gives the answer for the question: Which methods and parameter values work best for
the given problem?
Knowledge Flow – Supports incremental learning and allows designing configurations for streamed data
processing. Incremental algorithms can be used to process very large datasets.
Simple CLI – A simple Command Line Interface for executing WEKA commands directly.
The Explorer interface features several panels providing access to the main components of the
workbench:
The preprocess panel has facilities for importing data from a database, a CSV file, etc., and for
preprocessing this data using a filtering algorithm. It is possible to transform the data and delete
instances and attributes according to specific criteria.
The classify panel enable to apply classification and regression algorithms, to estimate the accuracy of the
resulting predictive model, and to visualize erroneous predictions, ROC curves, etc.
Vinod Gupta School of Management 2
3. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
The associate panel provides access to association rule learners that attempt to identify all important
interrelationships between attributes in the data.
The cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm.
There is also an implementation of the expectation maximization algorithm.
The select attributes panel provides algorithms for identifying the most predictive attributes in a dataset.
The visualize panel shows a scatter plot matrix, where individual scatter plots can be selected and
enlarged, and analyzed further using various selection operators.
This paper will demonstrate the following two data mining techniques in WEKA:
Clustering (Simple K Means)
Linear regression
Clustering in WEKA
Clustering: Clustering can be loosely defined as: The process of organizing objects into groups whose members
are similar in some way. Clustering is the task of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar to each other than to those in other clusters. The clusters found by
different algorithms vary significantly in their properties, and understanding these "cluster models" is key to
understanding the differences between the various algorithms. Typical cluster models include:
Connectivity models: Models based on distance connectivity.
Centroid models: The k-means algorithm represents each cluster by a single mean vector.
Distribution models: Clusters are modeled using statistic distributions, such as multivariate normal
distributions.
Density models: Clusters are seen as connected dense regions in the data space.
Subspace models: Clusters are modeled with both cluster members and relevant attributes.
Group models: Do not provide a refined model, just the grouping information.
Graph-based models: A clique (a subset of nodes in a graph such that every two nodes in the subset are
connected by an edge) can be considered as a prototypical form of cluster.
Clustering algorithms may be classified as listed below:
Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering
Four of the most used clustering algorithms are:
K-means
Fuzzy C-means
Hierarchical clustering
Mixture of Gaussians
K-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchical
clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm.
K-Means Clustering: K-means (MacQueen, 1967) is one of the simplest algorithms. The procedure follows a simple
and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster. The algorithm is composed of the following steps:
Vinod Gupta School of Management 3
4. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
1. Place K points into the space represented by the objects that are being clustered. These points represent
initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into
groups from which the metric to be minimized can be calculated.
The k-means algorithm does not necessarily find the most optimal configuration. The algorithm is also significantly
sensitive to the initial randomly selected cluster centres. There is no general theoretical solution to find the
optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs
with different k classes and choose the best one.
Why to do Clustering (Business Applications)?
Market Segmentation
Identifying market needs
To better understand the relationships between different groups of consumers/potential customers
Product positioning
New product opportunities
Selecting test markets
Clustering can be used to group all the shopping items available on the web into a set of unique products
K-Means Clustering in WEKA:
The data set we'll use for our clustering example will focus fictional BMW dealership. The dealership has kept track
of how people walk through the dealership and the showroom, what cars they look at, and how often they
ultimately make purchases. They are hoping to mine this data by finding patterns in the data and by using clusters
to determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and each
column describes the steps that the customers reached in their BMW experience, with a column having a 1 (they
made it to this step or looked at this car), or 0 (they didn't reach this step).
The ARFF data we'll be using with WEKA is:
Vinod Gupta School of Management 4
5. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Steps to be followed for doing K-Means Clustering in WEKA:
Step 1: Select Explorer in the Weka GUI Chooser window (Fig.1)
Step 2: The following window appears:
Fig. 2
Step 3: Select “Open File” and load the ARFF data file bmw-browsers. After loading the file, the interface will be
like this –
Fig. 3
Step 4: With this data set, we are looking to create clusters, so click on the Cluster tab. Click Choose and
select SimpleKMeans, set the numClusters value to 5 and click ok:
Vinod Gupta School of Management 5
6. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Fig. 4
Step 5: For viewing the distribution of all variables in the population, we can click on “Visualize All”:
Fig. 5
Vinod Gupta School of Management 6
7. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Step 6: Now, we are ready to run the clustering algorithm. 100 rows of data with five data clusters would likely
take a few hours of computation with a spreadsheet, but WEKA can spit out the answer in less than a second.
Select the Use training set in the Cluster mode panel and then click Start button to begin clustering process.
Fig. 6
Step 7: For displaying the result in separate window, in the Result list panel, right click the result and select View
in a separate window. Following result will be displayed:
Fig. 7
Vinod Gupta School of Management 7
8. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Results Interpretation:
The output tells how each cluster comes together: with a "1" meaning everyone in that cluster shares the same
value of one, and a "0" meaning everyone in that cluster has a value of zero for that attribute. Each cluster shows
a type of behavior in customers, from which we can draw conclusions:
Cluster 0:
o The "Dreamers"
o Wander around the dealership
o Don't purchase anything
Cluster 1:
o The "M5 Lovers”
o Not a high purchase rate — only 52 percent
o A potential problem and could be a focus
o More salespeople could be send to the M5 section
Cluster 2:
o The "Throw-Aways"
o No good conclusions from their behavior
Cluster 3:
o The "BMW Babies"
o Always purchase a car and finance it
o They walk around the lot looking at cars and then go to the computer search available at the
dealership
o Making search computers more prominent around the lots section
o Tend to buy M5s or Z4s
Cluster 4:
o The "Starting out with BMW"
o These look at the 3-series and never at the much more expensive M5
o Do not walk around the lot and ignore the computer search terminals
o While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction
o These know exactly what kind of car they want (the 3-series entry-level model)
o Sales to this group can be increased by relaxing financing standards or by reducing the 3-series
prices
The data in these clusters can also be inspected visually. To do this:
Right click the result in the Result list panel
Select Visualize cluster assignments
By setting X-axis variable as M5 Y-axis variable as Purchase we get the following output:
Fig. 8
Vinod Gupta School of Management 8
9. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
This figure shows in a chart how the clusters are grouped in terms of who looked at the M5 and who purchased
one. Also, turn up the "Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plot
points to allow us to see them more easily.
The visual results match the conclusions we drew above. We can see in the X=1, Y=1 point (those who looked at
M5s and made a purchase) that the only clusters represented here are 1 and 3. We also see that the only clusters
at point X=0, Y=0 are 4 and 0. Clusters 1 and 3 were buying the M5s, while cluster 0 wasn't buying anything, and
cluster 4 was only looking at the 3-series. By changing X and Y axes, we can identify other trends and patterns.
Other clustering methods can also be used to group the data into clusters. WEKA is very useful in the clustering
process when the size of data is huge. It can generate clusters pretty quickly even with huge data. As business has
huge applications of clustering, WEKA is very useful in the clustering of data in real business scenarios.
Linear Regression using WEKA
Regression: Regression is the easiest technique to use, but is also probably the least powerful. This model can
be as easy as one input variable and one output variable (Scatter diagram in Excel, or an XY Diagram in
OpenOffice.org). It can get more complex than that, including dozens of input variables. Regression models all fit
the same general pattern: there are a number of independent variables, which, when taken together, produce a
result — a dependent variable. The regression model is then used to predict the result of an unknown dependent
variable, given the values of the independent variables. Correlation analysis can be applied to determine the
degree to which variables are related. Broadly, regression can be classified into two types:
Simple linear regression (one dependent variable and one independent variable)
Multiple regression (one dependent variable and many independent variables)
The process of Multiple regression in WEKA is described with an example in this term paper.
Business applications of Regression
Pricing decisions
Trend Line Analysis
Risk Analysis for Investments
Sales or Market Forecasts
To predict the demographics and types of future work forces for large companies.
Total quality control
Regression in WEKA:
The price of the house (the dependent variable) is the result of many independent variables — the square footage
of the house, the size of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc. Let's continue this
example of a house price-based regression model, and create some real data to examine. These are actual
numbers from houses for sale, and we will try to find the value for a house.
House values for regression model:
House size (square feet) Lot size Bedrooms Granite Upgraded bathroom? Selling price
3529 9191 6 0 0 $205,000
3247 10061 5 1 1 $224,900
4032 10150 5 0 1 $197,900
2397 14156 4 1 0 $189,900
2200 9600 4 0 1` $195,000
3536 19994 6 1 1 $325,000
Vinod Gupta School of Management 9
10. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
2983 9365 5 0 1 $230,000
3198 9669 5 1 1 ????
Steps to be followed:
Step 1: Select Explorer from the WEKA GUI user window and load the file houses. Following screen will appear:
Fig. 9
Step 2: Click Classifier tab in the explorer window and click the Choose button in the Classifier panel. Then select
LinearRegression from functions:
Fig. 10
Vinod Gupta School of Management 10
11. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
It automatically identifies the dependent variable as Selling Price. In case it doesn’t happen we can select the
dependent variable.
Step3: Press the Start button and the following output is generated:
Fig. 11
As described earlier in clustering example, output can also be viewed in a separate window.
We can also visualize the classifier error i.e. those instances which are wrongly predicted by regression equation
by right clinking on the result set in the Result list panel and selecting Visualize classifier errors.
Fig. 12
Vinod Gupta School of Management 11
12. WEKA Manual (IT for Business Intelligence) Sagar (10BM60075)
Interpreting the regression model:
Let us interpret the patterns and conclusions that our model tells us, besides just a strict house value:
Granite doesn't matter: WEKA will only use columns that statistically contribute to the accuracy of the
model). This regression model is telling us that granite in your kitchen doesn't affect the house's value.
Bathrooms do matter: We can use the coefficient from the regression model to determine the value of
an upgraded bathroom on the house value.
Bigger houses reduce the value: WEKA is telling us that the bigger our house is, the lower the selling
price. This can be seen by the negative coefficient in front of the houseSize variable. The house size,
unfortunately, isn't an independent variable because it's related to the bedrooms variable, which makes
sense, since bigger houses tend to have more bedrooms.
Other applications of WEKA in data mining:
WEKA can be used for various other data mining techniques:
Classification (using decision trees)
Collaborative filtering (Nearest Neighbor)
Association
References:
a) Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher)
b) www.wikipedia.org
c) http://www2.cs.uregina.ca/~dbd/cs831/notes/clustering/clustering.html
d) http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html
Vinod Gupta School of Management 12